# SPACY
https://spacy.io/

In [1]:
# Import spaCy and load the language library
import spacy
nlp = spacy.load('en_core_web_sm')

In [2]:
doc=nlp(u'Tesla is looking at buying U.S. startup for $6 million') #unicode string

In [3]:
for token in doc:
    print(token.text,token.pos_,token.pos)#pos_ gives the part of speech

Tesla PROPN 96
is AUX 87
looking VERB 100
at ADP 85
buying VERB 100
U.S. PROPN 96
startup VERB 100
for ADP 85
$ SYM 99
6 NUM 93
million NUM 93


 Pipeline

When we run nlp, our text enters a processing pipeline that first breaks down the text and then performs a series of operations to tag, parse and describe the data. Image source:https://spacy.io/usage/spacy-101#pipelines

In [4]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x2357c0c6880>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x2357c0c6b20>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x2357bfd1eb0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x2357c269100>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x2357c25ff80>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x2357bfd1dd0>)]

In [5]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

Tokenization

The first step in processing text is to split up all the component parts (words & punctuation) into "tokens". These tokens are annotated inside the Doc object to contain descriptive information.

In [6]:
doc2 = nlp(u"Tesla isn't looking into startups anymore.")

In [7]:
for token in doc2:
    print(token.text,token.pos_)

Tesla PROPN
is AUX
n't PART
looking VERB
into ADP
startups NOUN
anymore ADV
. PUNCT


In [8]:
doc2[0].pos_

'PROPN'

In [9]:
# Lemmas (the base form of the word):
print(doc2[3].text)
print(doc2[3].lemma_)

looking
look


Spans

Large Doc objects can be hard to work with at times. A span is a slice of Doc object in the form Doc[start:stop].

In [10]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [11]:
life_quote = doc3[16:30]
print(life_quote)

"Life is what happens to us while we are making other plans"


In [12]:
type(life_quote)

spacy.tokens.span.Span

Sentences

Certain tokens inside a Doc object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through Doc.sents.We can write our own segmentation rules.

In [13]:
doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

In [14]:
for sent in doc4.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


In [15]:
doc4[6].is_sent_start

True

In [16]:
doc4[8].is_sent_start #it means that sentence doesn't starts from it

False

# Tokenization
The first step in creating a Doc object is to break down the incoming text into component pieces or "tokens"

In [17]:
#string that includes opening and closing quotation marks
mystring = '"We\'re moving to L.A.!"'
print(mystring)

"We're moving to L.A.!"


In [18]:
# Create a Doc object and explore tokens
doc = nlp(mystring)

for token in doc:
    print(token.text, end=' | ')

" | We | 're | moving | to | L.A. | ! | " | 

Prefixes, Suffixes and Infixes¶

spaCy will isolate punctuation that does not form an integral part of a word. Quotation marks, commas, and punctuation at the end of a sentence will be assigned their own token. However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token

In [19]:
doc2 = nlp(u"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")

for t in doc2:
    print(t,"    ",t.pos_)

We      PRON
're      AUX
here      ADV
to      PART
help      VERB
!      PUNCT
Send      VERB
snail      NOUN
-      PUNCT
mail      NOUN
,      PUNCT
email      NOUN
support@oursite.com      X
or      CCONJ
visit      VERB
us      PRON
at      ADP
http://www.oursite.com      X
!      PUNCT


In [20]:
doc3 = nlp(u'A 5km NYC cab ride costs $10.30')

for t in doc3:
    print(t)

A
5
km
NYC
cab
ride
costs
$
10.30


Exceptions

Punctuation that exists as part of a known abbreviation will be kept as part of the token.

In [21]:
doc4 = nlp(u"Let's visit St. Louis in the U.S. next year.")

for t in doc4:
    print(t)

Let
's
visit
St.
Louis
in
the
U.S.
next
year
.


In [22]:
len(doc4) #count of tokens

11

Counting Vocab Entries

Vocab objects contain a full library of items!

In [23]:
len(doc4.vocab)

843

Tokens can be retrieved by index position and slice

Doc objects can be thought of as lists of token objects. As such, individual tokens can be retrieved by index position, and spans of tokens can be retrieved through slicing:

In [24]:
doc5 = nlp(u'It is better to give than to receive.') #No reassingments can haapen when doc is created

# Retrieve the third token:
doc5[2]

better

In [25]:
# Retrieve three tokens from the middle:
doc5[2:5]

better to give

In [26]:
# Retrieve the last four tokens:
doc5[-4:]

than to receive.

Named Entities

Going a step beyond tokens, named entities add another layer of context. The language model recognizes that certain words are organizational names while others are locations, and still other combinations relate to money, dates, etc. Named entities are accessible through the ents property of a Doc object.

In [27]:
doc8 = nlp(u'Apple to build a Hong Kong factory for $6 million')

for token in doc8:
    print(token.text, end=' | ')

Apple | to | build | a | Hong | Kong | factory | for | $ | 6 | million | 

In [28]:

for ent in doc8.ents:
    print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))

Apple - ORG - Companies, agencies, institutions, etc.
Hong Kong - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


In [29]:
len(doc8.ents)

3

Noun Chunks  https://spacy.io/usage/linguistic-features#noun-chunks

Similar to Doc.ents, Doc.noun_chunks are another object property. Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun

In [30]:
doc9 = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

for chunk in doc9.noun_chunks:
    print(chunk.text)

Autonomous cars
insurance liability
manufacturers


In [31]:
doc10 = nlp(u"Red cars do not carry higher insurance rates.")

for chunk in doc10.noun_chunks:
    print(chunk)

Red cars
higher insurance rates


Built-in Visualizers

spaCy includes a built-in visualization tool called displaCy. displaCy is able to detect whether you're working in a Jupyter notebook, and will return markup that can be rendered in a cell right away. When you export your notebook, the visualizations will be included as HTML.

For more info visit https://spacy.io/usage/visualizers

In [32]:
from spacy import displacy

In [33]:
doc = nlp(u'Apple is going to build a U.K. factory for $6 million.')
displacy.render(doc, style='dep', jupyter=True, options={'distance': 100})

In [34]:
doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.')
#returns the entity with its type
displacy.render(doc, style='ent', jupyter=True)

# Stemming

Stemming is a technique used to extract the base form of the words by removing affixes from them. It is just like cutting down the branches of a tree to its stems. For instance, searching for "boat" might also return "boats" and "boating". Here, "boat" would be the stem for [boat, boater, boating, boats].

Stemming is a somewhat crude method for cataloging related words; it essentially chops off letters from the end until the stem is reached. This works fairly well in most cases, but unfortunately English has many exceptions where a more sophisticated process is required. In fact, spaCy doesn't include a stemmer, opting instead to rely entirely on lemmatization. For those interested, there's some background on this decision here. We discuss the virtues of lemmatization in the next section.

Instead, we'll use another popular NLP tool called nltk, which stands for Natural Language Toolkit. For more information on nltk visit https://www.nltk.org/

In [35]:
# Import the toolkit and the full Porter Stemmer library
import nltk

from nltk.stem.porter import PorterStemmer

In [36]:
p_stemmer = PorterStemmer()

In [37]:
words = ['run','runner','running','ran','runs','easily','fairly']

In [38]:
for word in words:
    print(word+' --> '+p_stemmer.stem(word))

run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fairli


Also, the adverbs "easily" and "fairly" are stemmed to the unusual root "easili" and "fairli"



In [39]:
from nltk.stem.snowball import SnowballStemmer

# The Snowball Stemmer requires that you pass a language parameter
s_stemmer = SnowballStemmer(language='english')

In [40]:
words = ['run','runner','running','ran','runs','easily','fairly']

In [41]:
for word in words:
    print(word+' --> '+s_stemmer.stem(word))

run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fair


Stemming has its drawbacks. If given the token saw, stemming might always return saw, whereas lemmatization would likely return either see or saw depending on whether the use of the token was as a verb or a noun

In [42]:
phrase = 'I am meeting him tomorrow at the meeting'
for word in phrase.split():
    print(word+' --> '+p_stemmer.stem(word))

I --> i
am --> am
meeting --> meet
him --> him
tomorrow --> tomorrow
at --> at
the --> the
meeting --> meet


In [43]:
for word in phrase.split():
    print(word+' --> '+s_stemmer.stem(word))

I --> i
am --> am
meeting --> meet
him --> him
tomorrow --> tomorrow
at --> at
the --> the
meeting --> meet


# Lemmatization

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma

In [44]:
doc1 = nlp(u"I am a runner running in a race because I love to run since I ran today")

for token in doc1:
    print(token.text, '\t', token.pos_, '\t', token.lemma, '\t', token.lemma_)

I 	 PRON 	 4690420944186131903 	 I
am 	 AUX 	 10382539506755952630 	 be
a 	 DET 	 11901859001352538922 	 a
runner 	 NOUN 	 12640964157389618806 	 runner
running 	 VERB 	 12767647472892411841 	 run
in 	 ADP 	 3002984154512732771 	 in
a 	 DET 	 11901859001352538922 	 a
race 	 NOUN 	 8048469955494714898 	 race
because 	 SCONJ 	 16950148841647037698 	 because
I 	 PRON 	 4690420944186131903 	 I
love 	 VERB 	 3702023516439754181 	 love
to 	 PART 	 3791531372978436496 	 to
run 	 VERB 	 12767647472892411841 	 run
since 	 SCONJ 	 10066841407251338481 	 since
I 	 PRON 	 4690420944186131903 	 I
ran 	 VERB 	 12767647472892411841 	 run
today 	 NOUN 	 11042482332948150395 	 today


In [45]:
def show_lemmas(text): #formatting text to see a proper output
    for token in text:
        print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')

In [46]:
doc2 = nlp(u"I saw eighteen mice today!")

show_lemmas(doc2)

I            PRON   4690420944186131903    I
saw          VERB   11925638236994514241   see
eighteen     NUM    9609336664675087640    eighteen
mice         NOUN   1384165645700560590    mouse
today        NOUN   11042482332948150395   today
!            PUNCT  17494803046312582752   !


In [47]:
doc4 = nlp(u"That's an enormous automobile")

show_lemmas(doc4)

That         PRON   4380130941430378203    that
's           AUX    10382539506755952630   be
an           DET    15099054000809333061   an
enormous     ADJ    17917224542039855524   enormous
automobile   NOUN   7211811266693931283    automobile


# Stop Words

Words like "a" and "the" appear so frequently that they don't require tagging as thoroughly as nouns, verbs and modifiers. We call these stop words, and they can be filtered from the text to be processed. spaCy holds a built-in list of some 326 English stop words.

In [48]:
# Print the set of spaCy's default stop words (remember that sets are unordered):
print(nlp.Defaults.stop_words)

{'eight', 'three', '‘d', 'just', 'due', 'sometime', 'thru', 'put', 'twenty', 'over', 'top', 'nor', 'before', 'seeming', 'until', 'whereby', 'nobody', 'so', 'thereby', 'least', 'move', 'everywhere', 'why', 'again', 'otherwise', 'eleven', 'sixty', 'among', 'down', 'therein', 'ten', 'onto', 'make', 'other', 'indeed', 'first', 'using', 'somewhere', 'unless', 'you', 'had', 'whom', 'amongst', 'even', 'could', 'always', 'whenever', 'up', 'everything', 'get', 'hereafter', 'nevertheless', "'ll", 'anyhow', 'several', 'doing', 'together', 'everyone', '’m', 'it', 'afterwards', 'thereafter', 'these', 'under', '’re', 'me', "n't", 'however', 'such', 'being', 'full', 'him', 'two', 'would', 'a', 'their', 'which', 'but', 'sometimes', 'now', 'the', 'above', 'mine', "'s", 'perhaps', 'go', 'some', 'hence', 'this', 'those', 'at', 'beforehand', 'his', 'whatever', 'whether', 'please', 'whole', 'also', 'most', 'our', "'m", 'therefore', 'towards', 'made', 'whither', 'six', 'behind', 'when', 'herself', 'serious'

To add a stop word

There may be times when you wish to add a stop word to the default set. Perhaps you decide that 'btw' (common shorthand for "by the way") should be considered a stop word.

In [49]:
# Add the word to the set of stop words. Use lowercase!
nlp.Defaults.stop_words.add('btw')

# Set the stop_word tag on the lexeme
nlp.vocab['btw'].is_stop = True

In [50]:
len(nlp.Defaults.stop_words)

327

When adding stop words, always use lowercase. Lexemes are converted to lowercase before being added to vocab.

To remove a stop word

Alternatively, you may decide that 'beyond' should not be considered a stop word

In [51]:


# Remove the stop_word tag from the lexeme
nlp.vocab['beyond'].is_stop = False

In [52]:
nlp.vocab['beyond'].is_stop

False

# Vocabulary and Matching

Rule-based Matching https://spacy.io/usage/linguistic-features#section-rule-based-matching

spaCy offers a rule-matching tool called Matcher that allows you to build a library of token patterns, then match those patterns against a Doc object to return a list of found matches. You can match on any part of the token including text and annotations, and you can add multiple patterns to the same matcher.

In [53]:
# Import the Matcher library
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [54]:
pattern1 = [{'LOWER': 'solarpower'}] #SolarPower
pattern2 = [{'LOWER': 'solar'}, {'LOWER': 'power'}] #Solar power
pattern3 = [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}] #Solar-power

matcher.add('SolarPower', [pattern1, pattern2, pattern3],on_match=None)

In [55]:
doc = nlp(u'The Solar Power industry continues to grow as demand \
for solarpower increases. Solar-power cars are gaining popularity.')

In [56]:
found_matches = matcher(doc)
print(found_matches)

[(8656102463236116519, 1, 3), (8656102463236116519, 10, 11), (8656102463236116519, 13, 16)]


matcher returns a list of tuples. Each tuple contains an ID for the match, with start & end tokens that map to the span doc[start:end]

In [57]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 10 11 solarpower
8656102463236116519 SolarPower 13 16 Solar-power


PhraseMatcher

In the above section we used token patterns to perform rule-based matching. An alternative - and often more efficient - method is to match on terminology lists. In this case we use PhraseMatcher to create a Doc object from a list of phrases, and pass that into matcher instead.

In [58]:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

In [59]:
with open('C:\\Users\\hp\\Desktop\\UPDATED_NLP_COURSE\\UPDATED_NLP_COURSE\\TextFiles\\reaganomics.txt') as f:
    doc3 = nlp(f.read())

In [60]:
# create a list of match phrases:
phrase_list = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']

In [61]:
phrase_patterns=[]
for text in phrase_list:
    phrase_patterns.append((nlp(text)))

In [62]:
# Pass each Doc object into matcher (note the use of the asterisk!):
matcher.add('VoodooEconomics', None, *phrase_patterns)

In [63]:
# Build a list of matches:
matches = matcher(doc3)
matches

[(3473369816841043438, 41, 45),
 (3473369816841043438, 49, 53),
 (3473369816841043438, 54, 56),
 (3473369816841043438, 61, 65),
 (3473369816841043438, 673, 677),
 (3473369816841043438, 2987, 2991)]

Viewing Matches

There are a few ways to fetch the text surrounding a match. The simplest is to grab a slice of tokens from the doc that is wider than the match:

In [64]:
doc3[665:685]  # Note that the fifth match starts at doc3[673]

same time he attracted a following from the supply-side economics movement, which formed in opposition to Keynesian

Another way is to first apply the sentencizer to the Doc, then iterate through the sentences to the match point:

In [65]:
sents=[]
for sent in doc3.sents:
    sents.append(sent)

In [66]:
print(sents[0].start, sents[0].end)

0 35


# Part of Speech Basics

Part-of-speech (POS) tagging is a popular Natural Language Processing process which refers to categorizing words in a text (corpus) in correspondence with a particular part of speech, depending on the definition of the word and its context.

View token tags

To view the coarse POS tag use token.pos_

To view the fine-grained tag use token.tag_

To view the description of either type of tag use spacy.explain(tag)

Note that `token.pos` and `token.tag` return integer hash values; by adding the underscores we get the text equivalent that lives in **doc.vocab**.

In [67]:
# Create a simple Doc object
doc = nlp(u"The quick brown fox jumped over the lazy dog's back.")

In [68]:
print(doc[4])
print(doc[4].pos_)

jumped
VERB


In [69]:
for token in doc:
    print(f'{token.text:{10}} {token.pos_:{8}} {token.tag_:{6}} {spacy.explain(token.tag_)}')

The        DET      DT     determiner
quick      ADJ      JJ     adjective (English), other noun-modifier (Chinese)
brown      ADJ      JJ     adjective (English), other noun-modifier (Chinese)
fox        NOUN     NN     noun, singular or mass
jumped     VERB     VBD    verb, past tense
over       ADP      IN     conjunction, subordinating or preposition
the        DET      DT     determiner
lazy       ADJ      JJ     adjective (English), other noun-modifier (Chinese)
dog        NOUN     NN     noun, singular or mass
's         PART     POS    possessive ending
back       NOUN     NN     noun, singular or mass
.          PUNCT    .      punctuation mark, sentence closer


Counting POS Tags

The Doc.count_by() method accepts a specific token attribute as its argument, and returns a frequency count of the given attribute as a dictionary object. Keys in the dictionary are the integer values of the given attribute ID, and values are the frequency. Counts of zero are not included.

In [70]:
doc = nlp(u"The quick brown fox jumped over the lazy dog's back.")

# Count the frequencies of different coarse-grained POS tags:
POS_counts = doc.count_by(spacy.attrs.POS)
POS_counts

{90: 2, 84: 3, 92: 3, 100: 1, 85: 1, 94: 1, 97: 1}

In [71]:
doc.vocab[90].text

'DET'

In [72]:
doc.vocab[84].text #means we have 3 adjectives are in the doc file

'ADJ'

In [73]:
for k,v in sorted(POS_counts.items()):
    print(f'{k}. {doc.vocab[k].text:{5}}: {v}')

84. ADJ  : 3
85. ADP  : 1
90. DET  : 2
92. NOUN : 3
94. PART : 1
97. PUNCT: 1
100. VERB : 1


In [74]:
# Count the different fine-grained tags:
TAG_counts = doc.count_by(spacy.attrs.TAG)

for k,v in sorted(TAG_counts.items()):
    print(f'{k}. {doc.vocab[k].text:{4}}: {v}')

74. POS : 1
1292078113972184607. IN  : 1
10554686591937588953. JJ  : 3
12646065887601541794. .   : 1
15267657372422890137. DT  : 2
15308085513773655218. NN  : 3
17109001835818727656. VBD : 1


In [75]:
len(doc.vocab)

2308

In [76]:
# Count the different dependencies:
DEP_counts = doc.count_by(spacy.attrs.DEP)

for k,v in sorted(DEP_counts.items()):
    print(f'{k}. {doc.vocab[k].text:>{5}}: {v}')

402.  amod: 3
415.   det: 2
429. nsubj: 1
439.  pobj: 1
440.  poss: 1
443.  prep: 1
445. punct: 1
8110129090154140942.  case: 1
8206900633647566924.  ROOT: 1


# Named Entity Recognition (NER)
spaCy has an 'ner' pipeline component that identifies token spans fitting a predetermined set of named entities. These are available as the ents property of a Doc object.

In [77]:
def show_ents(doc):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text+" - "+ent.label_ + " - "+str(spacy.explain(ent.label_)))
    else:
        print("No entities found")

In [78]:
doc = nlp(u'May I go to Washington, DC next May to see the Washington Monument?')

show_ents(doc)

Washington - GPE - Countries, cities, states
DC - GPE - Countries, cities, states
next May - DATE - Absolute or relative dates or periods
the Washington Monument - ORG - Companies, agencies, institutions, etc.


In [79]:
doc=nlp(u'Hi, how are you?')
show_ents(doc)

No entities found


In [80]:
doc = nlp(u'Tesla to build a U.K. factory for $6 million')

show_ents(doc)

U.K. - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


In [81]:
from spacy.tokens import Span

# Get the hash value of the ORG entity label
ORG = doc.vocab.strings[u'ORG'] 
ORG

383

In [82]:
# Create a Span for the new entity
new_ent = Span(doc, 0,1, label=ORG)
new_ent

Tesla

In [83]:
# Add the entity to the existing Doc object
doc.ents = list(doc.ents) + [new_ent]

In [84]:
show_ents(doc)

Tesla - ORG - Companies, agencies, institutions, etc.
U.K. - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


Adding Named Entities to All Matching Spans

In [85]:
doc = nlp(u'Our company plans to introduce a new vacuum cleaner. '
          u'If successful, the vacuum-cleaner will be our best in show.')

show_ents(doc)

No entities found


In [86]:
from spacy.matcher import PhraseMatcher

In [87]:
matcher = PhraseMatcher(nlp.vocab)

In [88]:
phrase_list=['vacuum cleaner','vacuum-cleaner']

In [89]:
phrase_patterns=[]
for text in phrase_list:
    phrase_patterns.append(nlp(text))

In [90]:
matcher.add('newproduct',None,*phrase_patterns) #adding multiple terms

In [91]:
found_matches=matcher(doc)

In [92]:
found_matches

[(2689272359382549672, 7, 9), (2689272359382549672, 14, 17)]

In [93]:
from spacy.tokens import Span
# Here we create Spans from each match, and create named entities from them:

In [94]:
PROD = doc.vocab.strings[u'PRODUCT']

In [95]:
new_ents = [Span(doc, match[1],match[2],label=PROD) for match in found_matches]

doc.ents = list(doc.ents) + new_ents

In [96]:
show_ents(doc)

vacuum cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)
vacuum-cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)


Counting Entities

While spaCy may not have a built-in tool for counting entities, we can pass a conditional statement into a list comprehension:

In [97]:
doc = nlp(u'Originally priced at $29.50, the sweater was marked down to five dollars.')

show_ents(doc)

29.50 - MONEY - Monetary values, including unit
five dollars - MONEY - Monetary values, including unit


In [98]:
count=0
for ent in doc.ents:
    if ent.label_=='MONEY':
        count+=1
print(count)

2


# Visualizing Named Entities

Besides viewing Part of Speech dependencies with style='dep', displaCy offers a style='ent' visualizer:

In [99]:
from spacy import displacy

In [100]:
doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million. '
         u'By contrast, Sony sold only 7 thousand Walkman music players.')

displacy.render(doc, style='ent', jupyter=True)

In [101]:
doc2 = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million. '
         u'By contrast, my kids sold a lot of lemonade.')

In [102]:
for sent in doc.sents:
    displacy.render(nlp(sent.text),style='ent',jupyter=True)

Viewing Specific Entities

You can pass a list of entity types to restrict the visualization:

In [103]:
options = {'ents': ['ORG', 'PRODUCT']}

displacy.render(doc, style='ent', jupyter=True, options=options)

Customizing Colors and Effects

You can also pass background color and gradient options:

In [104]:
colors = {'ORG': 'red', 'PRODUCT': 'green'}

options = {'ents': ['ORG', 'PRODUCT'], 'colors':colors}

displacy.render(doc, style='ent', jupyter=True, options=options)

# Sentence Segmentation

Sentence segmentation is the problem of dividing a string of written language into its component sentences

In [105]:
# From Spacy Basics:
doc = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

In [106]:
for sent in doc.sents: #doc.sents is a generator
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


Doc.sents is a generator

It is important to note that doc.sents is a generator. That is, a Doc is not segmented until doc.sents is called. This means that, where you could print the second Doc token with print(doc[1]), you can't call the "second Doc sentence" with print(doc.sents[1]):

In [107]:
doc4 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')

for sent in doc4.sents:
    print(sent)

"Management is doing things right; leadership is doing the right things."
-Peter Drucker


In [108]:
doc4[0:-1] #its taking only till peter

"Management is doing things right; leadership is doing the right things." -Peter

ADD A NEW RULE TO THE PIPELINE

In [109]:
#Adding new rule based on ";"
from spacy.language import Language #This has to be done in spacy 3.0+
@Language.component("component")
def set_custom(doc):

    for token in doc[:-1]:#we are doing cuz :1 leaves the last word
        if token.text==';':
            doc[token.i+1].is_sent_start=True
    return doc

In [110]:
nlp.add_pipe("component", before='parser')

<function __main__.set_custom(doc)>

In [111]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'component',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner']

In [112]:
doc4 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')

for sent in doc4.sents:
    print(sent)

"Management is doing things right;
leadership is doing the right things."
-Peter Drucker


Changing the Rules

In some cases we want to replace spaCy's default sentencizer with our own set of rules. In this section we'll see how the default sentencizer breaks on periods. We'll then replace this behavior with a sentencizer that breaks on linebreaks.

In [113]:
nlp = spacy.load('en_core_web_sm')  # reset to the original

In [114]:
mystring = u"This is a sentence. This is another.\n\nThis is a \nthird sentence."
print(mystring)

This is a sentence. This is another.

This is a 
third sentence.


In [115]:
# SPACY DEFAULT BEHAVIOR:
doc = nlp(mystring)

for sent in doc.sents:
    print(sent) #here we can have issue if we want to have \n to be new seegmenter of line

This is a sentence.
This is another.


This is a 
third sentence.


In [116]:
#need to see on sentence segmentizer

# TEXT CLASSIFICATION

Feature Extraction from Text

In the Scikit-learn Primer lecture we applied a simple SVC classification model to the SMSSpamCollection dataset. We tried to predict the ham/spam label based on message length and punctuation counts. In this section we'll actually look at the text of each message and try to perform a classification based on content. We'll take advantage of some of scikit-learn's https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction tools.

In [117]:
import numpy as np
import pandas as pd

In [118]:
df=pd.read_csv("E:/spam.csv")

In [119]:
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [120]:
df['Length']=df['Message'].apply(len)

In [121]:
df.head()

Unnamed: 0,Category,Message,Length
0,ham,"Go until jurong point, crazy.. Available only ...",111
1,ham,Ok lar... Joking wif u oni...,29
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155
3,ham,U dun say so early hor... U c already then say...,49
4,ham,"Nah I don't think he goes to usf, he lives aro...",61


In [122]:
df.isna().sum()

Category    0
Message     0
Length      0
dtype: int64

In [123]:
df.Category.value_counts()

ham     4825
spam     747
Name: Category, dtype: int64

In [124]:
from sklearn.model_selection import train_test_split

X = df['Message']  # this time we want to look at the text
y = df['Category']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Scikit-learn's CountVectorizer

Text preprocessing, tokenizing and the ability to filter out stopwords are all included in CountVectorizer, which builds a dictionary of features and transforms documents to feature vectors.

In [125]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(X_train)

In [126]:
from sklearn.svm import LinearSVC


In [127]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

X_train_tfidf = vectorizer.fit_transform(X_train) 
X_train_tfidf.shape

(3733, 7081)

In [128]:
clf = LinearSVC()
clf.fit(X_train_tfidf,y_train)

LinearSVC()

We can covert all above steps into a single pipeline

Build a Pipeline

Remember that only our training set has been vectorized into a full vocabulary. In order to perform an analysis on our test set we'll have to submit it to the same procedures. Fortunately scikit-learn offers a https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html class that behaves like a compound classifier.

In [129]:
from sklearn.pipeline import Pipeline

In [130]:
text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

In [131]:
text_clf.fit(X_train,y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

Test the classifier and display results

In [132]:
predictions = text_clf.predict(X_test)

In [133]:
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[1586    7]
 [  12  234]]


In [134]:
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1593
        spam       0.97      0.95      0.96       246

    accuracy                           0.99      1839
   macro avg       0.98      0.97      0.98      1839
weighted avg       0.99      0.99      0.99      1839



In [135]:
metrics.accuracy_score(y_test,predictions)

0.989668297988037

In [136]:
text_clf.predict(['Hi Atul, how are you?'])

array(['ham'], dtype=object)

In [137]:
text_clf.predict(['Hi Atul, how r u?'])

array(['ham'], dtype=object)

# Semantics and Word Vectors

Word Vectors

Word vectors - also called word embeddings - are mathematical descriptions of individual words such that words that appear frequently together in the language will have similar values. In this way we can mathematically derive context. As mentioned above, the word vector for "lion" will be closer in value to "cat" than to "dandelion".

In [138]:
import spacy
nlp = spacy.load('en_core_web_md')

In [139]:
nlp(u'lion').vector

array([ 1.8963e-01, -4.0309e-01,  3.5350e-01, -4.7907e-01, -4.3311e-01,
        2.3857e-01,  2.6962e-01,  6.4332e-02,  3.0767e-01,  1.3712e+00,
       -3.7582e-01, -2.2713e-01, -3.5657e-01, -2.5355e-01,  1.7543e-02,
        3.3962e-01,  7.4723e-02,  5.1226e-01, -3.9759e-01,  5.1333e-03,
       -3.0929e-01,  4.8911e-02, -1.8610e-01, -4.1702e-01, -8.1639e-01,
       -1.6908e-01, -2.6246e-01, -1.5983e-02,  1.2479e-01, -3.7276e-02,
       -5.7125e-01, -1.6296e-01,  1.2376e-01, -5.5464e-02,  1.3244e-01,
        2.7519e-02,  1.2592e-01, -3.2722e-01, -4.9165e-01, -3.5559e-01,
       -3.0630e-01,  6.1185e-02, -1.6932e-01, -6.2405e-02,  6.5763e-01,
       -2.7925e-01, -3.0450e-03, -2.2400e-02, -2.8015e-01, -2.1975e-01,
       -4.3188e-01,  3.9864e-02, -2.2102e-01, -4.2693e-02,  5.2748e-02,
        2.8726e-01,  1.2315e-01, -2.8662e-02,  7.8294e-02,  4.6754e-01,
       -2.4589e-01, -1.1064e-01,  7.2250e-02, -9.4980e-02, -2.7548e-01,
       -5.4097e-01,  1.2823e-01, -8.2408e-02,  3.1035e-01, -6.33

What's interesting is that Doc and Span objects themselves have vectors, derived from the averages of individual token vectors.
This makes it possible to compare similarities between whole documents.

In [140]:
doc = nlp(u'The quick brown fox jumped over the lazy dogs.')

doc.vector.shape

(300,)

Identifying similar vectors

The best way to expose vector relationships is through the .similarity() method of Doc tokens.

In [141]:
tokens = nlp(u'lion cat pet')

for t1 in tokens:
    for t2 in tokens:
        print(t1.text,t2.text,t1.similarity(t2))

lion lion 1.0
lion cat 0.5265437364578247
lion pet 0.39923766255378723
cat lion 0.5265437364578247
cat cat 1.0
cat pet 0.7505456209182739
pet lion 0.39923766255378723
pet cat 0.7505456209182739
pet pet 1.0


Opposites are not necessarily different

Words that have opposite meaning, but that often appear in the same context may have similar vectors.

In [142]:
tokens = nlp(u'like love hate')


for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

like like 1.0
like love 0.6579039692878723
like hate 0.6574652194976807
love like 0.6579039692878723
love love 1.0
love hate 0.6393098831176758
hate like 0.6574652194976807
hate love 0.6393098831176758
hate hate 1.0


In [143]:
nlp.vocab.vectors.shape

(20000, 300)

Vector norms

It's sometimes helpful to aggregate 300 dimensions into a Euclidian (L2) norm, computed as the square root of the sum-of-squared-vectors. This is accessible as the .vector_norm token attribute. Other helpful attributes include .has_vector and .is_oov or out of vocabulary.

In [144]:
tokens = nlp(u'dog cat hskjdgbfkszdhb')

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

dog True 7.0336733 False
cat True 6.6808186 False
hskjdgbfkszdhb False 0.0 True


Vector arithmetic

Believe it or not, we can actually calculate new vectors by adding & subtracting related vectors. A famous example suggests

"king" - "man" + "woman" = "queen"

In [145]:
#writing cosine similarity

from scipy import spatial
cosine_sim=lambda vec1,vec2: 1- spatial.distance.cosine(vec1,vec2)

In [146]:
king=nlp.vocab['king'].vector
man=nlp.vocab['man'].vector
woman=nlp.vocab['woman'].vector

In [147]:
new_vector=king-man+woman

In [148]:
computed_sim=[]

for word in nlp.vocab:
    if word.has_vector:
        if word.is_lower:
            if word.is_alpha:
                similarity=cosine_sim(new_vector,word.vector)
                computed_sim.append((word,similarity))

In [149]:
computed_sim=sorted(computed_sim,key=lambda item: -item[1])#for descending order

In [150]:
print([w[0].text for w in computed_sim[:10]])

['king', 'woman', 'she', 'lion', 'who', 'fox', 'brown', 'when', 'dare', 'cat']


# Sentiment Analysis
Now that we've seen word vectors we can start to investigate sentiment analysis. The goal is to find commonalities between documents, with the understanding that similarly combined vectors should correspond to similar sentiments.

While the scope of sentiment analysis is very broad, we will focus our work in two ways.

# Polarity classification

We won't try to determine if a sentence is objective or subjective, fact or opinion. Rather, we care only if the text expresses a positive, negative or neutral opinion.

# Document level scope

We'll also try to aggregate all of the sentences in a document or paragraph, to arrive at an overall opinion.

# Coarse analysis

We won't try to perform a fine-grained analysis that would determine the degree of positivity/negativity. That is, we're not trying to guess how many stars a reviewer awarded, just whether the review was positive or negative.

# Broad Steps:

First, consider the text being analyzed. A model trained on paragraph-long movie reviews might not be effective on tweets. Make sure to use an appropriate model for the task at hand.
Next, decide the type of analysis to perform. In the previous section on text classification we used a bag-of-words technique that considered only single tokens, or unigrams. Some rudimentary sentiment analysis models go one step further, and consider two-word combinations, or bigrams. In this section, we'd like to work with complete sentences, and for this we're going to import a trained NLTK lexicon called VADER.

# NLTK's VADER module
VADER is an NLTK module that provides sentiment scores based on words used ("completely" boosts a score, while "slightly" reduces it), on capitalization & punctuation ("GREAT!!!" is stronger than "great."), and negations (words like "isn't" and "doesn't" affect the outcome).
To view the source code visit https://www.nltk.org/_modules/nltk/sentiment/vader.html

In [151]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [152]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

VADER's SentimentIntensityAnalyzer() takes in a string and returns a dictionary of scores in each of four categories:

negative, 
neutral, 
positive, 
compound (computed by normalizing the scores above)

In [153]:
a='This is a good movie'
sid.polarity_scores(a)

{'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'compound': 0.4404}

In [154]:
a = 'This was the best, most awesome movie EVER MADE!!!'
sid.polarity_scores(a)

{'neg': 0.0, 'neu': 0.425, 'pos': 0.575, 'compound': 0.8877}

In [155]:
a = 'This was the worst film to ever disgrace the screen.'
sid.polarity_scores(a)

{'neg': 0.477, 'neu': 0.523, 'pos': 0.0, 'compound': -0.8074}

# Topic Modelling

# Latent Dirichlet Allocation

In [156]:
import pandas as pd

In [157]:
npr=pd.read_csv("C:/Users/hp/Desktop/UPDATED_NLP_COURSE/UPDATED_NLP_COURSE/05-Topic-Modeling/npr.csv")

In [158]:
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


Preprocessing

In [159]:
from sklearn.feature_extraction.text import CountVectorizer

max_df: float in range [0.0, 1.0] or int, default=1.0

When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

min_df: float in range [0.0, 1.0] or int, default=1

When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

In [160]:
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [161]:
dtm = cv.fit_transform(npr['Article'])

In [162]:
from sklearn.decomposition import LatentDirichletAllocation

In [163]:
LDA = LatentDirichletAllocation(n_components=10,random_state=42)

In [164]:
LDA.fit(dtm)

LatentDirichletAllocation(random_state=42)

Showing Stored Words

In [165]:
len(cv.get_feature_names())

54777

In [166]:
len(LDA.components_)

10

In [167]:
import random

In [168]:
for i in range(10):
    random_word_id = random.randint(0,54776)
    print(cv.get_feature_names()[random_word_id])

jain
shanties
fringes
themes
chieftains
subservience
cig
spinraza
webinar
interplay


In [169]:
for i in range(10):
    random_word_id = random.randint(0,54776)
    print(cv.get_feature_names()[random_word_id])

vietnam
oxidized
fought
metformin
literature
gorbachev
glasse
timeline
sleeps
wallets


Showing Top Words Per Topic

In [170]:
len(LDA.components_[0])

54777

In [171]:
single_topic = LDA.components_[0]

In [172]:
# Returns the indices that would sort this array.
single_topic.argsort() #from least to greatest index position

array([18302,  2475, 44967, ..., 10425, 42561, 42993], dtype=int64)

In [173]:
# Word least representative of this topic
single_topic[18302]

0.10000000000053799

In [174]:
# Top 10 greatest indexes of words for this topic:
single_topic.argsort()[-10:]

array([    1, 18349, 33390, 32089, 10421, 31464, 22673, 10425, 42561,
       42993], dtype=int64)

In [175]:
top_word_indices = single_topic.argsort()[-10:]

In [176]:
for index in top_word_indices:
    print(cv.get_feature_names()[index])
    

000
federal
new
money
companies
million
health
company
said
says


In [177]:
top_word_indices = single_topic.argsort()[-20:]
for index in top_word_indices:
    print(cv.get_feature_names()[index])
    

industry
tax
business
percent
pay
people
care
government
year
insurance
000
federal
new
money
companies
million
health
company
said
says


In [178]:
type(LDA.components_)

numpy.ndarray

In [179]:
for index,topic in enumerate(LDA.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([cv.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

    
#Probability of words belonging to a particular topic

THE TOP 15 WORDS FOR TOPIC #0
['people', 'care', 'government', 'year', 'insurance', '000', 'federal', 'new', 'money', 'companies', 'million', 'health', 'company', 'said', 'says']


THE TOP 15 WORDS FOR TOPIC #1
['npr', 'intelligence', 'security', 'new', 'told', 'russian', 'campaign', 'obama', 'news', 'white', 'russia', 'house', 'president', 'said', 'trump']


THE TOP 15 WORDS FOR TOPIC #2
['know', 'little', 'home', 'make', 'way', 'day', 'water', 'time', 'years', 'people', 'food', 'new', 'just', 'like', 'says']


THE TOP 15 WORDS FOR TOPIC #3
['don', 'food', 'work', 'day', 'life', 'time', 'family', 'children', 'years', 'just', 'women', 'world', 'like', 'people', 'says']


THE TOP 15 WORDS FOR TOPIC #4
['supreme', 'order', 'city', 'states', 'federal', 'country', 'president', 'rights', 'government', 'people', 'law', 'state', 'said', 'court', 'says']


THE TOP 15 WORDS FOR TOPIC #5
['going', 've', 'story', 'life', 'don', 'new', 'way', 'time', 'really', 'know', 'think', 'music', 'people', '

In [180]:
topic_results = LDA.transform(dtm)

In [181]:
#Probability of a docment belonging to a particular topic
topic_results[0]

array([8.78101114e-03, 9.11263140e-01, 1.57269537e-04, 1.57265808e-04,
       1.57268730e-04, 1.57266519e-04, 1.57271636e-04, 1.57262374e-04,
       7.88549762e-02, 1.57267682e-04])

In [182]:
topic_results[0].round(2)

array([0.01, 0.91, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.08, 0.  ])

In [183]:
topic_results[0].argmax() #shows highest probaility index position

1

In [184]:
npr['Topic']=topic_results.argmax(axis=1)

In [185]:
npr.head()

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",6


# Non-Negative Matric Factorization

In [186]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [187]:
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [188]:
dtm = tfidf.fit_transform(npr['Article'])

In [189]:
from sklearn.decomposition import NMF

In [190]:
nmf_model = NMF(n_components=10,random_state=42)

In [191]:
nmf_model.fit(dtm)



NMF(n_components=10, random_state=42)

In [192]:
len(tfidf.get_feature_names())

54777

In [193]:
for i in range(10):
    random_word_id = random.randint(0,54776)
    print(tfidf.get_feature_names()[random_word_id])

gourmets
scheming
parochialism
unkind
silk
refinery29
infringed
prototypes
sexless
strident


In [194]:
single_topic = nmf_model.components_[0]

In [195]:
single_topic.argsort()

array([27388, 27031, 27030, ..., 19307, 36283, 42993], dtype=int64)

In [196]:
# Top 10 words for this topic:
single_topic.argsort()[-10:]

array([26752, 10425, 47218, 33390, 36310, 28659, 53152, 19307, 36283,
       42993], dtype=int64)

In [197]:
top_word_indices = single_topic.argsort()[-10:]

In [198]:
for index in top_word_indices:
    print(tfidf.get_feature_names()[index])

just
company
study
new
percent
like
water
food
people
says


In [199]:
for index,topic in enumerate(nmf_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['year', 'university', 'workers', '000', 'years', 'just', 'company', 'study', 'new', 'percent', 'like', 'water', 'food', 'people', 'says']


THE TOP 15 WORDS FOR TOPIC #1
['administration', 'cruz', 'election', 'pence', 'gop', 'presidential', 'obama', 'house', 'white', 'republican', 'donald', 'campaign', 'said', 'president', 'trump']


THE TOP 15 WORDS FOR TOPIC #2
['patients', 'repeal', 'law', 'act', 'republicans', 'tax', 'people', 'plan', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']


THE TOP 15 WORDS FOR TOPIC #3
['assad', 'iran', 'iraq', 'north', 'china', 'aleppo', 'war', 'korea', 'said', 'forces', 'russia', 'military', 'syrian', 'syria', 'isis']


THE TOP 15 WORDS FOR TOPIC #4
['cruz', 'election', 'primary', 'democrats', 'percent', 'party', 'vote', 'state', 'delegates', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']


THE TOP 15 WORDS FOR TOPIC #5
['book', 'love', 'women', 'way', 'time', 'life'

In [200]:
topic_results = LDA.transform(dtm)

In [201]:
topic_results[0]

array([0.00591734, 0.94690124, 0.00589809, 0.00589737, 0.00589752,
       0.0058973 , 0.0058976 , 0.00589714, 0.005899  , 0.00589739])

In [202]:
topic_results[0].round(2)

array([0.01, 0.95, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01])

In [203]:
topic_results.argmax(axis=1)

array([1, 1, 1, ..., 9, 8, 4], dtype=int64)

In [204]:
npr['Topic'] = topic_results.argmax(axis=1)

In [205]:
npr.head(10)

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2
5,I did not want to join yoga class. I hated tho...,9
6,With a who has publicly supported the debunk...,9
7,"I was standing by the airport exit, debating w...",2
8,"If movies were trying to be more realistic, pe...",2
9,"Eighteen years ago, on New Year’s Eve, David F...",2


In [206]:
my_topic={0:'Infrastructure',1:'Politics',2:'Health Care',3:'Military',4:'Election',5:'Music',6:'Education',7:'Infection',8:'Police-Enforcement',9:'Court'}
npr['Topic_Name']=npr['Topic'].map(my_topic)

In [207]:
npr.head(20)

Unnamed: 0,Article,Topic,Topic_Name
0,"In the Washington of 2016, even when the polic...",1,Politics
1,Donald Trump has used Twitter — his prefe...,1,Politics
2,Donald Trump is unabashedly praising Russian...,1,Politics
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1,Politics
4,"From photography, illustration and video, to d...",2,Health Care
5,I did not want to join yoga class. I hated tho...,9,Court
6,With a who has publicly supported the debunk...,9,Court
7,"I was standing by the airport exit, debating w...",2,Health Care
8,"If movies were trying to be more realistic, pe...",2,Health Care
9,"Eighteen years ago, on New Year’s Eve, David F...",2,Health Care


# Keras Basics

In [208]:
import numpy as np

In [209]:
from sklearn.datasets import load_iris

In [210]:
iris = load_iris()

In [211]:
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [212]:
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [213]:
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [214]:
X = iris.data

In [215]:
y = iris.target

In [216]:
from keras.utils import to_categorical

In [217]:
y = to_categorical(y)

In [218]:
y.shape

(150, 3)

In [219]:
y

array([[1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0

In [220]:
from sklearn.model_selection import train_test_split

In [221]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Standardizing the Data

Usually when using Neural Networks, you will get better performance when you standardize the data. Standardization just means normalizing the values to all fit between a certain range, like 0-1, or -1 to 1.

In [222]:
from sklearn.preprocessing import MinMaxScaler

In [223]:
scaler_object = MinMaxScaler()

In [224]:
scaler_object.fit(X_train)

MinMaxScaler()

In [225]:
scaled_X_train = scaler_object.transform(X_train)

In [226]:
scaled_X_test = scaler_object.transform(X_test)

# Building the Network with Keras

In [227]:
from keras.models import Sequential
from keras.layers import Dense

In [228]:
model = Sequential()
model.add(Dense(8, input_dim=4, activation='relu'))
model.add(Dense(8, input_dim=4, activation='relu'))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [229]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 8)                 40        
                                                                 
 dense_1 (Dense)             (None, 8)                 72        
                                                                 
 dense_2 (Dense)             (None, 3)                 27        
                                                                 
Total params: 139
Trainable params: 139
Non-trainable params: 0
_________________________________________________________________


In [230]:
model.fit(scaled_X_train,y_train,epochs=150, verbose=2)

Epoch 1/150
4/4 - 2s - loss: 1.1175 - accuracy: 0.3400 - 2s/epoch - 460ms/step
Epoch 2/150
4/4 - 0s - loss: 1.1136 - accuracy: 0.3400 - 7ms/epoch - 2ms/step
Epoch 3/150
4/4 - 0s - loss: 1.1101 - accuracy: 0.3500 - 6ms/epoch - 2ms/step
Epoch 4/150
4/4 - 0s - loss: 1.1071 - accuracy: 0.3500 - 6ms/epoch - 1ms/step
Epoch 5/150
4/4 - 0s - loss: 1.1036 - accuracy: 0.3700 - 5ms/epoch - 1ms/step
Epoch 6/150
4/4 - 0s - loss: 1.1004 - accuracy: 0.4100 - 5ms/epoch - 1ms/step
Epoch 7/150
4/4 - 0s - loss: 1.0968 - accuracy: 0.4300 - 5ms/epoch - 1ms/step
Epoch 8/150
4/4 - 0s - loss: 1.0932 - accuracy: 0.4700 - 5ms/epoch - 1ms/step
Epoch 9/150
4/4 - 0s - loss: 1.0893 - accuracy: 0.5100 - 6ms/epoch - 1ms/step
Epoch 10/150
4/4 - 0s - loss: 1.0851 - accuracy: 0.5700 - 5ms/epoch - 1ms/step
Epoch 11/150
4/4 - 0s - loss: 1.0809 - accuracy: 0.6300 - 4ms/epoch - 1ms/step
Epoch 12/150
4/4 - 0s - loss: 1.0766 - accuracy: 0.6500 - 5ms/epoch - 1ms/step
Epoch 13/150
4/4 - 0s - loss: 1.0724 - accuracy: 0.6500 - 6m

<keras.callbacks.History at 0x23503ec5b50>

In [231]:
y_pred=model.predict(scaled_X_test)



In [232]:
predict_x=model.predict(scaled_X_test) 
classes_x=np.argmax(predict_x,axis=1)
classes_x



array([2, 0, 2, 2, 2, 0, 1, 2, 2, 1, 2, 0, 0, 0, 0, 2, 2, 2, 2, 2, 0, 2,
       0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 2, 0, 0, 2, 2, 0, 0, 0, 2, 1, 2, 0,
       0, 1, 2, 2, 2, 2], dtype=int64)

# Evaluating Model Performance

In [233]:
model.metrics_names

['loss', 'accuracy']

In [234]:
model.evaluate(x=scaled_X_test,y=y_test)



[0.3938922584056854, 0.7799999713897705]

In [235]:
from sklearn.metrics import confusion_matrix,classification_report

In [236]:
y_test.argmax(axis=1)

array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2,
       0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 2, 1, 1, 0,
       0, 1, 2, 2, 1, 2], dtype=int64)

In [237]:
confusion_matrix(y_test.argmax(axis=1),classes_x)

array([[19,  0,  0],
       [ 0,  4, 11],
       [ 0,  0, 16]], dtype=int64)

In [238]:
print(classification_report(y_test.argmax(axis=1),classes_x))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      0.27      0.42        15
           2       0.59      1.00      0.74        16

    accuracy                           0.78        50
   macro avg       0.86      0.76      0.72        50
weighted avg       0.87      0.78      0.74        50

