# Fundamental Methods of NLP for Building Chatbots

In [7]:
import spacy

In [8]:
from prettytable import PrettyTable

## POS Tagging

Example 1:

In [9]:
nlp = spacy.load('en') #Loads the spacy en model into a python object
doc = nlp(u'I am learning how to build chatbots') #Creates a doc object
for token in doc:
    print(token.text, token.pos_) #prints the text and POS

I PRON
am VERB
learning VERB
how ADV
to PART
build VERB
chatbots NOUN


* As we can see, when we print the tokens from the returned Doc object from the method nlp, which is a container for accessing the annotations, we get the POS tagged with each of the words in the sentence.

* These tags are the properties belonging to the word that determine the word is used in a grammatically correct sentence. We can use these tags as the word features in information filtering, etc.

Example 2:

In [10]:
doc = nlp(u'I am learning how to build chatbots')
t = PrettyTable(['Text', 'Lemma','POS','TAG','DEP','SHAPE','ALPHA','Stop'])
for token in doc:
    t.add_row([token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop])
print(t)

+----------+---------+------+-----+--------+-------+-------+-------+
|   Text   |  Lemma  | POS  | TAG |  DEP   | SHAPE | ALPHA |  Stop |
+----------+---------+------+-----+--------+-------+-------+-------+
|    I     |  -PRON- | PRON | PRP | nsubj  |   X   |  True | False |
|    am    |    be   | VERB | VBP |  aux   |   xx  |  True |  True |
| learning |  learn  | VERB | VBG |  ROOT  |  xxxx |  True | False |
|   how    |   how   | ADV  | WRB | advmod |  xxx  |  True |  True |
|    to    |    to   | PART |  TO |  aux   |   xx  |  True |  True |
|  build   |  build  | VERB |  VB | xcomp  |  xxxx |  True | False |
| chatbots | chatbot | NOUN | NNS |  dobj  |  xxxx |  True | False |
+----------+---------+------+-----+--------+-------+-------+-------+


* Refer to the below table to find out the meaning of each attribute we printed in the code.

|TEXT | Actual text or word being processed |
|-----|-------------------------------------|
|LEMMA| Root form of the word being processed|
|POS |Part-of-speech of the word|
|TAG |They express the part-of-speech (e.g., VERB) and some amount ofmorphologicalinformation (e.g., that the verb is past tense).|
|DEP |Syntactic dependency (i.e., the relation between tokens)|
|SHAPE |Shape of the word (e.g., the capitalization, punctuation, digits format)|
|ALPHA |Is the token an alpha character?|
|Stop |Is the word a stop word or part of a stop list?|

## Stemming and Lemmatization

In [11]:
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES
lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)
lemmatizer('chuckles', 'NOUN')

['chuckle']

In [12]:
lemmatizer('blazing', 'VERB')

['blaze']

In [13]:
lemmatizer('ran', 'VERB')

['run']

## Named-Entity Recognition

In [14]:
my_string = "Google has its headquarters in Mountain View, California having revenue amounted to 109.65 billion US dollars"
doc = nlp(my_string)
for ent in doc.ents:
    print(ent.text, ent.label_)

Google ORG
Mountain View GPE
California GPE
109.65 billion US dollars MONEY


In [15]:
my_string = u"Mark Zuckerberg born May 14, 1984 in New York is an American technology entrepreneur and philanthropist best known for co-founding and leading Facebook as its chairman and CEO."
doc = nlp(my_string)
for ent in doc.ents:
    print(ent.text, ent.label_)

Mark Zuckerberg PERSON
May 14, 1984 DATE
New York GPE
American NORP
Facebook ORG


## Stop Words

In [16]:
from spacy.lang.en.stop_words import STOP_WORDS
print(STOP_WORDS)

{'your', 'first', 'often', 'those', 'around', 'already', 'it', 'at', 'doing', 'more', 'over', 'full', 'whereupon', 'mine', 'latter', 'never', 'in', 'now', 'whom', 'between', 'can', 'but', 'well', 'quite', 'about', 'always', 'several', 'every', 'ten', 'along', 'only', 'somehow', 'whereafter', 'wherein', 'from', 'ourselves', 'hereupon', 'get', 'above', 'five', 'whither', 'whence', 'together', 'show', 'whoever', 'beside', 'thereafter', 'before', 'one', 'because', 'seeming', 'give', 'side', 'done', 'same', 'not', 'move', 'should', 'whereas', 'namely', 'nowhere', 'nothing', 'hers', 'everywhere', 'who', 'how', 'become', 'elsewhere', 'within', 'you', 'and', 'nobody', 'yourselves', 'afterwards', 'that', 'out', 'cannot', 'ca', 'perhaps', 'thus', 'anyone', 'behind', 'call', 'even', 'under', 'yet', 'someone', 'below', 'any', 'there', 'anyhow', 'keep', 'made', 'bottom', 'towards', 'once', 'much', 'unless', 'on', 'again', 'what', 'without', 'hundred', 'many', 'through', 'such', 'whatever', 'please'

In [17]:
nlp.vocab[u'is'].is_stop

True

In [18]:
nlp.vocab[u'with'].is_stop

True

In [19]:
nlp.vocab[u'alphaalpha'].is_stop

False

## Dependency Parsing

In [20]:
doc = nlp(u'Book me a flight from Bangalore to Goa')
blr, goa = doc[5], doc[7]
list(blr.ancestors)

[from, flight, Book]

The above output can tell us that user is looking to book the flight from Bangalore.

In [21]:
list(goa.ancestors)

[to, flight, Book]

This output can tell us that the user is looking to book the flight to Goa.

## Noun Chunks

In [22]:
doc = nlp(u"Boston Dynamics is gearing up to produce thousands of robot dogs")
list(doc.noun_chunks)

[Boston Dynamics, thousands, robot dogs]

In [23]:
doc = nlp(u"Deep learning cracks the code of messenger RNAs and protein-­ coding potential")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
          chunk.root.head.text)

Deep learning learning nsubj cracks
the code code dobj cracks
messenger RNAs RNAs pobj of
potential potential dobj coding


## Finding Similarity

In [24]:
doc = nlp(u'How are you doing today?')
for token in doc:
    print(token.text, token.vector[:5])

How [-0.2974264   0.7393958  -0.04001491  0.44034028  2.8967488 ]
are [-0.23435098 -1.6145049   1.0197449   0.9928163   0.2822708 ]
you [ 0.1025213 -3.5647118  2.4822786  4.2825     3.590246 ]
doing [-0.6240919 -2.0210211 -0.9101493  2.7051926  4.189254 ]
today [ 3.5409102 -0.6218593  2.6274266  2.050488   0.2019196]
? [ 2.8914995  -0.25079104  3.3764172   1.6942688   1.9849054 ]


Above not makes sence but shows imilarity between two words.

In [25]:
hello_doc = nlp(u"hello")
hi_doc = nlp(u"hi")
hella_doc = nlp(u"hella")
print(hello_doc.similarity(hi_doc))
print(hello_doc.similarity(hella_doc))

0.7879069561264538
0.41934263641148983


If you see the word hello, it is more related and similar to the word hi, even though
there is only a difference of a character between the words hello and hella.

In [31]:
example_doc = nlp(u"car truck google")
for t1 in example_doc:
    for t2 in example_doc:
        similarity_perc = int(t1.similarity(t2) * 100)
        print ("Word {} is {}% similar to word {}".format(t1.text,similarity_perc,  t2.text))

Word car is 100% similar to word car
Word car is 71% similar to word truck
Word car is 24% similar to word google
Word truck is 71% similar to word car
Word truck is 100% similar to word truck
Word truck is 36% similar to word google
Word google is 24% similar to word car
Word google is 36% similar to word truck
Word google is 100% similar to word google


# Tokenization

In [34]:
doc = nlp(u'Brexit is the impending withdrawal of the U.K. from the European Union.')
for token in doc:
    print(token.text)

Brexit
is
the
impending
withdrawal
of
the
U.K.
from
the
European
Union
.


# Regular expression 

In [32]:
sentence1 = "Book me a metro from Airport Station to Hong Kong Station."
sentence2 = "Book me a cab to Hong Kong Airport from AsiaWorld-Expo."
import re
from_to = re.compile('.* from (.*) to (.*)')
to_from = re.compile('.* to (.*) from (.*)')
from_to_match = from_to.match(sentence2)
to_from_match = to_from.match(sentence2)
if from_to_match and from_to_match.groups():
    _from = from_to_match.groups()[0]
    _to = from_to_match.groups()[1]
    print("from_to pattern matched correctly. Printing values\n")
    print("From: {}, To: {}".format(_from, _to))
elif to_from_match and to_from_match.groups():
    _to = to_from_match.groups()[0]
    _from = to_from_match.groups()[1]
    print("to_from pattern matched correctly. Printing values\n")
    print("From: {}, To: {}".format(_from, _to))

to_from pattern matched correctly. Printing values

From: AsiaWorld-Expo., To: Hong Kong Airport
