# Natural Language Processing (NLP)

## Natural Language Processing (NLP)

### Introduction

NLP involves computing algorithms to mathematically analyze natural human languages. To understand NLP, we need to analyze how humans comprehend language. NLP is useful for Information
retrieval, sentiment analysis, Information Extraction, Machine Translation, Prediction of likely text in search boxes. It is also used in analyzing post speech recognition texts. 


<img src="../images/NLP.png" style="width: 350px;">

Here are various operations of NLP that we are going to learn:

- Tokenization
- Stopword removal
- Stemming and lemmatization by recognition of root words.
- Part-of-speech tagging
- Named entity recognition

<img src="../images/nlp-flow.png" style="width: 350px;">


### Tokenization

Tokenization is a process of understanding a sentence or a phrase by splitting them into substrings, words and punctuation.

```python
>>> from nltk.tokenize import word_tokenize
>>> text = '''All extraterrestrial activity today is governed by a 50-year-old treaty drafted at the height of the Cold War. Will governments around the world agree on an update before the final frontier becomes the Wild West?'''
>>> word_tokenize(text)
['All', 'extraterrestrial', 'activity', 'today', 'is', 'governed', 'by', 'a', '50-year-old', 'treaty', 'drafted', 'at', 'the', 'height', 'of', 'the', 'Cold', 'War', '.', 'Will', 'governments', 'around', 'the', 'world', 'agree', 'on', 'an', 'update', 'before', 'the', 'final', 'frontier', 'becomes', 'the', 'Wild', 'West', '?']
```

Similar to tokenizing words, we can tokenize sentences to split the text into the sentence levels.

```python
>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize(text)
['All extraterrestrial activity today is governed by a 50-year-old treaty drafted at the height of the Cold War.', 'Will governments around the world agree on an update before the final frontier becomes the Wild West?']
```

We can see that the text has been split up into multiple sentences. Therefore tokenizing is the first step in the analysis. 

### Stop Words

Most of the times we are interested in the useful words out a sentence either to recognize the text or to extract features from it. Towards that end, commonly occurring words such as 'the', 'is', 'and' would not be useful to us. Such words are called stop words. They can be listed in the nltk toolkit. You can list the stop words:
```python
from nltk.corpus import stopwords
>>> stopwords.words('english')
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']
```



## Exercise:

Given the text from the wall street journal, tokenize the text at the word level and remove the stop words.

- Assign the list to the variable, word_features and print it out.

In [2]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

wsj_text = '''All extraterrestrial activity today is governed by a 50-year-old treaty drafted at the height of the Cold War. Will governments around the world agree on an update before the final frontier becomes the Wild West?'''



<p>use [x for x in list if x not in ] to filter data.</p>

In [None]:
wsj_tokens = word_tokenize(wsj_text)
word_features = [word for word in wsj_tokens if word not in stopwords.words('english')]
print(word_features)

In [None]:
ref_tmp_var = False


try:
    ref_assert_var = False
    import numpy as np
    
    wsj_tokens_ = word_tokenize(wsj_text)
    word_features_ = [word for word in wsj_tokens_ if word not in stopwords.words('english')]
    
    if np.all(word_features == word_features_):
      ref_assert_var = True
    else:
      ref_assert_var = False
    
except Exception:
    print('Please follow the instructions given and use the same variables provided in the instructions.')
else:
    if ref_assert_var:
        ref_tmp_var = True
    else:
        print('Please follow the instructions given and use the same variables provided in the instructions.')


assert ref_tmp_var


<br/><br/><br/>
## Stemming

### Stemming

There is more cleaning we have to often to do extract features from a text before we can begin to use them. Removing stop words was the first step but that isnt enough. We see that the features extracted from the wall street journal text were:
```python
['All', 'extraterrestrial', 'activity', 'today', 'governed', '50-year-old', 'treaty', 'drafted', 'height', 'Cold', 'War', '.', 'Will', 'governments', 'around', 'world', 'agree', 'update', 'final', 'frontier', 'becomes', 'Wild', 'West', '?']
```

Words such as 'governed, drafted' are past tenses and the word 'becomes' is a plural word. Consider a scenario where another set of word features from a different class contains words such as "governs, drafts and became". These words refer to the same meaning but are regarded as different features. We need a way to map such word forms into their root word. That is possible by a word stemmer. There are many word stemmers available in nltk. Import the Porter Stemmer as:

```python
from nltk.stem.porter import PorterStemmer
```

Instantiate a stemmer and run it on the wsj text. Note that the stemmer works at the word level and hence the stemmer needs to be run on every single word after removing the stopwords.

```python
stemmer = PorterStemmer()
root_words = [stemmer.stem(word) for word in word_features]
print(root_words)
['all', 'extraterrestri', 'activ', 'today', 'govern', '50-year-old', 'treati', 'draft', 'height', 'cold', 'war', '.', 'will', 'govern', 'around', 'world', 'agre', 'updat', 'final', 'frontier', 'becom', 'wild', 'west', '?']
```

The root words need not be the same as the noun form of the word or even a word in the dictionary. It is a word feature used to associate various forms of the word to itself. 



## Exercise:

A better stemmer is the Snowball Stemmer. It is available in the nltk.stem.snowball module. The stemmer works for many languages. Hence the SnowballStemmer takes in the language "english" as an argument.

- Extract root words using the SnowballStemmer from the word_features and print it out.

In [None]:
from nltk.stem.snowball import SnowballStemmer



<p>works similar to PorterStemmer()</p>

In [None]:
stemmer = SnowballStemmer("english")
root_words = [stemmer.stem(word) for word in word_features]
print(root_words)

In [None]:
ref_tmp_var = False


try:
    ref_assert_var = False
    stemmer_ = SnowballStemmer("english")
    root_words_ = [stemmer_.stem(word) for word in word_features]
    
    import numpy as np
    
    if np.all(root_words == root_words_):
      ref_assert_var = True
    else:
      ref_assert_var = False
    
except Exception:
    print('Please follow the instructions given and use the same variables provided in the instructions.')
else:
    if ref_assert_var:
        ref_tmp_var = True
    else:
        print('Please follow the instructions given and use the same variables provided in the instructions.')


assert ref_tmp_var


<br/><br/><br/>
## Lemmatization

Lemma is defined in the google dictionary as "a heading indicating the subject or argument of a literary composition, an annotation, or a dictionary entry". Lemmatization is the algorithmic method of determining the 'lemma' of a word form. The only difference between stemming is that it involves contextual understanding of a word. Therefore, Lemmataization works based on the parts-of-speech of the word. Unlike stemming, lemmatization yields real words in the dictionary. For example, lemmatizing the word "governed" with POS as verb and noun would yield different lemmas:

```python
>>> lemmatizer.lemmatize("governed", pos="v")
'govern'

>>> lemmatizer.lemmatize("governed", pos="n")
'governed'
```



## Exercise:

- Lemmatize each word with pos as verb in the word_features list and assign it to the list, lemmas.
- print out the list, lemmas.

In [None]:
from nltk.stem import WordNetLemmatizer



<p>Use it similar to stemmers.</p>

In [None]:
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word, pos="v") for word in word_features]
print(lemmas)

In [None]:
ref_tmp_var = False


try:
    ref_assert_var = False
    import numpy as np
    
    lemmatizer_ = WordNetLemmatizer()
    lemmas_ = [lemmatizer_.lemmatize(word, pos="v") for word in word_features]
    
    if np.all(lemmas == lemmas_):
      ref_assert_var = True
    else:
      ref_assert_var = False
    
except Exception:
    print('Please follow the instructions given and use the same variables provided in the instructions.')
else:
    if ref_assert_var:
        ref_tmp_var = True
    else:
        print('Please follow the instructions given and use the same variables provided in the instructions.')


assert ref_tmp_var


<br/><br/><br/>
## Parts-Of-Speech Tagging

We learnt that the stemming generates root word forms but not real words. The lemmatization does generate root words given the POS. Since large parts of text, it would be impractical to input POS for every single word. There must be a mechanism where we can generate POS given the structure of the sentence. This can be done with a POS tagger. POS taggers look at the context, sentence structure and with Machine Learning algorithms generate the best guesses for the parts-of-speech. This is performed in two steps:

### Tokenizing the Words

The words need to be split into 'tokens' or individual words, symbols and other textual elements that exist in the dictionary. This is done with pos_tag in nltk:

```python
from nltk import pos_tag
from nltk.tokenize import word_tokenize

wsj_tokens = word_tokenize(wsj_text)
```

### POS Tagging

Tagging POS is done by running pos_tag imported from nltk.tokenize module.

```python
wsj_pos_tokens = pos_tag(wsj_tokens)
>>> wsj_pos_tokens
[('All', 'DT'), ('extraterrestrial', 'JJ'), ('activity', 'NN'), ('today', 'NN'), ('is', 'VBZ'), ('governed', 'VBN'), ('by', 'IN'), ('a', 'DT'), ('50-year-old', 'JJ'), ('treaty', 'NN'), ('drafted', 'VBN'), ('at', 'IN'), ('the', 'DT'), ('height', 'NN'), ('of', 'IN'), ('the', 'DT'), ('Cold', 'NNP'), ('War', 'NNP'), ('.', '.'), ('Will', 'MD'), ('governments', 'NNS'), ('around', 'IN'), ('the', 'DT'), ('world', 'NN'), ('agree', 'NN'), ('on', 'IN'), ('an', 'DT'), ('update', 'NN'), ('before', 'IN'), ('the', 'DT'), ('final', 'JJ'), ('frontier', 'NN'), ('becomes', 'VBZ'), ('the', 'DT'), ('Wild', 'NNP'), ('West', 'NNP'), ('?', '.')]
```

The pos_tag generates list of tuples of word, token tuples.

<img src="../images/POS_tagging.png" style="width: 1000px;">


## Exercise:

After performing POS tagging, build out a dictionary that has a POS type as key and the list of words as their value. 

- Assign the dictionary to the variable, pos_word_map

In [None]:
from nltk import pos_tag
from nltk.tokenize import word_tokenize

wsj_tokens = word_tokenize(wsj_text)
wsj_pos = pos_tag(wsj_tokens)

# Determine the (type2, words) pairs
pos_word_map = {}


<p>check if the type already exists in the dictionary and then append to the list. If it doesn't then assign a list to the key, type.</p>

In [None]:
pos_word_map = {}

for (word, type2) in wsj_pos:
    if type2 in pos_word_map.keys():
        pos_word_map[type2].append(word)
    else:
        pos_word_map[type2] = [word]
 
print(pos_word_map)

In [None]:
ref_tmp_var = False


try:
    ref_assert_var = False
    pos_word_map_ = {}
    
    for (word, type2) in wsj_pos:
        if type2 in pos_word_map_.keys():
            pos_word_map_[type2].append(word)
        else:
            pos_word_map_[type2] = [word]
    
    if pos_word_map == pos_word_map_:
      ref_assert_var = True
    else:
      ref_assert_var = False
    
except Exception:
    print('Please follow the instructions given and use the same variables provided in the instructions.')
else:
    if ref_assert_var:
        ref_tmp_var = True
    else:
        print('Please follow the instructions given and use the same variables provided in the instructions.')


assert ref_tmp_var

<br/><br/><br/>
## Information Retrieval(IR)

Information retrieval (IR) is the study of extracting relevant information from a corpus of information sources. In the context of NLP, it involves identifying information through search such as recognizing entities that exist in the textual information. 

Wordcloud is one of the many ways to quickly see what a text is all about. It projects the importance or frequency of the words used in a text. Bigger the font, higher the frequency in which the word was used. Below is a wordcloud formed from all job descriptions for a big data engineer posted on stackoverflow.com.

<img src="../images/cloud-bigdataengineer.png" style="width: 700px;">

<br>
'Named Entity Recognition' (NER) is another form of information retrieval. It classifies the named entities in the text as pre-determined categories such as name of person, name of organization etc. nltk has feature to perform NER.


### Example:

```python
from nltk import word_tokenize, pos_tag, ne_chunk
 
wsj_text2 = "CEO Michael McGarry of PPG Industries, the world’s largest paint maker, is pursuing Dutch rival Akzo Nobel, in one of the boldest trans-Atlantic bids in recent memory."

ne_chunk(pos_tag(word_tokenize(wsj_text2)))
Tree('S', [Tree('ORGANIZATION', [('CEO', 'NN')]), Tree('PERSON', [('Michael', 'NNP'), ('McGarry', 'NNP')]), ('of', 'IN'), Tree('ORGANIZATION', [('PPG', 'NNP'), ('Industries', 'NNPS')]), (',', ','), ('the', 'DT'), ('world’s', 'NN'), ('largest', 'JJS'), ('paint', 'NN'), ('maker', 'NN'), (',', ','), ('is', 'VBZ'), ('pursuing', 'VBG'), Tree('GPE', [('Dutch', 'NNP')]), ('rival', 'NN'), Tree('PERSON', [('Akzo', 'NNP'), ('Nobel', 'NNP')]), (',', ','), ('in', 'IN'), ('one', 'CD'), ('of', 'IN'), ('the', 'DT'), ('boldest', 'JJS'), ('trans-Atlantic', 'JJ'), ('bids', 'NNS'), ('in', 'IN'), ('recent', 'JJ'), ('memory', 'NN'), ('.', '.')])
```

This is very useful as the NER is able to identify that 'Michael', 'Akzo' is a 'Person' and that CEO falls under 'Organization' category.


## Exercise:

The tree output is a data structure that is returned by ne_chunk.

- Input the leaves to the variable, ner_leaves. Use .leaves() function.
- Print out the ner_leaves and analyze the information. What does it contain?

In [None]:
from nltk import word_tokenize, pos_tag, ne_chunk

wsj_text2 = "CEO Michael McGarry of PPG Industries, the world’s largest paint maker, is pursuing Dutch rival Akzo Nobel, in one of the boldest trans-Atlantic bids in recent memory."
ner_info = ne_chunk(pos_tag(word_tokenize(wsj_text2)))

# What are the leaves?

ner_leaves = ''



In [None]:
ner_leaves = ner_info.leaves()
print(ner_leaves)

In [None]:
ref_tmp_var = False


try:
    ref_assert_var = False
    import numpy as np
    
    ner_leaves_ = ner_info.leaves()
    
    if np.all(ner_leaves == ner_leaves_):
      ref_assert_var = True
    else:
      ref_assert_var = False
    
except Exception:
    print('Please follow the instructions given and use the same variables provided in the instructions.')
else:
    if ref_assert_var:
        ref_tmp_var = True
    else:
        print('Please follow the instructions given and use the same variables provided in the instructions.')


assert ref_tmp_var