<a href="https://colab.research.google.com/github/arbi11/CompEM1/blob/master/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [0]:
sample_data = ["Today is a cold Sunday morning. I am at the Nashville School of Law. \
               I am here for PyTennessee where I can learn more about Python."]

## Sentence segmentation
Also called sentence tokenization or sentence boundary disambiguation, breaks up sentences by deciding where a sentence starts and ends. Challenges include recognizing ambiguous puncutation marks. Use ```sent_tokenize``` from ```nltk.tokenize```. 

In [0]:
from nltk.tokenize import sent_tokenize

def get_sent_tokens(data):
    """Sentence tokenization"""
    sentences = []
    for sent in data:
        sentences.extend(sent_tokenize(sent))
    print(sentences)
    return sentences


In [0]:
sample_sentences = get_sent_tokens(sample_data)


['Today is a cold Sunday morning.', 'I am at the Nashville School of Law.', 'I am here for PyTennessee where I can learn more about Python.']


## Word tokenization
Similar to sentence tokenization, but works on words. Use ```word_tokenize``` from ```nltk.tokenize```. 

In [0]:
from nltk.tokenize import word_tokenize

def get_word_tokens(sentences):
    '''Word tokenization'''
    words = []
    for sent in sentences:
        words.extend(word_tokenize(sent))
    print(words)
    return(words)

In [0]:
sample_words = get_word_tokens(sample_sentences)


['Today', 'is', 'a', 'cold', 'Sunday', 'morning', '.', 'I', 'am', 'at', 'the', 'Nashville', 'School', 'of', 'Law', '.', 'I', 'am', 'here', 'for', 'PyTennessee', 'where', 'I', 'can', 'learn', 'more', 'about', 'Python', '.']


## Frequency distribution
Calculates the frequency distribution for each word in the data. Use ```nltk.probability``` from ```FreqDist``` and ```matplotlib```.

In [13]:
import matplotlib
from nltk.probability import FreqDist
matplotlib.use('TkAgg') 

def plot_freq_dist(words, num_words = 20):
    fdist = FreqDist(words)
    fdist.plot(num_words,cumulative=False)

ImportError: ignored

## Cleaning the data
Real world data is often messy. You can do a bunch of preprocessing to ensure the data is clean, like:
- Removing special characters
- Removing stopwords


In [0]:
import re
def remove_special_characters(sentences, remove_digits=False):
    clean_sentences = []
    for sent in sentences:
        pattern = r'/[^\w-]|_/' if not remove_digits else r'[^a-zA-Z\s]'  # Regex needs correction
        clean_text = re.sub(pattern, '', sent)
        clean_sentences.append(clean_text)
    print(clean_sentences)
    return clean_sentences

In [15]:
remove_special_characters(sample_sentences)


['Today is a cold Sunday morning.', 'I am at the Nashville School of Law.', 'I am here for PyTennessee where I can learn more about Python.']


['Today is a cold Sunday morning.',
 'I am at the Nashville School of Law.',
 'I am here for PyTennessee where I can learn more about Python.']

## Text processing
Text processing approaches like stemming and lemmatization help in reducing inflectional forms of words. 
### Stemming
Stemming tries to cut off at the ends of the words in the hope of deriving the base form. Use ```PorterStemmer``` from ```ntlk.stem```.

In [0]:
# Stemming and lemmatization
from nltk.stem import PorterStemmer

def get_stems(words):
    ps = PorterStemmer()
    stems = []
    for word in words:
        stems.append(ps.stem(word))
    print(stems)
    return stems

In [17]:
sample_stems = get_stems(sample_words)


['today', 'is', 'a', 'cold', 'sunday', 'morn', '.', 'I', 'am', 'at', 'the', 'nashvil', 'school', 'of', 'law', '.', 'I', 'am', 'here', 'for', 'pytennesse', 'where', 'I', 'can', 'learn', 'more', 'about', 'python', '.']


### Lemmatization [ISSUE]
Lemmatization groups different inflected forms of a words so they can be mapped to the same base. 
More complex than stemming, context of words is also analyzed. Uses WordNet which is a lexical English database. 
Use ```WordNetLemmatizer``` from ```nltk.stem```.


In [22]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
def get_lemma(words):
    wordnet_lemmatizer = WordNetLemmatizer()
    lemma = []
    for word in words:
        lemma.append(wordnet_lemmatizer.lemmatize("becoming")) # Warning: Lemmatizer needs a POS tag or else it treats it as a noun and doesn't change it
    print(lemma)
    return(lemma)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [23]:
sample_lemma = get_lemma(sample_words)


['becoming', 'becoming', 'becoming', 'becoming', 'becoming', 'becoming', 'becoming', 'becoming', 'becoming', 'becoming', 'becoming', 'becoming', 'becoming', 'becoming', 'becoming', 'becoming', 'becoming', 'becoming', 'becoming', 'becoming', 'becoming', 'becoming', 'becoming', 'becoming', 'becoming', 'becoming', 'becoming', 'becoming', 'becoming']


In [21]:
print(sample_lemma)

None


## POS tagging
The English language is formed of different parts of speech (POS) like nouns, verbs, pronouns, adjectives, etc. POS tagging analyzes the words in a sentences and associates it with a POS tag depending on the way it is used. Also called grammatical tagging or word-category disambiguation. Use ```nltk.pos_tag```. There are different types of tagsets used with the most common being the Penn Treebank tagset and the Universal tagset. 

![Image of Yaktocat](https://i.stack.imgur.com/FhcKV.png)

In [24]:
nltk.download('averaged_perceptron_tagger')

def get_pos_tags(words):
    tags=[]
    for word in words:
        tags.append(nltk.pos_tag([word]))
    print(tags)
    return tags

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [25]:
sample_tags = get_pos_tags(sample_words)


[[('Today', 'NN')], [('is', 'VBZ')], [('a', 'DT')], [('cold', 'NN')], [('Sunday', 'NNP')], [('morning', 'NN')], [('.', '.')], [('I', 'PRP')], [('am', 'VBP')], [('at', 'IN')], [('the', 'DT')], [('Nashville', 'NNP')], [('School', 'NN')], [('of', 'IN')], [('Law', 'NN')], [('.', '.')], [('I', 'PRP')], [('am', 'VBP')], [('here', 'RB')], [('for', 'IN')], [('PyTennessee', 'NN')], [('where', 'WRB')], [('I', 'PRP')], [('can', 'MD')], [('learn', 'NN')], [('more', 'RBR')], [('about', 'IN')], [('Python', 'NN')], [('.', '.')]]


## Named entity recognition
Use NER to identify entities like person, organization, city, etc. Helpful in redacting PII. More details here: https://www.nltk.org/book/ch07.html


## Bag of words
Bag of words is an approach for text feature extraction. Just imagine a bag of popcorn, 
and each popcorn kernel represents a word that is present in the text. Each sentence can be represented as a vector
of all the words present in a vocabulary. If a word is present in the sentence, it is 1, otherwise 0.

![Image of Yaktocat](https://cdn-images-1.medium.com/max/1600/1*zMdHVQQ7HYv_mMZ5Ne-2yQ.png)

## TF-IDF
Term-frequency inverse document frequency assigns scores to words inside a document. Commonly occuring words in all documents would have less weightage.
![Image of Yaktocat](http://www.bloter.net/wp-content/uploads/2016/09/td-idf-graphic.png)

In [0]:
import sklearn
from sklearn.feature_extraction.text import CountVectorizer

def get_bag_of_words(sentences):
    vectorizer = CountVectorizer()
    print(vectorizer.fit_transform(sentences).todense())
    print(vectorizer.vocabulary_) 

In [27]:
get_bag_of_words(sample_data)


[[1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]
{'today': 19, 'is': 7, 'cold': 4, 'sunday': 17, 'morning': 11, 'am': 1, 'at': 2, 'the': 18, 'nashville': 12, 'school': 16, 'of': 13, 'law': 8, 'here': 6, 'for': 5, 'pytennessee': 14, 'where': 20, 'can': 3, 'learn': 9, 'more': 10, 'about': 0, 'python': 15}


## Sentiment analysis



In [0]:
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer 
  
restaurant_reviews = ["Great place to visit in Nashville.",
"The food took too long to come, the service was slow.",
"Everything was amazing.",
"Place closed down a month ago.",
"Had to wait in line for an hour, but the food was worth the wait.",
]
  
sentiment_analyzer = SentimentIntensityAnalyzer()
for sentence in restaurant_reviews:
     print(sentence)
     sentiment_score = sentiment_analyzer.polarity_scores(sentence)
     for score in sentiment_score:
         print('{0}: {1},' .format(score, sentiment_score[score]), end='')
     print()

## Word embeddings - Word2Vec
Vector space model - represent words and sentences as vectors to get semantic relationships. 

![Image of Yaktocat](http://www.flyml.net/wp-content/uploads/2016/11/w2v-3-samples.png)

We demonstrate the following functions:


1.   Train the word embeddings using brown corpus;
2.   Load the pre-trained model and perform simple tasks; and





In [0]:
import gensim


In [34]:
import warnings
warnings.filterwarnings('ignore')
nltk.download('brown')
from nltk.corpus import brown
model = gensim.models.Word2Vec(brown.sents())

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


In [0]:
model.save('brown.embedding')
new_model = gensim.models.Word2Vec.load('brown.embedding')

In [37]:
len(new_model['university'])


100

In [38]:
new_model.similarity('university','school') > 0.3


True

In [40]:
nltk.download('word2vec_sample')
from nltk.data import find
word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

[nltk_data] Downloading package word2vec_sample to /root/nltk_data...
[nltk_data]   Unzipping models/word2vec_sample.zip.


We pruned the model to only include the most common words (~44k words).



In [41]:
len(model.vocab)



43981

Each word is represented in the space of 300 dimensions:



In [42]:
len(model['university'])


300

Finding the top n words that are similar to a target word is simple. The result is the list of n words with the score.



In [43]:
model.most_similar(positive=['university'], topn = 3)


[('universities', 0.7003918886184692),
 ('faculty', 0.6780906915664673),
 ('undergraduate', 0.6587096452713013)]

In [44]:
model.doesnt_match('breakfast cereal dinner lunch'.split())


'cereal'

Mikolov et al. (2013) figured out that word embedding captures much of syntactic and semantic regularities. For example, the vector 'King - Man + Woman' is close to 'Queen' and 'Germany - Berlin + Paris' is close to 'France'.

In [45]:
model.most_similar(positive=['woman','king'], negative=['man'], topn = 1)


[('queen', 0.7118192911148071)]

In [46]:
model.most_similar(positive=['Paris','Germany'], negative=['Berlin'], topn = 1)


[('France', 0.7884092330932617)]