# NLTK Basics

NLTK is a widely used tools for preprocessing raw data. This tutorial will try to cover some of the basic contents of how NLTK can be useful tool for NLP tasks. For the purpose of this tutorial, I have taken references from https://github.com/zelandiya/KiwiPyCon-NLP-tutorial

Credits : A Medelyan  

## Dependencies
* NLTK 
* movie_reviews corpus
* punkt tokenizer model

To install Punkt model, typing *nltk.download()* will open up the bellow GUI, where you can go to models, select punkt and click on download.
![punkt_installer](punkt_install.jpg "Punkt Installer")


# Dataset -1 : Movie Reviews

We will use Movie review dataset for sake of understanding.

### Downloading a corpus

In [20]:
import nltk
nltk.download('movie_reviews')
nltk.download()

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /home/jaley/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

### Path where NLTK searches

In [18]:
print nltk.data.path

['/home/jaley/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']


### Getting the details of corpus

In [25]:
from nltk.corpus import movie_reviews
print 'No of documents in corpus  : ',len(movie_reviews.fileids())
print 'Categories of movie Review : ',movie_reviews.categories()
print '\nExample of filenames(pos)  : ',movie_reviews.fileids('pos')[:2]
print 'Example of filenames(neg)  : ',movie_reviews.fileids('neg')[:2]

#Words from sample files(pos)
print '\n{} Words from filenames(pos)  : '.format(len(movie_reviews.words('pos/cv000_29590.txt'))),
print movie_reviews.words('pos/cv000_29590.txt')

#Words from sample files(neg)
print '\n{} Words from filenames(neg)  : '.format(len(movie_reviews.words('neg/cv000_29416.txt'))),
print movie_reviews.words('neg/cv000_29416.txt')
print '\nExample of raw Text is : ',movie_reviews.raw('pos/cv000_29590.txt').split('.')[0]
print '\nMovie Review sentences : ',movie_reviews.sents('pos/cv000_29590.txt')



No of documents in corpus  :  2000
Categories of movie Review :  [u'neg', u'pos']

Example of filenames(pos)  :  [u'pos/cv000_29590.txt', u'pos/cv001_18431.txt']
Example of filenames(neg)  :  [u'neg/cv000_29416.txt', u'neg/cv001_19502.txt']

862 Words from filenames(pos)  :  [u'films', u'adapted', u'from', u'comic', u'books', ...]

879 Words from filenames(neg)  :  [u'plot', u':', u'two', u'teen', u'couples', u'go', ...]

Example of raw Text is :  films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before 

Movie Review sentences :  [[u'films', u'adapted', u'from', u'comic', u'books', u'have', u'had', u'plenty', u'of', u'success', u',', u'whether', u'they', u"'", u're', u'about', u'superheroes', u'(', u'batman', u',', u'superman', u',', u'spawn', u')', u',', u'or', u'geared', u'toward

### Most Frequent Words in Review

In [27]:
from nltk.probability import FreqDist
words = movie_reviews.words('pos/cv000_29590.txt')
words_by_frequency = FreqDist(words)
print 'Most frequent words in review'
print words_by_frequency.items()[:20]


Most frequent words in review
[(u'all', 3), (u'childs', 1), (u'steve', 1), (u'surgical', 1), (u'comments', 1), (u'go', 1), (u'certainly', 1), (u'to', 15), (u'watchmen', 1), (u'song', 1), (u'very', 1), (u'simpsons', 1), (u'novel', 1), (u'jack', 2), (u'surgeon', 1), (u'level', 1), (u'did', 1), (u'turns', 2), (u'michael', 1), (u'flashy', 1)]


For most frequent words across both *pos* and *neg* categories, run the bellow code

In [28]:
# Compare the most frequent words in both sets
print ''
for category in movie_reviews.categories():

    print 'Category', category
    all_words = movie_reviews.words(categories=category)
    all_words_by_frequency = FreqDist(all_words)
    print all_words_by_frequency.items()[:20]


Category neg
[(u'sonja', 1), (u'askew', 4), (u'woods', 54), (u'spiders', 1), (u'bazooms', 1), (u'hanging', 37), (u'francesca', 3), (u'comically', 5), (u'disobeying', 1), (u'hennings', 2), (u'canet', 1), (u'originality', 34), (u'caned', 1), (u'rickman', 4), (u'stipulate', 1), (u'rawhide', 1), (u'bringing', 25), (u'unsworth', 1), (u'liaisons', 8), (u'wooden', 27)]
Category pos
[(u'woods', 36), (u'spiders', 3), (u'hanging', 22), (u'woody', 100), (u'comically', 7), (u'localized', 1), (u'scold', 2), (u'originality', 24), (u'mutinies', 1), (u'rickman', 11), (u'slothful', 1), (u'wracked', 1), (u'capoeira', 1), (u'rawhide', 1), (u'bringing', 56), (u'liaisons', 1), (u'grueling', 1), (u'sommerset', 4), (u'wooden', 21), (u'wednesday', 5)]


### Removing Stopwords
Stopwords are redundent words in sentences which are encountered multiple times like a,the,from,etc.
These includes articles,helping verbs,etc.

In [34]:
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from string import punctuation
nltk.download('stopwords')
stop = stopwords.words('english')

# Strip stopwords from text
words = movie_reviews.words('pos/cv000_29590.txt')
print 'Words with stopwords    : ',words[:5]
no_stopwords = [word for word in words if word not in stop]
print 'Words without stopwords : ',no_stopwords[:5]

[nltk_data] Downloading package stopwords to /home/jaley/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Words with stopwords    :  [u'films', u'adapted', u'from', u'comic', u'books']
Words without stopwords :  [u'films', u'adapted', u'comic', u'books', u'plenty']


## Filtering by parts of speech
In NLP parts of speech are tags assigned to each word in sentence based on the symmentic category it belongs to. This includes noun, pronoun, preposition,etc. It is not a linear, rather heirarchial. For example, a sentence can be split into Noun phrase,Verb Phrase,etc. Thereafter, it is further split into propernoun,etc. Look into Penn's Treebank and the explaination of POS Tags as given in youtube link bellow. 
https://www.youtube.com/watch?v=LivXkL2DO_w

In the example bellow, we are using averaged perceptron tagger. Which means, it can be incorrect, as a lot of words have different parts of speech based on how it is used in sentence.

In [38]:
nltk.download('averaged_perceptron_tagger')
print nltk.pos_tag('This is a sample text'.split(' '))

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jaley/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('sample', 'JJ'), ('text', 'NN')]


Here JJ is adjective, NN is noun singular and  NNS is noun plural.  
Let us say we are interested in only adjectives, then we will do filtering based on POS as bellow.

In [40]:
all_words = movie_reviews.words(categories='pos')[:1000]
pos_tagged = nltk.pos_tag(all_words)
all_filtered_words = [x[0] for x in pos_tagged if x[1] in ('JJ') and len(x[0]) > 1]
print 'Adjectives in list of words are ',all_filtered_words

Adjectives in list of words are  [u'comic', u'ghost', u'comic', u'whole', u'new', u'little', u'graphic', u'other', u'whole', u'comic', u'allen', u'ludicrous', u'violent', u'east', u'sooty', u'little', u'nervous', u'mysterious', u'surgical', u'first', u'robbie', u'johnny', u'prophetic', u'copious', u'mary', u'isn', u'gruesome', u'other', u'unique', u'interesting', u'comic', u'vertical', u'rafael', u'good', u'funny', u'capable', u'such', u'ghastly', u'electric', u'bleak', u'tim', u'victorian', u'flashy', u'crazy', u'twin', u'black', u'white', u'comic', u'original', u'solid', u'strong', u'british', u'great', u'big', u'graham', u'first', u'irish', u'bad', u'good', u'strong', u'suspect', u'critical', u'mtv', u'high', u'reese', u'current', u'simple', u'washington', u'high', u'student', u'reese', u'high', u'megalomaniac', u'popular']


## Extract NGrams and multi-word phrases
When we have to predict a missing word, we have a few choices, either use previous word to predict it which is bigram model, to use previous two words which is trigram model.Now in bigram model, we look at pair of words to understand temporal structure of word with respect to sentence. Let us take examples to better understand the same. 

In [7]:
import nltk
from nltk.util import ngrams
sample_text='This is a sample sentence'
print 'Test sentence      : ',sample_text
tokenized_text=nltk.word_tokenize(sample_text)
print 'Tokenized sentence : ',tokenized_text
print 'Bigram pairs are   : ',[elem for elem in ngrams(tokenized_text,2)]


Test sentence      :  This is a sample sentence
Tokenized sentence :  ['This', 'is', 'a', 'sample', 'sentence']
Bigram pairs are   :  [('This', 'is'), ('is', 'a'), ('a', 'sample'), ('sample', 'sentence')]


### Most Common BiGram
We will use movies review dataset to find most common bigram/word pair

In [19]:
from nltk.corpus import movie_reviews
import nltk
from nltk.corpus import stopwords
from nltk.util import ngrams
from nltk.probability import FreqDist
test_sentence='Humpty Dumpty Humpty Dumpty sat on a wall'
test_tokens=test_sentence.split(' ')
print '2 most common BiGrams are',FreqDist(ngrams(test_tokens,2)).items()[:2]

2 most common BiGrams are [(('Humpty', 'Dumpty'), 2), (('Dumpty', 'sat'), 1)]


### Most Common N-Grams
Instead of looking at frequency of word pairs(bigram), we can also compare frequency of word phrases(word groups) of variable length. The bellow code gets list of all co-occuring word groups ranging from size 2 to 5.

In [17]:
from nltk.corpus import movie_reviews
import nltk
from nltk.util import ngrams
from nltk.probability import FreqDist
test_sentence='Humpty Dumpty Humpty Dumpty sat on a wall'
test_tokens=test_sentence.split(' ')

gramList=[' '.join(gram) for n in range(2,5) for gram in ngrams(test_tokens,n)]
print 'Common N-Grams are ',FreqDist(gramList).items()[-3:]

Common N-Grams are  [('a wall', 1), ('on a', 1), ('Humpty Dumpty', 2)]


### Overall Example
It skips ngrams which start/end with stopword. It also ignores ngrams  that span punctuation boundary.

In [30]:
from nltk.util import ngrams

test_sentence='Humpty Dumpty Dumpty Humpty Dumpty Dumpty Dumpty sat on a wall'
test_tokens=test_sentence.split(' ')


stop = stopwords.words('english')
boundaries = ['(', ')', '.', ',', ';', ':']
from nltk.corpus import stopwords


def acceptable(word):
    if word.lower() in stop:
        return False
    elif not word[0].isalpha():
        return False
    return True

def has_no_boundaries(my_gram):
    for my_word in my_gram:
        if my_word in boundaries:
            return False
    return True

gramList=[' '.join(gram) for n in range(2,5) for gram in ngrams(test_tokens,n) if acceptable(gram[-1]) and acceptable(gram[0]) and has_no_boundaries(gram) ]

print 'List of grams matching the proposed criteria are : ', gramList

print '\n Common N-Grams are : ',FreqDist(gramList).items()


List of grams matching the proposed criteria are :  ['Humpty Dumpty', 'Dumpty Dumpty', 'Dumpty Humpty', 'Humpty Dumpty', 'Dumpty Dumpty', 'Dumpty Dumpty', 'Dumpty sat', 'Humpty Dumpty Dumpty', 'Dumpty Dumpty Humpty', 'Dumpty Humpty Dumpty', 'Humpty Dumpty Dumpty', 'Dumpty Dumpty Dumpty', 'Dumpty Dumpty sat', 'Humpty Dumpty Dumpty Humpty', 'Dumpty Dumpty Humpty Dumpty', 'Dumpty Humpty Dumpty Dumpty', 'Humpty Dumpty Dumpty Dumpty', 'Dumpty Dumpty Dumpty sat', 'sat on a wall']

 Common N-Grams are :  [('Dumpty Dumpty sat', 1), ('sat on a wall', 1), ('Dumpty sat', 1), ('Humpty Dumpty Dumpty Humpty', 1), ('Dumpty Humpty', 1), ('Dumpty Dumpty Humpty', 1), ('Dumpty Humpty Dumpty Dumpty', 1), ('Dumpty Dumpty Humpty Dumpty', 1), ('Dumpty Humpty Dumpty', 1), ('Humpty Dumpty Dumpty Dumpty', 1), ('Humpty Dumpty Dumpty', 2), ('Dumpty Dumpty Dumpty sat', 1), ('Dumpty Dumpty', 3), ('Dumpty Dumpty Dumpty', 1), ('Humpty Dumpty', 2)]


## TF-IDF Scoring
Tf–idf(Term frequency–inverse document frequency) is a measure to show word importance in a document. 
Mathematically it is written as 

$tfidf(w,d) = tf(w,d)\times idf(w)$

or in other words the score is product of term frequency and inverse document frequency. Please note that 'Term' and 'word' are same in current context. 

### Term Frequency

$tf(w,d)=0.5+0.5\times\cfrac{f(w,d)}{max(f(w,d_i)}\hspace{10mm}\forall d_i\in D$

$f(w,d)$  = frequency of a word in document  
$max(f(w,d_i))$ = maximum frequency of word  amongst all documents  

Term Frequency is a floating point value between 0.5 to 1 and higher the value, higher the word frequency in docuement compare to all documents, higher the importance. For document with highest word frequency, it will be 1 and for document not containing the word, it will be 0.5.

### Inverse Document Frequency

$idf(w)=log(\cfrac{N}{\sum(I(w,d_i))}) \hspace{10mm}\forall d_i\in D$

$N$=Total number of documents
$I(w,d_i)$=Indicator function stating if word is contained in document $d_i$
$\sum(I(w,d_i))$=Total number of documents containing the word

Inverse document Frequency value states the importance of word, based on how frequently it is occouring. If it is very frequently occouring, than the word has lower importance, but if it is rare event the denominator decreases, which leads to increase in 'Inverse-Document frequency'.

### Gensim Package
Gensim package is a very popular python package for Topic Modelling. We will use it to find TF-IDF values.

### Few Important Terms

**Dictionary** : A unique mapping of a word to an integer value.  
**BOW Document Format** : Bag of Words Document format stores a document as a list of words with its associated count.  
So for *'Humpty Dumpty Humpty'*, it will be [(word2id('Humpty'),2),(word2id('Dumpty'),1)]
where word2id is returning dictionary id of that word.  
In above example, if dictionary is :  
0 => Humpty
1 => Dumpty  
Then, BOW Document Format will return :  
[(0,2),(1,1)]  
[(word1Id,word1Count),(word2Id,word2Count) . . .]


**Corpus** : Collection of Documents in BOW Format.

In [38]:
from gensim import corpora, models
from nltk.corpus import movie_reviews

texts=[]
#List of word lists. 
#[[doc1word1,doc1word2,doc1word3],[doc2],[doc3]]
for fileid in movie_reviews.fileids():
    words = movie_reviews.words(fileid)
    texts.append(words)

# Create a dictionary from list of documents
dictionary = corpora.Dictionary(texts)

# Create a Corpus based on BOW Format.
corpus = [dictionary.doc2bow(text) for text in texts]
print '\nSubset of First document in corpus according to BOW : ',corpus[0][:5]

#Create a TFIDF Model for the corpus
tfidf = models.TfidfModel(corpus)

print '\nSubset of TFIDF value for words in first document : ',tfidf[corpus[0]][:5]
print '\n Above result with words instead of ids : ',[(dictionary.get(item[0]),item[1]) for item in tfidf[corpus[0]][:5]]



 
Subset of First document in corpus according to BOW :  [(0, 6), (1, 1), (2, 1), (3, 2), (4, 1)]

Subset of TFIDF value for words in first document :  [(0, 0.019263321498932638), (1, 0.044059324185709306), (2, 0.05605413673895676), (3, 0.02842430174654846), (4, 0.03518230106766029)]

 Above result with words instead of ids :  [(u'all', 0.019263321498932638), (u'concept', 0.044059324185709306), (u'skip', 0.05605413673895676), (u'go', 0.02842430174654846), (u'seemed', 0.03518230106766029)]
