# NLTK Basics

NLTK is a widely used tools for preprocessing raw data. This tutorial will try to cover some of the basic contents of how NLTK can be useful tool for NLP tasks. For the purpose of this tutorial, I have taken references from https://github.com/zelandiya/KiwiPyCon-NLP-tutorial

## Dependencies
* NLTK 
* movie_reviews corpus
* punkt tokenizer model

To install Punkt model, typing *nltk.download()* will open up the bellow GUI, where you can go to models, select punkt and click on download.
![punkt_installer](punkt_install.jpg "Punkt Installer")


# Dataset -1 : Movie Reviews

We will use Movie review dataset for sake of understanding.

### Downloading a corpus

In [20]:
import nltk
nltk.download('movie_reviews')
nltk.download()

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /home/jaley/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

### Path where NLTK searches

In [18]:
print nltk.data.path

['/home/jaley/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']


### Getting the details of corpus

In [25]:
from nltk.corpus import movie_reviews
print 'No of documents in corpus  : ',len(movie_reviews.fileids())
print 'Categories of movie Review : ',movie_reviews.categories()
print '\nExample of filenames(pos)  : ',movie_reviews.fileids('pos')[:2]
print 'Example of filenames(neg)  : ',movie_reviews.fileids('neg')[:2]

#Words from sample files(pos)
print '\n{} Words from filenames(pos)  : '.format(len(movie_reviews.words('pos/cv000_29590.txt'))),
print movie_reviews.words('pos/cv000_29590.txt')

#Words from sample files(neg)
print '\n{} Words from filenames(neg)  : '.format(len(movie_reviews.words('neg/cv000_29416.txt'))),
print movie_reviews.words('neg/cv000_29416.txt')
print '\nExample of raw Text is : ',movie_reviews.raw('pos/cv000_29590.txt').split('.')[0]
print '\nMovie Review sentences : ',movie_reviews.sents('pos/cv000_29590.txt')



No of documents in corpus  :  2000
Categories of movie Review :  [u'neg', u'pos']

Example of filenames(pos)  :  [u'pos/cv000_29590.txt', u'pos/cv001_18431.txt']
Example of filenames(neg)  :  [u'neg/cv000_29416.txt', u'neg/cv001_19502.txt']

862 Words from filenames(pos)  :  [u'films', u'adapted', u'from', u'comic', u'books', ...]

879 Words from filenames(neg)  :  [u'plot', u':', u'two', u'teen', u'couples', u'go', ...]

Example of raw Text is :  films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before 

Movie Review sentences :  [[u'films', u'adapted', u'from', u'comic', u'books', u'have', u'had', u'plenty', u'of', u'success', u',', u'whether', u'they', u"'", u're', u'about', u'superheroes', u'(', u'batman', u',', u'superman', u',', u'spawn', u')', u',', u'or', u'geared', u'toward

### Most Frequent Words in Review

In [27]:
from nltk.probability import FreqDist
words = movie_reviews.words('pos/cv000_29590.txt')
words_by_frequency = FreqDist(words)
print 'Most frequent words in review'
print words_by_frequency.items()[:20]


Most frequent words in review
[(u'all', 3), (u'childs', 1), (u'steve', 1), (u'surgical', 1), (u'comments', 1), (u'go', 1), (u'certainly', 1), (u'to', 15), (u'watchmen', 1), (u'song', 1), (u'very', 1), (u'simpsons', 1), (u'novel', 1), (u'jack', 2), (u'surgeon', 1), (u'level', 1), (u'did', 1), (u'turns', 2), (u'michael', 1), (u'flashy', 1)]


For most frequent words across both *pos* and *neg* categories, run the bellow code

In [28]:
# Compare the most frequent words in both sets
print ''
for category in movie_reviews.categories():

    print 'Category', category
    all_words = movie_reviews.words(categories=category)
    all_words_by_frequency = FreqDist(all_words)
    print all_words_by_frequency.items()[:20]


Category neg
[(u'sonja', 1), (u'askew', 4), (u'woods', 54), (u'spiders', 1), (u'bazooms', 1), (u'hanging', 37), (u'francesca', 3), (u'comically', 5), (u'disobeying', 1), (u'hennings', 2), (u'canet', 1), (u'originality', 34), (u'caned', 1), (u'rickman', 4), (u'stipulate', 1), (u'rawhide', 1), (u'bringing', 25), (u'unsworth', 1), (u'liaisons', 8), (u'wooden', 27)]
Category pos
[(u'woods', 36), (u'spiders', 3), (u'hanging', 22), (u'woody', 100), (u'comically', 7), (u'localized', 1), (u'scold', 2), (u'originality', 24), (u'mutinies', 1), (u'rickman', 11), (u'slothful', 1), (u'wracked', 1), (u'capoeira', 1), (u'rawhide', 1), (u'bringing', 56), (u'liaisons', 1), (u'grueling', 1), (u'sommerset', 4), (u'wooden', 21), (u'wednesday', 5)]


### Removing Stopwords
Stopwords are redundent words in sentences which are encountered multiple times like a,the,from,etc.
These includes articles,helping verbs,etc.

In [34]:
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from string import punctuation
nltk.download('stopwords')
stop = stopwords.words('english')

# Strip stopwords from text
words = movie_reviews.words('pos/cv000_29590.txt')
print 'Words with stopwords    : ',words[:5]
no_stopwords = [word for word in words if word not in stop]
print 'Words without stopwords : ',no_stopwords[:5]

[nltk_data] Downloading package stopwords to /home/jaley/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Words with stopwords    :  [u'films', u'adapted', u'from', u'comic', u'books']
Words without stopwords :  [u'films', u'adapted', u'comic', u'books', u'plenty']


## Filtering by parts of speech
In NLP parts of speech are tags assigned to each word in sentence based on the symmentic category it belongs to. This includes noun, pronoun, preposition,etc. It is not a linear, rather heirarchial. For example, a sentence can be split into Noun phrase,Verb Phrase,etc. Thereafter, it is further split into propernoun,etc. Look into Penn's Treebank and the explaination of POS Tags as given in youtube link bellow. 
https://www.youtube.com/watch?v=LivXkL2DO_w

In the example bellow, we are using averaged perceptron tagger. Which means, it can be incorrect, as a lot of words have different parts of speech based on how it is used in sentence.

In [38]:
nltk.download('averaged_perceptron_tagger')
print nltk.pos_tag('This is a sample text'.split(' '))

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jaley/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('sample', 'JJ'), ('text', 'NN')]


Here JJ is adjective, NN is noun singular and  NNS is noun plural.  
Let us say we are interested in only adjectives, then we will do filtering based on POS as bellow.

In [40]:
all_words = movie_reviews.words(categories='pos')[:1000]
pos_tagged = nltk.pos_tag(all_words)
all_filtered_words = [x[0] for x in pos_tagged if x[1] in ('JJ') and len(x[0]) > 1]
print 'Adjectives in list of words are ',all_filtered_words

Adjectives in list of words are  [u'comic', u'ghost', u'comic', u'whole', u'new', u'little', u'graphic', u'other', u'whole', u'comic', u'allen', u'ludicrous', u'violent', u'east', u'sooty', u'little', u'nervous', u'mysterious', u'surgical', u'first', u'robbie', u'johnny', u'prophetic', u'copious', u'mary', u'isn', u'gruesome', u'other', u'unique', u'interesting', u'comic', u'vertical', u'rafael', u'good', u'funny', u'capable', u'such', u'ghastly', u'electric', u'bleak', u'tim', u'victorian', u'flashy', u'crazy', u'twin', u'black', u'white', u'comic', u'original', u'solid', u'strong', u'british', u'great', u'big', u'graham', u'first', u'irish', u'bad', u'good', u'strong', u'suspect', u'critical', u'mtv', u'high', u'reese', u'current', u'simple', u'washington', u'high', u'student', u'reese', u'high', u'megalomaniac', u'popular']


## Extract NGrams and multi-word phrases