# NLP Basics

In [53]:
import nltk



Goal of this section is to get started with textual data. We will start preprocessing data and make use of the **Natural Language Toolkit (NLTK)**

At the end of this session you will be able to work with textual data and you will know some of the parts related to working with textual data and pre-processing it like:

- Cleaning
- Tokenizing
- Segementation
- Normalizing
- Stemming


For this part of of the Session we will make use of NKTK and use the data provided by NLTK. To get the relevant data use the download function from NLTK. And download the parts realted to the NLTK Book:
1. d (Download function) 
2. book (Download Content related to the NLTK book)
3. q ( Quit)

This will download all the data wen need for this section

In [39]:
nltk.download_shell()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> book
    Downloading collection 'book'
       | 
       | Downloading package abc to
       |     C:\Users\Patrick\AppData\Roaming\nltk_data...
       |   Package abc is already up-to-date!
       | Downloading package brown to
       |     C:\Users\Patrick\AppData\Roaming\nltk_data...
       |   Package brown is already up-to-date!
       | Downloading package chat80 to
       |     C:\Users\Patrick\AppData\Roaming\nltk_data...
       |   Package chat80 is already up-to-date!
       | Downloading package cmudict to
       |     C:\Users\Patrick\AppData\Roaming\nltk_data...
       |   Package cmudict is already up-to-date!
       | Downloading package conll2000 to
       |     C:\

# Cleaning Data 

Cleaning the data and preparing it for the next steps is our first task in many applications. We will perform the following steps:
1. Load the raw text.
2. Split into tokens.
3. Convert to lowercase.
4. Remove punctuation from each token.
5. Filter out remaining tokens that are not alphabetic.
6. Filter out tokens that are stop words.


In [32]:
#get some HTML data
url = "http://www.madrid-guide-spain.com/classic-madrid.html"

In [33]:
from urllib import request
html = request.urlopen(url).read()
html[:60]

b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//E'

For cleaning of the HTML we will use a Python Library called Beatiful Soup. This is in general a very handy Library when working with HTML. More infromation is available at:
http://www.crummy.com/software/BeautifulSoup


In [35]:
#!conda install beautifulsoup4 -y

In [36]:
from bs4 import BeautifulSoup

In [37]:
raw = BeautifulSoup(html,"lxml").get_text()

In [38]:
raw

'\nClassic Madrid\n\n\n\n\n\n\n\n\n\n\n\nFIX.designMode="legacy";FIX.doEndOfHead();\n\n\n\n\n\n\n\n\n\n\n\n\n\nHomeGetting HereGetting AroundOnward TravelHotelsHostelsActivitiesWhat AttractionsExcursionsFor Kids!CultureSportRomanticShop ´til you Drop!Where to EatEssential InfoAtoZWork Here!BlogSitemapContact usOther Guides\n\n\n\n\n\n\n\n\n\nClassic Madrid  The Top 10 spots to experience Classic Madrid as per Madrid Guide Spain. Looking for some tips to make your visit special?Not sure exactly what to do now you are here?Try the following:  Top Ten: Classic Madrid   Museo de la Ciudad\rStudy the citys story from its inception through to the modern day. Who did what? Why? How did the civil war change the face of classic Madrid to the modern metroplois it is now? Discover the origins of the Madrileño!\nMuseo del Prado\nTHE gallery to see Picassos works. The largest collection of his work in one place however the gallery explores the works and influences of many other Spanish artists at t

## Tokenizing

Tokenizing will split our text first into sentences and then into word tokens

In [100]:
from nltk.tokenize import word_tokenize
from nltk import sent_tokenize
sentences = sent_tokenize(raw)
print(len(sentences))
print(sentences[5])


29
Discover the origins of the Madrileño!


In [134]:
tokenized_docs = [word_tokenize(sentence.lower()) for sentence in sentences] 
print(len(tokenized_docs))
print(tokenized_docs[27])

29
['...', 'or', 'check', 'out', 'some', 'of', 'madrids', 'other', 'attractions', '!']


## Stop Word removal

Stop words are highly frequent words like "the" or "to". Stopwords usually have little lexical content and in many cases we want to remove stop words from our text. NLTK comes with a stop words corpus conatining commonly agreed upon stop words for a variety of languages.

In [102]:
stopwords = nltk.corpus.stopwords.words('english')

In [141]:
from nltk.corpus import stopwords
print(stopwords.words('english'))
stop_words=stopwords.words('english')
# getting rid of all the symbols
# stop_words.update(['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}']) 

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [126]:
def removeStopWords(words):
    words = [w for w in words if not w in stop_words]
    return words

In [129]:
clean_text = [removeStopWords(sentence) for sentence in tokenized_docs]

In [135]:
print(clean_text[1])

['looking', 'tips', 'make', 'visit', 'special', '?', 'sure', 'exactly', '?', 'try', 'following', ':', 'top', 'ten', ':', 'classic', 'madrid', 'museo', 'de', 'la', 'ciudad', 'study', 'citys', 'story', 'inception', 'modern', 'day', '.']


An alternative to remove stopwords is to remove stopwords based on a POS Tagger, we will see this later

Another normalization task involves identifying non-standard words including numbers, abbreviations, and dates, and mapping any such tokens to a special vocabulary. For example, every decimal number could be mapped to a single token 0.0, and every acronym could be mapped to AAA. This keeps the vocabulary small and improves the accuracy of many language modeling tasks.

In [57]:
# removal of stopwords  alternative check POS TAGSET
#https://stackoverflow.com/questions/19130512/stopword-removal-with-nltk

## Stemming
NLTK includes several off-the-shelf stemmers, and if you ever need a stemmer you should use one of these in preference to crafting your own using regular expressions, since these handle a wide range of irregular cases. The Porter and Lancaster stemmers follow their own rules for stripping affixes. Observe that the Porter stemmer correctly handles the word lying (mapping it to lie), while the Lancaster stemmer does not.

In [136]:
import nltk
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = word_tokenize(raw)

In [137]:
print(raw)

DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony.


In [138]:
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
print([porter.stem(t) for t in tokens])

['DENNI', ':', 'Listen', ',', 'strang', 'women', 'lie', 'in', 'pond', 'distribut', 'sword', 'is', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', '.', 'Suprem', 'execut', 'power', 'deriv', 'from', 'a', 'mandat', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcic', 'aquat', 'ceremoni', '.']


In [139]:
print([lancaster.stem(t) for t in tokens])

['den', ':', 'list', ',', 'strange', 'wom', 'lying', 'in', 'pond', 'distribut', 'sword', 'is', 'no', 'bas', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem', 'execut', 'pow', 'der', 'from', 'a', 'mand', 'from', 'the', 'mass', ',', 'not', 'from', 'som', 'farc', 'aqu', 'ceremony', '.']


In [140]:
wnl = nltk.WordNetLemmatizer()
print([wnl.lemmatize(t, pos="v") for t in tokens])

['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'lie', 'in', 'ponds', 'distribute', 'swords', 'be', 'no', 'basis', 'for', 'a', 'system', 'of', 'government', '.', 'Supreme', 'executive', 'power', 'derive', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']


## Tagging

Part-of-speech tagging is one of the most important text analysis tasks used to classify words into their part-of-speech and label them accordingly.

In [147]:
review = """WOW! The best word that describes this movie is "wow"! 
Not only to say that this is the best Action movie of all time, 
this is probably one of the greatest movies ever made. 
The people in my country watched this film when there where limited VHS cassettes at all. 
And again, my favorite Director did an timeless epic-masterpiece. 
Yes, an epic. Every scene in this movie is beyond the perfection. The timeless plot. 
Groundbreaking effects. Unforgettable "Hasta la vista, baby." .

Perfect direction for a sci-fix action film.
When the action starts, you're in for the ride of your life. 
There never be the same movie like T2. 
What else I can say about this film? A Must see for everyone."""

In [148]:
text = word_tokenize(sent_tokenize(review)[1])
nltk.pos_tag(text)

[('The', 'DT'),
 ('best', 'JJS'),
 ('word', 'NN'),
 ('that', 'IN'),
 ('describes', 'VBZ'),
 ('this', 'DT'),
 ('movie', 'NN'),
 ('is', 'VBZ'),
 ('``', '``'),
 ('wow', 'JJ'),
 ("''", "''"),
 ('!', '.')]

All the labels are described in the help of NLTK, to get the description for all POS classes just call:

In [149]:
nltk.help.upenn_tagset('.*')

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

## NER - Named Entity Recognition

The goal of a named entity recognition (NER) system is to identify all textual mentions of the named entities.

NLTK provides a classifier that has already been trained to recognize named entities, accessed with the function nltk.ne_chunk(). If we set the parameter binary=True, then named entities are just tagged as NE; otherwise, the classifier adds category labels such as PERSON, ORGANIZATION, and GPE.

In [177]:
#http://www.bbc.com/future/story/20180126-meet-the-motorbike-racing-robot
sent="""Yamaha’s initial concept was a “humanoid robot that can ride a motorcycle autonomously” and the company teamed up with SRI International to achieve its vision. SRI, the Stanford Research Institute, as it was originally known, was founded in 1946 to be the cutting edge of innovation in Silicon Valley"""

In [178]:
sent=word_tokenize(sent)
print(sent_tokenized)
sent=nltk.pos_tag(sent)
print(sent)

['It', 'is', 'home', 'to', 'the', 'longest', 'automobile', 'race', 'in', 'the', 'United', 'States', ',', 'known', 'as', 'the', '25', 'Hours', 'of', 'Thunderhill', ',', 'and', 'last', 'September', ',', 'it', 'was', 'host', 'to', 'a', 'very', 'different', 'kind', 'of', 'race', ':', 'a', 'race', 'between', 'man', 'and', 'machine', '.']
[('Yamaha’s', 'NNP'), ('initial', 'JJ'), ('concept', 'NN'), ('was', 'VBD'), ('a', 'DT'), ('“humanoid', 'JJ'), ('robot', 'NN'), ('that', 'WDT'), ('can', 'MD'), ('ride', 'VB'), ('a', 'DT'), ('motorcycle', 'NN'), ('autonomously”', 'NN'), ('and', 'CC'), ('the', 'DT'), ('company', 'NN'), ('teamed', 'VBD'), ('up', 'RP'), ('with', 'IN'), ('SRI', 'NNP'), ('International', 'NNP'), ('to', 'TO'), ('achieve', 'VB'), ('its', 'PRP$'), ('vision', 'NN'), ('.', '.'), ('SRI', 'NNP'), (',', ','), ('the', 'DT'), ('Stanford', 'NNP'), ('Research', 'NNP'), ('Institute', 'NNP'), (',', ','), ('as', 'IN'), ('it', 'PRP'), ('was', 'VBD'), ('originally', 'RB'), ('known', 'VBN'), (',', 

In [179]:
print(nltk.ne_chunk(sent, binary=False))

(S
  Yamaha’s/NNP
  initial/JJ
  concept/NN
  was/VBD
  a/DT
  “humanoid/JJ
  robot/NN
  that/WDT
  can/MD
  ride/VB
  a/DT
  motorcycle/NN
  autonomously”/NN
  and/CC
  the/DT
  company/NN
  teamed/VBD
  up/RP
  with/IN
  (ORGANIZATION SRI/NNP International/NNP)
  to/TO
  achieve/VB
  its/PRP$
  vision/NN
  ./.
  (ORGANIZATION SRI/NNP)
  ,/,
  the/DT
  (ORGANIZATION Stanford/NNP Research/NNP Institute/NNP)
  ,/,
  as/IN
  it/PRP
  was/VBD
  originally/RB
  known/VBN
  ,/,
  was/VBD
  founded/VBN
  in/IN
  1946/CD
  to/TO
  be/VB
  the/DT
  cutting/VBG
  edge/NN
  of/IN
  innovation/NN
  in/IN
  (GPE Silicon/NNP)
  Valley/NNP)


# PART II

We will start with some easy example for Supervised learning. We will build a classifier that tells us the gender of a name.

Lets start by loading the Names corpus and having a look at some attributes

In [1]:
import nltk

In [2]:
names = nltk.corpus.names

In [3]:
names.fileids()

['female.txt', 'male.txt']

In [4]:
male_names = names.words('male.txt')
female_names = names.words('female.txt')

We can also have a look at names that are ambiguous for gender:


In [5]:
print([w for w in male_names if w in female_names])

['Abbey', 'Abbie', 'Abby', 'Addie', 'Adrian', 'Adrien', 'Ajay', 'Alex', 'Alexis', 'Alfie', 'Ali', 'Alix', 'Allie', 'Allyn', 'Andie', 'Andrea', 'Andy', 'Angel', 'Angie', 'Ariel', 'Ashley', 'Aubrey', 'Augustine', 'Austin', 'Averil', 'Barrie', 'Barry', 'Beau', 'Bennie', 'Benny', 'Bernie', 'Bert', 'Bertie', 'Bill', 'Billie', 'Billy', 'Blair', 'Blake', 'Bo', 'Bobbie', 'Bobby', 'Brandy', 'Brett', 'Britt', 'Brook', 'Brooke', 'Brooks', 'Bryn', 'Cal', 'Cam', 'Cammy', 'Carey', 'Carlie', 'Carlin', 'Carmine', 'Carroll', 'Cary', 'Caryl', 'Casey', 'Cass', 'Cat', 'Cecil', 'Chad', 'Chris', 'Chrissy', 'Christian', 'Christie', 'Christy', 'Clair', 'Claire', 'Clare', 'Claude', 'Clem', 'Clemmie', 'Cody', 'Connie', 'Constantine', 'Corey', 'Corrie', 'Cory', 'Courtney', 'Cris', 'Daffy', 'Dale', 'Dallas', 'Dana', 'Dani', 'Daniel', 'Dannie', 'Danny', 'Darby', 'Darcy', 'Darryl', 'Daryl', 'Deane', 'Del', 'Dell', 'Demetris', 'Dennie', 'Denny', 'Devin', 'Devon', 'Dion', 'Dionis', 'Dominique', 'Donnie', 'Donny', 'Do

It is well known that names ending in the letter a are almost always female. We can see this and some other patterns in a graph produced by the following code.

In [20]:
cfd = nltk.ConditionalFreqDist(
    (fileid, name[-1])
    for fileid in names.fileids()
        for name in names.words(fileid))


In [42]:
cfd.plot()

The first step in building our classifier is to decide for the features and how to encode them. We will start by looking at the final letter.

In [61]:
def gender_features(word):
    return {'last_letter': word[-1]}
gender_features('Patrick')

{'last_letter': 'k'}

In [62]:
labeled_names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])

In [63]:
import random
random.shuffle(labeled_names)

Next, we use the feature extractor to process the names data, and divide the resulting list of feature sets into a training set and a test set. The training set is used to train a new "naive Bayes" classifier.

In [97]:
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]

In [98]:
train_set, test_set = featuresets[500:], featuresets[:500]

In [99]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [100]:
classifier.classify(gender_features('Patrick'))

'male'

In [101]:
classifier.classify(gender_features('Sohpie'))

'female'

We can systematically evaluate the classifier on a much larger quantity of unseen data:

In [102]:
 print(nltk.classify.accuracy(classifier, test_set))

0.808


We can also have a look at the best performing features

In [103]:
classifier.show_most_informative_features(20)

Most Informative Features
                last_two = 'na'           female : male   =     98.5 : 1.0
                last_two = 'la'           female : male   =     73.2 : 1.0
                last_two = 'ia'           female : male   =     40.9 : 1.0
             last_letter = 'a'            female : male   =     35.6 : 1.0
                last_two = 'sa'           female : male   =     35.1 : 1.0
                last_two = 'ta'           female : male   =     32.9 : 1.0
             last_letter = 'k'              male : female =     31.1 : 1.0
                last_two = 'us'             male : female =     29.5 : 1.0
                last_two = 'do'             male : female =     27.4 : 1.0
                last_two = 'ra'           female : male   =     27.0 : 1.0
                last_two = 'rd'             male : female =     24.8 : 1.0
              last_three = 'ana'          female : male   =     24.5 : 1.0
                last_two = 'rt'             male : female =     23.8 : 1.0

Your Turn: Modify the gender_features() function to provide the classifier with features encoding the length of the name, its first letter, and any other features that seem like they might be informative. Retrain the classifier with these new features, and test its accuracy.


In [96]:
def gender_features(word):
    return {'last_letter': word[-1],
            'first_two': word[:-2],
            'last_two': word[-2:],
            'last_three': word[-3:],
            'lenght': len(word),
            'first_letter': word[1],
           }

In [130]:
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.808


we can also try another classifier, for instance the SVM from sklearn

In [137]:
import nltk.classify
from sklearn.svm import LinearSVC

In [139]:
classifier = nltk.classify.SklearnClassifier(LinearSVC(C=0.1,class_weight="balanced"))
classifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.814


# Movie Review Data

lets have a look at more complex problem: Sentiment Analysis

Sentiment analysis can be seen as a part of Text classification. Goal is to detect if a text is positive or negative (or neutral)
NLTK comes already with a large set of textual documents and also has some data for Sentiment Analysis which we will use in the remainder of this session.

In [140]:
from nltk.corpus import movie_reviews
from collections import defaultdict

In [141]:
#create a list with movie reviews from the NLTK corpus
# every element in the list is a list with the words of the review and the sentiment (pos/neg)
documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)]
len(documents)

2000

overall we have a set of 2000 movie reviews

In [148]:
text = movie_reviews.raw()

In [149]:
text[0:1000]

'plot : two teen couples go to a church party , drink and then drive . \nthey get into an accident . \none of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \nwhat\'s the deal ? \nwatch the movie and " sorta " find out . . . \ncritique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . \nwhich is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn\'t snag this one correctly . \nthey seem to have taken this pretty neat concept , but executed it terribly . \nso what are the problems with the movie ? \nwell , its main problem is that it\'s simply too jumbled . \nit starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience membe

In [151]:
documents = defaultdict(list)
for i in movie_reviews.fileids():
    documents[i.split('/')[0]].append(i)
print(documents['pos'][:10])# first ten pos reviws.)

print(documents['neg'][:10])# first ten neg reviews.)

['pos/cv000_29590.txt', 'pos/cv001_18431.txt', 'pos/cv002_15918.txt', 'pos/cv003_11664.txt', 'pos/cv004_11636.txt', 'pos/cv005_29443.txt', 'pos/cv006_15448.txt', 'pos/cv007_4968.txt', 'pos/cv008_29435.txt', 'pos/cv009_29592.txt']
['neg/cv000_29416.txt', 'neg/cv001_19502.txt', 'neg/cv002_17424.txt', 'neg/cv003_12683.txt', 'neg/cv004_12641.txt', 'neg/cv005_29357.txt', 'neg/cv006_17022.txt', 'neg/cv007_4992.txt', 'neg/cv008_29326.txt', 'neg/cv009_29417.txt']


In [155]:
import string
from nltk.corpus import stopwords
stop = stopwords.words('english')

documents = [([w for w in movie_reviews.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in movie_reviews.fileids()]

In [147]:
#from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [273]:
from nltk.probability import FreqDist

all_words = FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(word for (word,count) in all_words.most_common(2000))

In [274]:
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [275]:
print(document_features(movie_reviews.words('pos/cv957_8737.txt'))) 

{'contains(should)': False, 'contains(intelligent)': False, 'contains(hilarious)': True, 'contains(grant)': False, 'contains(russell)': False, 'contains(turns)': False, 'contains(hour)': False, 'contains(camera)': False, 'contains(possible)': False, 'contains(anne)': False, 'contains(he)': True, 'contains(tarzan)': False, 'contains(looking)': False, 'contains(audience)': False, 'contains(leading)': False, 'contains(country)': False, 'contains(max)': False, 'contains(cinematography)': False, 'contains(social)': False, 'contains(sounds)': False, 'contains(90)': False, 'contains(robert)': False, 'contains(third)': True, 'contains(once)': True, 'contains(says)': False, 'contains(a)': True, 'contains(had)': False, 'contains(look)': True, 'contains(low)': False, 'contains(mars)': False, 'contains(front)': False, 'contains(credit)': False, 'contains(simply)': False, 'contains(teenagers)': False, 'contains(earlier)': False, 'contains(horror)': False, 'contains(choice)': False, 'contains(headed

In [276]:
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [277]:
print(nltk.classify.accuracy(classifier, test_set))

0.88


In [278]:
classifier.show_most_informative_features(50)

Most Informative Features
   contains(outstanding) = True              pos : neg    =     10.4 : 1.0
        contains(seagal) = True              neg : pos    =      8.7 : 1.0
         contains(mulan) = True              pos : neg    =      8.1 : 1.0
   contains(wonderfully) = True              pos : neg    =      6.3 : 1.0
         contains(damon) = True              pos : neg    =      5.7 : 1.0
          contains(lame) = True              neg : pos    =      5.6 : 1.0
        contains(wasted) = True              neg : pos    =      5.6 : 1.0
         contains(awful) = True              neg : pos    =      5.4 : 1.0
         contains(flynt) = True              pos : neg    =      5.1 : 1.0
    contains(ridiculous) = True              neg : pos    =      5.1 : 1.0
        contains(poorly) = True              neg : pos    =      4.9 : 1.0
         contains(waste) = True              neg : pos    =      4.9 : 1.0
          contains(jedi) = True              pos : neg    =      4.8 : 1.0

The current version gets an accuracy of ~ 0.88, lets see how much better we can get by applying some of the learned methods, e.g. stemming or POS tagging.

## Lemmatizing

In [279]:
wnl = nltk.WordNetLemmatizer()

In [280]:
all_words = FreqDist(wnl.lemmatize(w.lower()) for w in movie_reviews.words())
word_features = list(word for (word,count) in all_words.most_common(2000))

In [281]:
def document_features(document):
    words=[wnl.lemmatize(word) for word in document]
    document_words = set(words)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [282]:
featuresets = [(document_features(d), c) for (d,c) in documents]

In [283]:
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [284]:
print(nltk.classify.accuracy(classifier, test_set))

0.88


## Using only adjectives

In [264]:
adjectives=[]
for sent in movie_reviews.sents():
    for word, pos in nltk.pos_tag(sent):
        if pos in ["JJ","JJR" "JJS"]: # feel free to add any other  tags
            adjectives.append(word)

In [265]:
adjectives

['fuck',
 'teen',
 'cool',
 'bad',
 'such',
 'good',
 'bad',
 't',
 'neat',
 'main',
 's',
 'normal',
 '"',
 'dead',
 'dead',
 'strange',
 'weird',
 'don',
 't',
 'same',
 'i',
 's',
 's',
 'big',
 'final',
 'sad',
 'half',
 'little',
 'bottom',
 'sure',
 'secret',
 'i',
 'melissa',
 'plain',
 'lazy',
 'different',
 'further',
 'pretty',
 'fuck',
 'i',
 'little',
 'good',
 'wes',
 'exact',
 'same',
 'american',
 'new',
 'own',
 'entire',
 's',
 'overall',
 't',
 's',
 'pretty',
 'pretty',
 'teen',
 's',
 'hot',
 's',
 'elm',
 'happy',
 's',
 'quick',
 's',
 'head',
 'deserted',
 'russian',
 'little',
 'few',
 'empty',
 'big',
 'flashy',
 'donald',
 's',
 's',
 'below',
 'likely',
 'halloween',
 's',
 'real',
 'stan',
 's',
 'schnazzy',
 'occasional',
 'good',
 's',
 's',
 'sunken',
 'jaded',
 'timex',
 'late',
 's',
 'same',
 'reformed',
 'wrong',
 'quick',
 'cool',
 's',
 'nice',
 'like',
 'cool',
 'first',
 'fifteen',
 'apparent',
 'complete',
 'nice',
 't',
 'long',
 'single',
 'spe

In [266]:
all_words = FreqDist(wnl.lemmatize(w.lower(),pos="a") for w in adjectives)
word_features = list(word for (word,count) in all_words.most_common(2000))

In [267]:
word_features

['s',
 '"',
 'good',
 't',
 'other',
 'i',
 'first',
 'bad',
 'little',
 'new',
 'many',
 'much',
 'great',
 'big',
 'such',
 'few',
 'real',
 'same',
 'old',
 'last',
 'own',
 'young',
 'original',
 'only',
 'funny',
 'high',
 'special',
 'american',
 'black',
 'hard',
 'interesting',
 'long',
 'different',
 'small',
 'next',
 'several',
 'whole',
 'true',
 'entire',
 'human',
 'll',
 'main',
 'dead',
 'comic',
 'final',
 'second',
 'full',
 'wrong',
 'sure',
 'able',
 'right',
 'nice',
 'white',
 'short',
 'major',
 'perfect',
 'obvious',
 'evil',
 'enough',
 'strong',
 'beautiful',
 'fine',
 'top',
 'possible',
 'romantic',
 'classic',
 'stupid',
 'deep',
 'simple',
 'single',
 'emotional',
 're',
 'wild',
 'poor',
 'easy',
 'serious',
 'important',
 'happy',
 'impressive',
 'recent',
 '*',
 'wonderful',
 'early',
 'dark',
 'local',
 'difficult',
 'hilarious',
 'certain',
 'popular',
 'effective',
 'late',
 'due',
 'dramatic',
 'personal',
 'entertaining',
 'sexual',
 'similar',
 'j

In [268]:
def document_features(document):
    words=[wnl.lemmatize(word.lower(),pos="a") for word in document]
    document_words = set(words)
    features = {}
    for word in word_features:
        features['contains({})'.format(word.lower())] = (word.lower() in document_words)
    return features

In [269]:
featuresets = [(document_features(d), c) for (d,c) in documents]

In [270]:
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [271]:
print(nltk.classify.accuracy(classifier, test_set))

0.87


In [272]:
classifier.show_most_informative_features(50)

Most Informative Features
     contains(insulting) = True              neg : pos    =     11.8 : 1.0
     contains(ludicrous) = True              neg : pos    =     11.0 : 1.0
   contains(outstanding) = True              pos : neg    =     10.4 : 1.0
    contains(incoherent) = True              neg : pos    =     10.0 : 1.0
      contains(seamless) = True              pos : neg    =      9.3 : 1.0
      contains(thematic) = True              pos : neg    =      9.3 : 1.0
        contains(feeble) = True              neg : pos    =      9.3 : 1.0
     contains(illogical) = True              neg : pos    =      9.3 : 1.0
        contains(seagal) = True              neg : pos    =      8.7 : 1.0
 contains(unimaginative) = True              neg : pos    =      7.8 : 1.0
      contains(poignant) = True              pos : neg    =      7.6 : 1.0
       contains(abysmal) = True              neg : pos    =      7.0 : 1.0
        contains(shoddy) = True              neg : pos    =      7.0 : 1.0

# Just try other things 