## Getting Started With Natural Language Processing in Python

In [1]:
# > pip install nltk

However, this doesn't install everything quite yet.  To select available packages and install different toolkits, use the following cells.  This code will open up a GUI which will allow you to run a full installation, or manually select the packages you want.

In [2]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

Now that we have the nltk package installed, lets go over some basic natural language processing vocabulary:

##### Corpus - 
Body of text, singular. Corpora is the plural of this. Example: A collection of medical journals.
##### Lexicon - 
Words and their meanings. Example: English dictionary. Consider, however, that various fields will have different lexicons. 
##### Token - 
Each "entity" that is a part of whatever was split up based on rules. For examples, each word is a token when a sentence is "tokenized" into words. Each sentence can also be a token, if you tokenized the sentences out of a paragraph.

In [3]:
from nltk.tokenize import sent_tokenize, word_tokenize

text = "Hello students, how are you doing today? The olympics are inspiring, and Python is awesome. You look nice today."

print(sent_tokenize(text))

['Hello students, how are you doing today?', 'The olympics are inspiring, and Python is awesome.', 'You look nice today.']


In [4]:
print(word_tokenize(text))

['Hello', 'students', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'olympics', 'are', 'inspiring', ',', 'and', 'Python', 'is', 'awesome', '.', 'You', 'look', 'nice', 'today', '.']


##### Stop Words with NLTK:

When using Natural Language Processing, our goal is to perform some analysis or processing so that a computer can respond to text appropriately. 

The process of converting data to something a computer can understand is referred to as "pre-processing." One of the major forms of pre-processing is going to be filtering out useless data. In natural language processing, useless words (data), are referred to as stop words.

In [5]:
from nltk.corpus import stopwords
print(set(stopwords.words('english')))

{'shan', 'did', 'his', 'it', 'by', 'about', 'up', 'same', "isn't", 's', 'on', 'yours', 'through', 'herself', 'as', 'hers', "you'll", 'only', 'aren', 'there', 'doing', 'yourselves', 'we', "you've", 'the', 'both', 'of', 'mustn', 'your', 'yourself', "wouldn't", 'which', 'hasn', "needn't", 'were', 'their', 'is', 'do', 'nor', 'd', "mightn't", 'some', 'very', 'had', 'isn', 'my', "she's", 'few', 'once', 'no', 'll', 'himself', 'wasn', "won't", "don't", "haven't", 'more', 'themselves', 'shouldn', "didn't", 'am', 'for', 'this', 'after', 'each', 'from', 'a', 'o', 'ourselves', 'into', 'myself', 'when', 'haven', 'in', 'what', 'y', 'between', 'itself', 'further', 'just', 'being', 'you', 'such', 'other', "it's", "doesn't", 'are', "should've", 'hadn', 'them', 'during', 'those', 'down', "hasn't", 'why', 'that', 'theirs', 'doesn', 'mightn', 'her', 'having', 'under', 'don', 'now', "weren't", 'has', 'our', 'm', 'then', 'over', 'an', 'against', 'how', "shan't", 'where', 'him', 'if', 'above', 'won', 'off', 

In [6]:
example_sent = "This is some sample text, showing off the stop words filtration."

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

['This', 'is', 'some', 'sample', 'text', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'text', ',', 'showing', 'stop', 'words', 'filtration', '.']


##### Stemming Words with NLTK:

Stemming, which attempts to normalize sentences, is another preprocessing step that we can perform.  In the english language, different variations of words and sentences often having the same meaning.  Stemming is a way to account for these variations; furthermore, it will help us shorten the sentences and shorten our lookup.  For example, consider the following sentence:

* I was taking a ride on my horse.
* I was riding my horse.

These sentences mean the same thing, as noted by the same tense (-ing) in each sentence; however, that isn't intuitively understood by the computer. To account for all the variations of words in the english language, we can use the Porter stemmer, which has been around since 1979.

In [7]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

example_words = ["ride","riding","rider","rides"]

for w in example_words:
    print(ps.stem(w))

ride
ride
rider
ride


In [8]:
# Now lets try stemming an entire sentence!

new_text = "When riders are riding their horses, they often think of how cowboys rode horses."

words = word_tokenize(new_text)

for w in words:
    print(ps.stem(w))

when
rider
are
ride
their
hors
,
they
often
think
of
how
cowboy
rode
hors
.


##### Part of Speech Tagging with NLTK

Part of speech tagging means labeling words as nouns, verbs, adjectives, etc. Even better, NLTK can handle tenses! While we're at it, we are also going to import a new sentence tokenizer (PunktSentenceTokenizer). This tokenizer is capable of unsupervised learning, so it can be trained on any body of text. 

In [9]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [10]:
# We can use documents from the nltk.corpus.  As an example, lets load the universal declaration of human rights.
from nltk.corpus import udhr
print(udhr.raw('English-Latin1'))

Universal Declaration of Human Rights
Preamble
Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world, 

Whereas disregard and contempt for human rights have resulted in barbarous acts which have outraged the conscience of mankind, and the advent of a world in which human beings shall enjoy freedom of speech and belief and freedom from fear and want has been proclaimed as the highest aspiration of the common people, 

Whereas it is essential, if man is not to be compelled to have recourse, as a last resort, to rebellion against tyranny and oppression, that human rights should be protected by the rule of law, 

Whereas it is essential to promote the development of friendly relations between nations, 

Whereas the peoples of the United Nations have in the Charter reaffirmed their faith in fundamental human rights, in the dignity and worth of the human person and in

In [11]:
# Lets import some sample and training text - George Bush's 2005 and 2006 state of the union addresses. 

from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

In [12]:
# Now that we have some text, we can train the PunktSentenceTokenizer

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

In [13]:
# Now lets tokenize the sample_text using our trained tokenizer

tokenized = custom_sent_tokenizer.tokenize(sample_text)

In [14]:
# This function will tag each tokenized word with a part of speech

def process_content():
    try:
        for i in tokenized[:5]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            print(tagged)

    except Exception as e:
        print(str(e))

        
# The output is a list of tuples - the word with it's part of speech
process_content()

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('January', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'), ('THE', 'NNP'), ('PRESIDENT', 'NNP'), (':', ':'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('.', '.')]
[('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), (',', ','), ('distinguished', 'JJ'), ('guests', 'NNS'), (',', ','), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), (':', ':'), ('Today', 'VB'), ('our', 'PRP$'), ('nat

##### Chunking with NLTK

Now that each word has been tagged with a part of speech, we can move onto chunking: grouping the words into meaningful clusters.  The main goal of chunking is to group words into "noun phrases", which is a noun with any associated verbs, adjectives, or adverbs. 

The part of speech tags that were generated in the previous step will be combined with regular expressions, such as the following:

In [15]:
'''
+ = match 1 or more
? = match 0 or 1 repetitions.
* = match 0 or MORE repetitions	  
. = Any character except a new line
'''

'\n+ = match 1 or more\n? = match 0 or 1 repetitions.\n* = match 0 or MORE repetitions\t  \n. = Any character except a new line\n'

In [16]:
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            # combine the part-of-speech tag with a regular expression
            
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            
            # draw the chunks with nltk
            # chunked.draw()     

    except Exception as e:
        print(str(e))

        
process_content()

The main line in question is:

In [17]:
'''
chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
'''

'\nchunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""\n'

This line, broken down:

In [18]:
'''
<RB.?>* = "0 or more of any tense of adverb," followed by: 

<VB.?>* = "0 or more of any tense of verb," followed by: 

<NNP>+ = "One or more proper nouns," followed by 

<NN>? = "zero or one singular noun." 

'''

'\n<RB.?>* = "0 or more of any tense of adverb," followed by: \n\n<VB.?>* = "0 or more of any tense of verb," followed by: \n\n<NNP>+ = "One or more proper nouns," followed by \n\n<NN>? = "zero or one singular noun." \n\n'

In [19]:
# We can access the chunks, which are stored as an NLTK tree 

def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            # combine the part-of-speech tag with a regular expression
            
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            
            # print(chunked)
            for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
                print(subtree)
            
            # draw the chunks with nltk
            # chunked.draw()     

    except Exception as e:
        print(str(e))

        
process_content()

(Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP)
(Chunk ADDRESS/NNP)
(Chunk A/NNP JOINT/NNP SESSION/NNP)
(Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
(Chunk THE/NNP UNION/NNP January/NNP)
(Chunk THE/NNP PRESIDENT/NNP)
(Chunk Thank/NNP)
(Chunk Mr./NNP Speaker/NNP)
(Chunk Vice/NNP President/NNP Cheney/NNP)
(Chunk Congress/NNP)
(Chunk Supreme/NNP Court/NNP)
(Chunk called/VBD America/NNP)
(Chunk Coretta/NNP Scott/NNP King/NNP)
(Chunk Applause/NNP)
(Chunk President/NNP George/NNP W./NNP Bush/NNP)
(Chunk State/NNP)
(Chunk Union/NNP Address/NNP)
(Chunk Capitol/NNP)
(Chunk Tuesday/NNP)
(Chunk Jan/NNP)
(Chunk White/NNP House/NNP photo/NN)
(Chunk Eric/NNP DraperEvery/NNP time/NN)
(Chunk Capitol/NNP dome/NN)
(Chunk have/VBP served/VBN America/NNP)
(Chunk Tonight/NNP)
(Chunk Union/NNP)
(Chunk Applause/NNP)
(Chunk United/NNP)
(Chunk America/NNP)
(Chunk Applause/NNP)
(Chunk America/NNP)
(Chunk September/NNP)
(Chunk Dictatorships/NNP shelter/NN)
(Chunk Applause/NNP)
(Chunk Afghanistan/NNP)
(

##### Chinking with NLTK

Sometimes there are words in the chunks that we don't won't, we can remove them using a process called chinking.

In [20]:
def process_content():
    try:
        for i in tokenized[5:]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            # The main difference here is the }{, vs. the {}. This means we're removing 
            # from the chink one or more verbs, prepositions, determiners, or the word 'to'.

            chunkGram = r"""Chunk: {<.*>+}
                                    }<VB.?|IN|DT|TO>+{"""

            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            
            # print(chunked)
            for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
                print(subtree)

            # chunked.draw()

    except Exception as e:
        print(str(e))

        
process_content()

(Chunk 31/CD ,/, 2006/CD ./.)
(Chunk White/NNP House/NNP photo/NN)
(Chunk Eric/NNP DraperEvery/NNP time/NN I/PRP)
(Chunk invited/JJ)
(Chunk rostrum/NN ,/, I/PRP)
(Chunk privilege/NN ,/, and/CC mindful/NN)
(Chunk history/NN we/PRP)
(Chunk together/RB ./.)
(Chunk We/PRP)
(Chunk Capitol/NNP dome/NN)
(Chunk moments/NNS)
(Chunk national/JJ mourning/NN and/CC national/JJ achievement/NN ./.)
(Chunk We/PRP)
(Chunk America/NNP)
(Chunk one/CD)
(Chunk most/RBS consequential/JJ periods/NNS)
(Chunk our/PRP$ history/NN --/: and/CC it/PRP)
(Chunk my/PRP$ honor/NN)
(Chunk you/PRP ./.)
(Chunk system/NN)
(Chunk
  two/CD
  parties/NNS
  ,/,
  two/CD
  chambers/NNS
  ,/,
  and/CC
  two/CD
  elected/JJ
  branches/NNS
  ,/,
  there/EX
  will/MD
  always/RB)
(Chunk differences/NNS and/CC debate/NN ./.)
(Chunk But/CC even/RB tough/JJ debates/NNS can/MD)
(Chunk
  civil/JJ
  tone/NN
  ,/,
  and/CC
  our/PRP$
  differences/NNS
  can/MD
  not/RB)
(Chunk anger/NN ./.)
(Chunk great/JJ issues/NNS)
(Chunk us/PRP ,/, 

(Chunk way/NN ,/, we/PRP)
(Chunk responsible/JJ criticism/NN and/CC counsel/NN)
(Chunk members/NNS)
(Chunk Congress/NNP)
(Chunk parties/NNS ./.)
(Chunk year/NN ,/, I/PRP will/MD)
(Chunk out/RP and/CC)
(Chunk your/PRP$ good/JJ advice/NN ./.)
(Chunk Yet/RB ,/, there/EX)
(Chunk difference/NN)
(Chunk responsible/JJ criticism/NN that/WDT)
(Chunk success/NN ,/, and/CC defeatism/NN that/WDT)
(Chunk anything/NN but/CC failure/NN ./.)
(Chunk (/( Applause/NNP ./. )/))
(Chunk Hindsight/NNP alone/RB)
(Chunk not/RB wisdom/JJ ,/, and/CC second-guessing/NN)
(Chunk not/RB)
(Chunk strategy/NN ./.)
(Chunk (/( Applause/NNP ./. )/))
(Chunk so/RB much/JJ)
(Chunk balance/NN ,/,)
(Chunk us/PRP)
(Chunk public/JJ office/NN)
(Chunk duty/NN)
(Chunk candor/NN ./.)
(Chunk sudden/JJ withdrawal/NN)
(Chunk our/PRP$ forces/NNS)
(Chunk Iraq/NNP would/MD)
(Chunk our/PRP$ Iraqi/NNP allies/NNS)
(Chunk death/NN and/CC prison/NN ,/, would/MD)
(Chunk men/NNS)
(Chunk bin/NN Laden/NNP and/CC Zarqawi/NNP)
(Chunk charge/NN)
(Chu

(Chunk milestone/NN)
(Chunk more/JJR)
(Chunk personal/JJ crisis/NN --/: (/( laughter/NN )/) --/: it/PRP)
(Chunk national/JJ challenge/NN ./.)
(Chunk retirement/NN)
(Chunk baby/NN boom/NN generation/NN will/MD)
(Chunk unprecedented/JJ strains/NNS)
(Chunk federal/JJ government/NN ./.)
(Chunk 2030/CD ,/,)
(Chunk
  Social/NNP
  Security/NNP
  ,/,
  Medicare/NNP
  and/CC
  Medicaid/NNP
  alone/RB
  will/MD)
(Chunk almost/RB 60/CD percent/NN)
(Chunk entire/JJ federal/JJ budget/NN ./.)
(Chunk And/CC)
(Chunk will/MD)
(Chunk future/JJ Congresses/NNS)
(Chunk impossible/JJ choices/NNS --/:)
(Chunk
  tax/NN
  increases/NNS
  ,/,
  immense/JJ
  deficits/NNS
  ,/,
  or/CC
  deep/JJ
  cuts/NNS)
(Chunk category/NN)
(Chunk spending/NN ./.)
(Chunk Congress/NNP)
(Chunk not/RB)
(Chunk last/JJ year/NN)
(Chunk my/PRP$ proposal/NN)
(Chunk Social/NNP Security/NNP --/: (/( applause/NN )/) --/: yet/RB)
(Chunk cost/NN)
(Chunk entitlements/NNS)
(Chunk problem/NN that/WDT)
(Chunk not/RB)
(Chunk away/RB ./.)
(Chunk

(Chunk We/PRP)
(Chunk debris/NN and/CC repairing/NN highways/NNS and/CC)
(Chunk stronger/JJR levees/NNS ./.)
(Chunk We/PRP)
(Chunk business/NN loans/NNS and/CC housing/NN assistance/NN ./.)
(Chunk Yet/RB)
(Chunk we/PRP)
(Chunk immediate/JJ needs/NNS ,/, we/PRP must/MD also/RB)
(Chunk deeper/JJR challenges/NNS that/WDT)
(Chunk storm/NN)
(Chunk ./.)
(Chunk New/NNP Orleans/NNP and/CC)
(Chunk other/JJ places/NNS ,/, many/JJ)
(Chunk our/PRP$ fellow/JJ citizens/NNS)
(Chunk promise/NN)
(Chunk our/PRP$ country/NN ./.)
(Chunk answer/NN)
(Chunk
  not/RB
  only/RB
  temporary/JJ
  relief/NN
  ,/,
  but/CC
  schools/NNS
  that/WDT)
(Chunk child/NN ,/, and/CC job/NN skills/NNS)
(Chunk upward/JJ mobility/NN ,/, and/CC more/JJR opportunities/NNS)
(Chunk home/NN and/CC)
(Chunk business/NN ./.)
(Chunk we/PRP)
(Chunk disaster/NN ,/,)
(Chunk us/PRP also/RB work/NN)
(Chunk day/NN when/WRB)
(Chunk Americans/NNPS)
(Chunk justice/NN ,/, equal/JJ)
(Chunk hope/NN ,/, and/CC rich/JJ)
(Chunk opportunity/NN ./.)


##### Named Entity Recognition with NLTK

One of the most common forms of chunking in natural language processing is called "Named Entity Recognition." NLTK is able to identify people, places, things, locations, monetary figures, and more.

There are two major options with NLTK's named entity recognition: either recognize all named entities, or recognize named entities as their respective type, like people, places, locations, etc.

Here, with the option of binary = True, this means either something is a named entity, or not. There will be no further detail.

In [21]:
def process_content():
    try:
        for i in tokenized[5:]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            namedEnt = nltk.ne_chunk(tagged, binary=True)
            # namedEnt.draw()
            
    except Exception as e:
        print(str(e))

        
process_content()

### Text Classification

##### Text classification using NLTK

Now that we have covered the basics of preprocessing for Natural Language Processing, we can move on to text classification using simple machine learning classification algorithms.

In [22]:
import random
import nltk
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# shuffle the documents
random.shuffle(documents)

print('Number of Documents: {}'.format(len(documents)))
print('First Review: {}'.format(documents[1]))

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

print('Most common words: {}'.format(all_words.most_common(15)))
print('The word happy: {}'.format(all_words["happy"]))

Number of Documents: 2000
First Review: (['synopsis', ':', 'in', 'this', 'cultural', 'exploration', ',', 'a', 'chinese', 'american', 'computer', 'engineer', 'named', 'fang', '(', 'peter', 'wang', ')', 'decides', 'to', 'take', 'a', 'month', '-', 'long', 'vacation', 'to', 'visit', 'his', 'sister', 'mrs', '.', 'chao', '(', 'shen', 'guanglan', ')', ',', 'her', 'husband', 'mr', '.', 'chao', '(', 'hy', 'xiaoguang', ')', ',', 'and', 'their', 'teenage', 'daughter', 'lili', '(', 'li', 'qinqin', ')', 'in', 'beijing', 'after', '30', 'years', 'of', 'separation', '.', 'fang', 'brings', 'his', 'asian', 'american', 'wife', 'grace', '(', 'sharon', 'iwai', ')', 'and', 'his', 'college', '-', 'aged', 'son', 'paul', '(', 'kelvin', 'han', 'yee', ')', 'along', ',', 'both', 'of', 'whom', 'don', "'", 't', 'speak', 'chinese', '.', 'the', 'encounter', 'between', 'the', 'two', 'families', 'allows', 'the', 'audience', 'to', 'compare', 'the', 'eastern', 'and', 'western', 'cultures', 'as', 'well', 'as', 'the', 'amb

In [23]:
# We'll use the 4000 most common words as features
print(len(all_words))
word_features = list(all_words.keys())[:4000]

39768


In [25]:
# The find_features function will determine which of the 3000 word features are contained in the review
def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features


# Lets use an example from a negative review
features = find_features(movie_reviews.words('neg/cv000_29416.txt'))
for key, value in features.items():
    if value == True:
        print (key)

plot
:
two
teen
couples
go
to
a
church
party
,
drink
and
then
drive
.
they
get
into
an
accident
one
of
the
guys
dies
but
his
girlfriend
continues
see
him
in
her
life
has
nightmares
what
'
s
deal
?
watch
movie
"
sorta
find
out
critique
mind
-
fuck
for
generation
that
touches
on
very
cool
idea
presents
it
bad
package
which
is
makes
this
review
even
harder
write
since
i
generally
applaud
films
attempt
break
mold
mess
with
your
head
such
(
lost
highway
&
memento
)
there
are
good
ways
making
all
types
these
folks
just
didn
t
snag
correctly
seem
have
taken
pretty
neat
concept
executed
terribly
so
problems
well
its
main
problem
simply
too
jumbled
starts
off
normal
downshifts
fantasy
world
you
as
audience
member
no
going
dreams
characters
coming
back
from
dead
others
who
look
like
strange
apparitions
disappearances
looooot
chase
scenes
tons
weird
things
happen
most
not
explained
now
personally
don
trying
unravel
film
every
when
does
give
me
same
clue
over
again
kind
fed
up
after
while
biggest


In [26]:
# Now lets do it for all the documents
featuresets = [(find_features(rev), category) for (rev, category) in documents]

In [27]:
# we can split the featuresets into training and testing datasets using sklearn
from sklearn import model_selection

# define a seed for reproducibility
seed = 1

# split the data into training and testing datasets
training, testing = model_selection.train_test_split(featuresets, test_size = 0.25, random_state=seed)

In [28]:
print(len(training))
print(len(testing))

1500
500


In [29]:
# We can use sklearn algorithms in NLTK
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC

model = SklearnClassifier(SVC(kernel = 'linear'))

# train the model on the training data
model.train(training)

# and test on the testing dataset!
accuracy = nltk.classify.accuracy(model, testing)*100
print("SVC Accuracy: {}".format(accuracy))

SVC Accuracy: 81.6
