# NLP (Natural Language Processing)

#### TL; DR
To develop a deeper intuition with NLP or vectorization of words to do sentimental analysis

#### Packages for NLP
NLTK

### Reference

Sentdex [video1](https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/)

### Key Terms

- **Tokenizing**: grouping of text (2 types of separators: sentences and words)
- **Corporas**: body of text with same subject/theme
- **Lexicon**: words & their meanings
- **Stop Words**: "fluff" meaningless words that are typically removed
- **Stemming**: typically referred to as the process of removing the end of words that connote a different tense
- **Lemmatizing**: gets the root of the words in contrast to stemming
- **Tagging**: labeling words as nouns, verbs, adjectives, etc...
- **Chunking**: phrases of words that contain a noun surrounded by a verb, adverb that are related

*[Regular Expressions](https://pythonprogramming.net/regular-expressions-regex-tutorial-python-3/): own language/symbols 

- **Chinking**: a chink is a chunk that is removed ofrom a chunk

In [1]:
import numpy as np
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

import matplotlib.pyplot as plt
%matplotlib inline

### Grouping Sentences & Words

In [2]:
example_text = "Hello Mr Smith, how are you doing today? The weather is great, \
                and Python is awesome. The sky is pinkish-blue. \
                You shouldn't eat cardboard."

In [3]:
print(sent_tokenize(example_text))

['Hello Mr Smith, how are you doing today?', 'The weather is great,                 and Python is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard."]


In [4]:
print(word_tokenize(example_text))

['Hello', 'Mr', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'Python', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat', 'cardboard', '.']


In [5]:
for i in word_tokenize(example_text):
    print(i)

Hello
Mr
Smith
,
how
are
you
doing
today
?
The
weather
is
great
,
and
Python
is
awesome
.
The
sky
is
pinkish-blue
.
You
should
n't
eat
cardboard
.


### Stopwords

In [6]:
example_sent = "This is a sample sentence, showing off the stop words filtration."

stop_words = set(stopwords.words('english'))
print(stop_words)

set([u'all', u'just', u"don't", u'being', u'over', u'both', u'through', u'yourselves', u'its', u'before', u'o', u'don', u'hadn', u'herself', u'll', u'had', u'should', u'to', u'only', u'won', u'under', u'ours', u'has', u"should've", u"haven't", u'do', u'them', u'his', u'very', u"you've", u'they', u'not', u'during', u'now', u'him', u'nor', u"wasn't", u'd', u'did', u'didn', u'this', u'she', u'each', u'further', u"won't", u'where', u"mustn't", u"isn't", u'few', u'because', u"you'd", u'doing', u'some', u'hasn', u"hasn't", u'are', u'our', u'ourselves', u'out', u'what', u'for', u"needn't", u'below', u're', u'does', u"shouldn't", u'above', u'between', u'mustn', u't', u'be', u'we', u'who', u"mightn't", u"doesn't", u'were', u'here', u'shouldn', u'hers', u"aren't", u'by', u'on', u'about', u'couldn', u'of', u"wouldn't", u'against', u's', u'isn', u'or', u'own', u'into', u'yourself', u'down', u"hadn't", u'mightn', u"couldn't", u'wasn', u'your', u"you're", u'from', u'her', u'their', u'aren', u"it's",

In [7]:
word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w in stop_words]

print(word_tokens)
print(filtered_sentence)

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']


In [8]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

example_words = ["python", "pythoner", "pythoning",
                "pythoned", "pythonly"]

In [9]:
[ps.stem(w) for w in example_words]

['python', u'python', u'python', u'python', u'pythonli']

In [10]:
for w in example_words:
    print(ps.stem(w))

python
python
python
python
pythonli


In [11]:
new_text = "It is important to by very pythonly while you are pythoning with python. \
All pythoners have pythoned poorly at least once."


In [12]:
words = word_tokenize(new_text)

for w in words:
    print(ps.stem(w))

It
is
import
to
by
veri
pythonli
while
you
are
python
with
python
.
all
python
have
python
poorli
at
least
onc
.


### Tagging

In [13]:
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

In [14]:
train_text = state_union.raw('2005-GWBush.txt')
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

In [15]:
def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            print(chunked)
            for subtree in chunked.subtress(filter=lambda t: t.label() == 'Chunk'):
                print(subtree)

    except Exception as e:
        print(str(e))

In [16]:
process_content()

(S
  (Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP)
  'S/POS
  (Chunk ADDRESS/NNP)
  BEFORE/IN
  (Chunk A/NNP JOINT/NNP SESSION/NNP)
  OF/IN
  (Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
  OF/IN
  (Chunk THE/NNP UNION/NNP January/NNP)
  31/CD
  ,/,
  2006/CD
  (Chunk THE/NNP PRESIDENT/NNP)
  :/:
  (Chunk Thank/NNP)
  you/PRP
  all/DT
  ./.)
'Tree' object has no attribute 'subtress'


### Chinking

In [17]:
def process_content():
    try:
        for i in tokenized[5:]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)

            chunkGram = r"""Chunk: {<.*>+}
                                    }<VB.?|IN|DT|TO>+{"""

            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            print(chunked)
#             chunked.draw()

    except Exception as e:
        print(str(e))

In [18]:
process_content()

(S (Chunk 31/CD ,/, 2006/CD ./.))
(S
  (Chunk White/NNP House/NNP photo/NN)
  by/IN
  (Chunk Eric/NNP DraperEvery/NNP time/NN I/PRP)
  'm/VBP
  (Chunk invited/JJ)
  to/TO
  this/DT
  (Chunk rostrum/NN ,/, I/PRP)
  'm/VBP
  humbled/VBN
  by/IN
  the/DT
  (Chunk privilege/NN ,/, and/CC mindful/NN)
  of/IN
  the/DT
  (Chunk history/NN we/PRP)
  've/VBP
  seen/VBN
  (Chunk together/RB ./.))
(S
  (Chunk We/PRP)
  have/VBP
  gathered/VBN
  under/IN
  this/DT
  (Chunk Capitol/NNP dome/NN)
  in/IN
  (Chunk moments/NNS)
  of/IN
  (Chunk
    national/JJ
    mourning/NN
    and/CC
    national/JJ
    achievement/NN
    ./.))
(S
  (Chunk We/PRP)
  have/VBP
  served/VBN
  (Chunk America/NNP)
  through/IN
  (Chunk one/CD)
  of/IN
  the/DT
  (Chunk most/RBS consequential/JJ periods/NNS)
  of/IN
  (Chunk our/PRP$ history/NN --/: and/CC it/PRP)
  has/VBZ
  been/VBN
  (Chunk my/PRP$ honor/NN)
  to/TO
  serve/VB
  with/IN
  (Chunk you/PRP ./.))
(S
  In/IN
  a/DT
  (Chunk system/NN)
  of/IN
  (Chunk
    t

(S
  Do/VBP
  (Chunk n't/RB)
  hesitate/VB
  to/TO
  honor/VB
  (Chunk and/CC)
  support/VB
  those/DT
  of/IN
  (Chunk us/PRP who/WP)
  have/VBP
  the/DT
  (Chunk honor/NN)
  of/IN
  protecting/VBG
  that/DT
  (Chunk which/WDT)
  is/VBZ
  (Chunk worth/JJ)
  protecting/VBG
  (Chunk ./. ''/''))
(S
  (Chunk
    Staff/NNP
    Sergeant/NNP
    Dan/NNP
    Clay/NNP
    's/POS
    wife/NN
    ,/,
    Lisa/NNP
    ,/,
    and/CC
    his/PRP$
    mom/NN
    and/CC
    dad/NN
    ,/,
    Sara/NNP
    Jo/NNP
    and/CC
    Bud/NNP
    ,/,)
  are/VBP
  with/IN
  (Chunk us/PRP)
  this/DT
  (Chunk evening/NN ./.))
(S (Chunk Welcome/NNP ./.))
(S (Chunk (/( Applause/NNP ./. )/)))
(S
  (Chunk Our/PRP$ nation/NN)
  is/VBZ
  (Chunk grateful/JJ)
  to/TO
  the/DT
  fallen/VBN
  (Chunk ,/, who/WP)
  live/VBP
  in/IN
  the/DT
  (Chunk memory/NN)
  of/IN
  (Chunk our/PRP$ country/NN ./.))
(S
  (Chunk We/PRP)
  're/VBP
  (Chunk grateful/JJ)
  to/TO
  all/DT
  (Chunk who/WP)
  volunteer/VBP
  to/TO
  wear/VB
 

(S
  (Chunk Congress/NNP)
  did/VBD
  (Chunk not/RB)
  act/VB
  (Chunk last/JJ year/NN)
  on/IN
  (Chunk my/PRP$ proposal/NN)
  to/TO
  save/VB
  (Chunk
    Social/NNP
    Security/NNP
    --/:
    (/(
    applause/NN
    )/)
    --/:
    yet/RB)
  the/DT
  rising/VBG
  (Chunk cost/NN)
  of/IN
  (Chunk entitlements/NNS)
  is/VBZ
  a/DT
  (Chunk problem/NN that/WDT)
  is/VBZ
  (Chunk not/RB)
  going/VBG
  (Chunk away/RB ./.))
(S (Chunk (/( Applause/NNP ./. )/)))
(S
  (Chunk And/CC)
  every/DT
  (Chunk year/NN we/PRP)
  fail/VBP
  to/TO
  act/VB
  (Chunk ,/,)
  the/DT
  (Chunk situation/NN)
  gets/VBZ
  (Chunk worse/JJR ./.))
(S
  (Chunk So/RB tonight/JJ ,/, I/PRP)
  ask/VBP
  (Chunk you/PRP)
  to/TO
  join/VB
  (Chunk me/PRP)
  in/IN
  creating/VBG
  a/DT
  (Chunk commission/NN)
  to/TO
  examine/VB
  the/DT
  (Chunk full/JJ impact/NN)
  of/IN
  (Chunk baby/NN boom/NN retirements/NNS)
  on/IN
  (Chunk
    Social/NNP
    Security/NNP
    ,/,
    Medicare/NNP
    ,/,
    and/CC
    Medica

  (Chunk America/NNP ./.))
(S (Chunk (/( Applause/NNP ./. )/)))
(S
  (Chunk Fellow/NNP citizens/NNS ,/, we/PRP)
  've/VBP
  been/VBN
  called/VBN
  to/TO
  (Chunk leadership/NN)
  in/IN
  a/DT
  (Chunk period/NN)
  of/IN
  (Chunk consequence/NN ./.))
(S
  (Chunk We/PRP)
  've/VBP
  entered/VBN
  a/DT
  (Chunk great/JJ ideological/JJ conflict/NN we/PRP)
  did/VBD
  (Chunk nothing/NN)
  to/TO
  invite/VB
  (Chunk ./.))
(S
  (Chunk We/PRP)
  see/VBP
  (Chunk great/JJ changes/NNS)
  in/IN
  (Chunk science/NN and/CC commerce/NN that/WDT will/MD)
  influence/VB
  all/DT
  (Chunk our/PRP$ lives/NNS ./.))
(S
  (Chunk Sometimes/RB it/PRP can/MD)
  seem/VB
  that/DT
  (Chunk history/NN)
  is/VBZ
  turning/VBG
  in/IN
  a/DT
  (Chunk wide/JJ arc/NN ,/,)
  toward/IN
  an/DT
  (Chunk unknown/JJ shore/NN ./.))
(S
  (Chunk Yet/CC)
  the/DT
  (Chunk destination/NN)
  of/IN
  (Chunk history/NN)
  is/VBZ
  determined/VBN
  by/IN
  (Chunk human/JJ action/NN ,/, and/CC)
  every/DT
  (Chunk great/JJ moveme

### Name Entity Recognition

In [19]:
def process_content():
    try:
        for i in tokenized[5:]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            namedEnt = nltk.ne_chunk(tagged, binary=True)
            print(namedEnt)
#             namedEnt.draw()
    except Exception as e:
        print(str(e))

In [20]:
process_content()

(S 31/CD ,/, 2006/CD ./.)
(S
  (NE White/NNP House/NNP)
  photo/NN
  by/IN
  (NE Eric/NNP)
  DraperEvery/NNP
  time/NN
  I/PRP
  'm/VBP
  invited/JJ
  to/TO
  this/DT
  rostrum/NN
  ,/,
  I/PRP
  'm/VBP
  humbled/VBN
  by/IN
  the/DT
  privilege/NN
  ,/,
  and/CC
  mindful/NN
  of/IN
  the/DT
  history/NN
  we/PRP
  've/VBP
  seen/VBN
  together/RB
  ./.)
(S
  We/PRP
  have/VBP
  gathered/VBN
  under/IN
  this/DT
  Capitol/NNP
  dome/NN
  in/IN
  moments/NNS
  of/IN
  national/JJ
  mourning/NN
  and/CC
  national/JJ
  achievement/NN
  ./.)
(S
  We/PRP
  have/VBP
  served/VBN
  (NE America/NNP)
  through/IN
  one/CD
  of/IN
  the/DT
  most/RBS
  consequential/JJ
  periods/NNS
  of/IN
  our/PRP$
  history/NN
  --/:
  and/CC
  it/PRP
  has/VBZ
  been/VBN
  my/PRP$
  honor/NN
  to/TO
  serve/VB
  with/IN
  you/PRP
  ./.)
(S
  In/IN
  a/DT
  system/NN
  of/IN
  two/CD
  parties/NNS
  ,/,
  two/CD
  chambers/NNS
  ,/,
  and/CC
  two/CD
  elected/JJ
  branches/NNS
  ,/,
  there/EX
  will/MD
 

(S
  We/PRP
  have/VBP
  killed/VBN
  or/CC
  captured/VBN
  many/JJ
  of/IN
  their/PRP$
  leaders/NNS
  --/:
  and/CC
  for/IN
  the/DT
  others/NNS
  ,/,
  their/PRP$
  day/NN
  will/MD
  come/VB
  ./.)
(S
  President/NNP
  (NE George/NNP)
  W./NNP
  Bush/NNP
  greets/VBZ
  members/NNS
  of/IN
  (NE Congress/NNP)
  after/IN
  his/PRP$
  State/NN
  of/IN
  the/DT
  (NE Union/NNP Address/NNP)
  at/IN
  the/DT
  (NE Capitol/NNP)
  ,/,
  Tuesday/NNP
  ,/,
  (NE Jan/NNP)
  ./.)
(S 31/CD ,/, 2006/CD ./.)
(S
  (NE White/NNP House/NNP)
  photo/NN
  by/IN
  (NE Eric/NNP Draper/NNP)
  We/PRP
  remain/VBP
  on/IN
  the/DT
  offensive/JJ
  in/IN
  (NE Afghanistan/NNP)
  ,/,
  where/WRB
  a/DT
  fine/JJ
  President/NNP
  and/CC
  a/DT
  (NE National/NNP Assembly/NNP)
  are/VBP
  fighting/VBG
  terror/NN
  while/IN
  building/VBG
  the/DT
  institutions/NNS
  of/IN
  a/DT
  new/JJ
  democracy/NN
  ./.)
(S
  We/PRP
  're/VBP
  on/IN
  the/DT
  offensive/JJ
  in/IN
  (NE Iraq/NNP)
  ,/,
  with/IN
 

(S
  Our/PRP$
  nation/NN
  is/VBZ
  grateful/JJ
  to/TO
  the/DT
  fallen/VBN
  ,/,
  who/WP
  live/VBP
  in/IN
  the/DT
  memory/NN
  of/IN
  our/PRP$
  country/NN
  ./.)
(S
  We/PRP
  're/VBP
  grateful/JJ
  to/TO
  all/DT
  who/WP
  volunteer/VBP
  to/TO
  wear/VB
  our/PRP$
  nation/NN
  's/POS
  uniform/NN
  --/:
  and/CC
  as/IN
  we/PRP
  honor/VBP
  our/PRP$
  brave/NN
  troops/NNS
  ,/,
  let/VB
  us/PRP
  never/RB
  forget/VBP
  the/DT
  sacrifices/NNS
  of/IN
  (NE America/NNP)
  's/POS
  military/JJ
  families/NNS
  ./.)
(S (/( (NE Applause/NNP) ./. )/))
(S
  Our/PRP$
  offensive/JJ
  against/IN
  terror/NN
  involves/VBZ
  more/JJR
  than/IN
  military/JJ
  action/NN
  ./.)
(S
  Ultimately/RB
  ,/,
  the/DT
  only/JJ
  way/NN
  to/TO
  defeat/VB
  the/DT
  terrorists/NNS
  is/VBZ
  to/TO
  defeat/VB
  their/PRP$
  dark/JJ
  vision/NN
  of/IN
  hatred/VBN
  and/CC
  fear/VBN
  by/IN
  offering/VBG
  the/DT
  hopeful/JJ
  alternative/NN
  of/IN
  political/JJ
  freedom/NN
 

(S
  The/DT
  terrorist/JJ
  surveillance/NN
  program/NN
  has/VBZ
  helped/VBN
  prevent/VB
  terrorist/JJ
  attacks/NNS
  ./.)
(S
  It/PRP
  remains/VBZ
  essential/JJ
  to/TO
  the/DT
  security/NN
  of/IN
  (NE America/NNP)
  ./.)
(S
  If/IN
  there/EX
  are/VBP
  people/NNS
  inside/IN
  our/PRP$
  country/NN
  who/WP
  are/VBP
  talking/VBG
  with/IN
  al/NN
  (NE Qaeda/NNP)
  ,/,
  we/PRP
  want/VBP
  to/TO
  know/VB
  about/IN
  it/PRP
  ,/,
  because/IN
  we/PRP
  will/MD
  not/RB
  sit/VB
  back/RB
  and/CC
  wait/NN
  to/TO
  be/VB
  hit/VBN
  again/RB
  ./.)
(S (/( (NE Applause/NNP) ./. )/))
(S
  In/IN
  all/PDT
  these/DT
  areas/NNS
  --/:
  from/IN
  the/DT
  disruption/NN
  of/IN
  terror/NN
  networks/NNS
  ,/,
  to/TO
  victory/NN
  in/IN
  (NE Iraq/NNP)
  ,/,
  to/TO
  the/DT
  spread/NN
  of/IN
  freedom/NN
  and/CC
  hope/NN
  in/IN
  troubled/JJ
  regions/NNS
  --/:
  we/PRP
  need/VBP
  the/DT
  support/NN
  of/IN
  our/PRP$
  friends/NNS
  and/CC
  allies/NNS
 

(S
  This/DT
  commission/NN
  should/MD
  include/VB
  members/NNS
  of/IN
  (NE Congress/NNP)
  of/IN
  both/DT
  parties/NNS
  ,/,
  and/CC
  offer/VBP
  bipartisan/JJ
  solutions/NNS
  ./.)
(S
  We/PRP
  need/VBP
  to/TO
  put/VB
  aside/RP
  partisan/JJ
  politics/NNS
  and/CC
  work/NN
  together/RB
  and/CC
  get/VB
  this/DT
  problem/NN
  solved/VBD
  ./.)
(S (/( (NE Applause/NNP) ./. )/))
(S
  Keeping/VBG
  (NE America/NNP)
  competitive/JJ
  requires/VBZ
  us/PRP
  to/TO
  open/VB
  more/JJR
  markets/NNS
  for/IN
  all/DT
  that/DT
  Americans/NNPS
  make/VBP
  and/CC
  grow/VB
  ./.)
(S
  One/CD
  out/NN
  of/IN
  every/DT
  five/CD
  factory/NN
  jobs/NNS
  in/IN
  (NE America/NNP)
  is/VBZ
  related/VBN
  to/TO
  global/JJ
  trade/NN
  ,/,
  and/CC
  we/PRP
  want/VBP
  people/NNS
  everywhere/RB
  to/TO
  buy/VB
  (NE American/NNP)
  ./.)
(S
  With/IN
  open/JJ
  markets/NNS
  and/CC
  a/DT
  level/JJ
  playing/NN
  field/NN
  ,/,
  no/DT
  one/NN
  can/MD
  out-produce

(S
  I/PRP
  urge/VBP
  you/PRP
  to/TO
  support/VB
  the/DT
  (NE American/JJ Competitiveness/NNP Initiative/NNP)
  ,/,
  and/CC
  together/RB
  we/PRP
  will/MD
  show/VB
  the/DT
  world/NN
  what/WP
  the/DT
  (NE American/JJ)
  people/NNS
  can/MD
  achieve/VB
  ./.)
(S
  (NE America/NNP)
  is/VBZ
  a/DT
  great/JJ
  force/NN
  for/IN
  freedom/NN
  and/CC
  prosperity/NN
  ./.)
(S
  Yet/RB
  our/PRP$
  greatness/NN
  is/VBZ
  not/RB
  measured/VBN
  in/IN
  power/NN
  or/CC
  luxuries/NNS
  ,/,
  but/CC
  by/IN
  who/WP
  we/PRP
  are/VBP
  and/CC
  how/WRB
  we/PRP
  treat/VBP
  one/CD
  another/DT
  ./.)
(S
  So/IN
  we/PRP
  strive/VBP
  to/TO
  be/VB
  a/DT
  compassionate/NN
  ,/,
  decent/NN
  ,/,
  hopeful/JJ
  society/NN
  ./.)
(S
  In/IN
  recent/JJ
  years/NNS
  ,/,
  (NE America/NNP)
  has/VBZ
  become/VBN
  a/DT
  more/RBR
  hopeful/JJ
  nation/NN
  ./.)
(S
  Violent/JJ
  crime/NN
  rates/NNS
  have/VBP
  fallen/VBN
  to/TO
  their/PRP$
  lowest/JJS
  levels/NNS
  si

(S
  We/PRP
  've/VBP
  entered/VBN
  a/DT
  great/JJ
  ideological/JJ
  conflict/NN
  we/PRP
  did/VBD
  nothing/NN
  to/TO
  invite/VB
  ./.)
(S
  We/PRP
  see/VBP
  great/JJ
  changes/NNS
  in/IN
  science/NN
  and/CC
  commerce/NN
  that/WDT
  will/MD
  influence/VB
  all/DT
  our/PRP$
  lives/NNS
  ./.)
(S
  Sometimes/RB
  it/PRP
  can/MD
  seem/VB
  that/DT
  history/NN
  is/VBZ
  turning/VBG
  in/IN
  a/DT
  wide/JJ
  arc/NN
  ,/,
  toward/IN
  an/DT
  unknown/JJ
  shore/NN
  ./.)
(S
  Yet/CC
  the/DT
  destination/NN
  of/IN
  history/NN
  is/VBZ
  determined/VBN
  by/IN
  human/JJ
  action/NN
  ,/,
  and/CC
  every/DT
  great/JJ
  movement/NN
  of/IN
  history/NN
  comes/VBZ
  to/TO
  a/DT
  point/NN
  of/IN
  choosing/NN
  ./.)
(S
  (NE Lincoln/NNP)
  could/MD
  have/VB
  accepted/VBN
  peace/NN
  at/IN
  the/DT
  cost/NN
  of/IN
  disunity/NN
  and/CC
  continued/JJ
  slavery/NN
  ./.)
(S
  (NE Martin/NNP Luther/NNP King/NNP)
  could/MD
  have/VB
  stopped/VBN
  at/IN
  (NE 

### Lemmatizing

In [21]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))

cat
cactus
goose
rock
python
good
best
run
run


### Corpora

In [27]:
print(nltk.__file__)

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/__init__.pyc


In [26]:
from nltk.tokenize import sent_tokenize, PunktSentenceTokenizer
from nltk.corpus import gutenberg

# sample text
sample = gutenberg.raw("bible-kjv.txt")

tok = sent_tokenize(sample)

for x in range(5):
    print(tok[x])


[The King James Bible]

The Old Testament of the King James Bible

The First Book of Moses:  Called Genesis


1:1 In the beginning God created the heaven and the earth.
1:2 And the earth was without form, and void; and darkness was upon
the face of the deep.
And the Spirit of God moved upon the face of the
waters.
1:3 And God said, Let there be light: and there was light.
1:4 And God saw the light, that it was good: and God divided the light
from the darkness.


### Wordnet/ Similarity

In [29]:
from nltk.corpus import wordnet

In [30]:
syns = wordnet.synsets("program")

In [31]:
print(syns[0].name())

plan.n.01


In [32]:
print(syns[0].definition())

a series of steps to be carried out or goals to be accomplished


In [33]:
print(syns[0].examples())

[u'they drew up a six-step plan', u'they discussed plans for a new bond issue']


In [36]:
synonyms = []
antonyms = []

for syn in wordnet.synsets("good"):
    for l in syn.lemmas():
        print("l:", l)
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())
            
# print(set(synonyms))
# print('\n')
# print(set(antonyms))

('l:', Lemma('good.n.01.good'))
('l:', Lemma('good.n.02.good'))
('l:', Lemma('good.n.02.goodness'))
('l:', Lemma('good.n.03.good'))
('l:', Lemma('good.n.03.goodness'))
('l:', Lemma('commodity.n.01.commodity'))
('l:', Lemma('commodity.n.01.trade_good'))
('l:', Lemma('commodity.n.01.good'))
('l:', Lemma('good.a.01.good'))
('l:', Lemma('full.s.06.full'))
('l:', Lemma('full.s.06.good'))
('l:', Lemma('good.a.03.good'))
('l:', Lemma('estimable.s.02.estimable'))
('l:', Lemma('estimable.s.02.good'))
('l:', Lemma('estimable.s.02.honorable'))
('l:', Lemma('estimable.s.02.respectable'))
('l:', Lemma('beneficial.s.01.beneficial'))
('l:', Lemma('beneficial.s.01.good'))
('l:', Lemma('good.s.06.good'))
('l:', Lemma('good.s.07.good'))
('l:', Lemma('good.s.07.just'))
('l:', Lemma('good.s.07.upright'))
('l:', Lemma('adept.s.01.adept'))
('l:', Lemma('adept.s.01.expert'))
('l:', Lemma('adept.s.01.good'))
('l:', Lemma('adept.s.01.practiced'))
('l:', Lemma('adept.s.01.proficient'))
('l:', Lemma('adept.s.01.

In [39]:
w1 = wordnet.synset("ship.n.01")
w2 = wordnet.synset("boat.n.01")

print(w1.wup_similarity(w2))

0.909090909091


In [40]:
w1 = wordnet.synset("ship.n.01")
w2 = wordnet.synset("car.n.01")

print(w1.wup_similarity(w2))

0.695652173913


In [44]:
w1 = wordnet.synset("cactus.n.01")
w2 = wordnet.synset("cat.n.01")

print(w1.wup_similarity(w2))

0.5


### Text Classification

In [45]:
import random
from nltk.corpus import movie_reviews

In [49]:
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

In [51]:
print(documents[1])

([u'alchemy', u'is', u'steeped', u'in', u'shades', u'of', u'blue', u'.', u'kieslowski', u"'", u's', u'blue', u',', u'that', u'is', u'.', u'with', u'its', u'examination', u'of', u'death', u',', u'isolation', u',', u'character', u'restoration', u',', u'and', u'recovery', u'from', u'loss', u',', u'suzanne', u'myers', u"'", u'new', u'independent', u'film', u'echoes', u'the', u'polish', u'director', u"'", u's', u'internationally', u'-', u'acclaimed', u'1993', u'release', u'.', u'language', u'aside', u',', u'the', u'principal', u'difference', u'between', u'the', u'films', u'is', u'that', u',', u'while', u'kieslowski', u'took', u'great', u'pains', u'to', u'draw', u'us', u'into', u'the', u'main', u'character', u"'", u's', u'world', u',', u'alchemy', u'keeps', u'its', u'viewers', u'at', u'arm', u"'", u's', u'length', u'.', u'as', u'a', u'result', u',', u'while', u'we', u"'", u're', u'able', u'to', u'appreciate', u'the', u'film', u"'", u's', u'intellectual', u'tapestry', u',', u'it', u'is', u'em

In [53]:
all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)
print(all_words.most_common(15))

[(u',', 77717), (u'the', 76529), (u'.', 65876), (u'a', 38106), (u'and', 35576), (u'of', 34123), (u'to', 31937), (u"'", 30585), (u'is', 25195), (u'in', 21822), (u's', 18513), (u'"', 17612), (u'it', 16107), (u'that', 15924), (u'-', 15595)]
253
35


In [54]:
print(all_words["stupid"])
print(all_words['awesome'])

253
35


### Converting Words to Features w/ NLTK