##  Text Preprocessing & Sentiment Analysis on movie reviews with Naive Bayes 
- The Natural Language Toolkit (NLTK) is a platform used for processing textual data that will be later used in a text analysis program.
- Naive Bayes is a classification algorithm used usually in text classification and in problems with multiple classes.
- Sentiment Analysis is a text classification technique that assigns sentiment labels on words, sentences or documents. Here we use two sentiment labels: positive and negative
- Movie reviews is a collection of reviews included in NLTK corpus

In [1]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

### 1. Text Preprocessing
#### Tokenizing = splitting into parts
Seperate sentence by words

In [2]:
# Word Tokenizer
sentence = "Some people choose to see the ugliness in this world. The disarray. I Choose to see the beauty. You shouldn't eat vegetables and honey."
words = nltk.word_tokenize(sentence)
words

['Some',
 'people',
 'choose',
 'to',
 'see',
 'the',
 'ugliness',
 'in',
 'this',
 'world',
 '.',
 'The',
 'disarray',
 '.',
 'I',
 'Choose',
 'to',
 'see',
 'the',
 'beauty',
 '.',
 'You',
 'should',
 "n't",
 'eat',
 'vegetables',
 'and',
 'honey',
 '.']

##### Seperate sentence by periods

In [3]:
# Sentence Tokenizer
print(sent_tokenize(sentence))

['Some people choose to see the ugliness in this world.', 'The disarray.', 'I Choose to see the beauty.', "You shouldn't eat vegetables and honey."]


### Attach a part of speech tag to each word
- VBZ, a verb
- CC, a coordinating conjunction
- RB, or adverbs; 
- IN, a preposition; 
- NN, a noun; 
- JJ, an adjective.

In [4]:
# Seperate parts of speech
tagged = nltk.pos_tag(words)
tagged[0:6]

[('Some', 'DT'),
 ('people', 'NNS'),
 ('choose', 'VBP'),
 ('to', 'TO'),
 ('see', 'VB'),
 ('the', 'DT')]

### Using resources of NLTK
- corpora: large collection of texts in different domains
- lexicon: a list of words and their meanings
- stopwords: uselless words for text analysis like is, at, to and so forth

In [5]:
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))

# Extract stopwords from our sentence
filtered_sentence =[w for w in words if not w in stopwords]
"""
The above line of code is equal to:

for w in words:
    if w not in stopwords:
    filtered_sentence.append(w)
"""
print(filtered_sentence)

['Some', 'people', 'choose', 'see', 'ugliness', 'world', '.', 'The', 'disarray', '.', 'I', 'Choose', 'see', 'beauty', '.', 'You', "n't", 'eat', 'vegetables', 'honey', '.']


### Semantic relations between words can be found via stemming and lemmatization. 
- Stemming: removes derivational affixes by applying a set of rules 
- Lemmatization: apply morphological analysis of words with the use of a vocabulary. 
    It can take into account synonyms and antonyms

In [6]:
# Stemming 
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = ['are', 'is', 'was', 'car','cars', 'went', 'important','to', 'poorly', 'by', 'goose', 'best','better']

for w in words:
    print(stemmer.stem(w))

are
is
wa
car
car
went
import
to
poorli
by
goos
best
better


In [7]:
# Lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

words_lem = ['are', 'is', 'was', 'car','cars', 'went', 'important','to', 'poorly', 'by', 'geese', 'best','better']
for w in words_lem:
    print(lemmatizer.lemmatize(w))

are
is
wa
car
car
went
important
to
poorly
by
goose
best
better


In [8]:
# Passing the part of speech tag into the lemmatizer, gives more accurate results
print(lemmatizer.lemmatize("better", pos="a")) # a->adjective
print(lemmatizer.lemmatize('loving'))
print(lemmatizer.lemmatize('loving', 'v')) # v->verb
print(lemmatizer.lemmatize("better", pos="a")) 

good
loving
love
good


### Named Entities (ne) Chunking or Partial Parsing
- Extracts short, well-formed phrases, or chunks, from a sentence
- Needs part-of-speech annotations to add ne labels to the sentence. The output is a nltk.Tree object.

- The ne_chunk produces 2-level trees:
  - Nodes on Level-1: Outside any chunk, i.e. and/CC 
  - Nodes on Level-2: Inside a chunk , i.e. Mark/NNP, part of person chunk
  - The label of the chunk is denoted by the label of the subtree 

In [9]:
from nltk import word_tokenize, pos_tag, ne_chunk
 
sentence = "Mark and John are working at Google." 
print ne_chunk(pos_tag(word_tokenize(sentence)))

(S
  (PERSON Mark/NNP)
  and/CC
  (PERSON John/NNP)
  are/VBP
  working/VBG
  at/IN
  (ORGANIZATION Google/NNP)
  ./.)


### WordNet 
- Covers semantic and lexical relations between terms and their meaning such as synonymy, hyponymy and polysemy.
- Synonyms for the word Success found in Google :victory, triumph,	prosperity,  affluence, wealth, riches, fortune, opulence, luxury, comfort, benefit, gain, hapiness

In [10]:
from nltk.corpus import wordnet
# synonyms
syns = wordnet.synsets("success")
syns

[Synset('success.n.01'),
 Synset('success.n.02'),
 Synset('success.n.03'),
 Synset('achiever.n.01')]

In [11]:
# Prints just the first example
print(syns[0].name())
# Prints just the word
print(syns[0].lemmas()[0].name())

success.n.01
success


In [12]:
print(syns[0].definition())

an event that accomplishes its intended purpose


In [13]:
# Examples of the word in use
print(syns[0].examples())

[u"let's call heads a success and tails a failure", u'the election was a remarkable success for the Whigs']


#### Synonym vs Synset 
- synonym: a word or phrase with a meaning that is the same as, or very similar to, another word or phrase 
- synset: a set of one or more synonyms that are interchangeable in some context  

In [14]:
# Print synonyms & antoyms of 'good'
synonyms = []
antonyms = []

for syn in wordnet.synsets("good"):
    for l in syn.lemmas():
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name()) #take just the first lemma, instead of the whole set

print('synsets({})'.format(set(synonyms)))
print('antonyms({})'.format(set(antonyms)))

synsets(set([u'beneficial', u'right', u'secure', u'just', u'unspoilt', u'respectable', u'good', u'goodness', u'dear', u'salutary', u'ripe', u'expert', u'skillful', u'in_force', u'proficient', u'unspoiled', u'dependable', u'soundly', u'honorable', u'full', u'undecomposed', u'safe', u'adept', u'upright', u'trade_good', u'sound', u'in_effect', u'practiced', u'effective', u'commodity', u'estimable', u'well', u'honest', u'near', u'skilful', u'thoroughly', u'serious']))
antonyms(set([u'bad', u'badness', u'ill', u'evil', u'evilness']))


In [15]:
# Print 'similar to' words 
for ss in wordnet.synsets('success'):
    print(ss)
    for sim in ss.similar_tos():
        print('    {}'.format(sim))

Synset('success.n.01')
Synset('success.n.02')
Synset('success.n.03')
Synset('achiever.n.01')


In [16]:
# Semantics Similarity
w1 = wordnet.synset('ship.n.01')
w2 = wordnet.synset('boat.n.01')
print(w1.wup_similarity(w2))

0.909090909091
