# CHAPTER 7 Semantic and Sentiment Analysis


## Semantic Analysis


## Exploring WordNet


### Understanding Synsets


In [1]:
from nltk.corpus import wordnet as wn
import pandas as pd


term = 'fruit'
synsets = wn.synsets(term)

print 'Total Synsets:', len(synsets)


Total Synsets: 5


In [2]:
# synsets for fruit
for synset in synsets:
    print 'Synset:', synset
    print 'Part of speech:', synset.lexname()
    print 'Definition:', synset.definition()
    print 'Lemmas:', synset.lemma_names()
    print 'Examples:', synset.examples()
    print



Synset: Synset('fruit.n.01')
Part of speech: noun.plant
Definition: the ripened reproductive body of a seed plant
Lemmas: [u'fruit']
Examples: []

Synset: Synset('yield.n.03')
Part of speech: noun.artifact
Definition: an amount of a product
Lemmas: [u'yield', u'fruit']
Examples: []

Synset: Synset('fruit.n.03')
Part of speech: noun.event
Definition: the consequence of some effort or action
Lemmas: [u'fruit']
Examples: [u'he lived long enough to see the fruit of his policies']

Synset: Synset('fruit.v.01')
Part of speech: verb.creation
Definition: cause to bear fruit
Lemmas: [u'fruit']
Examples: []

Synset: Synset('fruit.v.02')
Part of speech: verb.creation
Definition: bear fruit
Lemmas: [u'fruit']
Examples: [u'the trees fruited early this year']



### Analyzing Lexical Semantic Relations


In [3]:
# entailments
for action in ['walk', 'eat', 'digest']:
    action_syn = wn.synsets(action, pos='v')[0]
    print action_syn, '-- entails -->', action_syn.entailments()


Synset('walk.v.01') -- entails --> [Synset('step.v.01')]
Synset('eat.v.01') -- entails --> [Synset('chew.v.01'), Synset('swallow.v.01')]
Synset('digest.v.01') -- entails --> [Synset('consume.v.02')]


In [4]:
# homonyms\homographs  
for synset in wn.synsets('bank'):
    print synset.name(),'-',synset.definition()


bank.n.01 - sloping land (especially the slope beside a body of water)
depository_financial_institution.n.01 - a financial institution that accepts deposits and channels the money into lending activities
bank.n.03 - a long ridge or pile
bank.n.04 - an arrangement of similar objects in a row or in tiers
bank.n.05 - a supply or stock held in reserve for future use (especially in emergencies)
bank.n.06 - the funds held by a gambling house or the dealer in some gambling games
bank.n.07 - a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force
savings_bank.n.02 - a container (usually with a slot in the top) for keeping money at home
bank.n.09 - a building in which the business of banking transacted
bank.n.10 - a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning)
bank.v.01 - tip laterally
bank.v.02 - enclose with a bank
bank.v.03 - do business with a bank or keep an account at 

In [5]:
# synonyms and antonyms
term = 'large'
synsets = wn.synsets(term)
adj_large = synsets[1]
adj_large = adj_large.lemmas()[0]
adj_large_synonym = adj_large.synset()
adj_large_antonym = adj_large.antonyms()[0].synset()

print 'Synonym:', adj_large_synonym.name()
print 'Definition:', adj_large_synonym.definition()
print 'Antonym:', adj_large_antonym.name()
print 'Definition:', adj_large_antonym.definition()
print



Synonym: large.a.01
Definition: above average in size or number or quantity or magnitude or extent
Antonym: small.a.01
Definition: limited or below average in number or quantity or magnitude or extent



In [6]:
term = 'rich'
synsets = wn.synsets(term)[:3]

for synset in synsets:
    rich = synset.lemmas()[0]
    rich_synonym = rich.synset()
    rich_antonym = rich.antonyms()[0].synset()
    print 'Synonym:', rich_synonym.name()
    print 'Definition:', rich_synonym.definition()
    print 'Antonym:', rich_antonym.name()
    print 'Definition:', rich_antonym.definition()
    print



Synonym: rich_people.n.01
Definition: people who have possessions and wealth (considered as a group)
Antonym: poor_people.n.01
Definition: people without possessions or wealth (considered as a group)

Synonym: rich.a.01
Definition: possessing material wealth
Antonym: poor.a.02
Definition: having little money or few possessions

Synonym: rich.a.02
Definition: having an abundant supply of desirable qualities or substances (especially natural resources)
Antonym: poor.a.04
Definition: lacking in specific resources, qualities or substances



In [7]:
# hyponyms and hypernyms
term = 'tree'
synsets = wn.synsets(term)
tree = synsets[0]

print 'Name:', tree.name()
print 'Definition:', tree.definition()


Name: tree.n.01
Definition: a tall perennial woody plant having a main trunk and branches forming a distinct elevated crown; includes both gymnosperms and angiosperms


In [8]:
hyponyms = tree.hyponyms()
print 'Total Hyponyms:', len(hyponyms)
print 'Sample Hyponyms'
for hyponym in hyponyms[:10]:
    print hyponym.name(), '-', hyponym.definition()
    print


Total Hyponyms: 180
Sample Hyponyms
aalii.n.01 - a small Hawaiian tree with hard dark wood

acacia.n.01 - any of various spiny trees or shrubs of the genus Acacia

african_walnut.n.01 - tropical African timber tree with wood that resembles mahogany

albizzia.n.01 - any of numerous trees of the genus Albizia

alder.n.02 - north temperate shrubs or trees having toothed leaves and conelike fruit; bark is used in tanning and dyeing and the wood is rot-resistant

angelim.n.01 - any of several tropical American trees of the genus Andira

angiospermous_tree.n.01 - any tree having seeds and ovules contained in the ovary

anise_tree.n.01 - any of several evergreen shrubs and small trees of the genus Illicium

arbor.n.01 - tree (as opposed to shrub)

aroeira_blanca.n.01 - small resinous tree or shrub of Brazil



In [9]:
hypernyms = tree.hypernyms()
print hypernyms


[Synset('woody_plant.n.01')]


In [10]:
hypernym_paths = tree.hypernym_paths()
print 'Total Hypernym paths:', len(hypernym_paths)

print 'Hypernym Hierarchy'
print ' -> '.join(synset.name() for synset in hypernym_paths[0])


Total Hypernym paths: 1
Hypernym Hierarchy
entity.n.01 -> physical_entity.n.01 -> object.n.01 -> whole.n.02 -> living_thing.n.01 -> organism.n.01 -> plant.n.02 -> vascular_plant.n.01 -> woody_plant.n.01 -> tree.n.01


In [11]:
# holonyms and meronyms

# member holonyms
member_holonyms = tree.member_holonyms()    
print 'Total Member Holonyms:', len(member_holonyms)
print 'Member Holonyms for [tree]:-'
for holonym in member_holonyms:
    print holonym.name(), '-', holonym.definition()
    print


Total Member Holonyms: 1
Member Holonyms for [tree]:-
forest.n.01 - the trees and other plants in a large densely wooded area



In [12]:
# part meronyms
part_meronyms = tree.part_meronyms()
print 'Total Part Meronyms:', len(part_meronyms)
print 'Part Meronyms for [tree]:-'
for meronym in part_meronyms:
    print meronym.name(), '-', meronym.definition()
    print


Total Part Meronyms: 5
Part Meronyms for [tree]:-
burl.n.02 - a large rounded outgrowth on the trunk or branch of a tree

crown.n.07 - the upper branches and leaves of a tree or other plant

limb.n.02 - any of the main branches arising from the trunk or a bough of a tree

stump.n.01 - the base part of a tree that remains standing after the tree has been felled

trunk.n.01 - the main stem of a tree; usually covered with bark; the bole is usually the part that is commercially useful for lumber



In [13]:
# substance meronyms
substance_meronyms = tree.substance_meronyms()    
print 'Total Substance Meronyms:', len(substance_meronyms)
print 'Substance Meronyms for [tree]:-'
for meronym in substance_meronyms:
    print meronym.name(), '-', meronym.definition()
    print


Total Substance Meronyms: 2
Substance Meronyms for [tree]:-
heartwood.n.01 - the older inactive central wood of a tree or woody plant; usually darker and denser than the surrounding sapwood

sapwood.n.01 - newly formed outer wood lying between the cambium and the heartwood of a tree or woody plant; usually light colored; active in water conduction



### Semantic Relationships and Similarity


In [14]:
# semantic relationships and similarities
tree = wn.synset('tree.n.01')
lion = wn.synset('lion.n.01')
tiger = wn.synset('tiger.n.02')
cat = wn.synset('cat.n.01')
dog = wn.synset('dog.n.01')

entities = [tree, lion, tiger, cat, dog]
entity_names = [entity.name().split('.')[0] for entity in entities]
entity_definitions = [entity.definition() for entity in entities]

for entity, definition in zip(entity_names, entity_definitions):
    print entity, '-', definition
    print


tree - a tall perennial woody plant having a main trunk and branches forming a distinct elevated crown; includes both gymnosperms and angiosperms

lion - large gregarious predatory feline of Africa and India having a tawny coat with a shaggy mane in the male

tiger - large feline of forests in most of Asia having a tawny coat with black stripes; endangered

cat - feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats

dog - a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds



In [15]:
common_hypernyms = []
for entity in entities:
    # get pairwise lowest common hypernyms
    common_hypernyms.append([entity.lowest_common_hypernyms(compared_entity)[0]
                                            .name().split('.')[0]
                             for compared_entity in entities])
# build pairwise lower common hypernym matrix
common_hypernym_frame = pd.DataFrame(common_hypernyms,
                                     index=entity_names, 
                                     columns=entity_names)
                                     
print common_hypernym_frame    


           tree       lion      tiger        cat        dog
tree       tree   organism   organism   organism   organism
lion   organism       lion    big_cat     feline  carnivore
tiger  organism    big_cat      tiger     feline  carnivore
cat    organism     feline     feline        cat  carnivore
dog    organism  carnivore  carnivore  carnivore        dog


In [16]:
similarities = []
for entity in entities:
    # get pairwise similarities
    similarities.append([round(entity.path_similarity(compared_entity), 2)
                         for compared_entity in entities])        
# build pairwise similarity matrix                             
similarity_frame = pd.DataFrame(similarities,
                                index=entity_names, 
                                columns=entity_names)
                                     
print similarity_frame 


       tree  lion  tiger   cat   dog
tree   1.00  0.07   0.07  0.08  0.13
lion   0.07  1.00   0.33  0.25  0.17
tiger  0.07  0.33   1.00  0.25  0.17
cat    0.08  0.25   0.25  1.00  0.20
dog    0.13  0.17   0.17  0.20  1.00


## Word Sense Disambiguation


In [17]:
from nltk.wsd import lesk
from nltk import word_tokenize

samples = [('The fruits on that plant have ripened', 'n'),
           ('He finally reaped the fruit of his hard work as he won the race', 'n')]

word = 'fruit'
for sentence, pos_tag in samples:
    word_syn = lesk(word_tokenize(sentence.lower()), word, pos_tag)
    print 'Sentence:', sentence
    print 'Word synset:', word_syn
    print 'Corresponding defition:', word_syn.definition()
    print


Sentence: The fruits on that plant have ripened
Word synset: Synset('fruit.n.01')
Corresponding defition: the ripened reproductive body of a seed plant

Sentence: He finally reaped the fruit of his hard work as he won the race
Word synset: Synset('fruit.n.03')
Corresponding defition: the consequence of some effort or action



In [18]:
samples = [('Lead is a very soft, malleable metal', 'n'),
           ('John is the actor who plays the lead in that movie', 'n'),
           ('This road leads to nowhere', 'v')]
word = 'lead'

for sentence, pos_tag in samples:
    word_syn = lesk(word_tokenize(sentence.lower()), word, pos_tag)
    print 'Sentence:', sentence
    print 'Word synset:', word_syn
    print 'Corresponding defition:', word_syn.definition()
    print



Sentence: Lead is a very soft, malleable metal
Word synset: Synset('lead.n.02')
Corresponding defition: a soft heavy toxic malleable metallic element; bluish white when freshly cut but tarnishes readily to dull grey

Sentence: John is the actor who plays the lead in that movie
Word synset: Synset('star.n.04')
Corresponding defition: an actor who plays a principal role

Sentence: This road leads to nowhere
Word synset: Synset('run.v.23')
Corresponding defition: cause something to pass or lead somewhere



## Named Entity Recognition


In [19]:
text = """
Bayern Munich, or FC Bayern, is a German sports club based in Munich, 
Bavaria, Germany. It is best known for its professional football team, 
which plays in the Bundesliga, the top tier of the German football 
league system, and is the most successful club in German football 
history, having won a record 26 national titles and 18 national cups. 
FC Bayern was founded in 1900 by eleven football players led by Franz John. 
Although Bayern won its first national championship in 1932, the club 
was not selected for the Bundesliga at its inception in 1963. The club 
had its period of greatest success in the middle of the 1970s when, 
under the captaincy of Franz Beckenbauer, it won the European Cup three 
times in a row (1974-76). Overall, Bayern has reached ten UEFA Champions 
League finals, most recently winning their fifth title in 2013 as part 
of a continental treble. 
"""


In [20]:
import nltk
from normalization import parse_document
import pandas as pd


In [21]:
sentences = parse_document(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]


In [22]:
# nltk NER
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
ne_chunked_sents = [nltk.ne_chunk(tagged) for tagged in tagged_sentences]
named_entities = []
for ne_tagged_sentence in ne_chunked_sents:
    for tagged_tree in ne_tagged_sentence:
        if hasattr(tagged_tree, 'label'):
                entity_name = ' '.join(c[0] for c in tagged_tree.leaves())
                entity_type = tagged_tree.label()
                named_entities.append((entity_name, entity_type))
                
named_entities = list(set(named_entities))
entity_frame = pd.DataFrame(named_entities, 
                            columns=['Entity Name', 'Entity Type'])
print entity_frame    


          Entity Name   Entity Type
0              Bayern        PERSON
1          Franz John        PERSON
2   Franz Beckenbauer        PERSON
3              Munich  ORGANIZATION
4            European  ORGANIZATION
5          Bundesliga  ORGANIZATION
6              German           GPE
7             Bavaria           GPE
8             Germany           GPE
9           FC Bayern  ORGANIZATION
10               UEFA  ORGANIZATION
11             Munich           GPE
12             Bayern           GPE
13            Overall           GPE


In [23]:
# set java path
import os
java_path = r'/usr/bin/java'
os.environ['JAVAHOME'] = java_path


In [24]:
from nltk.tag import StanfordNERTagger
sn = StanfordNERTagger('/pub/nltk-parser/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz',
                       path_to_jar='/pub/nltk-parser/stanford-ner/stanford-ner.jar')

In [25]:
ne_annotated_sentences = [sn.tag(sent) for sent in tokenized_sentences]


In [26]:
named_entities = []
for sentence in ne_annotated_sentences:
    temp_entity_name = ''
    temp_named_entity = None
    for term, tag in sentence:
        if tag != 'O':
            temp_entity_name = ' '.join([temp_entity_name, term]).strip()
            temp_named_entity = (temp_entity_name, tag)
        else:
            if temp_named_entity:
                named_entities.append(temp_named_entity)
                temp_entity_name = ''
                temp_named_entity = None

named_entities = list(set(named_entities))
entity_frame = pd.DataFrame(named_entities, 
                            columns=['Entity Name', 'Entity Type'])
print entity_frame                       

         Entity Name   Entity Type
0         Franz John        PERSON
1  Franz Beckenbauer        PERSON
2            Germany      LOCATION
3             Bayern  ORGANIZATION
4            Bavaria      LOCATION
5             Munich      LOCATION
6          FC Bayern  ORGANIZATION
7      Bayern Munich  ORGANIZATION


## Analyzing Semantic Representations


In [27]:
import nltk
import pandas as pd
import os
# assign symbols and propositions
symbol_P = 'P'
symbol_Q = 'Q'
proposition_P = 'He is hungry'
propositon_Q = 'He will eat a sandwich'
# assign various truth values to the propositions
p_statuses = [False, False, True, True]
q_statuses = [False, True, False, True]
# assign the various expressions combining the logical operators
conjunction = '(P & Q)'
disjunction = '(P | Q)'
implication = '(P -> Q)'
equivalence = '(P <-> Q)'
expressions = [conjunction, disjunction, implication, equivalence]


results = []
for status_p, status_q in zip(p_statuses, q_statuses):
    dom = set([])
    val = nltk.Valuation([(symbol_P, status_p), 
                          (symbol_Q, status_q)])
    assignments = nltk.Assignment(dom)
    model = nltk.Model(dom, val)
    row = [status_p, status_q]
    for expression in expressions:
        result = model.evaluate(expression, assignments)
        row.append(result)
    results.append(row)
    
columns = [symbol_P, symbol_Q, conjunction, 
           disjunction, implication, equivalence]           
result_frame = pd.DataFrame(results, columns=columns)

print 'P:', proposition_P
print 'Q:', propositon_Q
print
print 'Expression Outcomes:-'
print result_frame   


P: He is hungry
Q: He will eat a sandwich

Expression Outcomes:-
       P      Q (P & Q) (P | Q) (P -> Q) (P <-> Q)
0  False  False   False   False     True      True
1  False   True   False    True     True     False
2   True  False   False    True    False     False
3   True   True    True    True     True      True


In [28]:
# first order logic

read_expr = nltk.sem.Expression.fromstring
os.environ['PROVER9'] = r'/usr/local/bin/prover9'
prover = nltk.Prover9()
#prover = nltk.ResolutionProver()   


In [29]:
# set the rule expression
rule = read_expr('all x. all y. (jumps_over(x, y) -> -jumps_over(y, x))')
# set the event occured
event = read_expr('jumps_over(fox, dog)')
# set the outcome we want to evaluate -- the goal
test_outcome = read_expr('jumps_over(dog, fox)')

# get the result
prover.prove(goal=test_outcome, 
             assumptions=[event, rule],
             verbose=True)


[Found prover9: /usr/local/bin/prover9/bin/prover9]
Calling: /usr/local/bin/prover9/bin/prover9
Args: []
Input:
 assign(max_seconds, 60).

clear(auto_denials).
formulas(assumptions).
    jumps_over(fox,dog).
    all x all y (jumps_over(x,y) -> -(jumps_over(y,x))).
end_of_list.

formulas(goals).
    jumps_over(dog,fox).
end_of_list.

 

Return code: 2
stdout:
Prover9 (64) version 2009-11A, November 2009.
Process 23030 was started by cskim on ubuva,
Mon Jun 19 23:24:40 2017
The command was "/usr/local/bin/prover9/bin/prover9".

assign(max_seconds,60).
clear(auto_denials).

formulas(assumptions).
jumps_over(fox,dog).
(all x all y (jumps_over(x,y) -> -jumps_over(y,x))).
end_of_list.

formulas(goals).
jumps_over(dog,fox).
end_of_list.



% Formulas that are not ordinary clauses:
1 (all x all y (jumps_over(x,y) -> -jumps_over(y,x))) # label(non_clause).  [assumption].
2 jumps_over(dog,fox) # label(non_clause) # label(goal).  [goal].



% Clauses before input processing:

formulas(usable).
en

False

In [30]:
# set the rule expression                          
rule = read_expr('all x. (studies(x, exam) -> pass(x, exam))') 
# set the events and outcomes we want to determine
event1 = read_expr('-studies(John, exam)')  
test_outcome1 = read_expr('pass(John, exam)') 
event2 = read_expr('studies(Pierre, exam)')  
test_outcome2 = read_expr('pass(Pierre, exam)') 

prover.prove(goal=test_outcome1, 
             assumptions=[event1, rule],
             verbose=True)  


Calling: /usr/local/bin/prover9/bin/prover9
Args: []
Input:
 assign(max_seconds, 60).

clear(auto_denials).
formulas(assumptions).
    -(studies(John,exam)).
    all x (studies(x,exam) -> pass(x,exam)).
end_of_list.

formulas(goals).
    pass(John,exam).
end_of_list.

 

Return code: 2
stdout:
Prover9 (64) version 2009-11A, November 2009.
Process 23031 was started by cskim on ubuva,
Mon Jun 19 23:24:40 2017
The command was "/usr/local/bin/prover9/bin/prover9".

assign(max_seconds,60).
clear(auto_denials).

formulas(assumptions).
-studies(John,exam).
(all x (studies(x,exam) -> pass(x,exam))).
end_of_list.

formulas(goals).
pass(John,exam).
end_of_list.



% Formulas that are not ordinary clauses:
1 (all x (studies(x,exam) -> pass(x,exam))) # label(non_clause).  [assumption].
2 pass(John,exam) # label(non_clause) # label(goal).  [goal].



% Clauses before input processing:

formulas(usable).
end_of_list.

formulas(sos).
-studies(John,exam).  [assumption].
-studies(x,exam) | pass(x,exam)

False

In [31]:
prover.prove(goal=test_outcome2, 
             assumptions=[event2, rule],
             verbose=True)               


Calling: /usr/local/bin/prover9/bin/prover9
Args: []
Input:
 assign(max_seconds, 60).

clear(auto_denials).
formulas(assumptions).
    studies(Pierre,exam).
    all x (studies(x,exam) -> pass(x,exam)).
end_of_list.

formulas(goals).
    pass(Pierre,exam).
end_of_list.

 

Return code: 0
stdout:
Prover9 (64) version 2009-11A, November 2009.
Process 23032 was started by cskim on ubuva,
Mon Jun 19 23:24:40 2017
The command was "/usr/local/bin/prover9/bin/prover9".

assign(max_seconds,60).
clear(auto_denials).

formulas(assumptions).
studies(Pierre,exam).
(all x (studies(x,exam) -> pass(x,exam))).
end_of_list.

formulas(goals).
pass(Pierre,exam).
end_of_list.



% Formulas that are not ordinary clauses:
1 (all x (studies(x,exam) -> pass(x,exam))) # label(non_clause).  [assumption].
2 pass(Pierre,exam) # label(non_clause) # label(goal).  [goal].



% Clauses before input processing:

formulas(usable).
end_of_list.

formulas(sos).
studies(Pierre,exam).  [assumption].
-studies(x,exam) | pass(

True

In [32]:
# define symbols (entities\functions) and their values
rules = """
    rover => r
    felix => f
    garfield => g
    alex => a
    dog => {r, a}
    cat => {g}
    fox => {f}
    runs => {a, f}
    sleeps => {r, g}
    jumps_over => {(f, g), (a, g), (f, r), (a, r)}
    """
val = nltk.Valuation.fromstring(rules)

print val


{'rover': 'r', 'runs': set([('f',), ('a',)]), 'alex': 'a', 'sleeps': set([('r',), ('g',)]), 'felix': 'f', 'fox': set([('f',)]), 'dog': set([('a',), ('r',)]), 'jumps_over': set([('a', 'g'), ('f', 'g'), ('a', 'r'), ('f', 'r')]), 'cat': set([('g',)]), 'garfield': 'g'}


In [33]:
dom = {'r', 'f', 'g', 'a'}
m = nltk.Model(dom, val)

print m.evaluate('jumps_over(felix, rover) & dog(rover) & runs(rover)', None)
print m.evaluate('jumps_over(felix, rover) & dog(rover) & -runs(rover)', None)
print m.evaluate('jumps_over(alex, garfield) & dog(alex) & cat(garfield) & sleeps(garfield)', None)


False
True
True


In [34]:
g = nltk.Assignment(dom, [('x', 'r'), ('y', 'f')])   
print m.evaluate('runs(y) & jumps_over(y, x) & sleeps(x)', g)   
print m.evaluate('exists y. (fox(y) & runs(y))', g)     



True
True


In [35]:
formula = read_expr('runs(x)')
print m.satisfiers(formula, 'x', g)  


set(['a', 'f'])


In [36]:
formula = read_expr('runs(x) & fox(x)')
print m.satisfiers(formula, 'x', g)              


set(['f'])


## Sentiment Analysis


### Feature Extraction


In [37]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

def build_feature_matrix(documents, feature_type='frequency',
                         ngram_range=(1, 1), min_df=0.0, max_df=1.0):

    feature_type = feature_type.lower().strip()  
    
    if feature_type == 'binary':
        vectorizer = CountVectorizer(binary=True, min_df=min_df,
                                     max_df=max_df, ngram_range=ngram_range)
    elif feature_type == 'frequency':
        vectorizer = CountVectorizer(binary=False, min_df=min_df,
                                     max_df=max_df, ngram_range=ngram_range)
    elif feature_type == 'tfidf':
        vectorizer = TfidfVectorizer(min_df=min_df, max_df=max_df, 
                                     ngram_range=ngram_range)
    else:
        raise Exception("Wrong feature type entered. Possible values: 'binary', 'frequency', 'tfidf'")

    feature_matrix = vectorizer.fit_transform(documents).astype(float)
    
    return vectorizer, feature_matrix
    
    


### Model Performance Evaluation


In [38]:
from sklearn import metrics
import numpy as np
import pandas as pd

def display_evaluation_metrics(true_labels, predicted_labels, positive_class=1):
    print 'Accuracy:', np.round(metrics.accuracy_score(true_labels, predicted_labels), 2)
    print 'Precision:', np.round(
        metrics.precision_score(
            true_labels, predicted_labels, pos_label=positive_class, average='binary'), 2)
    print 'Recall:', np.round(
        metrics.recall_score(
            true_labels, predicted_labels, pos_label=positive_class, average='binary'), 2)
    print 'F1 Score:', np.round(
        metrics.f1_score(
            true_labels, predicted_labels, pos_label=positive_class, average='binary'), 2)
                        
def display_confusion_matrix(true_labels, predicted_labels, classes=[1,0]):
    
    cm = metrics.confusion_matrix(y_true=true_labels, 
                                  y_pred=predicted_labels, 
                                  labels=classes)
    cm_frame = pd.DataFrame(data=cm, 
                            columns=pd.MultiIndex(levels=[['Predicted:'], classes], 
                                                  labels=[[0,0],[0,1]]), 
                            index=pd.MultiIndex(levels=[['Actual:'], classes], 
                                                labels=[[0,0],[0,1]])) 
    print cm_frame                            


def display_classification_report(true_labels, predicted_labels, classes=[1,0]):

    report = metrics.classification_report(y_true=true_labels, 
                                           y_pred=predicted_labels, 
                                           labels=classes) 
    print report

### Extract Review Data

In [39]:
import pandas as pd
import numpy as np
import os


In [40]:
labels = {'pos': 'positive', 'neg': 'negative'}

dataset = pd.DataFrame()
for directory in ('test', 'train'):
    for sentiment in ('pos', 'neg'):
        path =r'/pub/data/aclImdb/{}/{}'.format(directory, sentiment)
        for review_file in os.listdir(path):
            with open(os.path.join(path, review_file), 'r') as input_file:
                review = input_file.read()
            dataset = dataset.append([[review, labels[sentiment]]], 
                                     ignore_index=True)


In [41]:
dataset.columns = ['review', 'sentiment']

In [42]:
indices = dataset.index.tolist()
np.random.shuffle(indices)
indices = np.array(indices)


In [43]:
dataset = dataset.reindex(index=indices)

dataset.to_csv('movie_reviews.csv', index=False)

### Preparing Datasets


In [44]:
import pandas as pd
import numpy as np
from normalization import normalize_corpus
#from utils import build_feature_matrix


dataset = pd.read_csv(r'movie_reviews.csv')

print dataset.head()


                                              review sentiment
0  I have to totally disagree with the other comm...  positive
1  Things get dull early an often in this in this...  negative
2  I just wish I was eloquent enough to say how G...  positive
3  Notorious HK CATIII actor, Anthony Wong, is fo...  positive
4  What often threatens to turn into a soppy and ...  positive


In [45]:
train_data = dataset[:35000]
test_data = dataset[35000:]

train_reviews = np.array(train_data['review'])
train_sentiments = np.array(train_data['sentiment'])
test_reviews = np.array(test_data['review'])
test_sentiments = np.array(test_data['sentiment'])


sample_docs = [100, 5817, 7626, 7356, 1008, 7155, 3533, 13010]
sample_data = [(test_reviews[index],
                test_sentiments[index])
                  for index in sample_docs]

sample_data    


[("Got into this flick, just as it was beginning, on an afternoon where I was home with a touch of flu - otherwise I'd have missed it. That probably would have been best.<br /><br />I noticed the presence of Lindsay Crouse and Jay Thomas - both very good performers - and thought this might be worth a look. It proved to be to some extent, but only because it is one of those stories so awful it fascinates.<br /><br />Zoe McLellan has little to recommend her talents, except for her Jayne Mansfield- or Loni Anderson-like bosom. Unfortunately, her acting prowess - at least here - makes Mansfield and Anderson seem to be Garbo or Davis by comparison.<br /><br />The young nut case's white rat, the owner's cat, the young nut case having the owner evicted and restrained in her own home, and a bunch of doophus's (including the young nut case) running around a bio hazard facility, and the absurd conclusion. I kept waiting for at least some scene or plot element to contain at least a modicum of rea

## Supervised Machine Learning Technique


In [46]:
# normalization
norm_train_reviews = normalize_corpus(train_reviews,
                                      lemmatize=True,
                                      only_text_chars=True)


In [47]:
# feature extraction                                                                            
vectorizer, train_features = build_feature_matrix(documents=norm_train_reviews,
                                                  feature_type='tfidf',
                                                  ngram_range=(1, 1), 
                                                  min_df=0.0, max_df=1.0)                                      


In [48]:
from sklearn.linear_model import SGDClassifier
# build the model
svm = SGDClassifier(loss='hinge', n_iter=500)
svm.fit(train_features, train_sentiments)


SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=500, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

In [49]:
# normalize reviews                        
norm_test_reviews = normalize_corpus(test_reviews,
                                     lemmatize=True,
                                     only_text_chars=True)  


In [50]:
# extract features                                     
test_features = vectorizer.transform(norm_test_reviews)         


for doc_index in sample_docs:
    print 'Review:-'
    print test_reviews[doc_index]
    print 'Actual Labeled Sentiment:', test_sentiments[doc_index]
    doc_features = test_features[doc_index]
    predicted_sentiment = svm.predict(doc_features)[0]
    print 'Predicted Sentiment:', predicted_sentiment
    print


Review:-
Got into this flick, just as it was beginning, on an afternoon where I was home with a touch of flu - otherwise I'd have missed it. That probably would have been best.<br /><br />I noticed the presence of Lindsay Crouse and Jay Thomas - both very good performers - and thought this might be worth a look. It proved to be to some extent, but only because it is one of those stories so awful it fascinates.<br /><br />Zoe McLellan has little to recommend her talents, except for her Jayne Mansfield- or Loni Anderson-like bosom. Unfortunately, her acting prowess - at least here - makes Mansfield and Anderson seem to be Garbo or Davis by comparison.<br /><br />The young nut case's white rat, the owner's cat, the young nut case having the owner evicted and restrained in her own home, and a bunch of doophus's (including the young nut case) running around a bio hazard facility, and the absurd conclusion. I kept waiting for at least some scene or plot element to contain at least a modicum 

In [51]:
predicted_sentiments = svm.predict(test_features)       

#from utils import display_evaluation_metrics, display_confusion_matrix, display_classification_report

display_evaluation_metrics(true_labels=test_sentiments,
                           predicted_labels=predicted_sentiments,
                           positive_class='positive')  


Accuracy: 0.89
Precision: 0.88
Recall: 0.91
F1 Score: 0.89


In [52]:
display_confusion_matrix(true_labels=test_sentiments,
                         predicted_labels=predicted_sentiments,
                         classes=['positive', 'negative'])


                 Predicted:         
                   positive negative
Actual: positive       6854      704
        negative        937     6505


In [53]:
display_classification_report(true_labels=test_sentiments,
                              predicted_labels=predicted_sentiments,
                              classes=['positive', 'negative'])                         

             precision    recall  f1-score   support

   positive       0.88      0.91      0.89      7558
   negative       0.90      0.87      0.89      7442

avg / total       0.89      0.89      0.89     15000



## Unsupervised Lexicon-based Techniques


### AFINN Lexicon


pip install afinn

In [54]:
from afinn import Afinn
afn = Afinn(emoticons=True) 
print afn.score('I really hated the plot of this movie')

print afn.score('I really hated the plot of this movie :(')


-3.0
-5.0


### SentiWordNet


In [55]:
import nltk
from nltk.corpus import sentiwordnet as swn

good = swn.senti_synsets('good', 'n')[0]
print 'Positive Polarity Score:', good.pos_score()
print 'Negative Polarity Score:', good.neg_score()
print 'Objective Score:', good.obj_score()


Positive Polarity Score: 0.5
Negative Polarity Score: 0.0
Objective Score: 0.5


In [56]:
from normalization import normalize_accented_characters, html_parser, strip_html

def analyze_sentiment_sentiwordnet_lexicon(review,
                                           verbose=False):
    # pre-process text
    review = normalize_accented_characters(review)
    review = html_parser.unescape(review)
    review = strip_html(review)
    # tokenize and POS tag text tokens
    text_tokens = nltk.word_tokenize(review)
    tagged_text = nltk.pos_tag(text_tokens)
    pos_score = neg_score = token_count = obj_score = 0
    # get wordnet synsets based on POS tags
    # get sentiment scores if synsets are found
    for word, tag in tagged_text:
        ss_set = None
        if 'NN' in tag and swn.senti_synsets(word, 'n'):
            ss_set = swn.senti_synsets(word, 'n')[0]
        elif 'VB' in tag and swn.senti_synsets(word, 'v'):
            ss_set = swn.senti_synsets(word, 'v')[0]
        elif 'JJ' in tag and swn.senti_synsets(word, 'a'):
            ss_set = swn.senti_synsets(word, 'a')[0]
        elif 'RB' in tag and swn.senti_synsets(word, 'r'):
            ss_set = swn.senti_synsets(word, 'r')[0]
        # if senti-synset is found        
        if ss_set:
            # add scores for all found synsets
            pos_score += ss_set.pos_score()
            neg_score += ss_set.neg_score()
            obj_score += ss_set.obj_score()
            token_count += 1
    
    # aggregate final scores
    final_score = pos_score - neg_score
    norm_final_score = round(float(final_score) / token_count, 2)
    final_sentiment = 'positive' if norm_final_score >= 0 else 'negative'
    if verbose:
        norm_obj_score = round(float(obj_score) / token_count, 2)
        norm_pos_score = round(float(pos_score) / token_count, 2)
        norm_neg_score = round(float(neg_score) / token_count, 2)
        # to display results in a nice table
        sentiment_frame = pd.DataFrame([[final_sentiment, norm_obj_score,
                                         norm_pos_score, norm_neg_score,
                                         norm_final_score]],
                                         columns=pd.MultiIndex(levels=[['SENTIMENT STATS:'], 
                                                                      ['Predicted Sentiment', 'Objectivity',
                                                                       'Positive', 'Negative', 'Overall']], 
                                                              labels=[[0,0,0,0,0],[0,1,2,3,4]]))
        print sentiment_frame
        
    return final_sentiment


In [57]:
for review, review_sentiment in sample_data:  
    print 'Review:'
    print review
    print
    print 'Labeled Sentiment:', review_sentiment    
    print    
    final_sentiment = analyze_sentiment_sentiwordnet_lexicon(review, verbose=True)
    print '-'*60                                                         


Review:
Got into this flick, just as it was beginning, on an afternoon where I was home with a touch of flu - otherwise I'd have missed it. That probably would have been best.<br /><br />I noticed the presence of Lindsay Crouse and Jay Thomas - both very good performers - and thought this might be worth a look. It proved to be to some extent, but only because it is one of those stories so awful it fascinates.<br /><br />Zoe McLellan has little to recommend her talents, except for her Jayne Mansfield- or Loni Anderson-like bosom. Unfortunately, her acting prowess - at least here - makes Mansfield and Anderson seem to be Garbo or Davis by comparison.<br /><br />The young nut case's white rat, the owner's cat, the young nut case having the owner evicted and restrained in her own home, and a bunch of doophus's (including the young nut case) running around a bio hazard facility, and the absurd conclusion. I kept waiting for at least some scene or plot element to contain at least a modicum o

In [58]:
sentiwordnet_predictions = [analyze_sentiment_sentiwordnet_lexicon(review)
                            for review in test_reviews]

#from utils import display_evaluation_metrics, display_confusion_matrix, display_classification_report

print 'Performance metrics:'
display_evaluation_metrics(true_labels=test_sentiments,
                           predicted_labels=sentiwordnet_predictions,
                           positive_class='positive')  


Performance metrics:
Accuracy: 0.6
Precision: 0.56
Recall: 0.92
F1 Score: 0.7


In [59]:
print '\nConfusion Matrix:'                           
display_confusion_matrix(true_labels=test_sentiments,
                         predicted_labels=sentiwordnet_predictions,
                         classes=['positive', 'negative'])



Confusion Matrix:
                 Predicted:         
                   positive negative
Actual: positive       6948      610
        negative       5407     2035


In [60]:
print '\nClassification report:'                         
display_classification_report(true_labels=test_sentiments,
                              predicted_labels=sentiwordnet_predictions,
                              classes=['positive', 'negative'])  



Classification report:
             precision    recall  f1-score   support

   positive       0.56      0.92      0.70      7558
   negative       0.77      0.27      0.40      7442

avg / total       0.67      0.60      0.55     15000



### VADER Lexicon


In [61]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

def analyze_sentiment_vader_lexicon(review, 
                                    threshold=0.1,
                                    verbose=False):
    # pre-process text
    review = normalize_accented_characters(review)
    review = html_parser.unescape(review)
    review = strip_html(review)
    # analyze the sentiment for review
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(review)
    # get aggregate scores and final sentiment
    agg_score = scores['compound']
    final_sentiment = 'positive' if agg_score >= threshold\
                                   else 'negative'
    if verbose:
        # display detailed sentiment statistics
        positive = str(round(scores['pos'], 2)*100)+'%'
        final = round(agg_score, 2)
        negative = str(round(scores['neg'], 2)*100)+'%'
        neutral = str(round(scores['neu'], 2)*100)+'%'
        sentiment_frame = pd.DataFrame([[final_sentiment, final, positive,
                                        negative, neutral]],
                                        columns=pd.MultiIndex(levels=[['SENTIMENT STATS:'], 
                                                                      ['Predicted Sentiment', 'Polarity Score',
                                                                       'Positive', 'Negative',
                                                                       'Neutral']], 
                                                              labels=[[0,0,0,0,0],[0,1,2,3,4]]))
        print sentiment_frame
    
    return final_sentiment
        




In [62]:
for review, review_sentiment in sample_data:
    print 'Review:'
    print review
    print
    print 'Labeled Sentiment:', review_sentiment    
    print    
    final_sentiment = analyze_sentiment_vader_lexicon(review, threshold=0.1, verbose=True)
    print '-'*60                                                       


Review:
Got into this flick, just as it was beginning, on an afternoon where I was home with a touch of flu - otherwise I'd have missed it. That probably would have been best.<br /><br />I noticed the presence of Lindsay Crouse and Jay Thomas - both very good performers - and thought this might be worth a look. It proved to be to some extent, but only because it is one of those stories so awful it fascinates.<br /><br />Zoe McLellan has little to recommend her talents, except for her Jayne Mansfield- or Loni Anderson-like bosom. Unfortunately, her acting prowess - at least here - makes Mansfield and Anderson seem to be Garbo or Davis by comparison.<br /><br />The young nut case's white rat, the owner's cat, the young nut case having the owner evicted and restrained in her own home, and a bunch of doophus's (including the young nut case) running around a bio hazard facility, and the absurd conclusion. I kept waiting for at least some scene or plot element to contain at least a modicum o

In [63]:
vader_predictions = [analyze_sentiment_vader_lexicon(review, threshold=0.1)
                     for review in test_reviews] 

print 'Performance metrics:'
display_evaluation_metrics(true_labels=test_sentiments,
                           predicted_labels=vader_predictions,
                           positive_class='positive')  


Performance metrics:
Accuracy: 0.7
Precision: 0.65
Recall: 0.85
F1 Score: 0.74


In [64]:
print '\nConfusion Matrix:'                           
display_confusion_matrix(true_labels=test_sentiments,
                         predicted_labels=vader_predictions,
                         classes=['positive', 'negative'])



Confusion Matrix:
                 Predicted:         
                   positive negative
Actual: positive       6450     1108
        negative       3425     4017


In [65]:
print '\nClassification report:'                         
display_classification_report(true_labels=test_sentiments,
                              predicted_labels=vader_predictions,
                              classes=['positive', 'negative']) 



Classification report:
             precision    recall  f1-score   support

   positive       0.65      0.85      0.74      7558
   negative       0.78      0.54      0.64      7442

avg / total       0.72      0.70      0.69     15000



### Pattern Lexicon


In [66]:
from pattern.en import sentiment, mood, modality

def analyze_sentiment_pattern_lexicon(review, threshold=0.1,
                                      verbose=False):
    # pre-process text
    review = normalize_accented_characters(review)
    review = html_parser.unescape(review)
    review = strip_html(review)
    # analyze sentiment for the text document
    analysis = sentiment(review)
    sentiment_score = round(analysis[0], 2)
    sentiment_subjectivity = round(analysis[1], 2)
    # get final sentiment
    final_sentiment = 'positive' if sentiment_score >= threshold\
                                   else 'negative'
    if verbose:
        # display detailed sentiment statistics
        sentiment_frame = pd.DataFrame([[final_sentiment, sentiment_score,
                                        sentiment_subjectivity]],
                                        columns=pd.MultiIndex(levels=[['SENTIMENT STATS:'], 
                                                                      ['Predicted Sentiment', 'Polarity Score',
                                                                       'Subjectivity Score']], 
                                                              labels=[[0,0,0],[0,1,2]]))
        print sentiment_frame
        assessment = analysis.assessments
        assessment_frame = pd.DataFrame(assessment, 
                                        columns=pd.MultiIndex(levels=[['DETAILED ASSESSMENT STATS:'], 
                                                                      ['Key Terms', 'Polarity Score',
                                                                       'Subjectivity Score', 'Type']], 
                                                              labels=[[0,0,0,0],[0,1,2,3]]))
        print assessment_frame
        print
    
    return final_sentiment                                       


In [67]:
for review, review_sentiment in sample_data:
    print 'Review:'
    print review
    print
    print 'Labeled Sentiment:', review_sentiment    
    print    
    final_sentiment = analyze_sentiment_pattern_lexicon(review,
                                                        threshold=0.1,
                                                        verbose=True)
    print '-'*60            


Review:
Got into this flick, just as it was beginning, on an afternoon where I was home with a touch of flu - otherwise I'd have missed it. That probably would have been best.<br /><br />I noticed the presence of Lindsay Crouse and Jay Thomas - both very good performers - and thought this might be worth a look. It proved to be to some extent, but only because it is one of those stories so awful it fascinates.<br /><br />Zoe McLellan has little to recommend her talents, except for her Jayne Mansfield- or Loni Anderson-like bosom. Unfortunately, her acting prowess - at least here - makes Mansfield and Anderson seem to be Garbo or Davis by comparison.<br /><br />The young nut case's white rat, the owner's cat, the young nut case having the owner evicted and restrained in her own home, and a bunch of doophus's (including the young nut case) running around a bio hazard facility, and the absurd conclusion. I kept waiting for at least some scene or plot element to contain at least a modicum o

In [68]:
for review, review_sentiment in sample_data:
    print 'Review:'
    print review
    print 'Labeled Sentiment:', review_sentiment 
    print 'Mood:', mood(review)
    mod_score = modality(review)
    print 'Modality Score:', round(mod_score, 2)
    print 'Certainty:', 'Strong' if mod_score > 0.5 \
                                    else 'Medium' if mod_score > 0.35 \
                                                    else 'Low'
    print '-'*60            


Review:
Got into this flick, just as it was beginning, on an afternoon where I was home with a touch of flu - otherwise I'd have missed it. That probably would have been best.<br /><br />I noticed the presence of Lindsay Crouse and Jay Thomas - both very good performers - and thought this might be worth a look. It proved to be to some extent, but only because it is one of those stories so awful it fascinates.<br /><br />Zoe McLellan has little to recommend her talents, except for her Jayne Mansfield- or Loni Anderson-like bosom. Unfortunately, her acting prowess - at least here - makes Mansfield and Anderson seem to be Garbo or Davis by comparison.<br /><br />The young nut case's white rat, the owner's cat, the young nut case having the owner evicted and restrained in her own home, and a bunch of doophus's (including the young nut case) running around a bio hazard facility, and the absurd conclusion. I kept waiting for at least some scene or plot element to contain at least a modicum o