# WordNet

WordNet is a huge database for the english language.

## Synsets

Synset is a collection or set of data entities that are considered to be semantically similar.

## Repo
https://github.com/dipanjanS/text-analytics-with-python/tree/master/Old-First-Edition

In [1]:
from nltk.corpus import wordnet as wn
import pandas as pd

term = 'fruit'
synsets = wn.synsets(term)
print('Total Synsets:', len(synsets))

Total Synsets: 5


In [2]:
for synset in synsets:
    print('Synset:', synset)
    print('Part of Speech:', synset.lexname())
    print('Definition:', synset.definition())
    print('Lemmas:', synset.lemma_names())
    print('Examples:', synset.examples())
    print()

Synset: Synset('fruit.n.01')
Part of Speech: noun.plant
Definition: the ripened reproductive body of a seed plant
Lemmas: ['fruit']
Examples: []

Synset: Synset('yield.n.03')
Part of Speech: noun.artifact
Definition: an amount of a product
Lemmas: ['yield', 'fruit']
Examples: []

Synset: Synset('fruit.n.03')
Part of Speech: noun.event
Definition: the consequence of some effort or action
Lemmas: ['fruit']
Examples: ['he lived long enough to see the fruit of his policies']

Synset: Synset('fruit.v.01')
Part of Speech: verb.creation
Definition: cause to bear fruit
Lemmas: ['fruit']
Examples: []

Synset: Synset('fruit.v.02')
Part of Speech: verb.creation
Definition: bear fruit
Lemmas: ['fruit']
Examples: ['the trees fruited early this year']



# Entailments

The term entailments usually refers to some event or action that logically involves or is associated with some other action or event that has taken place or will take place.

In [3]:
for action in ['walk', 'eat', 'digest']:
    action_syn = wn.synsets(action, pos='v')[0]
    print(action_syn, '-- entails -->', action_syn.entailments())

Synset('walk.v.01') -- entails --> [Synset('step.v.01')]
Synset('eat.v.01') -- entails --> [Synset('chew.v.01'), Synset('swallow.v.01')]
Synset('digest.v.01') -- entails --> [Synset('consume.v.02')]


# Homonyms and Homographs

Homonyms refer to words or terms having the same written form or pronunciation but different meanings. Homonyms are a superset of homographs, which are words with same spelling but may have different pronunciation and meaning.

In [4]:
for synset in wn.synsets('bank'):
    print(synset.name(), '-', synset.definition())

bank.n.01 - sloping land (especially the slope beside a body of water)
depository_financial_institution.n.01 - a financial institution that accepts deposits and channels the money into lending activities
bank.n.03 - a long ridge or pile
bank.n.04 - an arrangement of similar objects in a row or in tiers
bank.n.05 - a supply or stock held in reserve for future use (especially in emergencies)
bank.n.06 - the funds held by a gambling house or the dealer in some gambling games
bank.n.07 - a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force
savings_bank.n.02 - a container (usually with a slot in the top) for keeping money at home
bank.n.09 - a building in which the business of banking transacted
bank.n.10 - a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning)
bank.v.01 - tip laterally
bank.v.02 - enclose with a bank
bank.v.03 - do business with a bank or keep an account at 

# Synonyms and antonyms

Synonyms are words having similar meaning and context, and antonyms are words having opposite or contrasting meaning.

In [5]:
term = 'large'
synsets = wn.synsets(term)
adj_large = synsets[1]
adj_large = adj_large.lemmas()[0]
adj_large_synonym = adj_large.synset()
adj_large_antonym = adj_large.antonyms()[0].synset()

# Print synonym and antonym.
print('Synonym:', adj_large_synonym.name())
print('Definition:', adj_large_synonym.definition())
print('Antonym:', adj_large_antonym.name())
print('Definition:', adj_large_antonym.definition())

Synonym: large.a.01
Definition: above average in size or number or quantity or magnitude or extent
Antonym: small.a.01
Definition: limited or below average in number or quantity or magnitude or extent


In [6]:
term = 'rich'
synsets = wn.synsets(term)[:3]

# Print synonym and antonym for different synsets.
for synset in synsets:
    rich = synset.lemmas()[0]
    rich_synonym = rich.synset()
    rich_antonym = rich.antonyms()[0].synset()
    
    print('Synonym:', rich_synonym.name())
    print('Definition:', rich_synonym.definition())


    print('Antonym:', rich_antonym.name())
    print('Definition:', rich_antonym.definition())

Synonym: rich_people.n.01
Definition: people who have possessions and wealth (considered as a group)
Antonym: poor_people.n.01
Definition: people without possessions or wealth (considered as a group)
Synonym: rich.a.01
Definition: possessing material wealth
Antonym: poor.a.02
Definition: having little money or few possessions
Synonym: rich.a.02
Definition: having an abundant supply of desirable qualities or substances (especially natural resources)
Antonym: poor.a.04
Definition: lacking in specific resources, qualities or substances


# Hyponyms and Hypernyms

Hyponym refers to entities or concepts that are a subclass of a higher order concept or entity and have very specific sense or context compared to its superclass.

In [7]:
term = 'tree'
synsets = wn.synsets(term)
tree = synsets[0]

# Print the entity and its meaning.
print('Name:', tree.name())
print('Definition:', tree.definition())

# Print total hyponyms and some sample hyponyms for 'tree'.
hyponyms = tree.hyponyms()
print('Total Hyponyms:', len(hyponyms))
print('Sample Hyponyms')
for hyponym in hyponyms[:10]:
    print(hyponym.name(), '-', hyponym.definition())

Name: tree.n.01
Definition: a tall perennial woody plant having a main trunk and branches forming a distinct elevated crown; includes both gymnosperms and angiosperms
Total Hyponyms: 180
Sample Hyponyms
aalii.n.01 - a small Hawaiian tree with hard dark wood
acacia.n.01 - any of various spiny trees or shrubs of the genus Acacia
african_walnut.n.01 - tropical African timber tree with wood that resembles mahogany
albizzia.n.01 - any of numerous trees of the genus Albizia
alder.n.02 - north temperate shrubs or trees having toothed leaves and conelike fruit; bark is used in tanning and dyeing and the wood is rot-resistant
angelim.n.01 - any of several tropical American trees of the genus Andira
angiospermous_tree.n.01 - any tree having seeds and ovules contained in the ovary
anise_tree.n.01 - any of several evergreen shrubs and small trees of the genus Illicium
arbor.n.01 - tree (as opposed to shrub)
aroeira_blanca.n.01 - small resinous tree or shrub of Brazil


In [8]:
hypernyms = tree.hypernyms()
print(hypernyms)

[Synset('woody_plant.n.01')]


In [9]:
# Get total hierarchy pathways for tree.
hypernym_paths = tree.hypernym_paths()
print('Total Hypernym paths:', len(hypernym_paths))

Total Hypernym paths: 1


In [10]:
# Print the entire hypernym hierarchy.
print('Hypernym Hierarchy')
print(' -> '.join(synset.name() for synset in hypernym_paths[0]))

Hypernym Hierarchy
entity.n.01 -> physical_entity.n.01 -> object.n.01 -> whole.n.02 -> living_thing.n.01 -> organism.n.01 -> plant.n.02 -> vascular_plant.n.01 -> woody_plant.n.01 -> tree.n.01


# Holonyms and Meronyms


Holonyms are entities that contains a specific entity of our interest. Basically holonyms refers to the relationship between a term or entity that denotes the whole and a term denoting a specific part of the whole.

In [11]:
member_holonyms = tree.member_holonyms()
print('Total member holonyms:', len(member_holonyms))
print('Member holonyms for [tree]:-')
for holonym in member_holonyms:
    print(holonym.name(), '-', holonym.definition())

Total member holonyms: 1
Member holonyms for [tree]:-
forest.n.01 - the trees and other plants in a large densely wooded area


In [12]:
# Part based meronyms for tree.
part_meronyms = tree.part_meronyms()
print('Total Part Meronyms:', len(part_meronyms))
print('Part Meronyms for [tree]:-')
for meronym in part_meronyms:
    print(meronym.name(), '-', meronym.definition())

Total Part Meronyms: 5
Part Meronyms for [tree]:-
burl.n.02 - a large rounded outgrowth on the trunk or branch of a tree
crown.n.07 - the upper branches and leaves of a tree or other plant
limb.n.02 - any of the main branches arising from the trunk or a bough of a tree
stump.n.01 - the base part of a tree that remains standing after the tree has been felled
trunk.n.01 - the main stem of a tree; usually covered with bark; the bole is usually the part that is commercially useful for lumber


In [13]:
# Substance based meronyms for tree.
substance_meronyms = tree.substance_meronyms()
print('Total substance meronyms:', len(substance_meronyms))
for meronym in substance_meronyms:
    print(meronym.name(), '-', meronym.definition())

Total substance meronyms: 2
heartwood.n.01 - the older inactive central wood of a tree or woody plant; usually darker and denser than the surrounding sapwood
sapwood.n.01 - newly formed outer wood lying between the cambium and the heartwood of a tree or woody plant; usually light colored; active in water conduction


# Semantic relationships and similarity

In [14]:
tree = wn.synset('tree.n.01')
lion = wn.synset('lion.n.01')
tiger = wn.synset('tiger.n.02')
cat = wn.synset('cat.n.01')
dog = wn.synset('dog.n.01')

# Create entities and extract names and definitions.
entities = [tree, lion, tiger, cat, dog]
entity_names = [entity.name().split('.')[0] for entity in entities]
entity_definitions = [entity.definition() for entity in entities]

# Print entities and their definitions.
for entity, definition in zip(entity_names, entity_definitions):
    print(entity, '-', definition)


tree - a tall perennial woody plant having a main trunk and branches forming a distinct elevated crown; includes both gymnosperms and angiosperms
lion - large gregarious predatory feline of Africa and India having a tawny coat with a shaggy mane in the male
tiger - large feline of forests in most of Asia having a tawny coat with black stripes; endangered
cat - feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats
dog - a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds


In [15]:
common_hypernyms = []
for entity in entities:
    # Get pairwise lowest common hypernyms.
    common_hypernyms.append([entity.lowest_common_hypernyms(compared_entity)[0].name().split('.')[0]
                             for compared_entity in entities])
    
# Build pairwise lower common hypernym matrix.
common_hypernym_frame = pd.DataFrame(common_hypernyms,
                                     index=entity_names,
                                     columns=entity_names)

# Print the matrix.
common_hypernym_frame

Unnamed: 0,tree,lion,tiger,cat,dog
tree,tree,organism,organism,organism,organism
lion,organism,lion,big_cat,feline,carnivore
tiger,organism,big_cat,tiger,feline,carnivore
cat,organism,feline,feline,cat,carnivore
dog,organism,carnivore,carnivore,carnivore,dog


In [16]:
similarities = []
for entity in entities:
    # Get pairwise similarities.
    similarities.append([round(entity.path_similarity(compared_entity), 2)
                         for compared_entity in entities])

# Build pairwise similarity matrix.
similarity_frame = pd.DataFrame(similarities,
                                index=entity_names,
                                columns=entity_names)

similarity_frame

Unnamed: 0,tree,lion,tiger,cat,dog
tree,1.0,0.07,0.07,0.08,0.12
lion,0.07,1.0,0.33,0.25,0.17
tiger,0.07,0.33,1.0,0.25,0.17
cat,0.08,0.25,0.25,1.0,0.2
dog,0.12,0.17,0.17,0.2,1.0


# Word sense disambiguation

In [17]:
from nltk.wsd import lesk
from nltk import word_tokenize

In [18]:
# Sample text and word to disambiguate.
samples = [('The fruits on that plant has ripened', 'n'),
           ('He finally reaped the fruit of his hard work as he won the race', 'n')]
word = 'fruit'

# Perform word sense disambiguation.
for sentence, pos_tag in samples:
    word_syn = lesk(word_tokenize(sentence.lower()), word, pos_tag)
    print('Sentence:', sentence)
    print('Word synset:', word_syn)
    print('Corresponding definition:', word_syn.definition())
    print()

Sentence: The fruits on that plant has ripened
Word synset: Synset('fruit.n.01')
Corresponding definition: the ripened reproductive body of a seed plant

Sentence: He finally reaped the fruit of his hard work as he won the race
Word synset: Synset('fruit.n.03')
Corresponding definition: the consequence of some effort or action



In [19]:
# Sample text and word to disambiguate.
samples = [('Lead is a very soft, malleable metal', 'n'),
           ('John is the actor who plays the lead in that movie', 'n'),
           ('This road leads to nowhere', 'v')]
word = 'lead'

# Perform word sense disambiguation.
for sentence, pos_tag in samples:
    word_syn = lesk(word_tokenize(sentence.lower()), word, pos_tag)
    print('Sentence:', sentence)
    print('Word synset:', word_syn)
    print('Corresponding definition:', word_syn.definition())
    print()

Sentence: Lead is a very soft, malleable metal
Word synset: Synset('lead.n.02')
Corresponding definition: a soft heavy toxic malleable metallic element; bluish white when freshly cut but tarnishes readily to dull grey

Sentence: John is the actor who plays the lead in that movie
Word synset: Synset('star.n.04')
Corresponding definition: an actor who plays a principal role

Sentence: This road leads to nowhere
Word synset: Synset('run.v.23')
Corresponding definition: cause something to pass or lead somewhere



# Named entity recognition
NER, also known as entity chunking/extraction is a popular technique used in information extraction to identify and segment named entities and classify or categorize them under various predefined classes.

In [20]:
text = """
Bayern Munich, or FC Bayern, is a German sports club based in Munich, 
Bavaria, Germany. It is best known for its professional football team, 
which plays in the Bundesliga, the top tier of the German football 
league system, and is the most successful club in German football 
history, having won a record 26 national titles and 18 national cups. 
FC Bayern was founded in 1900 by eleven football players led by Franz John. 
Although Bayern won its first national championship in 1932, the club 
was not selected for the Bundesliga at its inception in 1963. The club 
had its period of greatest success in the middle of the 1970s when, 
under the captaincy of Franz Beckenbauer, it won the European Cup three 
times in a row (1974-76). Overall, Bayern has reached ten UEFA Champions 
League finals, most recently winning their fifth title in 2013 as part 
of a continental treble. 
"""

In [21]:
import nltk
from module.normalization import parse_document
import pandas as pd

# Tokenize sentences.
sentences = parse_document(text)
tokenized_sentences = [nltk.word_tokenize(sentence) 
                       for sentence in sentences]

# Tag sentences and use nltk's Named Entity Chunker.
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
ne_chunked_sents = [nltk.ne_chunk(tagged) for tagged in tagged_sentences]


# Extract all named entities.
named_entities = []
for ne_tagged_sentence in ne_chunked_sents:
    for tagged_tree in ne_tagged_sentence:
        # Extract only chunks having NE labels.
        if hasattr(tagged_tree, 'label'):
            entity_name = ' '.join(c[0] for c in tagged_tree.leaves()) # Get NE name.
            entity_type = tagged_tree.label() # Get NE category.
            named_entities.append((entity_name, entity_type))
# Get unique named entities.
named_entities = list(set(named_entities))

# Store named entities in a data frame.
entity_frame = pd.DataFrame(named_entities, 
                            columns=['Entity Name', 'Entity Type'])
entity_frame

Unnamed: 0,Entity Name,Entity Type
0,FC Bayern,ORGANIZATION
1,German,GPE
2,Germany,GPE
3,Munich,GPE
4,Munich,ORGANIZATION
5,Bayern,GPE
6,Bavaria,GPE
7,Franz John,PERSON
8,UEFA,ORGANIZATION
9,Overall,GPE


# NOTE: SKIP Standford NER Tagger

# Propositional Logic

In [22]:
import nltk
import pandas as pd
import os

In [23]:
# Assign symbols and propositions.
symbol_P = 'P'
symbol_Q = 'Q'
proposition_P = 'He is hungry'
proposition_Q = 'He will eat a sandwich'

# Assign various truth values to the proposition.
p_statuses = [False, False, True, True]
q_statuses = [False, True, False, True]

# Assign the various expressions combining the logical operators.
conjunction = '(P & Q)'
disjunction = '(P | Q)'
implication = '(P -> Q)'
equivalence = '(P <-> Q)'
expressions = [conjunction, disjunction, implication, equivalence]

# Evaluate each expression using propositional logic.
results = []

for status_p, status_q in zip(p_statuses, q_statuses):
    dom = set([])
    val = nltk.Valuation([(symbol_P, status_p), 
                          (symbol_Q, status_q)])
    assignments = nltk.Assignment(dom)
    model = nltk.Model(dom, val)
    row = [status_p, status_q]
    for expression in expressions:
        # Evaluate each expression based on proposition truth values.
        result = model.evaluate(expression, assignments)
        row.append(result)
    results.append(row)

# Build the result table.
columns = [symbol_P, symbol_Q, conjunction, 
           disjunction, 
           implication,
           equivalence]

result_frame = pd.DataFrame(results, columns=columns)

# Display results.

print('P:', proposition_P)
print('Q:', proposition_Q)
print()
print('Expression Outcomes:-')
result_frame

P: He is hungry
Q: He will eat a sandwich

Expression Outcomes:-


Unnamed: 0,P,Q,(P & Q),(P | Q),(P -> Q),(P <-> Q)
0,False,False,False,False,True,True
1,False,True,False,True,True,False
2,True,False,False,True,False,False
3,True,True,True,True,True,True


# Sentiment Analysis

In [24]:
%load_ext autoreload
%autoreload
from module.normalization import normalize_corpus
from module.utils import build_feature_matrix, display_evaluation_metrics, display_confusion_matrix, display_classification_report
import pandas as pd
import numpy as np

In [25]:
dataset = pd.read_csv('movie_reviews.csv')
dataset = dataset.head(10_000)
dataset.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [26]:
n = 1_000 # 35_000
train_data = dataset[:n]
test_data = dataset[n:n+n]

train_reviews = np.array(train_data['review'])
train_sentiments = np.array(train_data['sentiment'])
test_reviews = np.array(test_data['review'])
test_sentiments = np.array(test_data['sentiment'])

# Prepare sample dataset for experiments.
# sample_docs = [100, 5817, 7626, 7356, 1008, 7155, 3533, 13010]
sample_docs = [100, 581, 762, 735, 100, 715, 353, 130]
sample_data = [(test_reviews[index], test_sentiments[index])
               for index in sample_docs]

In [29]:
# Normalization.
norm_train_reviews = normalize_corpus(train_reviews, lemmatize=True, only_text_chars=True)

# Feature extraction.
vectorizer, train_features = build_feature_matrix(documents=norm_train_reviews,
                                                  feature_type='tfidf',
                                                  ngram_range=(1, 1),
                                                  min_df=0.0,
                                                  max_df=1.0)

In [30]:
from sklearn.linear_model import SGDClassifier

# Build the model.
svm = SGDClassifier(loss='hinge', max_iter=200)
svm.fit(train_features, train_sentiments)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=200, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=None, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

In [None]:
# Normalize reviews.
norm_test_reviews = normalize_corpus(test_reviews, lemmatize=True, only_text_chars=True)

# Extract features.
test_features = vectorizer.transform(norm_test_reviews)

# Predict sentiment for sample docs from test data.
for doc_index in sample_docs:
    print('Review:-')
    print(test_reviews[doc_index])
    print('Actual labeled sentiment:', test_sentiments[doc_index])
    doc_features = test_features[doc_index]
    predicted_sentiment = svm.predict(doc_features)[0]
    print('Predicted sentiment:', predicted_sentiment)
    print()

In [None]:
# Predict the sentiment for test dataset movie reviews.
predicted_sentiments = svm.predict(test_features)

# Evaluate model prediction performance.
# Show performance metrics.
display_evaluation_metrics(true_labels=test_sentiments,
                           predicted_labels=predicted_sentiments,
                           positive_class='positive')

# Show confusion matrix.
display_confusion_matrix(true_labels=test_sentiments,
                         predicted_labels=predicted_sentiments,
                         classes=['positive', 'negative'])

# Show detailed per-class classification report.
display_classification_report(true_labels=test_sentiments,
                              predicted_labels=predicted_sentiments,
                              classes=['positive', 'negative'])

# Unsupervised lexicon-based techniques

- AFINN lexicon
- Bing Liu's lexicon
- MPQA subjectivity lexicon
- SentiWordNet
- VADER lexicon
- Pattern lexicon

# AFINN Lexicon

In [None]:
from afinn import Afinn

afn = Afinn(emoticons=True)
afn.score('I really hated the plot of this movie')

In [None]:
afn.score('I really hated the plot of this movie :(')

# SentiWordNet

In [None]:
import nltk
from nltk.corpus import sentiwordnet as swn

# Get synset for 'good'.
good = list(swn.senti_synsets('good', 'n'))[0]

# Print synset sentiment scores.
print('Positive polarity score:', good.pos_score())
print('Negative polarity score:', good.neg_score())
print('Objective score:', good.obj_score())

In [None]:
from module.normalization import normalize_accented_characters, strip_html
import html

def safe_list(l, i=0):
    return l[i] if len(l) > i else None

def analyze_sentiment_sentiwordnet_lexicon(review, verbose=False):
    # Pre-process text.
    review = normalize_accented_characters(review)
    review = html.unescape(review.decode('utf-8'))
    review = strip_html(review)
    
    # Tokenize and POS tag text tokens.
    text_tokens = nltk.word_tokenize(review)
    tagged_text = nltk.pos_tag(text_tokens)
    pos_score = neg_score = token_count = obj_score = 0
    
    # Get word synsets based on POS tags.
    # Get sentiment scores if synsets are found.
    for word, tag in tagged_text:
        ss_set = None
        if 'NN' in tag and swn.senti_synsets(word, 'n'):
            ss_set = safe_list(list(swn.senti_synsets(word, 'n')))
        elif 'VB' in tag and swn.senti_synsets(word, 'v'):
            ss_set = safe_list(list(swn.senti_synsets(word, 'v')))
        elif 'JJ' in tag and swn.senti_synsets(word, 'a'):
            ss_set = safe_list(list(swn.senti_synsets(word, 'a')))
        elif 'RB' in tag and swn.senti_synsets(word, 'r'):
            ss_set = safe_list(list(swn.senti_synsets(word, 'r')))
        
        if ss_set:
            # If senti-synset is found.
            # Add scores for all found synsets.
            pos_score += ss_set.pos_score()
            neg_score += ss_set.neg_score()
            obj_score += ss_set.obj_score()
            token_count += 1
    
    # Aggregate final scores.
    final_score = pos_score - neg_score
    norm_final_score = round(float(final_score)/token_count, 2)
    final_sentiment = 'positive' if norm_final_score > 0 else 'negative'
    if verbose:
        norm_obj_score = round(float(obj_score) / token_count, 2)
        norm_pos_score = round(float(pos_score) / token_count, 2)
        norm_neg_score = round(float(neg_score) / token_count, 2)
        
        # To display results in a nice table.
        sentiment_frame = pd.DataFrame([[final_sentiment, norm_obj_score, 
                                         norm_pos_score,
                                         norm_neg_score,
                                         norm_final_score]],
                                       columns=pd.MultiIndex(levels=[['SENTIMENT STATS:'],
                                                                     ['Predicted Sentiment',
                                                                      'Objectivity',
                                                                      'Positive',
                                                                      'Negative',
                                                                      'Overall']],
                                                            codes=[[0,0,0,0,0], [0,1,2,3,4]]))
        print(sentiment_frame)
    return final_sentiment

In [None]:
for review, review_sentiment in sample_data:
    print('Review:')
    print(review)
    print()
    print('Labeled sentiment:', review_sentiment)
    print()
    final_sentiment = analyze_sentiment_sentiwordnet_lexicon(review, verbose=True)
    print('-' * 60)

In [None]:
# Predict sentiment for test movie reviews dataset.
sentiwordnet_predictions = [analyze_sentiment_sentiwordnet_lexicon(review)
                            for review in test_reviews]

# Get model performance statistics.
print('Performance metrics:')
display_evaluation_metrics(true_labels=test_sentiments,
                           predicted_labels=sentiwordnet_predictions,
                           positive_class='positive')
print()
print('Confusion Matrix:')
display_confusion_matrix(true_labels=test_sentiments,
                         predicted_labels=sentiwordnet_predictions,
                         classes=['positive', 'negative'])

print()
print('Classification Report:')
display_classification_report(true_labels=test_sentiments,
                              predicted_labels=sentiwordnet_predictions,
                              classes=['positive', 'negative'])

# VADER Lexicon

VADER stands for Valence Aware Dictionary and sEntiment Reasoner. It is a lexicon with a rule-based sentiment analysis framework that was specially built for analyzing sentiment from social media resources.

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

def analyze_sentiment_vader_lexicon(review,
                                    threhold=0.1,
                                    verbose=False):
    # Pre-process text.
    review = normalize_accented_characters(review)
    review = html.unescape(review.decode('utf-8'))
    review = strip_html(review)
    
    # Analyze the sentiment for review.
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(review)
    
    # Get aggregate scores and final sentiment.
    agg_score = scores['compound']
    
    final_sentiment = 'positive' if agg_score >= threshold else 'negative'
    
    if verbose:
        # Display detailed sentiment statistics.
        positive = str(round(scores['pos'], 2) * 100) + '%'
        final = round(agg_score, 2)
        negative = str(round(scores['neg'], 2) * 100) + '%'
        neutral = str(round(scores['neu'], 2) * 100) + '%'
        sentiment_frame = pd.DataFrame([[final_sentiment, final, positive, negative, neutral]],
                                       columns=pd.MultiIndex(levels=[['SENTIMENT STATS:'],
                                                                     ['Predicted Sentiment',
                                                                      'Polarity Score',
                                                                      'Positive',
                                                                      'Negative',
                                                                      'Neutral']],
                                                            labels=[[0, 0, 0, 0, 0], [0, 1, 2, 3, 4]]))
        print(sentiment_frame)
    return final_sentiment

In [None]:
# Get detailed sentiment statistics.
for review, review_sentiment in sample_data:
    print('Review:')
    print(review)
    print()
    print('Labeled Sentiment:', review_sentiment)
    print()
    final_sentiment = analyze_sentiment_vader_lexicon(review,
                                                      threshold=0.1,
                                                      verbose=True)
    print('-' * 60)

In [None]:
vader_predictions = [analyze_sentiment_vader_lexicon(review, threshold=0.1)
                     for review in test_reviews]

# Get model performance statistics.
print('Performance metrics:')
display_evaluation_metrics(true_labels=test_sentiments,
                           predicted_labels=vader_predictions,
                           positive_class='positive')

print('\nConfusion matrix:')
display_confusion_matrix(true_labels=test_sentiments,
                         predicted_labels=vader_predictions,
                         classes=['positive', 'negative'])

print('\Classification report:')
display_classfication_report(true_labels=test_sentiments,
                             predicted_labels=vader_predictions,
                             classes=['positive', 'negative'])

In [None]:
from pattern.en import sentiment, mood, modality

def analyze_sentiment_pattern_lexicon(review, threshold=0.1, verbose=False):
    # Pre-process text.
    review = normalize_accented_characters(review)
    review = html.unescape(review.decode('utf-8'))
    review = strip_html(review)
    
    # Analyze sentiment for the text document.
    analysis = sentiment(review)
    sentiment_score = round(analysis[0], 2)
    sentiment_subjectivity = round(analysis[1], 2)
    
    # Get final sentiment.
    final_sentiment = 'positive' if sentiment_score >= threshold else 'negative'
    if verbose:
        # Display detailed sentiment statistics.
        sentiment_frame = pd.DataFrame([[final_sentiment, sentiment_score, sentiment_subjectivity]],
                                       columns=pd.MultiIndex(levels=[['SENTIMENT STATS:'],
                                                                     ['Predicted Sentiment',
                                                                      'Polarity Score',
                                                                      'Subjectivity Score']],
                                                             labels=[[0, 0, 0], [0, 1, 2]]))
        print(sentiment_frame)
        assessment = analysis.assessments
        assessment_frame = pd.DataFrame(assessment,
                                        columns=pd.MultiIndex(levels=[['DETAILED ASSESSMENT STATS:'],
                                                                      ['Key Terms', 'Polarity Score',
                                                                       'Subjectivity Score', 'Type']],
                                                              labels=[[0, 0, 0, 0], [0,1,2,3]]))
        print(assessment_frame)
    return final_sentiment

In [None]:
# Get detailed sentiment statistics.
for review, review_sentiment in sample_data:
    print('Review:')
    print(review)
    print()
    print('Labeled sentiment:', review_sentiment)
    print()
    final_sentiment = analyze_sentiment_pattern_lexicon(review, threshold=0.1, verbose=True)
    print('-' * 60)

In [None]:
for review, review_sentiment in sample_data:
    print('Review:')
    print(review)
    print('Labeled sentiment:', review_sentiment)
    print('Mood:', mood(review))
    mod_score = modality(review)
    print('Modality score:', round(mod_score, 2))
    print('Certainty: ', 'Strong' if mod_score > 0.5 else 'Medium' if mod_score > 0.35 else 'Low')
    print('-' * 60)

In [None]:
# Predict sentiment for test movie reviews dataset.
pattern_predictions = [analyze_sentiment_pattern_lexicon(review, threshold=0.1) for review in test_reviews]

# Get model performance statistics.
print('Performance statistics:')
display_evaluation_metrics(true_labels=test_sentiments,
                           predicted_labels=pattern_predictions,
                           positive_class='positive')

print('\nConfusion matrix:')
display_confusion_matrix(true_labels=test_sentiments,
                         predicted_labels=pattern_predictions,
                         classes=['positive', 'negative'])

print('\nClassification report:')
display_classification_report(true_labels=test_sentiments,
                              predicted_labels=pattern_predictions,
                              classes=['positive', 'negative'])