In [181]:
import nltk
from statistics import mean
from random import shuffle
from sklearn.naive_bayes import (
    BernoulliNB,
    ComplementNB,
    MultinomialNB,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

## NLTK Sentiment Analysis - Data being fetched from NLTK library

* names: A list of common English names compiled by Mark Kantrowitz
* stopwords: A list of really common words, like articles, pronouns, prepositions, and conjunctions
* state_union: A sample of transcribed State of the Union addresses by different US presidents, compiled by Kathleen Ahrens
* twitter_samples: A list of social media phrases posted to Twitter
* movie_reviews: Two thousand movie reviews categorized by Bo Pang and Lillian Lee
* averaged_perceptron_tagger: A data model that NLTK uses to categorize words into their part of speech
* vader_lexicon: A scored list of words and jargon that NLTK references when performing sentiment analysis, created by C.J. Hutto and Eric Gilbert
* punkt: A data model created by Jan Strunk that NLTK uses to split full texts into word lists

In [3]:
# nltk.download([
#      "names",
#      "stopwords",
#      "state_union",
#      "twitter_samples",
#      "movie_reviews",
#      "averaged_perceptron_tagger",
#      "vader_lexicon",
#      "punkt",
# ])

[nltk_data] Downloading package names to
[nltk_data]     C:\Users\ayush\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\names.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ayush\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package state_union to
[nltk_data]     C:\Users\ayush\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\state_union.zip.
[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\ayush\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\twitter_samples.zip.
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\ayush\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\movie_reviews.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ayush\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data] Downloading package vader_

True

### Fetching script from state_union collection from NLTK

In [4]:
words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]

In [9]:
stopwords = nltk.corpus.stopwords.words("english")

### Stopwords occur too often in the sentences, thus leaving a negative impact on the sentiment analysis, thus we remove them from our scipts. Also converting all the script to lowecase to avoid mixed cases.

In [11]:
words = [w for w in words if w.lower() not in stopwords]

### Now we determine the frequency of words and create a dictionary of key->words and value->frequency and run analysis on it

In [15]:
fd = nltk.FreqDist(words)

In [23]:
fd.most_common(3)

[('must', 1568), ('people', 1291), ('world', 1128)]

In [27]:
fd.tabulate(6)

   must  people   world    year America      us 
   1568    1291    1128    1097    1076    1049 


In [29]:
lower_fd = nltk.FreqDist([w.lower() for w in fd]) # Converts all types of cases to lower case!

## Extracting Concordance and Collocations
In the context of NLP, a `concordance` is a collection of word locations along with their context. You can use concordances to find:

* How many times a word appears
* Where each occurrence appears
* What words surround each occurrence

In NLTK, you can do this by calling `concordance()` To use it, you need an instance of the `nltk.Text` class, which can also be constructed with a word list.

Another powerful feature of NLTK is its ability to quickly find `collocations` with simple function calls. Collocations are series of words that frequently appear together in a given text. In the State of the Union corpus, for example, you’d expect to find the words United and States appearing next to each other very often. Those two words appearing together is a collocation.

Collocations can be made up of two or more words. NLTK provides classes to handle several types of collocations:

* Bigrams: Frequent two-word combinations
* Trigrams: Frequent three-word combinations
* Quadgrams: Frequent four-word combinations

In [125]:
text = """Luck is not a matter of chance. Lucky You!
Thomas Jefferson once said, "I'm a great believer in luck, and I find the harder I work, the more I have of it." What, though, is luck? Webster's dictionary suggests that luck is the "events or circumstances that operate for or against an individual."
In truth, luck has nothing to do with something operating for or against you. Luck is not a matter of chance. It is a matter of being open to new experiences, perseverance, hard work, and positive thinking.
When seventeen-year-old Steven Spielberg spent some time with his cousin in the summer of 1965, they toured Universal pictures. The tram stopped at none of the sound stages. Spielberg snuck off on a bathroom break to watch a bit of the real action. When he encountered an unfamiliar face who demanded to know what he was doing, he told him his story. The man turned out to be the head of the editorial department. Spielberg got a pass to the lot for the very next day and showed a very impressed Chuck Silvers four of his eight-millimeter films. This event was the foot in the door Spielberg needed to start squatting on the lot, a decision that led to his first contract with Universal Studios."""

In [86]:
words: list[str] = nltk.word_tokenize(text)

In [87]:
words = [w for w in words if w.isalpha()]

In [88]:
words = [w for w in words if w.lower() not in stopwords]

In [89]:
text = nltk.Text(words)

In [90]:
fd = text.vocab() # same as FreqDist

In [91]:
fd.tabulate(3)

     luck Spielberg    matter 
        4         4         3 


In [94]:
concordance_list = text.concordance_list("luck")

In [95]:
for entry in concordance_list:
    print(entry.line)

 Luck matter chance Lucky Thomas Jefferson
Thomas Jefferson said great believer luck find harder work though luck Webster
eliever luck find harder work though luck Webster dictionary suggests luck eve
ugh luck Webster dictionary suggests luck events circumstances operate individ
rcumstances operate individual truth luck nothing something operating Luck mat
uth luck nothing something operating Luck matter chance matter open new experi


In [100]:
finder = nltk.collocations.TrigramCollocationFinder.from_words(words)

In [101]:
finder.ngram_fd.most_common(2)

[(('Luck', 'matter', 'chance'), 2), (('matter', 'chance', 'Lucky'), 1)]

In [102]:
finder.ngram_fd.tabulate(2)

 ('Luck', 'matter', 'chance') ('matter', 'chance', 'Lucky') 
                            2                             1 


## Using NLTK’s Pre-Trained Sentiment Analyzer
NLTK already has a built-in, pretrained sentiment analyzer called VADER (Valence Aware Dictionary and sEntiment Reasoner).

Since VADER is pretrained, you can get results more quickly than with many other analyzers. However, VADER is best suited for language used in social media, like short sentences with some slang and abbreviations. It’s less accurate when rating longer, structured sentences, but it’s often a good launching point.

In [103]:
from nltk.sentiment import SentimentIntensityAnalyzer

The negative, neutral, and positive scores are related: They all add up to 1 and can’t be negative. The compound score is calculated differently. It’s not just an average, and it can range from -1 to 1.

In [121]:
def is_positive(string: str):
    sia = SentimentIntensityAnalyzer()
    return sia.polarity_scores(string)["compound"] > 0

In [122]:
is_positive("The lion ate the cat!")

False

In [123]:
is_positive("She escaped successfully!")

True

## Customizing NLTK’s Sentiment Analysis - Selecting useful features!
NLTK offers a few built-in classifiers that are suitable for various types of analyses, including sentiment analysis. The trick is to figure out which properties of your dataset are useful in classifying each piece of data into your desired categories.

In the world of machine learning, these data properties are known as features, which you must reveal and select as you work with your data. While this tutorial won’t dive too deeply into feature selection and feature engineering, you’ll be able to see their effects on the accuracy of classifiers.


In [182]:
unwanted = nltk.corpus.stopwords.words("english")
unwanted.extend([w.lower() for w in nltk.corpus.names.words()])

def skip_unwanted(pos_tuple):
    word, tag = pos_tuple
    if not word.isalpha() or word in unwanted:
        return False
    if tag.startswith("NN"):
        return False
    return True

positive_words = [word for word, tag in filter(
    skip_unwanted,
    nltk.pos_tag(nltk.corpus.movie_reviews.words(categories=["pos"]))
)]
negative_words = [word for word, tag in filter(
    skip_unwanted,
    nltk.pos_tag(nltk.corpus.movie_reviews.words(categories=["neg"]))
)]

In [183]:
def extract_features(text):
    features = dict()
    wordcount = 0
    compound_scores = list()
    positive_scores = list()

    for sentence in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sentence):
            if word.isalpha():
                if word.lower() in positive_words:
                    wordcount += 1
        compound_scores.append(sia.polarity_scores(sentence)["compound"])
        positive_scores.append(sia.polarity_scores(sentence)["pos"])

    # Adding 1 to the final compound score to always have positive numbers
    # since some classifiers you'll use later don't work with negative numbers.
    features["mean_compound"] = mean(compound_scores) + 1
    features["mean_positive"] = mean(positive_scores)
    features["wordcount"] = wordcount

    return features

In [184]:
features = extract_features(text)

In [185]:
features

{'mean_compound': 1.26714,
 'mean_positive': 0.20626666666666668,
 'wordcount': 62}

In [186]:
features = [
    (extract_features(text), "pos")
    for review in nltk.corpus.movie_reviews.fileids(categories=["pos"])
]
features.extend([
    (extract_features(text), "neg")
    for review in nltk.corpus.movie_reviews.fileids(categories=["neg"])
])

In [187]:
train_count = len(features) // 4
shuffle(features)
classifier = nltk.NaiveBayesClassifier.train(features[:train_count])
classifier.show_most_informative_features(10)

Most Informative Features
           mean_compound = 1.26714           neg : pos    =      1.0 : 1.0
           mean_positive = 0.20626666666666668    neg : pos    =      1.0 : 1.0
               wordcount = 62                neg : pos    =      1.0 : 1.0


In [188]:
nltk.classify.accuracy(classifier, features[train_count:])

0.49266666666666664

In [189]:
classifiers = {
    "BernoulliNB": BernoulliNB(),
    "ComplementNB": ComplementNB(),
    "MultinomialNB": MultinomialNB(),
    "KNeighborsClassifier": KNeighborsClassifier(),
    "DecisionTreeClassifier": DecisionTreeClassifier(),
    "RandomForestClassifier": RandomForestClassifier(),
    "LogisticRegression": LogisticRegression(),
    "MLPClassifier": MLPClassifier(max_iter=1000),
    "AdaBoostClassifier": AdaBoostClassifier(),
}

#### For each scikit-learn classifier, call nltk.classify.SklearnClassifier to create a usable NLTK classifier that can be trained and evaluated exactly like you’ve seen before with nltk.NaiveBayesClassifier and its other built-in classifiers. The .train() and .accuracy() methods should receive different portions of the same list of features.

In [190]:
train_count = len(features) // 4
shuffle(features)
for name, sklearn_classifier in classifiers.items():
    classifier = nltk.classify.SklearnClassifier(sklearn_classifier)
    classifier.train(features[:train_count])
    accuracy = nltk.classify.accuracy(classifier, features[train_count:])
    print(F"{accuracy:.2%} - {name}")

49.60% - BernoulliNB
49.60% - ComplementNB
49.60% - MultinomialNB
50.40% - KNeighborsClassifier
49.60% - DecisionTreeClassifier
49.60% - RandomForestClassifier
49.60% - LogisticRegression
49.60% - MLPClassifier
49.60% - AdaBoostClassifier


#### Highest score is with KNeighbours Classifier, thus trying with full feature set on KNN

In [197]:
train_count = int(len(features) *0.95) 
shuffle(features)

classifier = nltk.classify.SklearnClassifier(KNeighborsClassifier())
classifier.train(features[:train_count])
accuracy = nltk.classify.accuracy(classifier, features[len(features)-train_count:])
print(F"{accuracy:.2%} - KNN")

49.79% - KNN
