Source:

https://realpython.com/python-nltk-sentiment-analysis/

A quick way to download specific resources directly from the console is to pass a list to nltk.download():

In [1]:
import nltk

nltk.download([
    "names",
     "stopwords",
     "state_union",
     "twitter_samples",
     "movie_reviews",
     "averaged_perceptron_tagger",
     "vader_lexicon",
     "punkt",
])

[nltk_data] Downloading package names to /home/ali/nltk_data...
[nltk_data]   Package names is already up-to-date!
[nltk_data] Downloading package stopwords to /home/ali/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package state_union to /home/ali/nltk_data...
[nltk_data]   Package state_union is already up-to-date!
[nltk_data] Downloading package twitter_samples to
[nltk_data]     /home/ali/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /home/ali/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/ali/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/ali/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date

True

This will tell NLTK to find and download each resource based on its identifier.

Should NLTK require additional resources that you haven’t installed, you’ll see a helpful LookupError with details and instructions to download the resource:

In [7]:
# w = nltk.corpus.shakespeare.word()

The LookupError specifies which resource is necessary for the requested operation along with instructions to download it using its identifier.

### Compiling Data



In [8]:
words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]
words

['PRESIDENT',
 'HARRY',
 'S',
 'TRUMAN',
 'S',
 'ADDRESS',
 'BEFORE',
 'A',
 'JOINT',
 'SESSION',
 'OF',
 'THE',
 'CONGRESS',
 'April',
 'Mr',
 'Speaker',
 'Mr',
 'President',
 'Members',
 'of',
 'the',
 'Congress',
 'It',
 'is',
 'with',
 'a',
 'heavy',
 'heart',
 'that',
 'I',
 'stand',
 'before',
 'you',
 'my',
 'friends',
 'and',
 'colleagues',
 'in',
 'the',
 'Congress',
 'of',
 'the',
 'United',
 'States',
 'Only',
 'yesterday',
 'we',
 'laid',
 'to',
 'rest',
 'the',
 'mortal',
 'remains',
 'of',
 'our',
 'beloved',
 'President',
 'Franklin',
 'Delano',
 'Roosevelt',
 'At',
 'a',
 'time',
 'like',
 'this',
 'words',
 'are',
 'inadequate',
 'The',
 'most',
 'eloquent',
 'tribute',
 'would',
 'be',
 'a',
 'reverent',
 'silence',
 'Yet',
 'in',
 'this',
 'decisive',
 'hour',
 'when',
 'world',
 'events',
 'are',
 'moving',
 'so',
 'rapidly',
 'our',
 'silence',
 'might',
 'be',
 'misunderstood',
 'and',
 'might',
 'give',
 'comfort',
 'to',
 'our',
 'enemies',
 'In',
 'His',
 'infi

NLTK provides a small corpus of stop words that you can load into a list:

In [9]:
stopwords = nltk.corpus.stopwords.words("english")

Now you can remove stop words from your original word list:

In [10]:
words = [w for w in words if w.lower() not in stopwords]
words

['PRESIDENT',
 'HARRY',
 'TRUMAN',
 'ADDRESS',
 'JOINT',
 'SESSION',
 'CONGRESS',
 'April',
 'Mr',
 'Speaker',
 'Mr',
 'President',
 'Members',
 'Congress',
 'heavy',
 'heart',
 'stand',
 'friends',
 'colleagues',
 'Congress',
 'United',
 'States',
 'yesterday',
 'laid',
 'rest',
 'mortal',
 'remains',
 'beloved',
 'President',
 'Franklin',
 'Delano',
 'Roosevelt',
 'time',
 'like',
 'words',
 'inadequate',
 'eloquent',
 'tribute',
 'would',
 'reverent',
 'silence',
 'Yet',
 'decisive',
 'hour',
 'world',
 'events',
 'moving',
 'rapidly',
 'silence',
 'might',
 'misunderstood',
 'might',
 'give',
 'comfort',
 'enemies',
 'infinite',
 'wisdom',
 'Almighty',
 'God',
 'seen',
 'fit',
 'take',
 'us',
 'great',
 'man',
 'loved',
 'beloved',
 'humanity',
 'man',
 'could',
 'possibly',
 'fill',
 'tremendous',
 'void',
 'left',
 'passing',
 'noble',
 'soul',
 'words',
 'ease',
 'aching',
 'hearts',
 'untold',
 'millions',
 'every',
 'race',
 'creed',
 'color',
 'world',
 'knows',
 'lost',
 'he

this is how `word_tokenize()` works:

In [11]:
from pprint import pprint

text = """
 For some quick analysis, creating a corpus could be overkill.
 If all you need is a word list,
 there are simpler ways to achieve that goal."""
pprint(nltk.word_tokenize(text), width=79, compact=True)

['For', 'some', 'quick', 'analysis', ',', 'creating', 'a', 'corpus', 'could',
 'be', 'overkill', '.', 'If', 'all', 'you', 'need', 'is', 'a', 'word', 'list',
 ',', 'there', 'are', 'simpler', 'ways', 'to', 'achieve', 'that', 'goal', '.']


### Creating Frequency Distributions

To build a frequency distribution with NLTK, construct the nltk.FreqDist class with a word list:

In [26]:
text = nltk.Text(nltk.corpus.state_union.words())

In [27]:
# words: list[str] = nltk.word_tokenize(text)
words = [w for w in text if w.lower() not in stopwords]
words = [w for w in words if w.isalpha()]
fd = nltk.FreqDist(words)
fd

FreqDist({'must': 1568, 'people': 1291, 'world': 1128, 'year': 1097, 'America': 1076, 'us': 1049, 'new': 1049, 'Congress': 1014, 'years': 827, 'American': 784, ...})

After building the object, you can use methods like `.most_common()` and `.tabulate()` to start visualizing information:

In [28]:
fd.most_common(3)

[('must', 1568), ('people', 1291), ('world', 1128)]

In [29]:
fd.tabulate(3)

  must people  world 
  1568   1291   1128 


In [30]:
fd["America"]

1076

In [31]:
fd["america"]

0

In [32]:
fd["AMERICA"]

3

Try creating a new frequency distribution that’s based on the initial one but normalizes all words to lowercase:

In [33]:
lower_fd = nltk.FreqDist([w.lower() for w in fd])
lower_fd

FreqDist({'world': 3, 'year': 3, 'new': 3, 'congress': 3, 'peace': 3, 'federal': 3, 'program': 3, 'government': 3, 'war': 3, 'economic': 3, ...})

In [35]:
lower_fd.tabulate(3)

world  year   new 
    3     3     3 


### Extracting Concordance and Collocations

In [36]:
text = nltk.Text(nltk.corpus.state_union.words())
text.concordance("america", lines=5)

Displaying 5 of 1079 matches:
 would want us to do . That is what America will do . So much blood has already
ay , the entire world is looking to America for enlightened leadership to peace
beyond any shadow of a doubt , that America will continue the fight for freedom
 to make complete victory certain , America will never become a party to any pl
nly in law and in justice . Here in America , we have labored long and hard to 


In [37]:
concordance_list = text.concordance_list("america", lines=2)
for entry in concordance_list:
     print(entry.line)

 would want us to do . That is what America will do . So much blood has already
ay , the entire world is looking to America for enlightened leadership to peace


In [38]:
words: list[str] = nltk.word_tokenize(
     """Beautiful is better than ugly.
     Explicit is better than implicit.
     Simple is better than complex."""
)
text = nltk.Text(words)
fd = text.vocab()  # Equivalent to fd = nltk.FreqDist(words)
fd.tabulate(3)

    is better   than 
     3      3      3 


In [39]:
words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]
finder = nltk.collocations.TrigramCollocationFinder.from_words(words)

In [40]:
finder.ngram_fd.most_common(2)

[(('the', 'United', 'States'), 294), (('the', 'American', 'people'), 185)]

In [41]:
finder.ngram_fd.tabulate(2)

  ('the', 'United', 'States') ('the', 'American', 'people') 
                          294                           185 


### Using NLTK’s Pre-Trained Sentiment Analyzer

In [42]:
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
sia.polarity_scores("Wow, NLTK is really powerful!")

{'neg': 0.0, 'neu': 0.295, 'pos': 0.705, 'compound': 0.8012}

You’ll get back a dictionary of different scores. The negative, neutral, and positive scores are related: They all add up to 1 and can’t be negative. The compound score is calculated differently. It’s not just an average, and it can range from -1 to 1.

Now you’ll put it to the test against real data using two different corpora. First, load the twitter_samples corpus into a list of strings, making a replacement to render URLs inactive to avoid accidental clicks:

In [43]:
tweets = [t.replace("://", "//") for t in nltk.corpus.twitter_samples.strings()]

In [44]:
from random import shuffle

def is_positive(tweet: str) -> bool:
    """True if tweet has positive compound sentiment, False otherwise."""
    return sia.polarity_scores(tweet)["compound"] > 0

shuffle(tweets)
for tweet in tweets[:10]:
    print(">", is_positive(tweet), tweet)

> False RT @HouseOfTraitors: Why is @jo_coburn trying to argue with FARAGE on every question ?

I didn't see Dimbelby do that to Camoron, Clegg, Mi…
> False RT @LordAshcroft: Panelbase poll LAB 34% CON 32% UKIP 17% LDEM 8% GRNS 4%
> True RT @BBCPropaganda: Even Nick Robinson struggling to spin it as a Miliband win. Cameron must have won by a mile. #BBCNews
> True RT @earthygirl01: #bbbqt What a strong and commanding performance from Ed Miliband tonight Not like #ChickenDave @CCHQPress #HellYesEd http…
> True Especially for three of you, LASS :) w/ Aling http//t.co/aRwmTLsFZr
> False RT @Le_Figaro: David Cameron domine le dernier débat de la campagne électorale britannique http//t.co/9KXtl3ekWB
> True RT @edballsmp: Tonight confirmed it: David Cameron and the Tories will cut child benefit if they win next week #bbcqt http//t.co/IHePV7fFLJ
> False @SynergyFlying No!!! Why did you delete me?:(
> False @paul7day She can bloody stay in Scotland Fuck Labour and Milliband  the lying bastard
>

In [45]:
positive_review_ids = nltk.corpus.movie_reviews.fileids(categories=["pos"])
negative_review_ids = nltk.corpus.movie_reviews.fileids(categories=["neg"])
all_review_ids = positive_review_ids + negative_review_ids

In [46]:
from statistics import mean

def is_positive(review_id: str) -> bool:
    """True if the average of all sentence compound scores is positive."""
    text = nltk.corpus.movie_reviews.raw(review_id)
    scores = [
        sia.polarity_scores(sentence)["compound"]
        for sentence in nltk.sent_tokenize(text)
    ]
    return mean(scores) > 0

In [47]:
shuffle(all_review_ids)
correct = 0
for review_id in all_review_ids:
     if is_positive(review_id):
         if review_id in positive_review_ids:
             correct += 1
     else:
         if review_id in negative_review_ids:
             correct += 1

print(F"{correct / len(all_review_ids):.2%} correct")

64.00% correct


### Customizing NLTK's Sentiment Analysis

Selecting Useful Features

In [49]:
unwanted = nltk.corpus.stopwords.words('english')
unwanted.extend([w.lower() for w in nltk.corpus.names.words()])

def skip_unwanted(pos_tuple):
    word, tag = pos_tuple
    if not word.isalpha() or word in unwanted:
        return False
    if tag.startswith("NN"):
        return False
    return True

positive_words = [word for word, tag in filter(
    skip_unwanted,
    nltk.pos_tag(nltk.corpus.movie_reviews.words(categories=["pos"]))
)]
negative_words = [word for word, tag in filter(
    skip_unwanted,
    nltk.pos_tag(nltk.corpus.movie_reviews.words(categories=["neg"]))
)]

Now you’re ready to create the frequency distributions for your custom feature. Since many words are present in both positive and negative sets, begin by finding the common set so you can remove it from the distribution objects:

In [50]:
postive_fd = nltk.FreqDist(positive_words)
negative_fd = nltk.FreqDist(negative_words)

common_set = set(postive_fd).intersection(negative_fd)

for word in common_set:
    del postive_fd[word]
    del negative_fd[word]

top_100_positive = {word for word, count in postive_fd.most_common(100)}
top_100_negative = {word for word, count in negative_fd.most_common(100)}

In [51]:
unwanted = nltk.corpus.stopwords.words("english")
unwanted.extend([w.lower() for w in nltk.corpus.names.words()])

positive_bigram_finder = nltk.collocations.BigramCollocationFinder.from_words([
    w for w in nltk.corpus.movie_reviews.words(categories=["pos"])
    if w.isalpha() and w not in unwanted
])
negative_bigram_finder = nltk.collocations.BigramCollocationFinder.from_words([
    w for w in nltk.corpus.movie_reviews.words(categories=["neg"])
    if w.isalpha() and w not in unwanted
])

### Training and Using a Classifier

In [52]:
def extract_features(text):
    features = dict()
    wordcount = 0
    compound_scores = list()
    positive_scores = list()

    for sentence in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sentence):
            if word.lower() in top_100_positive:
                wordcount += 1
        compound_scores.append(sia.polarity_scores(sentence)["compound"])
        positive_scores.append(sia.polarity_scores(sentence)["pos"])
    
    # Adding 1 to the final compound score to always have positive numbers
    # since some classfiers you'll use later don't work with negative numbers
    features["mean_compound"] = mean(compound_scores) + 1
    features["mean_positive"] = mean(positive_scores)
    features["wordcount"]     = wordcount

    return features

In order to train and evaluate a classifier, you’ll need to build a list of features for each text you’ll analyze:

In [53]:
features = [
    (extract_features(nltk.corpus.movie_reviews.raw(review)),"pos")
    for review in nltk.corpus.movie_reviews.fileids(categories=["pos"])
]
features.extend([
    (extract_features(nltk.corpus.movie_reviews.raw(review)), "neg")
    for review in nltk.corpus.movie_reviews.fileids(categories=["neg"])
])

In [56]:
# Use 1/4 of the set for training
train_count = len(features) // 4
shuffle(features)
classifier = nltk.NaiveBayesClassifier.train(features[:train_count])
classifier.show_most_informative_features(10)

Most Informative Features
               wordcount = 2                 pos : neg    =      4.9 : 1.0
               wordcount = 0                 neg : pos    =      1.7 : 1.0
               wordcount = 1                 pos : neg    =      1.3 : 1.0


In [57]:
nltk.classify.accuracy(classifier, features[train_count:])

0.668

### Comparing Additional Classifiers

In [64]:
from sklearn.naive_bayes import (
    BernoulliNB,
    ComplementNB,
    MultinomialNB,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

In [65]:
classifiers = {
    "BernoulliNB": BernoulliNB(),
    "ComplementNB": ComplementNB(),
    "MultinomialNB": MultinomialNB(),
    "KNeighborsClassifier": KNeighborsClassifier(),
    "DecisionTreeClassifier": DecisionTreeClassifier(),
    "RandomForestClassifier": RandomForestClassifier(),
    "LogisticRegression": LogisticRegression(),
    "MLPClassifier": MLPClassifier(max_iter=1000),
    "AdaBoostClassifier": AdaBoostClassifier(),
}


In [66]:
# Use 1/4 of the set for training
train_count = len(features) // 4
shuffle(features)
for name, sklearn_classifier in classifiers.items():
     classifier = nltk.classify.SklearnClassifier(sklearn_classifier)
     classifier.train(features[:train_count])
     accuracy = nltk.classify.accuracy(classifier, features[train_count:])
     print(F"{accuracy:.2%} - {name}")

Exception ignored on calling ctypes callback function: <function _ThreadpoolInfo._find_modules_with_dl_iterate_phdr.<locals>.match_module_callback at 0x7fc180660400>
Traceback (most recent call last):
  File "/home/ali/miniconda3/envs/ali-gpu/lib/python3.11/site-packages/threadpoolctl.py", line 400, in match_module_callback
    self._make_module_from_path(filepath)
  File "/home/ali/miniconda3/envs/ali-gpu/lib/python3.11/site-packages/threadpoolctl.py", line 515, in _make_module_from_path
    module = module_class(filepath, prefix, user_api, internal_api)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali/miniconda3/envs/ali-gpu/lib/python3.11/site-packages/threadpoolctl.py", line 606, in __init__
    self.version = self.get_version()
                   ^^^^^^^^^^^^^^^^^^
  File "/home/ali/miniconda3/envs/ali-gpu/lib/python3.11/site-packages/threadpoolctl.py", line 646, in get_version
    config = get_config().split()
             ^^^^^^^^^^^^^^^^^^
A

67.67% - BernoulliNB
66.93% - ComplementNB
67.20% - MultinomialNB
70.47% - KNeighborsClassifier
63.07% - DecisionTreeClassifier
69.53% - RandomForestClassifier
73.60% - LogisticRegression
74.00% - MLPClassifier
69.67% - AdaBoostClassifier
