# Sentiment Classification with NLTK

**Objective**: Divide and prepare Amazon product reviews for food items for Natural Language Toolkit (NLTK) to train a classifier for sentiment analysis to determine positive and negative reviews.

**Data Source**: Grocery and Gourmet Food section of [Amazon Review Data (2018)](https://nijianmo.github.io/amazon/index.html) from Jianmo Ni, UCSD

**Overview**:
- Divide and label the data based on type (positive vs negative)
- Prepare the data for analysis through tokenization and removing stop words and punctuation
- Find frequency distribution of words
- Train and test a classifer to determine positive versus negative reviews
- Evaluate the classifier and check most informative features

In [1]:
import pandas as pd
import random
import string
import nltk
from nltk.tokenize import WhitespaceTokenizer
from nltk.corpus import stopwords
from nltk import classify
from nltk import NaiveBayesClassifier

Import the main dataset for splitting. This is in a JSON format, but we will load it into a dataframe with pandas for splitting.

In [2]:
df = pd.read_json('data/Grocery_and_Gourmet_Food_5.json', lines=True)

In [3]:
df.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image
0,5,True,"11 19, 2014",A1QVBUH9E1V6I8,4639725183,Jamshed Mathur,No adverse comment.,Five Stars,1416355200,,,
1,5,True,"10 13, 2016",A3GEOILWLK86XM,4639725183,itsjustme,Gift for college student.,Great product.,1476316800,,,
2,5,True,"11 21, 2015",A32RD6L701BIGP,4639725183,Krystal Clifton,"If you like strong tea, this is for you. It mi...",Strong,1448064000,,,
3,5,True,"08 12, 2015",A2UY1O1FBGKIE6,4639725183,U. Kane,Love the tea. The flavor is way better than th...,Great tea,1439337600,,,
4,5,True,"05 28, 2015",A3QHVBQYDV7Z6U,4639725183,The Nana,I have searched everywhere until I browsed Ama...,This is the tea I remembered!,1432771200,,,


The main information we will use is the 'overall' score column and the 'reviewText' column. We will create a new dataframe with those first before splitting.

In [4]:
condensed_df = df[['overall','reviewText']].copy()

In [5]:
condensed_df.head()

Unnamed: 0,overall,reviewText
0,5,No adverse comment.
1,5,Gift for college student.
2,5,"If you like strong tea, this is for you. It mi..."
3,5,Love the tea. The flavor is way better than th...
4,5,I have searched everywhere until I browsed Ama...


In [6]:
condensed_df['overall'].unique()

array([5, 4, 3, 1, 2], dtype=int64)

The overall score category has five rating options. We will be assigning scores 1 and 2 as negative reviews and 4 and 5 as positive, with the reviews with a count of 3 as somewhere in between. We will omit the scores of 3 for classification.

In [7]:
pos_reviews = condensed_df.loc[condensed_df['overall'] > 3]

In [8]:
neg_reviews = condensed_df.loc[condensed_df['overall'] < 3]

Label both the types of reactions before combining the datasets into one. This will be used for gathering an evenly split training set.

In [9]:
pos_reviews.insert(2, 'reaction', 'positive')

In [10]:
neg_reviews.insert(2, 'reaction', 'negative')

Create a combined dataframe of the positive and negative reviews.

In [11]:
combined_reviews = pd.concat([pos_reviews, neg_reviews], ignore_index=True)

In [12]:
combined_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1063154 entries, 0 to 1063153
Data columns (total 3 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   overall     1063154 non-null  int64 
 1   reviewText  1062768 non-null  object
 2   reaction    1063154 non-null  object
dtypes: int64(1), object(2)
memory usage: 24.3+ MB


In [13]:
combined_reviews['reviewText'].isnull().sum()

386

We can see that there are some reviews that do not contain text. We can remove those since they will not provide any information we can use for the classifying task.

In [14]:
reviews = combined_reviews[combined_reviews['reviewText'].notna()]

In [15]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1062768 entries, 0 to 1063153
Data columns (total 3 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   overall     1062768 non-null  int64 
 1   reviewText  1062768 non-null  object
 2   reaction    1062768 non-null  object
dtypes: int64(1), object(2)
memory usage: 32.4+ MB


*Optional:* Save the reviews dataframe to a CSV at this point, for further analysis outside this notebook.

In [16]:
#reviews.to_csv('data/combined_reviews.csv')

The main combined dataframe has over 1 million records. We will reduce the amount for the classifier to 20,000 for an even split between the positive and negative reviews.

In [17]:
sample_df = reviews.groupby('reaction').apply(lambda x: x.sample(n=10000)).reset_index(drop = True)

In [18]:
sample_df['reaction'].value_counts()

negative    10000
positive    10000
Name: reaction, dtype: int64

In [19]:
sample_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   overall     20000 non-null  int64 
 1   reviewText  20000 non-null  object
 2   reaction    20000 non-null  object
dtypes: int64(1), object(2)
memory usage: 468.9+ KB


From this data, we will create a list of positive reviews and another of the negative reviews.

In [20]:
pos_df = sample_df.loc[sample_df['reaction'] == 'positive']
pos_list = pos_df['reviewText'].tolist()

neg_df = sample_df.loc[sample_df['reaction'] == 'negative']
neg_list = neg_df['reviewText'].tolist()

Preview the first 10 records of each list.

In [21]:
pos_list[:10]

['Handy snack',
 "So happy to have finally found a shelf stable until opened replacement for half and half!. Actually love the flavor better than half and half now too.  I've also tried the Vanilla Nut Pods...very nice and pleasing vanilla without being overbearing.  You do not need as much Nut Pods to cream as real half and half so if you judge if you've added enough don't worry it looks darker than the same amount of half and half.  I'm organic but I require good flavor plus healthy products.",
 'Always a great product and well we enjoy them',
 "These are fabulous.  Not a diet food, just  a healthy delicious snack. It's  scrumptious.",
 'Exactly as described!',
 'Love this yogurt! Become my favorite!',
 'good product for a fair price and conveniently delivered.',
 "I can bake a lot of things with this product and the taste it's really good!! I like the texture and flavor combined with sucralose and my hubby love it too, considering that he is really picky to eat sweet things.\nI'll b

In [22]:
neg_list[:10]

['not a fan of this kind of taste.',
 "This is good cocoa, but you'd almost have better luck just opening the K-cup and stirring it in. I have found that if I make it as per the box directions, I'll have about 1/4 of the cocoa still inside the cup.\n\nAlso, you do not want to do this on the large cup size - it will be so diluted that will not want to drink it. I make it using the middle sized cup and use 2 k-cups.\n\nI got this on sale for $7, so it wasn't too bad of a price, but I think in the future I will just stick to buying a container of hot cocoa.",
 'I bought three different colors of candy melts and this one did not melt. It remained clumpy and thick. Needless to say it was a waste for our Easter treats!',
 "in my struggle to bake with less carbs, these mixes upset my stomach and I've not returned to them after my first disasterous attempts.  The sugar free cakes look like regular cakes when baked, and they taste good and go down well, but I found they cause me gastric distres

To make things cleaner for the classifier, we will lowercase all the words in the lists.

In [23]:
pos_list_lowered = [word.lower() for word in pos_list]
neg_list_lowered = [word.lower() for word in neg_list]

In [24]:
pos_list_lowered[:1]

['handy snack']

In [25]:
neg_list_lowered[:1]

['not a fan of this kind of taste.']

We will use what is called the 'bag of words' strategy to divide up indicator words for our classifier. This means we will take all words from all respective reviews, and then use those to train the classifier. The reviews are in a list format, so we will have to turn these into string types.

In [26]:
# using list comprehension 
pos_list_to_string = ' '.join([str(elem) for elem in pos_list_lowered]) 
neg_list_to_string = ' '.join([str(elem) for elem in neg_list_lowered])

Next, we'll remove stop words and punctuation. From NLTK, we'll specify stop words to look for are in English, and then call upon punctuation for the strings.

In [27]:
stop = set(stopwords.words('english') + list(string.punctuation))

To tokenize the data, we will use nltk's Whitespace Tokenizer, because this will let us preserve any contractions that may have been used in the reviews.

In [28]:
tokenizer = WhitespaceTokenizer()

In [29]:
filtered_pos_list = [w for w in tokenizer.tokenize(pos_list_to_string) if w not in stop]
filtered_neg_list = [w for w in tokenizer.tokenize(neg_list_to_string) if w not in stop]

In [30]:
filtered_pos_list[:30]

['handy',
 'snack',
 'happy',
 'finally',
 'found',
 'shelf',
 'stable',
 'opened',
 'replacement',
 'half',
 'half!.',
 'actually',
 'love',
 'flavor',
 'better',
 'half',
 'half',
 'too.',
 "i've",
 'also',
 'tried',
 'vanilla',
 'nut',
 'pods...very',
 'nice',
 'pleasing',
 'vanilla',
 'without',
 'overbearing.',
 'need']

Remove periods that may have been connected to words, using regular expression.

In [31]:
filtered_pos_list2 = [w.strip(string.punctuation) for w in filtered_pos_list]

In [32]:
filtered_pos_list2[:30]

['handy',
 'snack',
 'happy',
 'finally',
 'found',
 'shelf',
 'stable',
 'opened',
 'replacement',
 'half',
 'half',
 'actually',
 'love',
 'flavor',
 'better',
 'half',
 'half',
 'too',
 "i've",
 'also',
 'tried',
 'vanilla',
 'nut',
 'pods...very',
 'nice',
 'pleasing',
 'vanilla',
 'without',
 'overbearing',
 'need']

In [33]:
filtered_neg_list2 = [w.strip(string.punctuation) for w in filtered_neg_list]

Use nltk's Frequency Distribution to get a preview of the most common words in each list. As we can see from this, there is some overlap. Further options for reducing overlap include looking at indicator words based on a specific type of speech, such as adjectives.

In [34]:
fd_pos = nltk.FreqDist(filtered_pos_list2)
fd_neg = nltk.FreqDist(filtered_neg_list2)

In [35]:
fd_pos.most_common(15)

[('good', 2960),
 ('great', 2787),
 ('like', 2191),
 ('love', 1931),
 ('taste', 1742),
 ('flavor', 1675),
 ('tea', 1464),
 ('coffee', 1411),
 ('one', 1286),
 ('use', 1229),
 ('product', 1216),
 ('it', 1113),
 ('price', 953),
 ('really', 952),
 ('best', 879)]

In [36]:
fd_neg.most_common(15)

[('like', 4318),
 ('taste', 3671),
 ('flavor', 2324),
 ('product', 2080),
 ('good', 2011),
 ('one', 1866),
 ('it', 1729),
 ('would', 1694),
 ('coffee', 1513),
 ('tea', 1401),
 ('much', 1278),
 ('buy', 1183),
 ('really', 1162),
 ('get', 1127),
 ('even', 1113)]

For each of the sets of words, we'll need to convert them to feature sets as a dictionary.

In [37]:
def word_features(words):
    return dict([(word, True) for word in words.split()])

Label the two sets of word features and combine them in one set for training and testing.

In [38]:
positive_features = [(word_features(f), 'pos') for f in filtered_pos_list2]

negative_features = [(word_features(f), 'neg') for f in filtered_neg_list2]

labeledwords = positive_features + negative_features

In [39]:
print(negative_features[10])

({'stirring': True}, 'neg')


In [40]:
type(negative_features)

list

In [41]:
len(positive_features)

196362

In [42]:
len(negative_features)

272944

Randomly shuffle the labeled words, and then create a train and test set to reflect a split of the word lists.

In [43]:
random.shuffle(labeledwords)

In [44]:
train_set, test_set = labeledwords[2000:], labeledwords[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

Provide examples to test the classifier.

In [45]:
print(classifier.classify(word_features('I hate this product, it tasted weird')))

neg


In [46]:
print(classifier.classify(word_features('I love this product, it tasted great')))

pos


Calculate the accuracy of the classifer.

In [47]:
print(nltk.classify.accuracy(classifier, test_set))

0.634


Check to see what the most informative features are for the sets.

In [48]:
classifier.show_most_informative_features(15)

Most Informative Features
              disgusting = True              neg : pos    =     47.9 : 1.0
                   awful = True              neg : pos    =     40.2 : 1.0
                returned = True              neg : pos    =     30.5 : 1.0
           disappointing = True              neg : pos    =     29.8 : 1.0
                   threw = True              neg : pos    =     28.6 : 1.0
                horrible = True              neg : pos    =     27.9 : 1.0
                dextrose = True              neg : pos    =     26.6 : 1.0
                     yum = True              pos : neg    =     26.4 : 1.0
            maltodextrin = True              neg : pos    =     23.5 : 1.0
               tasteless = True              neg : pos    =     21.5 : 1.0
                   trash = True              neg : pos    =     21.3 : 1.0
               returning = True              neg : pos    =     19.9 : 1.0
                     ugh = True              neg : pos    =     19.4 : 1.0