Zachary Proom

EN.605.646.81: Natural Language Processing

# Lab #3

## a

First, I split the training data in train.tsv into two groups, based on the class.

In [47]:
import pandas as pd

In [48]:
# Load in train.tsv.
training_data = pd.read_csv('train.tsv', sep = '\t', header = None) 

# Add column names.
training_data.columns = ["stars", "docid", "text"]

# Split into two groups based on class.
training_data_negative = training_data.loc[training_data["stars"] == 2]
training_data_positive = training_data.loc[training_data["stars"] == 4]

I found ten words that indicate positive or negative sentiment, and I show their relative frequencies in the table below (i.e. the percent of reviews they appear in in each class). The first five words indicate positive sentiment, and the last five indicate negative sentiment.

In [49]:
relative_freqs = pd.DataFrame(columns = ['word', 'positive', 'negative'])

for word in ["amazing", "awesome", "great", "incredible", "fantastic", "terrible", "horrible", "worst", "awful", "bad"]:
    negative_freq = sum(training_data_negative['text'].str.contains(word))/len(training_data_negative) * 100
    positive_freq = sum(training_data_positive['text'].str.contains(word))/len(training_data_positive) * 100
    relative_freqs = pd.concat([relative_freqs, pd.DataFrame({"word": [word], "positive": [positive_freq], "negative": [negative_freq]})], ignore_index = True)

relative_freqs = relative_freqs.reset_index(drop=True)
relative_freqs

Unnamed: 0,word,positive,negative
0,amazing,5.2,1.8
1,awesome,5.7,2.2
2,great,35.9,17.3
3,incredible,0.4,0.1
4,fantastic,3.7,1.0
5,terrible,0.3,3.2
6,horrible,0.6,3.1
7,worst,0.7,4.7
8,awful,0.6,2.5
9,bad,7.1,16.7


After reading a few of the reviews, I noticed that there's a lot of mixed language. Take this review as an example: 

2	YT9tezwopYagEjTxIzN2dg	i am just not a fan of this kind of pizza. i hate the sweet sauce, i hate that the ingredients are under the cheese so they don't get crunchy and crispy, the pepperoni is floppy. just not for me. the crust was kinda gooey like.   my delivery was 30 minutes late but its ok bc i wasn't in a hurry. they were really nice i just can't stand this kind of pizza.    if you like the sweet sauce and toppings under the cheese then go for it bc you'll probably love it!

The review is part of the negative class, but it contains several positive words such as "love", "nice", and "like". This is true for a lot of the reviews I read. A lot of the reviews are balanced and not polarized. This makes classifying sentiment more challenging.

## b

Below I train a Multinomial Naive Bayes model ("bag of words model") on the training set. To do this, I use sklearn, a popular open source machine learning package. I use sklearn's make_pipeline() function to create the model. The pipeline applies the functions passed in as arguments to the training data. In this case, it first applies CountVectorizer() to the training data to convert it into a matrix of counts. Finally, it applies MultinomialNB() to train the Multinomial NB model on the matrix of counts. The only parameters I provided are the reviews and ratings in the training data. I didn't modify the behavior of MultinomialNB(). By default, it sets alpha=1.0, which means it uses Laplace smoothing.

In [50]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn import metrics

labels = [4, 2] # 4 for positive, 2 for negative
X_train = training_data["text"]
y_train = training_data["stars"]
multinomialnb_model = make_pipeline(CountVectorizer(), MultinomialNB())

# Train the model.
multinomialnb_model.fit(X_train, y_train)

multinomialnb_model[1].n_features_in_

11468

The total number of features is shown above: 11,468. It's all the unique words in the training data.

Next, I print a feature representation for the first document in the dev set.

In [51]:
# Load in dev data.
dev_data = pd.read_csv('dev.tsv', sep = '\t', header = None) 

# Add column names.
dev_data.columns = ["stars", "docid", "text"]

# Create CountVectorizer instance.
vectorizer = CountVectorizer()

# Fit and transform the first document in the dev set.
X = vectorizer.fit_transform([dev_data["text"][0]])

feature_rep = pd.DataFrame({"feature": vectorizer.get_feature_names_out(), "frequency": X.toarray()[0].tolist()})
feature_rep = feature_rep.sort_values(by = ['frequency'], ascending = False)
feature_rep = feature_rep.reset_index(drop=True)

# Print all rows.
print(feature_rep.to_string())

       feature  frequency
0          the          6
1          and          6
2         menu          3
3          not          3
4        their          2
5         good          2
6           to          2
7           it          2
8         have          2
9         were          2
10       china          2
11          on          2
12    original          2
13          is          2
14      really          1
15      recent          1
16  restaurant          1
17      return          1
18         she          1
19       price          1
20     portion          1
21        site          1
22        size          1
23          so          1
24       again          1
25       steak          1
26       tasty          1
27       phone          1
28        this          1
29        told          1
30      trying          1
31         tso          1
32       under          1
33      wanted          1
34         web          1
35        what          1
36        when          1
37       yea

Next, I print the docid and prediction (separted by a tab) for the first 10 documents in the dev file.

In [62]:
X_test = dev_data["text"][0:10]

predictions = multinomialnb_model.predict(X_test)

for i in range(0, 10):
    print(dev_data["docid"][i] + "\t" + str(predictions[i]))

ZSJnW6faaNFQoqq4ALqYg	4
Rcbv11hm5AYEwZyqYwAvg	2
rkRTjhu5szaBggeFVcVJlA	4
dhmeDsQGUS1FXMLs49SWjQ	4
z9zfIMYmRRCE4ggfOIieEw	4
Xtb3pGSh39bqcozkBECw	2
DOUflAGzxLsXG6xOmR1w	2
0RxCEWURe08CTcZt95F4AQ	2
MzUg5twEcCyd0X6lBMP2Lg	2
uNlw2D5CYKk0wjNxLtYw	4


Finally, I make predictions for the dev and test partitions and write those to two files. The file names are dev_predictions_multinomialnb.tsv and test_predictions_multinomialnb.tsv.

In [70]:
# Make predictions for the full dev data.
X_test = dev_data["text"]
predictions = multinomialnb_model.predict(X_test)

# Write predictions to a file.
predictions_df = pd.DataFrame({"docid": dev_data["docid"], "prediction": predictions})
predictions_df.to_csv("dev_predictions_multinomialnb.tsv", sep = "\t")

# Make predictions for the test data.
# Load in test data and add column names.
test_data = pd.read_csv('test.tsv', sep = '\t', header = None) 
test_data.columns = ["stars", "docid", "text"]
X_test = test_data["text"]
predictions = multinomialnb_model.predict(X_test)

# Write predictions to a file.
predictions_df = pd.DataFrame({"docid": test_data["docid"], "prediction": predictions})
predictions_df.to_csv("test_predictions_multinomialnb.tsv", sep = "\t")

## c