# Final Presentation: Sentiment Prediction
Matthew Landry, Alexey Solganik, Michael Klisiwecz

#Preprocessing
Our project is sentiment prediction on sentiment labelled reviews from yelp, imdb, and amazon. Before we can create a model to predict sentiment, we need to preprocess the sentences from the example data set so they can be represented as vectors. First, we remove unnecessary symbols from the sentences (punctuation, numbers, capital letters, extra spaces) and then tokenize each sentence. Initially, we tried to remove stop words as part of our preprocessing step, but it turned out that this would remove words like "not", which change the sentiment of a sentence, so we realized we could not include this step even if it means leaving in many words that do not affect sentence sentiment (ex. a, the, is). We used PorterStemmer from the gensim library to remove word stems, because we want as many instances as possible of a given word in our training data, so it is best to consider all forms of a word (ex. large, larger, largest) as the same word. 
Note: stemming is commented out because it is only used for models in which we use word embeddings trained on the sample data, and causes worse performance with the pre-trained embeddings

##Review Dataset (imdb, yelp, amazon)

In [None]:
import gensim
import pandas as pd
from gensim.parsing.preprocessing import strip_numeric, strip_punctuation, strip_multiple_whitespaces
from gensim.parsing.porter import PorterStemmer
pstem = PorterStemmer()

filter = ['a', 'an', 'the', 'is', 'are', 'were', 'was', 'will', 'be', 'in', 'to', 'on','at','and','or']

with open("imdb_labelled.txt") as dat:
    imdb_data = []
    imdb_results = []
    for line in dat:
        imdb_results.append(int(line[len(line)-2:len(line)-1]))
        line = line.lower()
        line = strip_punctuation(line)
        line = strip_multiple_whitespaces(line)
        line = strip_numeric(line)

        tokenized_line = line.split()
        tokenized_line1 = []
        for token in tokenized_line:
          if token not in filter:
            tokenized_line1.append(token)
        tokenized_line = tokenized_line1  
       # stem_line = lambda x: [pstem.stem(token) for token in x]
        #tokenized_line = stem_line(tokenized_line)

        imdb_data.append(tokenized_line)
        
with open("amazon_cells_labelled.txt") as dat:
    amazon_data = []
    amazon_results = []
    for line in dat:
        amazon_results.append(int(line[len(line)-2:len(line)-1]))
        line = line.lower()
        line = strip_punctuation(line)
        line = strip_multiple_whitespaces(line)
        line = strip_numeric(line)

        tokenized_line = line.split()
        tokenized_line1 = []
        for token in tokenized_line:
          if token not in filter:
            tokenized_line1.append(token)
        tokenized_line = tokenized_line1 
        #stem_line = lambda x: [pstem.stem(token) for token in x]
        #tokenized_line = stem_line(tokenized_line)

        amazon_data.append(tokenized_line)
        
with open("yelp_labelled.txt") as dat:
    yelp_data = []
    yelp_results = []
    for line in dat:
        yelp_results.append(int(line[len(line)-2:len(line)-1]))
        line = line.lower()
        line = strip_punctuation(line)
        line = strip_multiple_whitespaces(line)
        line = strip_numeric(line)

        tokenized_line = line.split()
        tokenized_line1 = []
        for token in tokenized_line:
          if token not in filter:
            tokenized_line1.append(token)
        tokenized_line = tokenized_line1 
        #stem_line = lambda x: [pstem.stem(token) for token in x]
        #tokenized_line = stem_line(tokenized_line)
        
        yelp_data.append(tokenized_line)
        


imdb_df = pd.DataFrame({"Phrases" : imdb_data, "Labels" : imdb_results})
amazon_df = pd.DataFrame({"Phrases" : amazon_data, "Labels" : amazon_results})
yelp_df = pd.DataFrame({"Phrases" : yelp_data, "Labels" : yelp_results})

print(imdb_df[:10])
print(amazon_df[:10])
print(yelp_df[:10])
imdb_df["Labels"].value_counts()
amazon_df["Labels"].value_counts()
yelp_df["Labels"].value_counts()




                                             Phrases  Labels
0  [very, very, very, slow, moving, aimless, movi...       0
1  [not, sure, who, more, lost, flat, characters,...       0
2  [attempting, artiness, with, black, white, cle...       0
3         [very, little, music, anything, speak, of]       0
4  [best, scene, movie, when, gerardo, trying, fi...       1
5  [rest, of, movie, lacks, art, charm, meaning, ...       0
6                               [wasted, two, hours]       0
7  [saw, movie, today, thought, it, good, effort,...       1
8                                 [bit, predictable]       0
9  [loved, casting, of, jimmy, buffet, as, scienc...       1
                                             Phrases  Labels
0  [so, there, no, way, for, me, plug, it, here, ...       0
1                     [good, case, excellent, value]       1
2                              [great, for, jawbone]       1
3  [tied, charger, for, conversations, lasting, m...       0
4                       

1    500
0    500
Name: Labels, dtype: int64

The above output shows examples of tokenized sentences from our example dataset of user reviews from imdb, amazon, and yelp. Each sentence has been tokenized and stripped of unneccessary symbols. 

##IMDB Big Dataset
Below is the preprocessing for a larger dataset of labeled imdb reviews (25000 samples). This section does not contain stemming/filtering because we did not have a chance to do as much as we had planned with this dataset because using it was not one of the primary goals of our project. 

In [None]:
data = pd.read_csv("IMDB_data.txt", sep='\t')

In [None]:
tokenized_lines = []
for line in data["review"]:
  line = line.lower()
  line = strip_punctuation(line)
  line = strip_multiple_whitespaces(line)
  line = strip_numeric(line)

  tokenized_lines.append(line.split())

tokenized_data = pd.DataFrame({"id":data["id"], "sentiment":data["sentiment"], "review":tokenized_lines})

In [None]:
print(data.shape)
print(tokenized_data.shape)

(25000, 3)
(25000, 3)


#Word2Vec
To convert text to a vector representation so that we can train a binary classifier for sentiment prediction, we ran the gensim library's word2vec model on the tokenized text, representing each word as a vector with 100 features. We combined the data from all three websites to maximize the size of the data to train the word embeddings on, as we only have 1000 samples from each website. We used the skip-gram model version of word2vec, which predicts probability of context given a word. 

We also used google's pretrained word2vec model, which has a vocabulary of around 3 million different words, and has 300 features for each, so is far more comprehensive than our own trained word2vec model. We wanted to see which would provide better word embeddings for our classification models. While google's word2vec is far larger, our own word2vec is trained on the same contexts those words appear in the training data, so it was unclear which would have better performance. 

In [None]:
!wget -P /root/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
EMBEDDING_FILE = '/root/input/GoogleNews-vectors-negative300.bin.gz'



--2021-12-15 02:53:27--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.248.54
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.248.54|:443... connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

    The file is already fully retrieved; nothing to do.



In [None]:
from gensim.models import word2vec
from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True)

In [None]:
all_tokenized_sentences = pd.concat([imdb_df, yelp_df, amazon_df])
all_tokenized_sentences = all_tokenized_sentences.reset_index(drop=True)
all_tokenized_sentences = all_tokenized_sentences.reset_index()
all_tokenized_sentences = all_tokenized_sentences.rename(columns = {"index": "UID"})

#sg specifies skip-gram variant, hs and negative specify negative sampling
model_w2v = gensim.models.Word2Vec(all_tokenized_sentences["Phrases"],size=100, window=5, min_count=1, sg = 1, hs = 0, negative = 10)
#we used a high number of epochs (iterations over corpus) because of the small number of words we are training on
model_w2v.train(all_tokenized_sentences["Phrases"], total_examples= len(all_tokenized_sentences["Phrases"]), epochs=20)


(470488, 573820)

Using word2vec, the cosine similarity between the vectors corresponding to the word embeddings of 2 words approximates how close the words are in semantic meaning. Most similar outputs the top 10 most similar words to the given word in the model's vocabulary, and we used this to see how well our word2vec model worked. While the results here are not entirely random and suggest that our model does capture some semantic meaning, these results are far worse than google's pretrained word2vec model (below).

In [None]:
print(model_w2v.wv.most_similar("dinner", topn=10))

[('lange', 0.9288738369941711), ('become', 0.9264037609100342), ('bachi', 0.9216042757034302), ('hostess', 0.918390154838562), ('marrow', 0.9169269800186157), ('update', 0.9166852235794067), ('putting', 0.914053201675415), ('bed', 0.9140514135360718), ('regular', 0.9127137660980225), ('ignored', 0.9120527505874634)]


In [None]:
print(word_vectors.most_similar("dinner", topn=10))

[('dinners', 0.7902063131332397), ('brunch', 0.790051281452179), ('Dinner', 0.7639397382736206), ('supper', 0.7596100568771362), ('luncheon', 0.7099571228027344), ('banquet', 0.7032414674758911), ('breakfast', 0.7007028460502625), ('buffet_dinner', 0.6914125680923462), ('meal', 0.6843624114990234), ('lunch', 0.6815704703330994)]


For our model, the data to be classified is a sentence, not individual words. So now that we have trained word2vec and can embed each word as a vector, we take the average of the vectors for all tokens in a sentence to get a vector representation of the entire sentence. 

In [None]:
import numpy as np

count = 0
sentence_embeddings = np.zeros((3000, 100))
num = 0
for phrase in all_tokenized_sentences["Phrases"]:
  vec = np.zeros(100).reshape((1, 100))
  for word in phrase:
    try:
        vec += model_w2v[word].reshape((1, 100))
        count += 1
    except KeyError:
      continue
    if count != 0:
      vec /= count
  sentence_embeddings[num]= vec
  num += 1
  count = 0

embeddings_df = pd.DataFrame(sentence_embeddings)

#we previously tried normalizing vectors
"""
#print(embeddings_df.mean().mean())
#print(embeddings_df.std().std())

embeddings_df1 = embeddings_df.sub(embeddings_df.mean(axis=1),axis = 'rows')
embeddings_df2 = embeddings_df1.div(embeddings_df.std(axis=1),axis = 'rows')
embeddings_df = embeddings_df2

embeddings_df["Labels"] = all_tokenized_sentences["Labels"]
embeddings_df = embeddings_df.dropna(how='any',axis=0) 
"""
print(embeddings_df)




  # Remove the CWD from sys.path while we load stuff.


            0         1         2   ...        97        98        99
0     0.046459  0.023194  0.000837  ... -0.027057  0.028209  0.000286
1     0.015237  0.023584  0.009458  ... -0.017994 -0.013727 -0.017573
2     0.017664  0.007595  0.006733  ... -0.009008  0.004559  0.004932
3     0.083060  0.009398 -0.001145  ... -0.051467  0.027076 -0.053451
4     0.037847  0.009748  0.012589  ... -0.012116 -0.014882  0.003329
...        ...       ...       ...  ...       ...       ...       ...
2995  0.032723  0.014882  0.015116  ... -0.014789 -0.010481 -0.000783
2996  0.032856  0.053026  0.049837  ...  0.011872 -0.043454  0.019117
2997  0.112321  0.061408  0.055865  ... -0.075427 -0.013113  0.031067
2998  0.047290  0.017800  0.017361  ... -0.021853  0.001053  0.015029
2999  0.042475  0.025446  0.032964  ... -0.019595 -0.051928  0.029802

[3000 rows x 100 columns]


This is the same process of converted word2vec word embeddings to a single vector for the entire sentence, this time for google's word2vec model. 

In [None]:
pt_sentence_embeddings = np.zeros((3000, 300))
num = 0
for phrase in all_tokenized_sentences["Phrases"]:
  vec = np.zeros(300).reshape((1, 300))
  for word in phrase:
    try:
        vec += word_vectors[word].reshape((1, 300))
        count += 1
    except KeyError:
      continue
    if count != 0:
      vec /= count
  pt_sentence_embeddings[num]= vec
  num += 1
  count = 0

  pt_embeddings_df = pd.DataFrame(pt_sentence_embeddings)


Again, same process but for the big dataset.

In [None]:
#bigger dataset

big_embeddings = np.zeros((25000, 300))
num = 0
for phrase in data["review"]:
  vec = np.zeros(300).reshape((1, 300))
  for word in phrase:
    try:
        vec += word_vectors[word].reshape((1, 300))
        count += 1
    except KeyError:
      continue
    if count != 0:
      vec /= count
  big_embeddings[num]= vec
  num += 1
  count = 0

  big_embeddings_df = pd.DataFrame(big_embeddings)


#SVM Model
Our first approach was to use a Support Vector Machine model for binary classification. The two classes are 1 for positive sentiment, and 0 for negative sentiment. We trained the model on the vector representations of each sentence from word2vec. We settled on a linear kernel and regularization parameter of one after doing a hyperparameter sweep with gridsearch. Below is the output of this model's f1 score, accuracy score, and the confusion matrix. The scores for this model as seen below are slightly lower than when stemming is used, because stemming adds more instances of certain words so word2vec is better on the sample review data. The accuracy with stemming for this model is about 4% higher. 

In [None]:
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix 

#embeddings_df_no_labels  = embeddings_df.loc[:,embeddings_df.columns != 'Labels']
xtrain, xtest, ytrain, ytest = train_test_split(embeddings_df, all_tokenized_sentences['Labels'], random_state=42, test_size=0.25)

svc = svm.SVC(kernel='linear', C=1, gamma = 1, probability=True).fit(xtrain, ytrain) 
prediction = svc.predict_proba(xtest) 
prediction_int = prediction[:,1] >= 0.5
prediction_int = prediction_int.astype(np.int) 

print(f1_score(ytest, prediction_int))
print(accuracy_score(ytest, prediction_int))
print(confusion_matrix(ytest, prediction_int))
TN, FP, FN, TP = confusion_matrix(ytest, prediction_int).ravel()
print("TP = "+str(TP))
print("FP = "+str(FP))
print("FN = "+str(FN))
print("TN = "+str(TN))



0.6639676113360324
0.668
[[255 114]
 [135 246]]
TP = 246
FP = 114
FN = 135
TN = 255


SVM model with google's word2vec. Interestingly, when we attempted to normalize the vectors, we got better scores with this model (5% higher accuracy). However, normalizing the vectors caused the svm model with our word embeddings to have lower accuracy. Its possible that this is because google's word embeddings contain very small numbers, so scaling them up was better for our small sample. 

In [None]:

xtrain_pt, xtest_pt, ytrain_pt, ytest_pt = train_test_split(pt_embeddings_df, all_tokenized_sentences['Labels'], random_state=42, test_size=0.25)

svc_pt = svm.SVC(kernel='linear', C=1, probability=True).fit(xtrain_pt, ytrain_pt) 


In [None]:
prediction_pt = svc_pt.predict_proba(xtest_pt) 
prediction_int_pt = prediction_pt[:,1] >= 0.5
prediction_int_pt = prediction_int_pt.astype(np.int) 
print(f1_score(ytest_pt, prediction_int_pt))
print(accuracy_score(ytest_pt, prediction_int_pt))
print(confusion_matrix(ytest_pt, prediction_int_pt))
TN, FP, FN, TP = confusion_matrix(ytest_pt, prediction_int_pt).ravel()
print("TP = "+str(TP))
print("FP = "+str(FP))
print("FN = "+str(FN))
print("TN = "+str(TN))

0.6344086021505376
0.6358768406961178
[[239 133]
 [139 236]]
TP = 236
FP = 133
FN = 139
TN = 239


SVM model for big dataset, unfinished (accuracy hovered around 50% so we must either have a lingering bug or a mistaken assumption in this model)

In [None]:
#Big dataset svm
big_embeddings_df_no_labels  = big_embeddings_df.loc[:,big_embeddings_df.columns != 'Labels']
xtrain_big, xtest_big, ytrain_big, ytest_big = train_test_split(big_embeddings_df_no_labels, big_embeddings_df['Labels'], random_state=42, test_size=0.25)
print(xtrain_big[:10])
print(ytrain_big[:10])

svc_big = svm.SVC(kernel='linear', C=1, probability=True).fit(xtrain_big, ytrain_big) 
prediction_big = svc_big.predict_proba(xtest_big) 
prediction_int_big = prediction_big[:,1] >= 0.5
prediction_int_big = prediction_int_big.astype(np.int) 
print(f1_score(ytest_big, prediction_int_big))
print(accuracy_score(ytest_big, prediction_int_big))
print(confusion_matrix(ytest_big, prediction_int_big))
TN, FP, FN, TP = confusion_matrix(ytest_big, prediction_int_big).ravel()
print("TP = "+str(TP))
print("FP = "+str(FP))
print("FN = "+str(FN))
print("TN = "+str(TN))

Hyperparameter Search
gridsearch for best svm parameters-looking at regularization parameter, kernel type, and kernel coefficient (for rbf)

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.1, 1, 10, 100, 1000],
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
              'kernel': ['rbf','linear']}

grid = GridSearchCV(svm.SVC(probability=True), param_grid, refit = True, verbose = 3)
 
# fitting the model for grid search
grid.fit(xtrain, ytrain)

print(grid.best_params_)
 
# print how our model looks after hyper-parameter tuning
print(grid.best_estimator_)

In [None]:
grid_predictions = grid.predict_proba(xtest)
grid_prediction_int = grid_predictions[:,1] >= 0.3
grid_prediction_int = grid_prediction_int.astype(np.int) 
print(f1_score(ytest, grid_prediction_int))
print(accuracy_score(ytest, grid_prediction_int))

svcbest = svm.SVC(kernel='rbf', C=1000, gamma= 1, probability=True).fit(xtrain, ytrain) 
prediction2 = svcbest.predict_proba(xtest) 
prediction_int2 = prediction2[:,1] >= 0.3
prediction_int2 = prediction_int2.astype(np.int) 
print(f1_score(ytest, prediction_int2))
print(accuracy_score(ytest, prediction_int2))

#MLP Model
Multilayer Perceptron Neural Network (MLP) is a feed-forward neural network. We chose this model for its simplicity. The solver for weight optimization that we used is Limited-Memory BFGS (lmbfgs). It is an optimizer in the family of quasi-Newton methods. While BFGS method approximates the inverse of the Hessian matrix, L-BFGS maintains a history of the past several states and gradients. This allows for optimization with many variables as this method uses less memory. We tested different hidden layers options and arrived at 2 layers of 5 hidden units for the Google embeddings and 3 layers of 15 units for our embeddings. As the method did not converge with the default maximum number of iterations, we had to increase that number to 10,000.

Among our methods, this one attained the highest accuracy and the best metrics overall. 

For the MLP classifier with embeddings from our own word2vec directly below, the accuracy is again around 3-4% lower than the best accuracy we achieved with stemming also used. 

In [None]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
                 hidden_layer_sizes=(15, 3), random_state=1,max_iter=10000)
clf.fit(xtrain, ytrain)
pred = clf.predict(xtest)
print(f1_score(ytest, pred))
print(accuracy_score(ytest,pred))
TN, FP, FN, TP = confusion_matrix(ytest, pred).ravel()
print("TP = "+str(TP))
print("FP = "+str(FP))
print("FN = "+str(FN))
print("TN = "+str(TN))

0.6790450928381963
0.6773333333333333
TP = 256
FP = 117
FN = 125
TN = 252


MLP with google's word2vec embeddings

In [None]:
clf_pt = MLPClassifier(solver='lbfgs', alpha=1e-5,
                 hidden_layer_sizes=(5, 2), random_state=1,max_iter=10000)
clf_pt.fit(xtrain_pt, ytrain_pt)
pred_ptc = clf_pt.predict(xtest_pt)
print(f1_score(ytest_pt, pred_ptc))
print(accuracy_score(ytest_pt,pred_ptc))
TN, FP, FN, TP = confusion_matrix(ytest_pt, pred_ptc).ravel()
print("TP = "+str(TP))
print("FP = "+str(FP))
print("FN = "+str(FN))
print("TN = "+str(TN))

0.7032967032967034
0.6746987951807228
TP = 288
FP = 156
FN = 87
TN = 216


# COMPARING VADER RESULTS TO BINARY LABELS
VADER is a rules based sentiment algorithm. It can be helpful for establishing a baseline for sentiment. The most common use for VADER where it has been found to have significant success is with tweets. VADER generates a negative, neutral, and positive score for each sentence. It then computes a compound score. We can make this score binary by choosing a threshold that best divides negative and positive data. At first I seperated sentences that VADER did not give negative or positive weight to. With this specific data, VADER had a 65-70% base accuracy and could be improved to around 80%. This improvement is done by realizing that neutral answers are more likely to be negative. For the code exploring this, see the attached FinalProjectVADER jupyter notebook.


#Conclusion
We found that the MLP classifier performed better than the SVM classifier for sentiment prediction. As for word embeddings, models using google's word2vec performed slightly better/equal with our word2vec when our samples were not preprocessed with stemming and stopword filtering. With these preprocessing techniques, the models using our word2vec outperfomed the ones using google's word2vec. 

We would like to see what changes using a larger dataset for training the word2vec model and the classifier, as we had concerns the whole time about the small size of our training data. Also, we would like to try additional preprocessing steps with our sentences, as there are many more words that do not affect sentiment that could possibly be filtered out, and even the limited filtering that we did improved model performance. 