# Final Presentation: Sentiment Prediction
Matthew Landry, Alexey Solganik, Michael Klisiwecz

##Preprocessing
Our project is sentiment prediction on sentiment labelled reviews from yelp, imdb, and amazon. Before we can create a model to predict sentiment, we need to preprocess the sentences from the example data set so they can be represented as vectors. First, we remove unnecessary symbols from the sentences (punctuation, numbers, capital letters, extra spaces) and then tokenize each sentence. Initially, we tried to remove stop words as part of our preprocessing step, but it turned out that this would remove words like "not", which change the sentiment of a sentence, so we realized we could not include this step even if it means leaving in many words that do not affect sentence sentiment (ex. a, the, is). We used PorterStemmer from the gensim library to remove word stems, because we want as many instances as possible of a given word in our training data, so it is best to consider all forms of a word (ex. large, larger, largest) as the same word. 

In [None]:
import gensim
import pandas as pd
from gensim.parsing.preprocessing import strip_numeric, strip_punctuation, strip_multiple_whitespaces
from gensim.parsing.porter import PorterStemmer
pstem = PorterStemmer()
#remove_stopwords("Better late than never, but better never late.")

filter = ['a', 'an', 'the', 'is', 'are', 'were', 'was', 'will', 'be', 'in', 'to', 'on','at','and','or']

with open("imdb_labelled.txt") as dat:
    imdb_data = []
    imdb_results = []
    for line in dat:
        imdb_results.append(int(line[len(line)-2:len(line)-1]))
        line = line.lower()
        line = strip_punctuation(line)
        line = strip_multiple_whitespaces(line)
        line = strip_numeric(line)

        tokenized_line = line.split()
        tokenized_line1 = []
        for token in tokenized_line:
          if token not in filter:
            tokenized_line1.append(token)
        tokenized_line = tokenized_line1  
       # stem_line = lambda x: [pstem.stem(token) for token in x]
        #tokenized_line = stem_line(tokenized_line)

        imdb_data.append(tokenized_line)
        
with open("amazon_cells_labelled.txt") as dat:
    amazon_data = []
    amazon_results = []
    for line in dat:
        amazon_results.append(int(line[len(line)-2:len(line)-1]))
        line = line.lower()
        line = strip_punctuation(line)
        line = strip_multiple_whitespaces(line)
        line = strip_numeric(line)

        tokenized_line = line.split()
        tokenized_line1 = []
        for token in tokenized_line:
          if token not in filter:
            tokenized_line1.append(token)
        tokenized_line = tokenized_line1 
        #stem_line = lambda x: [pstem.stem(token) for token in x]
        #tokenized_line = stem_line(tokenized_line)

        amazon_data.append(tokenized_line)
        
with open("yelp_labelled.txt") as dat:
    yelp_data = []
    yelp_results = []
    for line in dat:
        yelp_results.append(int(line[len(line)-2:len(line)-1]))
        line = line.lower()
        line = strip_punctuation(line)
        line = strip_multiple_whitespaces(line)
        line = strip_numeric(line)

        tokenized_line = line.split()
        tokenized_line1 = []
        for token in tokenized_line:
          if token not in filter:
            tokenized_line1.append(token)
        tokenized_line = tokenized_line1 
        #stem_line = lambda x: [pstem.stem(token) for token in x]
        #tokenized_line = stem_line(tokenized_line)
        
        yelp_data.append(tokenized_line)
        


imdb_df = pd.DataFrame({"Phrases" : imdb_data, "Labels" : imdb_results})
amazon_df = pd.DataFrame({"Phrases" : amazon_data, "Labels" : amazon_results})
yelp_df = pd.DataFrame({"Phrases" : yelp_data, "Labels" : yelp_results})
#imdb_df = imdb_df.reset_index()
#imdb_df = imdb_df.rename(columns = {"index": "UID"})
#amazon_df = amazon_df.reset_index()
#amazon_df = amazon_df.rename(columns = {"index": "UID"})
#yelp_df = yelp_df.reset_index()
#yelp_df = yelp_df.rename(columns = {"index": "UID"})

print(imdb_df[:10])
print(amazon_df[:10])
print(yelp_df[:10])
imdb_df["Labels"].value_counts()
amazon_df["Labels"].value_counts()
yelp_df["Labels"].value_counts()




                                             Phrases  Labels
0  [very, very, very, slow, moving, aimless, movi...       0
1  [not, sure, who, more, lost, flat, characters,...       0
2  [attempting, artiness, with, black, white, cle...       0
3         [very, little, music, anything, speak, of]       0
4  [best, scene, movie, when, gerardo, trying, fi...       1
5  [rest, of, movie, lacks, art, charm, meaning, ...       0
6                               [wasted, two, hours]       0
7  [saw, movie, today, thought, it, good, effort,...       1
8                                 [bit, predictable]       0
9  [loved, casting, of, jimmy, buffet, as, scienc...       1
                                             Phrases  Labels
0  [so, there, no, way, for, me, plug, it, here, ...       0
1                     [good, case, excellent, value]       1
2                              [great, for, jawbone]       1
3  [tied, charger, for, conversations, lasting, m...       0
4                       

1    500
0    500
Name: Labels, dtype: int64

In [None]:
from gensim.parsing.preprocessing import STOPWORDS


frozenset({'ourselves', 'so', 'must', 'now', 'has', 'sixty', 'me', 'who', 'many', 'are', 'above', 'thus', 'these', 'name', 'become', 'may', 'further', 'sometime', 'con', 'everyone', 'your', 'make', 'off', 'no', 'onto', 'in', 'due', 'behind', 'for', 'quite', 'thereby', 'namely', 'wherein', 'somehow', 'since', 'upon', 'because', 'those', 'herein', 'through', 'might', 'out', 'noone', 'whether', 'on', 'give', 'hereby', 'cannot', 'could', 'by', 'whatever', 'per', 'enough', 'indeed', 'of', 'us', 'almost', 'etc', 'therefore', 'still', 'doing', 'also', 'when', 'why', 'de', 'always', 'her', 'all', 'eight', 'whereafter', 'everywhere', 'hereafter', 'they', 'next', 'unless', 'whenever', 'seeming', 'around', 'he', 'sincere', 'much', 'three', 'nowhere', 'to', 'up', 'mine', 'don', 'anyhow', 'otherwise', 'over', 'fifteen', 'thru', 'well', 'nobody', 'seems', 'itself', 'last', 'really', 'top', 'if', 'whose', 'done', 'beside', 'former', 'go', 'six', 'each', 'four', 'my', 'keep', 'towards', 'down', 'was',

The above output shows examples of tokenized sentences from our example dataset of user reviews from imdb, amazon, and yelp. Each sentence has been tokenized and stripped of unneccessary symbols. 

In [None]:
data = pd.read_csv("IMDB_data.txt", sep='\t')

In [None]:
print(data[:5])
tokenized_lines = []
for line in data["review"]:
  line = line.lower()
  line = strip_punctuation(line)
  line = strip_multiple_whitespaces(line)
  line = strip_numeric(line)

  tokenized_lines.append(line.split())

tokenized_data = pd.DataFrame({"id":data["id"], "sentiment":data["sentiment"], "review":tokenized_lines})
print(tokenized_data[:5])

       id  sentiment                                             review
0  5814_8          1  With all this stuff going down at the moment w...
1  2381_9          1  \The Classic War of the Worlds\" by Timothy Hi...
2  7759_3          0  The film starts with a manager (Nicholas Bell)...
3  3630_4          0  It must be assumed that those who praised this...
4  9495_8          1  Superbly trashy and wondrously unpretentious 8...
       id  sentiment                                             review
0  5814_8          1  [with, all, this, stuff, going, down, at, the,...
1  2381_9          1  [the, classic, war, of, the, worlds, by, timot...
2  7759_3          0  [the, film, starts, with, a, manager, nicholas...
3  3630_4          0  [it, must, be, assumed, that, those, who, prai...
4  9495_8          1  [superbly, trashy, and, wondrously, unpretenti...


In [None]:
print(data.shape)
print(tokenized_data.shape)

(25000, 3)
(25000, 3)


##Word2Vec
To convert text to a vector representation so that we can train a sentiment classifier, we ran word2vec on the tokenized text, representing each word as a vector with 200 features. We combined the data from all three websites to maximize the size of the data to train the word embeddings on. EXPLAIN IN MORE DETAIL HOW WORD2VEC WORKS, SKIP-GRAM, NEGATIVE SAMPLING, COSINE SIMILARITY

In [None]:
!wget -P /root/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
EMBEDDING_FILE = '/root/input/GoogleNews-vectors-negative300.bin.gz'



--2021-12-14 20:45:24--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.146.69
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.146.69|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘/root/input/GoogleNews-vectors-negative300.bin.gz’


2021-12-14 20:46:01 (42.3 MB/s) - ‘/root/input/GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



In [None]:
from gensim.models import word2vec
from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True)

In [None]:

#import zipfile 

#!unzip /content/GoogleNews-vectors-negative300.bin.gz

#import gzip
#f=gzip.open("/content/GoogleNews-vectors-negative300.bin.gz",'rb')
#file_content=f.read()
#print(file_content)



all_tokenized_sentences = pd.concat([imdb_df, yelp_df, amazon_df])
all_tokenized_sentences = all_tokenized_sentences.reset_index(drop=True)
all_tokenized_sentences = all_tokenized_sentences.reset_index()
all_tokenized_sentences = all_tokenized_sentences.rename(columns = {"index": "UID"})
print(all_tokenized_sentences   )

model_w2v = gensim.models.Word2Vec(
            all_tokenized_sentences["Phrases"],
            size=100, # desired no. of features/independent variables
            window=5, # context window size
            min_count=1, # Ignores all words with total frequency lower than 2.                                  
            sg = 1, # 1 for skip-gram model
            hs = 0,
            negative = 10, # for negative sampling
            workers= 32, # no.of cores
            seed = 34
)
model_w2v.train(all_tokenized_sentences["Phrases"], total_examples= len(all_tokenized_sentences["Phrases"]), epochs=20)
#print(all_tokenized_sentences['Phrases'][0])



       UID                                            Phrases  Labels
0        0  [very, very, very, slow, moving, aimless, movi...       0
1        1  [not, sure, who, more, lost, flat, characters,...       0
2        2  [attempting, artiness, with, black, white, cle...       0
3        3         [very, little, music, anything, speak, of]       0
4        4  [best, scene, movie, when, gerardo, trying, fi...       1
...    ...                                                ...     ...
2995  2995  [screen, does, get, smudged, easily, because, ...       0
2996  2996  [what, piece, of, junk, i, lose, more, calls, ...       0
2997  2997                  [item, does, not, match, picture]       0
2998  2998  [only, thing, that, disappoint, me, infra, red...       0
2999  2999  [you, can, not, answer, calls, with, unit, nev...       0

[3000 rows x 3 columns]


(470924, 573820)

In [None]:
print(model_w2v.wv.most_similar("dinner", topn=10))

[('lange', 0.9288738369941711), ('become', 0.9264037609100342), ('bachi', 0.9216042757034302), ('hostess', 0.918390154838562), ('marrow', 0.9169269800186157), ('update', 0.9166852235794067), ('putting', 0.914053201675415), ('bed', 0.9140514135360718), ('regular', 0.9127137660980225), ('ignored', 0.9120527505874634)]


For our model, the data to be classified is a sentence, not individual words. So now that we have trained word2vec and can embed each word as a vector, we take the average of the vectors for all tokens in a sentence to get a vector representation of the entire sentence. 

In [None]:
import numpy as np


count = 0
#wordvec_arrays = np.zeros(len(all_tokenized_sentences["Phrases"]), 200)
sentence_embeddings = np.zeros((3000, 100))
num = 0
for phrase in all_tokenized_sentences["Phrases"]:
  vec = np.zeros(100).reshape((1, 100))
  for word in phrase:
    try:
        vec += model_w2v[word].reshape((1, 100))
        count += 1
    except KeyError:
      continue
    if count != 0:
      vec /= count
  sentence_embeddings[num]= vec
  num += 1
  count = 0

#m = np.mean(sentence_embeddings)
#st = np.std(sentence_embeddings, dtype=np.float64)

#for i in sentence_embeddings:
  #for j in range(len(i)):
   # i[j] = (i[j] - m)/st


#print(sentence_embeddings[:2])

#sentence_embeddings = sentence_embeddings.transpose(2,0,1).reshape(3,-1)
embeddings_df = pd.DataFrame(sentence_embeddings)
#print(embeddings_df.isnull().sum().sum())
#print((embeddings_df.std(axis=1) == 0).sum().sum())
#print(embeddings_df.mean(axis=1).isnull().sum().sum())
#print(len(embeddings_df.std(axis=1)))
#print(embeddings_df.mean(axis=1)[0])
count = 0
#for i in range(len(embeddings_df.mean(axis=1).isnull())):
 # if embeddings_df.std(axis=1)[i] == True:
   # print(i)

print(embeddings_df.mean().mean())
print(embeddings_df.std().std())
#embeddings_df1 = embeddings_df.sub(embeddings_df.min().min())
#embeddings_df2 = embeddings_df1.div((embeddings_df.max().max() - embeddings_df.min().min()))
#embeddings_df = embeddings_df2
embeddings_df1 = embeddings_df.sub(embeddings_df.mean(axis=1),axis = 'rows')
embeddings_df2 = embeddings_df1.div(embeddings_df.std(axis=1),axis = 'rows')
embeddings_df = embeddings_df2




embeddings_df["Labels"] = all_tokenized_sentences["Labels"]
min = 0
for i in embeddings_df.min():
  if i < min:
    min = i
print(min)    
max = 0
for i in embeddings_df.max():
  if i > max:
    max = i
print(max)  
#print(embeddings_df2)
#print(embeddings_df.isnull().sum().sum())
embeddings_df = embeddings_df.dropna(how='any',axis=0) 
#embeddings_df.drop('Labels',axis=1, inplace=True)
print(embeddings_df)




  if sys.path[0] == '':


-0.008380801579187643
0.018725636737784185
-3.6054464807495363
3.787876349855741
             0         1         2  ...        98        99  Labels
0    -0.730128 -0.632606 -1.525628  ... -1.862945  0.888169       0
1     1.285443  0.039475  0.841172  ... -1.438221  0.726320       0
2    -1.282141  0.145933 -1.961896  ... -2.412739  1.485084       0
3    -1.035003  0.707328 -0.677356  ... -1.224561  0.653487       0
4    -1.709333 -0.248560 -1.861558  ... -1.914453  1.256154       1
...        ...       ...       ...  ...       ...       ...     ...
2995 -1.401742 -0.548999 -0.516806  ... -1.951500  0.535924       0
2996 -1.638751  0.376282 -1.264383  ... -1.258569 -0.270860       0
2997 -1.386679  0.205948 -1.421735  ... -1.992234  0.121745       0
2998 -1.266277  0.409864 -1.896123  ... -2.386592  1.662854       0
2999 -0.882469 -0.401198 -0.305942  ... -1.309577  1.700173       0

[2998 rows x 101 columns]


-3.985183726251882e-06
0.00043072569659024243


In [None]:
from sklearn.metrics.pairwise import cosine_similarity
v_apple = word_vectors["apple"] 
v_mango = word_vectors["mango"]
print(v_apple.shape)
print(v_mango.shape)
print(v_mango.reshape(1,300).shape)
cosine_similarity([v_mango],[v_apple])
word_vectors.wv.most_similar("dinner", topn=10)

NameError: ignored

In [None]:
pt_sentence_embeddings = np.zeros((3000, 300))
num = 0
for phrase in all_tokenized_sentences["Phrases"]:
  vec = np.zeros(300).reshape((1, 300))
  for word in phrase:
    try:
        vec += word_vectors[word].reshape((1, 300))
        count += 1
    except KeyError:
      continue
    if count != 0:
      vec /= count
  pt_sentence_embeddings[num]= vec
  num += 1
  count = 0

print(pt_sentence_embeddings[:2])
#sentence_embeddings = sentence_embeddings.transpose(2,0,1).reshape(3,-1)
pt_embeddings_df = pd.DataFrame(pt_sentence_embeddings)
#print(pt_embeddings_df[:2])
#print(all_tokenized_sentences["Labels"][:2])
#print(pt_embeddings_df.shape)
#print(pt_embeddings_df[:1])
sum = np.zeros(300).reshape((1,300))

pt_embeddings_df1 = pt_embeddings_df.sub(pt_embeddings_df.mean(axis=1),axis = 'rows')
pt_embeddings_df2 = pt_embeddings_df1.div(pt_embeddings_df.std(axis=1),axis = 'rows')
pt_embeddings_df = pt_embeddings_df2



pt_embeddings_df["Labels"] = all_tokenized_sentences["Labels"]


#print(embeddings_df2)
print(pt_embeddings_df.isnull().sum().sum())
pt_embeddings_df = pt_embeddings_df.dropna(how='any',axis=0) 
#embeddings_df.drop('Labels',axis=1, inplace=True)
print(pt_embeddings_df)


[[ 2.79852732e-02  1.34874607e-02  2.42976590e-03 -6.91709925e-03
   7.12158108e-03 -3.74082028e-03 -1.72039271e-02  1.31641859e-04
   1.09635908e-02  7.62919531e-04  3.16937715e-03 -2.06804650e-02
  -7.23117613e-03 -1.28787153e-02 -9.71366916e-03  2.25327974e-03
  -8.96228498e-03  7.22456788e-03  1.76048754e-03 -7.91555433e-03
  -1.01615373e-02  9.26663545e-04  1.91333605e-03 -2.24893464e-02
   5.42205690e-03 -1.74922017e-02 -6.83247888e-03  2.36653970e-02
   3.20949703e-02 -3.73824448e-03  1.35545921e-02  1.17330842e-02
  -5.17623773e-03 -1.57497262e-03 -7.14972374e-03  2.06626339e-02
   2.04240735e-02 -1.11550336e-02  5.67605353e-03  2.81117341e-02
   2.55269738e-03 -3.05485013e-03  1.78088184e-02 -1.44712809e-03
   1.96807916e-02 -1.12643858e-02 -7.18977912e-03  6.97322324e-04
  -4.29980348e-03  2.51486602e-03 -1.06795131e-02  1.09748586e-02
   4.23014165e-03 -1.54960489e-02  7.52638772e-03 -7.89925451e-03
  -5.42811469e-03 -4.21271170e-03  4.23682644e-03 -6.84809244e-04
   1.52150

In [None]:
#bigger dataset

big_embeddings = np.zeros((25000, 300))
num = 0
for phrase in data["review"]:
  vec = np.zeros(300).reshape((1, 300))
  for word in phrase:
    try:
        vec += word_vectors[word].reshape((1, 300))
        count += 1
    except KeyError:
      continue
    if count != 0:
      vec /= count
  big_embeddings[num]= vec
  num += 1
  count = 0

big_embeddings_df = pd.DataFrame(big_embeddings)
big_embeddings_df1 = big_embeddings_df.sub(big_embeddings_df.mean(axis=1),axis = 'rows')
big_embeddings_df2 = big_embeddings_df1.div(big_embeddings_df.std(axis=1),axis = 'rows')
big_embeddings_df = big_embeddings_df2

big_embeddings_df["Labels"] = data['sentiment']


#print(embeddings_df2)
print(big_embeddings_df.isnull().sum().sum())
big_embeddings_df = big_embeddings_df.dropna(how='any',axis=0) 
#embeddings_df.drop('Labels',axis=1, inplace=True)
print(big_embeddings_df)  
"""
print(big_embeddings[:2])
#sentence_embeddings = sentence_embeddings.transpose(2,0,1).reshape(3,-1)

print(big_embeddings_df[:2])
print(big_embeddings_df.shape)
print(big_embeddings_df[:1])
"""

0
              0         1         2  ...       298       299  Labels
0     -2.212821  1.450381 -0.226155  ... -0.866486  2.194698       1
1     -2.439197  1.342694  0.309873  ...  0.850287  2.751547       1
2     -0.159698  1.826075 -0.425596  ... -1.454917  0.942753       0
3     -0.980355  1.041724 -0.314449  ... -0.221568 -1.120664       0
4     -0.287194  0.973776 -0.035684  ... -0.984885  1.643056       1
...         ...       ...       ...  ...       ...       ...     ...
24995 -2.438734  1.343486  0.311355  ...  0.850847  2.751584       0
24996 -1.977057  1.216293 -1.733094  ...  0.237154  0.582399       0
24997  0.830364 -0.661113  0.556532  ... -0.313325  1.510859       0
24998 -0.287257  0.975322 -0.035872  ... -0.985168  1.643689       0
24999 -0.527713  1.411506  0.478683  ... -0.851087  0.268782       1

[25000 rows x 301 columns]


'\nprint(big_embeddings[:2])\n#sentence_embeddings = sentence_embeddings.transpose(2,0,1).reshape(3,-1)\n\nprint(big_embeddings_df[:2])\nprint(big_embeddings_df.shape)\nprint(big_embeddings_df[:1])\n'

##SVM Model
Our first approach was to use an svm model for binary classification. The two classes are 1 for positive sentiment, and 0 for negative sentiment. We trained the model on the vector representations of each sentence. 

It was in this step that we found our first major hurdle. When we looked at our model's f1 score and accuracy metrics, we seemed to get the same exact numbers regardless of what parameters we modified. These scores were also very low: in our initial attempt our accuracy was sub 50%. The confusion matrix for this model indicated that every single prediction was either a true positive or false positive, so there was clearly something fundamentally wrong with our model. Our guess was that the word embeddings trained on the dataset were not useful due to the small size of the dataset. Even when combining all 3 example datasets, we only had a total of 3000 example sentences, and many of the sentences were only 4 words or less. To get useful embeddings of these words, that are intended to encode meaning and context, we would need a much larger dataset. To remedy this, we decided to use a pretrained version of word2vec that was trained on an enormous quantity of sentences in our preprocessing, and retrain our model on these new sentence embeddings. 

In [None]:
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix 

#train_w2v = embeddings_df.iloc[:2400,:]
#test_w2v = embeddings_df.iloc[2400:,:]

embeddings_df_no_labels  = embeddings_df.loc[:,embeddings_df.columns != 'Labels']
xtrain, xtest, ytrain, ytest = train_test_split(embeddings_df_no_labels , embeddings_df['Labels'], random_state=42, test_size=0.25)
print(xtrain[:10])
print(ytrain[:10])

svc = svm.SVC(kernel='linear', C=1, gamma = 1, probability=True).fit(xtrain, ytrain) 
prediction = svc.predict_proba(xtest) 
prediction_int = prediction[:,1] >= 0.5
prediction_int = prediction_int.astype(np.int) 
#prediction = svc.predict(xtrain) 
print(f1_score(ytest, prediction_int))
print(accuracy_score(ytest, prediction_int))
print(confusion_matrix(ytest, prediction_int))
TN, FP, FN, TP = confusion_matrix(ytest, prediction_int).ravel()
print("TP = "+str(TP))
print("FP = "+str(FP))
print("FN = "+str(FN))
print("TN = "+str(TN))



            0         1         2   ...        97        98        99
1066 -1.134841 -0.269029 -1.297176  ...  0.394902 -1.825627  1.860990
663  -1.511112  0.227453 -1.663059  ... -0.089047 -2.112329  2.020554
481  -1.314494  0.242423 -1.921818  ...  0.482243 -2.306766  1.840990
2138 -1.405848  0.369150 -2.150020  ...  0.671640 -2.063225  1.369380
2703 -1.384383  0.619844 -1.763493  ...  0.977955 -2.440726  1.279768
1338 -0.481294  0.508624 -0.697093  ...  0.896794 -1.591527  1.697029
2981 -1.339690  1.803891 -1.933986  ...  0.186324 -1.978891  1.125692
2048 -1.577000  1.192710 -2.178420  ...  0.151728 -1.690598  1.745137
1243 -1.782035 -0.174848 -2.155849  ...  0.342310 -2.173884  0.739187
2708 -1.503534  0.369664 -1.836620  ...  0.367803 -1.394882  1.766591

[10 rows x 100 columns]
1066    1
663     1
481     0
2138    0
2703    1
1338    1
2981    0
2048    1
1243    0
2708    0
Name: Labels, dtype: int64
0.6763540290620872
0.6733333333333333
[[249 120]
 [125 256]]
TP = 256
FP = 120

In [None]:

pt_embeddings_df_no_labels  = pt_embeddings_df.loc[:,pt_embeddings_df.columns != 'Labels']
xtrain_pt, xtest_pt, ytrain_pt, ytest_pt = train_test_split(pt_embeddings_df_no_labels, pt_embeddings_df['Labels'], random_state=42, test_size=0.25)
print(xtrain_pt[:10])
print(ytrain_pt[:10])

svc_pt = svm.SVC(kernel='linear', C=1, probability=True).fit(xtrain_pt, ytrain_pt) 


           0         1         2    ...       297       298       299
1761  0.848872  0.672286 -0.698902  ... -0.366251 -1.542368  0.880810
2795 -1.118913 -0.251022 -0.213904  ... -0.302753 -0.988999  0.999010
659   0.097256 -0.336943  0.372916  ... -1.496920 -0.318650 -0.543134
1194  0.663193 -0.296420 -0.446368  ... -0.875605 -1.009303  0.219543
1042  1.156703  0.905345 -0.363905  ...  0.953512  0.507924 -0.107910
2623  1.149816 -0.014923  0.411930  ...  0.406199 -0.198657 -0.123025
2587 -0.484713  1.161837  1.076292  ...  0.979939  0.333367  0.193056
2107 -1.012698  1.330769 -1.029547  ...  0.361821  1.017335 -0.356732
2433 -0.108252 -0.779854 -0.693216  ... -0.152409  0.291147 -0.689624
2257  0.530908  1.535280 -0.207971  ... -1.236642 -0.152173 -1.049447

[10 rows x 300 columns]
1761    0
2795    1
659     1
1194    0
1042    0
2623    0
2587    0
2107    1
2433    1
2257    1
Name: Labels, dtype: int64


In [None]:
prediction_pt = svc_pt.predict_proba(xtest_pt) 
prediction_int_pt = prediction_pt[:,1] >= 0.5
prediction_int_pt = prediction_int_pt.astype(np.int) 
print(f1_score(ytest_pt, prediction_int_pt))
print(accuracy_score(ytest_pt, prediction_int_pt))
print(confusion_matrix(ytest_pt, prediction_int_pt))
TN, FP, FN, TP = confusion_matrix(ytest_pt, prediction_int_pt).ravel()
print("TP = "+str(TP))
print("FP = "+str(FP))
print("FN = "+str(FN))
print("TN = "+str(TN))

0.6344086021505376
0.6358768406961178
[[239 133]
 [139 236]]
TP = 236
FP = 133
FN = 139
TN = 239


In [None]:
#print(xtest.loc[0])
print(svc)
prediction = svc.predict_proba([xtest.loc[7]])
print(prediction)
prediction_int = prediction[:,1] >= 0.5
print(prediction_int)
prediction_int = prediction_int.astype(np.int) 
print(prediction_int)

SVC(C=1, gamma=1, kernel='linear', probability=True)
[[0.62835496 0.37164504]]
[False]
[0]


In [None]:
#Big dataset svm
big_embeddings_df_no_labels  = big_embeddings_df.loc[:,big_embeddings_df.columns != 'Labels']
xtrain_big, xtest_big, ytrain_big, ytest_big = train_test_split(big_embeddings_df_no_labels, big_embeddings_df['Labels'], random_state=42, test_size=0.25)
print(xtrain_big[:10])
print(ytrain_big[:10])

svc_big = svm.SVC(kernel='linear', C=1, probability=True).fit(xtrain_big, ytrain_big) 
prediction_big = svc_big.predict_proba(xtest_big) 
prediction_int_big = prediction_big[:,1] >= 0.5
prediction_int_big = prediction_int_big.astype(np.int) 
print(f1_score(ytest_big, prediction_int_big))
print(accuracy_score(ytest_big, prediction_int_big))
print(confusion_matrix(ytest_big, prediction_int_big))
TN, FP, FN, TP = confusion_matrix(ytest_big, prediction_int_big).ravel()
print("TP = "+str(TP))
print("FP = "+str(FP))
print("FN = "+str(FN))
print("TN = "+str(TN))

            0         1         2    ...       297       298       299
6920  -1.980397  1.218343 -1.731330  ... -0.912561  0.240662  0.587081
17926 -0.752291  0.355936  0.919886  ... -0.513939 -0.409460  1.195257
1123  -2.436951  1.343560  0.309208  ... -0.278999  0.851972  2.750618
4518  -0.289672  0.975285 -0.035760  ...  0.447473 -0.984770  1.645191
5576  -0.289698  0.973217 -0.034247  ...  0.448660 -0.985622  1.642261
14265 -1.978273  1.217246 -1.732899  ... -0.912768  0.239628  0.583597
8845  -0.752541  0.357346  0.919970  ... -0.513086 -0.409592  1.197406
2223  -0.527726  1.411375  0.478760  ... -0.773530 -0.850963  0.268689
23631 -0.161325  1.824720 -0.424586  ... -0.415728 -1.455227  0.941276
7315   0.830291 -0.660750  0.556271  ...  0.073973 -0.313184  1.511314

[10 rows x 300 columns]
6920     1
17926    1
1123     1
4518     1
5576     0
14265    0
8845     1
2223     1
23631    0
7315     0
Name: Labels, dtype: int64
0.4898921832884098
0.51552
[[1768 1324]
 [1704 1454]]
TP 

In [None]:
prediction_pt2 = svc_pt.predict(xtest_pt) 
print(f1_score(ytest_pt, prediction_pt2))
print(accuracy_score(ytest_pt, prediction_pt2))
print(confusion_matrix(ytest_pt, prediction_pt2))
TN, FP, FN, TP = confusion_matrix(ytest_pt, prediction_pt2).ravel()
print("TP = "+str(TP))
print("FP = "+str(FP))
print("FN = "+str(FN))
print("TN = "+str(TN))

0.6654804270462633
0.49866666666666665
[[  0 376]
 [  0 374]]
TP = 374
FP = 376
FN = 0
TN = 0


Hyperparameter Search

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.1, 1, 10, 100, 1000],
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
              'kernel': ['rbf','linear']}

grid = GridSearchCV(svm.SVC(probability=True), param_grid, refit = True, verbose = 3)
 
# fitting the model for grid search
grid.fit(xtrain, ytrain)

print(grid.best_params_)
 
# print how our model looks after hyper-parameter tuning
print(grid.best_estimator_)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV 1/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.502 total time=   1.8s
[CV 2/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.500 total time=   1.7s
[CV 3/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.500 total time=   1.7s
[CV 4/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.500 total time=   1.7s
[CV 5/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.500 total time=   1.8s
[CV 1/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.502 total time=   1.0s
[CV 2/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.500 total time=   1.0s
[CV 3/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.500 total time=   1.1s
[CV 4/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.500 total time=   1.0s
[CV 5/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.500 total time=   1.0s
[CV 1/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.502 total time=   1.7s
[CV 2/5] END ......C=0.1, gamma=0.1, kernel=rbf

In [None]:
grid_predictions = grid.predict_proba(xtest)
grid_prediction_int = grid_predictions[:,1] >= 0.3
grid_prediction_int = grid_prediction_int.astype(np.int) 
print(f1_score(ytest, grid_prediction_int))
print(accuracy_score(ytest, grid_prediction_int))

svcbest = svm.SVC(kernel='rbf', C=1000, gamma= 1, probability=True).fit(xtrain, ytrain) 
prediction2 = svcbest.predict_proba(xtest) 
prediction_int2 = prediction2[:,1] >= 0.3
prediction_int2 = prediction_int2.astype(np.int) 
print(f1_score(ytest, prediction_int2))
print(accuracy_score(ytest, prediction_int2))

0.6654804270462633
0.49866666666666665
0.6654804270462633
0.49866666666666665


In [None]:
model = gensim.models.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)

EOFError: ignored

Deep learning approach-first approach equally flawed because of same problem with size of training data for word2vec

In [None]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
                 hidden_layer_sizes=(15, 3), random_state=1,max_iter=10000)
clf.fit(xtrain, ytrain)
pred = clf.predict(xtest)
print(f1_score(ytest, pred))
print(accuracy_score(ytest,pred))
TN, FP, FN, TP = confusion_matrix(ytest, pred).ravel()
print("TP = "+str(TP))
print("FP = "+str(FP))
print("FN = "+str(FN))
print("TN = "+str(TN))

0.7020725388601037
0.6933333333333334
TP = 271
FP = 120
FN = 110
TN = 249


In [None]:
clf_pt = MLPClassifier(solver='lbfgs', alpha=1e-5,
                 hidden_layer_sizes=(5, 2), random_state=1,max_iter=10000)
clf_pt.fit(xtrain_pt, ytrain_pt)
pred_ptc = clf_pt.predict(xtest_pt)
print(f1_score(ytest_pt, pred_ptc))
print(accuracy_score(ytest_pt,pred_ptc))
TN, FP, FN, TP = confusion_matrix(ytest_pt, pred_ptc).ravel()
print("TP = "+str(TP))
print("FP = "+str(FP))
print("FN = "+str(FN))
print("TN = "+str(TN))

0.7032967032967034
0.6746987951807228
TP = 288
FP = 156
FN = 87
TN = 216


## COMPARING VADER RESULTS TO BINARY LABELS
VADER is a rules based sentiment algorithm. It can be helpful for establishing a baseline for sentiment. The most common use for VADER where it has been found to have significant success is with tweets. With this specific data, VADER had a 65-70% base accuracy and could be improved to around 80%. For the code exploring this, see the attached FinalProjectVADER jupyter notebook.
