## General Machine Learning Models

In this notebook we will be using general machine learning models to predict the sentiment of a tweet. These models will be using data converted by either CountVectorizer or TFIDFVectorizer objects.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score, classification_report, confusion_matrix

In [2]:
df = pd.read_csv('clean_tweets.csv')

In [3]:
df['clean_tweets'].isnull().sum()

38

In [4]:
 df.dropna(inplace=True)

In [5]:
df[['clean_tweets','sentiment']]

Unnamed: 0,clean_tweets,sentiment
0,id respond go,neutral
1,sooo sad miss san diego,negative
2,boss bulli,negative
3,interview leav alon,negative
4,son whi couldnt put releas whatev alreadi bought,negative
...,...,...
24457,wish whatev could come see denver husband lost...,negative
24458,ive wonder rake client ha made clear net onli ...,negative
24459,yay good enjoy break probabl need hectic weeke...,positive
24460,wa worth,positive


## Baseline Model 

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer


## Creating X and y variables with CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1,3), min_df=5)
X = vectorizer.fit_transform(df['clean_tweets'])
X = X.tocsc() 
y = df['sentiment']


In [7]:
from sklearn.dummy import DummyClassifier

#split dataset
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.20, random_state=1)

dclf = DummyClassifier(strategy='most_frequent')
dclf.fit(X_train,y_train)
y_pred = dclf.predict(X_test)
print(dclf.score(X_train,y_train))
print(dclf.score(X_test, y_test))

0.3498643738164696
0.3572159672466735


In [8]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

    negative       0.00      0.00      0.00      1510
     neutral       0.00      0.00      0.00      1630
    positive       0.36      1.00      0.53      1745

    accuracy                           0.36      4885
   macro avg       0.12      0.33      0.18      4885
weighted avg       0.13      0.36      0.19      4885



  _warn_prf(average, modifier, msg_start, len(result))


## CountVectorizer Models

CountVectorizer uses a bag of words model, which takes a count of all the words in a document. These words are then tokenized and encoded as vectors of term/token counts.

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer


## Creating X and y variables with CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1,3), min_df=5)
X = vectorizer.fit_transform(df['clean_tweets'])
X = X.tocsc() 
y = df['sentiment']


#split dataset
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.20, random_state=1)

### Support Vector Classifier with Grid Search

In [10]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

svc = SVC()
param_grid = {'C': [ 1, 10, 20],
             'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
              'gamma' : ['scale', 'auto']}


gs = GridSearchCV(estimator=svc, param_grid=param_grid, cv=3, scoring = 'accuracy')
gs.fit(X_train, y_train)
y_pred = gs.predict(X_test)
print("Training Accuracy: {:.2f}".format(gs.score(X_train, y_train)))
print("Test Accuracy: {:.2f}".format(gs.score(X_test, y_test)))
gs.best_params_

Training Accuracy: 0.70
Test Accuracy: 0.70


{'C': 1, 'gamma': 'scale', 'kernel': 'sigmoid'}

In [11]:
print(f1_score(y_test,y_pred, average='macro'))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test,y_pred))

0.6982681042860944
              precision    recall  f1-score   support

    negative       0.75      0.62      0.68      1510
     neutral       0.60      0.72      0.65      1630
    positive       0.77      0.74      0.76      1745

    accuracy                           0.70      4885
   macro avg       0.71      0.70      0.70      4885
weighted avg       0.71      0.70      0.70      4885

[[ 942  428  140]
 [ 219 1175  236]
 [  87  363 1295]]


### Naive Bayes Multinomial Model

In [12]:
## finding best value for alpha
alphas = [.1, 1, 5, 10, 50]

from sklearn.naive_bayes import MultinomialNB
for alpha in alphas:
    nbc = MultinomialNB(alpha=alpha)
    nbc.fit(X_train, y_train)
    y_pred = nbc.predict(X_test)
    print("Training Accuracy:" ,nbc.score(X_train, y_train), "Test Accuracy:",nbc.score(X_test, y_test), 'alpha:',alpha)

Training Accuracy: 0.7598648856133886 Test Accuracy: 0.6165813715455476 alpha: 0.1
Training Accuracy: 0.7469676032550284 Test Accuracy: 0.6327533265097236 alpha: 1
Training Accuracy: 0.7293106095501305 Test Accuracy: 0.6452405322415558 alpha: 5
Training Accuracy: 0.7165156865755669 Test Accuracy: 0.6499488229273286 alpha: 10
Training Accuracy: 0.6707098623266288 Test Accuracy: 0.6249744114636643 alpha: 50


In [13]:
## naive bayes

from sklearn.naive_bayes import MultinomialNB

#initiate classifier
nbc = MultinomialNB(alpha=10)
nbc.fit(X_train, y_train)

y_pred = nbc.predict(X_test)

print("Training Accuracy: {:.2f}".format(nbc.score(X_train, y_train)))
print("Test Accuracy: {:.2f}".format(nbc.score(X_test, y_test)))

Training Accuracy: 0.72
Test Accuracy: 0.65


In [14]:
print(f1_score(y_test,y_pred, average='macro'))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test,y_pred))

0.6427851818665348
              precision    recall  f1-score   support

    negative       0.65      0.70      0.67      1510
     neutral       0.59      0.47      0.53      1630
    positive       0.69      0.77      0.73      1745

    accuracy                           0.65      4885
   macro avg       0.64      0.65      0.64      4885
weighted avg       0.64      0.65      0.64      4885

[[1064  276  170]
 [ 433  772  425]
 [ 146  260 1339]]


### RandomForestClassifier with Grid Seach

In [15]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()

param_grid = {'max_depth': [10,20,30],
             'max_features': ['sqrt', 'log2'],
             'n_estimators': [150, 200, 300],
             'criterion' : ['gini', 'entropy']
             }

gs = GridSearchCV(estimator=rfc, param_grid=param_grid, cv=3, scoring = 'accuracy')
gs.fit(X_train, y_train)
y_pred = gs.predict(X_test)
print("Training Accuracy: {:.2f}".format(gs.score(X_train, y_train)))
print("Test Accuracy: {:.2f}".format(gs.score(X_test, y_test)))
gs.best_params_

Training Accuracy: 0.76
Test Accuracy: 0.68


{'criterion': 'gini',
 'max_depth': 30,
 'max_features': 'sqrt',
 'n_estimators': 200}

In [16]:
print(f1_score(y_test,y_pred, average='macro'))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test,y_pred))

0.6759199263743797
              precision    recall  f1-score   support

    negative       0.76      0.56      0.65      1510
     neutral       0.56      0.74      0.63      1630
    positive       0.78      0.72      0.75      1745

    accuracy                           0.68      4885
   macro avg       0.70      0.67      0.68      4885
weighted avg       0.70      0.68      0.68      4885

[[ 848  537  125]
 [ 191 1200  239]
 [  73  416 1256]]


## TFIDFVectorizer Models

TFIDFVectorizer on the other hand is different from a CountVectorizer. The TFIDFVectorizer tries to avoid any frequently appearing words in relation to each doument. Words that don't appear as often in documents are seen as more valuble, while those that appear often are penalized.

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfvec = TfidfVectorizer(ngram_range=(1,3),min_df=5)
X = tfvec.fit_transform(df['clean_tweets'])
X = X.tocsc()
y = df['sentiment']

from sklearn.model_selection import train_test_split


#split dataset
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.20, random_state=1)


### Naive Bayes Multinomial Model 

In [18]:
from sklearn.naive_bayes import MultinomialNB

#initiate classifier
nbc = MultinomialNB(alpha=10)
nbc.fit(X_train, y_train)
y_pred = nbc.predict(X_test)

print("Training Accuracy: {:.2f}".format(nbc.score(X_train, y_train)))
print("Test Accuracy: {:.2f}".format(nbc.score(X_test, y_test)))

Training Accuracy: 0.72
Test Accuracy: 0.65


In [19]:
print(f1_score(y_test,y_pred, average='macro'))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test,y_pred))

0.6448989374888093
              precision    recall  f1-score   support

    negative       0.69      0.65      0.67      1510
     neutral       0.59      0.49      0.54      1630
    positive       0.67      0.81      0.73      1745

    accuracy                           0.65      4885
   macro avg       0.65      0.65      0.64      4885
weighted avg       0.65      0.65      0.65      4885

[[ 978  325  207]
 [ 333  797  500]
 [ 106  226 1413]]


### Support Vector Classifier with Grid Search

In [20]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

svc = SVC()
param_grid = {'C': [ 1, 10, 20],
             'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
              'gamma' : ['scale', 'auto']}


gs = GridSearchCV(estimator=svc, param_grid=param_grid, cv=3, scoring = 'accuracy')
gs.fit(X_train, y_train)
y_pred = gs.predict(X_test)
print("Training Accuracy: {:.2f}".format(gs.score(X_train, y_train)))
print("Test Accuracy: {:.2f}".format(gs.score(X_test, y_test)))
gs.best_params_

Training Accuracy: 0.81
Test Accuracy: 0.69


{'C': 1, 'gamma': 'scale', 'kernel': 'linear'}

In [21]:
print(f1_score(y_test,y_pred, average='macro'))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test,y_pred))

0.6932216271144324
              precision    recall  f1-score   support

    negative       0.70      0.66      0.68      1510
     neutral       0.60      0.67      0.63      1630
    positive       0.79      0.74      0.76      1745

    accuracy                           0.69      4885
   macro avg       0.70      0.69      0.69      4885
weighted avg       0.70      0.69      0.70      4885

[[ 994  402  114]
 [ 299 1099  232]
 [ 118  333 1294]]


### RandomForestClassifer with Grid Seach

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier 

tfvec = TfidfVectorizer(ngram_range=(1,3),min_df=5)
X = tfvec.fit_transform(df['clean_tweets'])
X = X.tocsc()
y = df['sentiment']

from sklearn.model_selection import train_test_split

#split dataset
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.20, random_state=1)

#initiate classifier

param_grid = {'max_depth': [10,20,30],
             'max_features': ['sqrt', 'log2'],
             'n_estimators': [150, 200, 300],
             'criterion' : ['gini', 'entropy']
             }

rfc = RandomForestClassifier()

gs = GridSearchCV(estimator=rfc, param_grid=param_grid, cv=3, scoring = 'accuracy')
gs.fit(X_train, y_train)
y_pred = gs.predict(X_test)
print("Training Accuracy: {:.2f}".format(gs.score(X_train, y_train)))
print("Test Accuracy: {:.2f}".format(gs.score(X_test, y_test)))
gs.best_params_

Training Accuracy: 0.76
Test Accuracy: 0.67


{'criterion': 'gini',
 'max_depth': 30,
 'max_features': 'sqrt',
 'n_estimators': 300}

In [23]:
print(f1_score(y_test,y_pred, average='macro'))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test,y_pred))

0.6725121120809634
              precision    recall  f1-score   support

    negative       0.76      0.56      0.64      1510
     neutral       0.55      0.73      0.63      1630
    positive       0.77      0.72      0.74      1745

    accuracy                           0.67      4885
   macro avg       0.70      0.67      0.67      4885
weighted avg       0.70      0.67      0.67      4885

[[ 843  537  130]
 [ 195 1190  245]
 [  67  424 1254]]


# Model Selection Summary 

## Best Model 

The best performing machine learning model was the Support Vector Classifier. The Support Vector Classifier achieved an accuracy of .70% and F1 score of .69%. The best hyperparameters found were ‘C’  set to 1, ‘gamma’ set to scale, and ‘kernel’ set to sigmoid. This model also had similar measures between precision and recall throughout all 3 classes. After looking at the confusion matrix for this model it appears that the class it struggled most with was the ‘neutral’. In my opinion this makes sense, because even from a human's perspective determining whether someone's emotions are neutral is difficult. 