## Sentiment Prediction Modeling

### Using Supervised Learning classification models, Logistic Regression and Multinomial Naive Bayes models, to label posts in test set

Having gathered the posts by scraping, a sample of 1000 posts were labelled manually so as to train models to subsequently predict labels on the remainder of the dataset (around 9000 posts). *Textblob and VADER (unsupervised learning text classification models) were initially used to predict the sentiment of the posts, but the results were not accurate*

The labels are: 
- **0 for negative sentiment** 
- **1 for neutral sentiment**
- **2 for positive sentiment**

It should be noted that the labels were highly skewed, with around 80% of the posts classed as neutral, 15% as negative and 5% positive. This affected how well the model was able to predict the sentiment of unlabelled posts. 

The models chosen were classification models - Logistic Regression and Multinomial Naive Bayes, Long Short Term Memory Recurrent Neural Net model and BERT. 

It was found that the Multinomial Naive Bayes model was the most accurate in predicting sentiment. It had high Accuracy, ROC AUC and F1 scores, with little variance between the train and validation sets, as well as being more likely to assign minority classes, compared to the Logistic Regression Model. The performance of LSTM RNN and BERT models paled in comparison as well. For BERT, it could be that the pretrained model did not generalise well on the dataset, due to the nature of the local Singlish language (in terms of different words and sentence structures). The LSTM RNN model also did not do well, as it relied on learning words before and after a significant word, instead of standalone words, which did not work particularly well for this dataset.  

It should be noted that SMOTE was used to try to address the issue of unbalanced classes in the data, all of which did not perform as well as a model where the minority classes were not oversampled in the model. 

The process of assessing a production model for sentiment analysis is as follows:
1. **Train Multinomial NB model on all posts in train set, which have labelled sentiments (THIS NOTEBOOK)** 
2. **Predict label on posts in test set (THIS NOTEBOOK)** 
3. Check accuracy of predictions and make changes to incorrect labels 
4. Collate labelled posts in train and test sets to train final production model 

While I acknowledge that it is not ideal to use a regular supervised learning classification models to predict the sentiment of text posts, this was the best method for this particular dataset. I believe that with more data gathered, a LSTM RNN or BERT model can be sufficiently trained to better predict sentiment of posts from Singaporean forums.  

---
## Importing libraries and collating labelled posts in train set 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, f1_score, confusion_matrix, classification_report

In [3]:
train = pd.read_csv('./labelled_posts/train_labelled_clean.csv')
train.head()

Unnamed: 0.1,Unnamed: 0,post,label,date,source,post_clean_for_rnn
0,0,There is no new cluster beside the dorm for tw...,2.0,2020-10-04 08:34:00,hardwarezone,there is no new cluster beside the dorm for tw...
1,1,">For consistency and accuracy, it could be eas...",0.0,2020-04-11 12:26:38,reddit,for consistency and accuracy it could be easie...
2,2,Only IQ lower than 86 will believe this CSB.Wh...,0.0,2020-04-23 10:42:00,hardwarezone,only iq lower than 86 will believe this csb wh...
3,3,I work nearby to the Westlite and Toh Guan Dor...,0.0,2020-04-06 20:55:47,reddit,i work nearby to the westlite and toh guan dor...
4,4,Ho seh liao,0.0,2020-09-04 21:43:00,hardwarezone,ho seh liao


In [4]:
train.shape

(8213, 6)

In [5]:
train.drop(columns = ['Unnamed: 0'], inplace=True)
train.fillna('nopost', inplace=True)

---
## Preprocessing of posts in train data for modeling 

- Tokenising
- Removing stopwords
- Lemmatising


In [6]:
#preprocess train['post']

#dealing with stopwords 

#first cut - removing words from stopwords that indicate sentiment 
remove_words = ["no", "not", "against", "don't", "should", "should've", "couldn", "couldn't",'didn', "didn't",
                   'doesn',"doesn't",'shouldn',"shouldn't",'wasn',"wasn't",'weren',"weren't",'won',"won't",
                   'wouldn',"wouldn't"]
stopwords = [word for word in stopwords.words('english') if word not in remove_words]
print(len(stopwords))

#also adding words that are either common words, singaporean slang or noisy words from forum posts
add_words = ['foreign', 'migrant', 'worker', 'workers', 'fw', 'dorm', 'dorms', 'dormitory', 'dormitories', 'covid', 
             '19', 'cases', 'virus', 'coronavirus', 'gagt', 'ah', 'liao', 'lah', 'trt', 'huawei', 'samsung',
            'xiaomi', 'l21a', '32']
stopwords.extend(add_words)
len(stopwords)

157


181

In [7]:
def preprocess(word):
     
    # tokenize and convert lower 
    # \w also removes punctuation - may need to add extra no punc if tokenizing does not do it
    token = RegexpTokenizer(r'\w+')
    tokens = token.tokenize(word.lower())
    
   #remove stopwords 
    no_stop = [word for word in tokens if word not in stopwords]

    no_stopword = (' '.join(no_stop))
        
    #lemmatize words
    lemmatizer = WordNetLemmatizer()
    lem = [lemmatizer.lemmatize(word) for word in no_stopword]
    
    #return words as a single string 
    return(''.join(lem))

In [8]:
post_clean = []

for p in train["post"]:
    post_clean.append(preprocess(p))

print(f"checking post_clean: \n{post_clean[0:5]}")

train['post_clean_nb_logreg'] = post_clean
train.head(10)

checking post_clean: 
['no new cluster beside two days', 'consistency accuracy could easier use data new moh situation report separates non foreigners citizen pr think count named clusters either linked construction sites live non cluster get categorized linked clusters pending investigations respectively moh situation report 28 3 10 4 ltp holders 545 linked clusters 126 linked clusters 141 pending investigations whereas estimate time period 702 construction related reason make graphs really 2 separate problems singapore circuit breaker slow growth general public stopping work construction sites circuit breaker doesn help construction site problem construction site problem tackled improving living conditions testing separating sick well conversely issues way many people exercising eating hawker centres sneakily meeting etc affect non problem yes testing situation worrisome not know prioritizing tests would political suicide moh admit either prioritizing sc pr first patriotism urgent se

Unnamed: 0,post,label,date,source,post_clean_for_rnn,post_clean_nb_logreg
0,There is no new cluster beside the dorm for tw...,2.0,2020-10-04 08:34:00,hardwarezone,there is no new cluster beside the dorm for tw...,no new cluster beside two days
1,">For consistency and accuracy, it could be eas...",0.0,2020-04-11 12:26:38,reddit,for consistency and accuracy it could be easie...,consistency accuracy could easier use data new...
2,Only IQ lower than 86 will believe this CSB.Wh...,0.0,2020-04-23 10:42:00,hardwarezone,only iq lower than 86 will believe this csb wh...,iq lower 86 believe csb maids pregnant nothing...
3,I work nearby to the Westlite and Toh Guan Dor...,0.0,2020-04-06 20:55:47,reddit,i work nearby to the westlite and toh guan dor...,work nearby westlite toh guan initial reports ...
4,Ho seh liao,0.0,2020-09-04 21:43:00,hardwarezone,ho seh liao,ho seh
5,I’m not saying we caused this spread among the...,0.0,2020-04-16 23:59:35,reddit,i m not saying we caused this spread among the...,not saying caused spread among agree oversight...
6,1. From healthy no wear mask to Mask mandatory...,0.0,2020-04-14 20:29:00,hardwarezone,1 from healthy no wear mask to mask mandatory ...,1 healthy no wear mask mask mandatory 2 many 9...
7,Exactly. People don’t even wanna let our publi...,2.0,2020-05-27 20:28:30,reddit,exactly people don t even wanna let our public...,exactly people even wanna let public servants ...
8,The current situation is beyond this woman. 24...,1.0,2020-06-06 14:40:00,sgtalk,the current situation is beyond this woman 24h...,current situation beyond woman 24hrs day not e...
9,"Iran, followed by China, India, Israel, Saudi ...",0.0,2020-04-23 17:52:00,hardwarezone,iran followed by china india israel saudi arab...,iran followed china india israel saudi arabia ...


---
## Preparing data for modeling

As stated earlier, I had tried oversampling the minority classes (0 and 2) with SMOTE. All of these methods did not perform as well as the model without oversampling. 
- Models with smote had lower accuracy, ROC AUC and f1-scores than models without SMOTE


In [9]:
#features matrix

X = train['post_clean_nb_logreg']
y = train['label']

In [10]:
#baseline 
y.value_counts(normalize=True)

# 0 - negative post
# 1 - neutral post 
# 2 - positive post 

1.0    0.857421
0.0    0.116888
2.0    0.025691
Name: label, dtype: float64

In [31]:
#train-test-split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.2, 
                                                    stratify=y, 
                                                    random_state = 42)

---
## Logistic Regression model

### Logreg model with no SMOTE


In [12]:
#logreg

pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('lr', LogisticRegression(solver = 'liblinear'))
])

pipe_params = {
    'cvec__max_features': [100, 200, 500],
    'cvec__min_df': [2, 4, 6],
    'cvec__max_df': [0.2, 0.3, 0.7],
    'cvec__ngram_range': [(1,1)],
    'lr__penalty': ['l1'],
    'lr__C': np.logspace(-5, 1, 10)
}

gscv_lr = GridSearchCV(pipe, pipe_params, cv=5, n_jobs =-1, verbose=1)
gscv_lr.fit(X_train, y_train)

Fitting 5 folds for each of 270 candidates, totalling 1350 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    8.7s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   45.6s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:  5.1min
[Parallel(n_jobs=-1)]: Done 1350 out of 1350 | elapsed:  5.4min finished


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec', CountVectorizer()),
                                       ('lr',
                                        LogisticRegression(solver='liblinear'))]),
             n_jobs=-1,
             param_grid={'cvec__max_df': [0.2, 0.3, 0.7],
                         'cvec__max_features': [100, 200, 500],
                         'cvec__min_df': [2, 4, 6],
                         'cvec__ngram_range': [(1, 1)],
                         'lr__C': array([1.00000000e-05, 4.64158883e-05, 2.15443469e-04, 1.00000000e-03,
       4.64158883e-03, 2.15443469e-02, 1.00000000e-01, 4.64158883e-01,
       2.15443469e+00, 1.00000000e+01]),
                         'lr__penalty': ['l1']},
             verbose=1)

In [13]:
gscv_lr.best_params_

{'cvec__max_df': 0.3,
 'cvec__max_features': 500,
 'cvec__min_df': 2,
 'cvec__ngram_range': (1, 1),
 'lr__C': 0.46415888336127725,
 'lr__penalty': 'l1'}

In [14]:
#fitting model with optimised params 
opt_gscv_lr = gscv_lr.best_estimator_
opt_gscv_lr.fit(X_train, y_train)

Pipeline(steps=[('cvec',
                 CountVectorizer(max_df=0.3, max_features=500, min_df=2)),
                ('lr',
                 LogisticRegression(C=0.46415888336127725, penalty='l1',
                                    solver='liblinear'))])

In [15]:
#create dataframe of metrics based on optimised model 
opt_results_lr = pd.DataFrame()

opt_results_lr['model'] = ['cvec + logistic regression']
opt_results_lr['optimised_params'] = [gscv_lr.best_params_]
opt_results_lr['train_acc_score'] = opt_gscv_lr.score(X_train, y_train)
opt_results_lr['test_acc_score'] = opt_gscv_lr.score(X_test, y_test)

pred_proba = opt_gscv_lr.predict_proba(X_test)
opt_results_lr['roc_auc_score'] = roc_auc_score(y_test, pred_proba, multi_class="ovo", average = 'weighted')
opt_results_lr['train_f1_score'] = f1_score((opt_gscv_lr.predict(X_train)), y_train, average = 'weighted')
opt_results_lr['test_f1_score'] = f1_score((opt_gscv_lr.predict(X_test)), y_test, average = 'weighted')

display(opt_results_lr)

#confusion matrix for logreg

y_pred_lr = opt_gscv_lr.predict(X_test)
cm = confusion_matrix(y_test, y_pred_lr)
cm_df = pd.DataFrame(cm,
                     index = ['actual 0','actual 1','actual 2'], 
                     columns = ['pred 0','pred 1','pred 2'])
display(cm_df)

print(classification_report(y_test, y_pred_lr, digits=3))

Unnamed: 0,model,optimised_params,train_acc_score,test_acc_score,roc_auc_score,train_f1_score,test_f1_score
0,cvec + logistic regression,"{'cvec__max_df': 0.3, 'cvec__max_features': 50...",0.884779,0.863664,0.714364,0.909458,0.895506


Unnamed: 0,pred 0,pred 1,pred 2
actual 0,39,152,1
actual 1,26,1378,5
actual 2,7,33,2


              precision    recall  f1-score   support

         0.0      0.542     0.203     0.295       192
         1.0      0.882     0.978     0.927      1409
         2.0      0.250     0.048     0.080        42

    accuracy                          0.864      1643
   macro avg      0.558     0.410     0.434      1643
weighted avg      0.826     0.864     0.832      1643



### Logreg model with SMOTE

In [16]:
# Testing logreg model with SMOTE for unbalanced classes - did not perform well!!!

#cvec
cvec = CountVectorizer(max_df=0.2, max_features=500, min_df=6)
X_train = cvec.fit_transform(X_train).toarray()
X_test = cvec.transform(X_test).toarray()

#SMOTE for inbalanced classes 
sm = SMOTE()
X_train, y_train = sm.fit_resample(X_train, y_train)

#fit model 
lr = LogisticRegression(C=0.46415888336127725, penalty='l1', solver='liblinear')
lr.fit(X_train,y_train)

LogisticRegression(C=0.46415888336127725, penalty='l1', solver='liblinear')

In [17]:
model_logreg = pd.DataFrame()

model_logreg['model'] = ['cvec + smote + logreg']
model_logreg['train_acc_score'] = lr.score(X_train, y_train)
model_logreg['test_acc_score'] = lr.score(X_test, y_test)

pred_proba = lr.predict_proba(X_test)
model_logreg['roc_auc_score'] = roc_auc_score(y_test, pred_proba, multi_class="ovo", average = 'weighted')
model_logreg['train_f1_score'] = f1_score((lr.predict(X_train)), y_train, average = 'weighted')
model_logreg['test_f1_score'] = f1_score((lr.predict(X_test)), y_test, average = 'weighted')

print('Performance of cvec + smote + logreg model on train data')
display(model_logreg)

cm = confusion_matrix(y_test, lr.predict(X_test))
cm_df = pd.DataFrame(cm,
                     index = ['actual 0','actual 1','actual 2'], 
                     columns = ['pred 0','pred 1','pred 2'])
display(cm_df)
print(classification_report(y_test, lr.predict(X_test), digits=3))

Performance of cvec + smote + logreg model on train data


Unnamed: 0,model,train_acc_score,test_acc_score,roc_auc_score,train_f1_score,test_f1_score
0,cvec + smote + logreg,0.692112,0.580645,0.570307,0.69884,0.502009


Unnamed: 0,pred 0,pred 1,pred 2
actual 0,94,72,26
actual 1,270,851,288
actual 2,7,26,9


              precision    recall  f1-score   support

         0.0      0.253     0.490     0.334       192
         1.0      0.897     0.604     0.722      1409
         2.0      0.028     0.214     0.049        42

    accuracy                          0.581      1643
   macro avg      0.393     0.436     0.368      1643
weighted avg      0.799     0.581     0.659      1643



---
## Multinomial Naive Bayes model

### Multinomial NB without SMOTE

In [32]:
pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('nb', MultinomialNB())
])


pipe_params = {
    'cvec__max_features': [200, 300, 500],
    'cvec__min_df': [2, 4, 6],
    'cvec__max_df': [0.2, 0.3, 0.7],
    'cvec__ngram_range': [(1,1)]}


gscv_nb = GridSearchCV(pipe, pipe_params, cv=5, n_jobs =-1, verbose=1)
gscv_nb.fit(X_train, y_train)

Fitting 5 folds for each of 27 candidates, totalling 135 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   13.2s
[Parallel(n_jobs=-1)]: Done 135 out of 135 | elapsed:   40.1s finished


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec', CountVectorizer()),
                                       ('nb', MultinomialNB())]),
             n_jobs=-1,
             param_grid={'cvec__max_df': [0.2, 0.3, 0.7],
                         'cvec__max_features': [200, 300, 500],
                         'cvec__min_df': [2, 4, 6],
                         'cvec__ngram_range': [(1, 1)]},
             verbose=1)

In [33]:
gscv_nb.best_params_

{'cvec__max_df': 0.3,
 'cvec__max_features': 300,
 'cvec__min_df': 2,
 'cvec__ngram_range': (1, 1)}

In [34]:
#fitting model with optimised params 
opt_gscv_nb = gscv_nb.best_estimator_
opt_gscv_nb.fit(X_train, y_train)

Pipeline(steps=[('cvec',
                 CountVectorizer(max_df=0.3, max_features=300, min_df=2)),
                ('nb', MultinomialNB())])

In [35]:
#create dataframe of metrics based on optimised model 
opt_results_nb = pd.DataFrame()

opt_results_nb['model'] = ['tvec + multinomial nb']
opt_results_nb['optimised_params'] = [gscv_nb.best_params_]
opt_results_nb['train_acc_score'] = opt_gscv_nb.score(X_train, y_train)
opt_results_nb['test_acc_score'] = opt_gscv_nb.score(X_test, y_test)

pred_proba = opt_gscv_nb.predict_proba(X_test)
opt_results_nb['roc_auc_score'] = roc_auc_score(y_test, pred_proba, multi_class="ovo", average = 'weighted')
opt_results_nb['train_f1_score'] = f1_score((opt_gscv_nb.predict(X_train)), y_train, average = 'weighted')
opt_results_nb['test_f1_score'] = f1_score((opt_gscv_nb.predict(X_test)), y_test, average = 'weighted')

display(opt_results_nb)

#confusion matrix and classification report for multinomial nb 

y_pred_nb = opt_gscv_nb.predict(X_test)
cm = confusion_matrix(y_test, y_pred_nb)
cm_df = pd.DataFrame(cm,
                     index = ['actual 0','actual 1','actual 2'], 
                     columns = ['pred 0','pred 1','pred 2'])
display(cm_df)
print('')
print(classification_report(y_test, y_pred_nb, digits=3))

Unnamed: 0,model,optimised_params,train_acc_score,test_acc_score,roc_auc_score,train_f1_score,test_f1_score
0,tvec + multinomial nb,"{'cvec__max_df': 0.3, 'cvec__max_features': 30...",0.862709,0.846622,0.697404,0.874432,0.858131


Unnamed: 0,pred 0,pred 1,pred 2
actual 0,63,121,8
actual 1,77,1322,10
actual 2,7,29,6



              precision    recall  f1-score   support

         0.0      0.429     0.328     0.372       192
         1.0      0.898     0.938     0.918      1409
         2.0      0.250     0.143     0.182        42

    accuracy                          0.847      1643
   macro avg      0.526     0.470     0.490      1643
weighted avg      0.827     0.847     0.835      1643



### Multinomial NB with SMOTE

In [36]:
# Testing multinomial naive bayes model with SMOTE for unbalanced classes 

#cvec
cvec = CountVectorizer(max_df=0.3, max_features=300, min_df=2)
X_train = cvec.fit_transform(X_train).toarray()
X_test = cvec.transform(X_test).toarray()

#SMOTE for inbalanced classes 
sm = SMOTE()
X_train, y_train = sm.fit_resample(X_train, y_train)

#fit model 
nb = MultinomialNB()
nb.fit(X_train,y_train)

MultinomialNB()

In [37]:
#eval model performance 
model_nb = pd.DataFrame()

model_nb['model'] = ['cvec + smote + nb']
model_nb['train_acc_score'] = nb.score(X_train, y_train)
model_nb['test_acc_score'] = nb.score(X_test, y_test)

pred_proba = nb.predict_proba(X_test)
model_nb['roc_auc_score'] = roc_auc_score(y_test, pred_proba, multi_class="ovo", average = 'weighted')
model_nb['train_f1_score'] = f1_score((nb.predict(X_train)), y_train, average = 'weighted')
model_nb['test_f1_score'] = f1_score((nb.predict(X_test)), y_test, average = 'weighted')

print('Performance of cvec + smote + multinomial nb model on train data')
display(model_nb)

cm = confusion_matrix(y_test, nb.predict(X_test))
cm_df = pd.DataFrame(cm,
                     index = ['actual 0','actual 1','actual 2'], 
                     columns = ['pred 0','pred 1','pred 2'])
display(cm_df)
print(classification_report(y_test, y_pred_nb, digits=3))

Performance of cvec + smote + multinomial nb model on train data


Unnamed: 0,model,train_acc_score,test_acc_score,roc_auc_score,train_f1_score,test_f1_score
0,cvec + smote + nb,0.564294,0.549604,0.592483,0.563384,0.468342


Unnamed: 0,pred 0,pred 1,pred 2
actual 0,100,60,32
actual 1,406,796,207
actual 2,12,23,7


              precision    recall  f1-score   support

         0.0      0.429     0.328     0.372       192
         1.0      0.898     0.938     0.918      1409
         2.0      0.250     0.143     0.182        42

    accuracy                          0.847      1643
   macro avg      0.526     0.470     0.490      1643
weighted avg      0.827     0.847     0.835      1643



In [38]:
pred_proba[50:60]

array([[1.18869235e-02, 9.87847390e-01, 2.65686313e-04],
       [6.38742299e-02, 7.64853133e-01, 1.71272637e-01],
       [1.21534181e-02, 9.87004193e-01, 8.42388808e-04],
       [3.33333333e-01, 3.33333333e-01, 3.33333333e-01],
       [3.33333333e-01, 3.33333333e-01, 3.33333333e-01],
       [8.14507165e-01, 1.63370896e-01, 2.21219387e-02],
       [5.16419272e-01, 4.50330594e-01, 3.32501349e-02],
       [5.62692199e-01, 2.88031493e-03, 4.34427486e-01],
       [6.38847606e-01, 3.60259118e-01, 8.93275734e-04],
       [5.70538256e-04, 9.99372809e-01, 5.66531570e-05]])

In [39]:
nb.predict(X_test)[50:70]

array([1., 1., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 2., 0., 0., 1., 1.,
       1., 0., 2.])

In [40]:
#eval model performance WITH THRESHOLD ADJUSTMENT - no go!!!!

model_nb = pd.DataFrame()

model_nb['model'] = ['cvec + smote + nb']
model_nb['train_acc_score'] = nb.score(X_train, y_train)
model_nb['test_acc_score'] = nb.score(X_test, y_test)

pred_proba = nb.predict_proba(X_test)
model_nb['roc_auc_score'] = roc_auc_score(y_test, pred_proba, multi_class="ovo", average = 'weighted')
model_nb['train_f1_score'] = f1_score((nb.predict(X_train)), y_train, average = 'weighted')
model_nb['test_f1_score'] = f1_score((nb.predict(X_test)), y_test, average = 'weighted')

print('Performance of cvec + smote + multinomial model on train data with threshold adjustment')
display(model_nb)


THRESHOLD = .05

preds = np.where(nb.predict_proba(X_test)[:,1] > THRESHOLD, 1, 0)

cm = confusion_matrix(y_test, preds)
cm_df = pd.DataFrame(cm,
                     index = ['actual 0','actual 1','actual 2'], 
                     columns = ['pred 0','pred 1','pred 2'])
display(cm_df)
print(classification_report(y_test, preds, digits=3))

Performance of cvec + smote + multinomial model on train data with threshold adjustment


Unnamed: 0,model,train_acc_score,test_acc_score,roc_auc_score,train_f1_score,test_f1_score
0,cvec + smote + nb,0.564294,0.549604,0.592483,0.563384,0.468342


Unnamed: 0,pred 0,pred 1,pred 2
actual 0,42,150,0
actual 1,46,1363,0
actual 2,6,36,0


              precision    recall  f1-score   support

         0.0      0.447     0.219     0.294       192
         1.0      0.880     0.967     0.922      1409
         2.0      0.000     0.000     0.000        42

    accuracy                          0.855      1643
   macro avg      0.442     0.395     0.405      1643
weighted avg      0.807     0.855     0.825      1643



  _warn_prf(average, modifier, msg_start, len(result))


---
## Predicting sentiment of posts in test set 

In [42]:
#import unlabelled data 
test = pd.read_csv('./unlabelled_posts/test_posts.csv')
test.head()

Unnamed: 0,post,date,source,post_clean
0,Need more camp to house all foreign worker,2020-04-15 15:09:00,hardwarezone,need camp house
1,Im interested in the clusters esp new ones. Wh...,2020-04-29 23:07:00,hardwarezone,im interested clusters esp new ones see
2,did the virus make his kkj buay kia?,2020-04-23 21:22:00,hardwarezone,make kkj buay kia
3,I’m not saying there isn’t room for improvemen...,2020-04-08 11:23:23,reddit,not saying room improvement point choice nsfs ...
4,abnn good life no need work get paid and free ...,2020-04-05 13:04:00,hardwarezone,abnn good life no need work get paid free food...


In [43]:
#preprocessing test data 

test.isnull().sum()
test['post_clean'].fillna('nopost', inplace=True)

test_posts = test['post_clean']

In [44]:
#predict test with nb model without smote 
test_pred_nb = opt_gscv_nb.predict(test_posts)
test['preds_nb'] = test_pred_nb
test['preds_nb'].value_counts()

1.0    1855
0.0     167
2.0      33
Name: preds_nb, dtype: int64

In [45]:
#predict test with lr model without smote 
test_pred_lr = opt_gscv_lr.predict(test_posts)
test['preds_lr'] = test_pred_lr
test['preds_lr'].value_counts()

1.0    1946
0.0      99
2.0      10
Name: preds_lr, dtype: int64

In [46]:
#fit count vectorizer with optimised params 
# cvec = CountVectorizer(max_df=0.2, max_features=100, min_df=1)
cvec_test = cvec.transform(test_posts)

In [50]:
#predict test with nb model with smote 
test_pred_nb_smote = nb.predict(cvec_test)
test['preds_nb_smote'] = test_pred_nb_smote
test['preds_nb_smote'].value_counts()

1.0    1173
0.0     603
2.0     279
Name: preds_nb_smote, dtype: int64

In [52]:
test.to_csv('./labelled_posts/test_nb_lr_labelled.csv')