## Using supervised learning models (Multinomial Naive Bayes and Logistic Regression) to label sentiment of posts

Having manually labelled 1000 posts from the train set, these labelled posts were used to train both Multinomial NB and Log Reg models to incrementally label the remaining 7000 posts in the train set.

Process:
1. Manually label 1000 posts 
2. Split posts to train and val set
3. Train models
4. Evaluate best performing model on val set 
5. Use models to predict next 1000 posts 
6. Manually check accuracy of predictions, make changes to labels 
7. Combine all labelled posts 
8. Repeat 2-7 until all posts in train set labelled

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, f1_score, confusion_matrix

In [4]:
sample_posts = pd.read_csv('./labelled_posts/train_posts_sample_1000_manual_label.clean.csv')
sample_posts.head()

Unnamed: 0.1,Unnamed: 0,post,manual sentiment,date,source,post_clean
0,0,">For consistency and accuracy, it could be eas...",0,2020-04-11 12:26:38,reddit,consistency accuracy could easier use data new...
1,1,Only IQ lower than 86 will believe this CSB.Wh...,0,2020-04-23 10:42:00,hardwarezone,iq lower 86 believe csb maids pregnant nothing...
2,2,I work nearby to the Westlite and Toh Guan Dor...,0,2020-04-06 20:55:47,reddit,work nearby westlite toh guan initial reports ...
3,3,Ho seh liao,0,2020-09-04 21:43:00,hardwarezone,ho seh
4,4,I’m not saying we caused this spread among the...,0,2020-04-16 23:59:35,reddit,not saying caused spread among agree oversight...


In [5]:
sample_posts.drop(columns = ['Unnamed: 0'], inplace=True)
sample_posts.fillna('nopost', inplace=True)

In [6]:
#features matrix

X = sample_posts['post_clean']
y = sample_posts['manual sentiment']

In [7]:
#baseline 
y.value_counts(normalize=True)

# 0 - negative post
# 1 - neutral post 
# 2 - positive post 

#data is skewed to neutral post, very imbalanced for other classes 
#use SMOTE?

1    0.738
0    0.205
2    0.057
Name: manual sentiment, dtype: float64

In [8]:
#train-test-split
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size = 0.2, 
                                                    stratify=y, 
                                                    random_state = 42)

In [9]:
#logreg

pipe = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('lr', LogisticRegression(solver = 'liblinear'))
])

pipe_params = {
    'tvec__max_features': [100, 200, 500],
    'tvec__min_df': [2, 4, 6],
    'tvec__max_df': [0.2, 0.3, 0.7],
    'tvec__ngram_range': [(1,1)],
    'lr__penalty': ['l1', 'l2'],
    'lr__C': np.logspace(-5, 1, 10)
}

gscv_lr = GridSearchCV(pipe, pipe_params, cv=5, n_jobs =-1, verbose=1)
gscv_lr.fit(X_train, y_train)

Fitting 5 folds for each of 540 candidates, totalling 2700 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   10.0s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   18.3s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:   35.8s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed:  3.7min
[Parallel(n_jobs=-1)]: Done 2700 out of 2700 | elapsed:  4.0min finished


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tvec', TfidfVectorizer()),
                                       ('lr',
                                        LogisticRegression(solver='liblinear'))]),
             n_jobs=-1,
             param_grid={'lr__C': array([1.00000000e-05, 4.64158883e-05, 2.15443469e-04, 1.00000000e-03,
       4.64158883e-03, 2.15443469e-02, 1.00000000e-01, 4.64158883e-01,
       2.15443469e+00, 1.00000000e+01]),
                         'lr__penalty': ['l1', 'l2'],
                         'tvec__max_df': [0.2, 0.3, 0.7],
                         'tvec__max_features': [100, 200, 500],
                         'tvec__min_df': [2, 4, 6],
                         'tvec__ngram_range': [(1, 1)]},
             verbose=1)

In [10]:
gscv_lr.best_params_

{'lr__C': 10.0,
 'lr__penalty': 'l2',
 'tvec__max_df': 0.3,
 'tvec__max_features': 500,
 'tvec__min_df': 2,
 'tvec__ngram_range': (1, 1)}

In [11]:
#fitting model with optimised params 
opt_gscv_lr = gscv_lr.best_estimator_
opt_gscv_lr.fit(X_train, y_train)

Pipeline(steps=[('tvec',
                 TfidfVectorizer(max_df=0.3, max_features=500, min_df=2)),
                ('lr', LogisticRegression(C=10.0, solver='liblinear'))])

In [12]:
#create dataframe of metrics based on optimised model 
opt_results_lr = pd.DataFrame()

opt_results_lr['model'] = ['tvec + logistic regression']
opt_results_lr['optimised_params'] = [gscv_lr.best_params_]
opt_results_lr['train_score'] = opt_gscv_lr.score(X_train, y_train)
opt_results_lr['test_score'] = opt_gscv_lr.score(X_test, y_test)

opt_results_lr

Unnamed: 0,model,optimised_params,train_score,test_score
0,tvec + logistic regression,"{'lr__C': 10.0, 'lr__penalty': 'l2', 'tvec__ma...",0.935,0.775


In [13]:
from sklearn.metrics import multilabel_confusion_matrix, classification_report

In [14]:
#confusion matrix for logreg

y_pred_lr = opt_gscv_lr.predict(X_test)
cm = confusion_matrix(y_test, y_pred_lr)
cm_df = pd.DataFrame(cm,
                     index = [0,1,2], 
                     columns = [0,1,2])
display(cm_df)
display(multilabel_confusion_matrix(y_test, y_pred_lr))

print(classification_report(y_test, y_pred_lr, digits=3))

Unnamed: 0,0,1,2
0,11,30,0
1,4,144,0
2,1,10,0


array([[[154,   5],
        [ 30,  11]],

       [[ 12,  40],
        [  4, 144]],

       [[189,   0],
        [ 11,   0]]])

              precision    recall  f1-score   support

           0      0.688     0.268     0.386        41
           1      0.783     0.973     0.867       148
           2      0.000     0.000     0.000        11

    accuracy                          0.775       200
   macro avg      0.490     0.414     0.418       200
weighted avg      0.720     0.775     0.721       200



  _warn_prf(average, modifier, msg_start, len(result))


In [15]:
y_prob_lr = opt_gscv_lr.predict_proba(X_test)

macro_roc_auc_ovo = roc_auc_score(y_test, y_prob_lr, multi_class="ovo",
                                  average="macro")
weighted_roc_auc_ovo = roc_auc_score(y_test, y_prob_lr, multi_class="ovo",
                                     average="weighted")

print(f'OVO ROC AUC scores: {macro_roc_auc_ovo:.4f}(macro), {weighted_roc_auc_ovo:.4f}(weighted by prevalence)')

OVO ROC AUC scores: 0.6221(macro), 0.6385(weighted by prevalence)


In [16]:
pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('nb', MultinomialNB())
])


pipe_params = {
    'cvec__max_features': [100, 200, 500],
    'cvec__min_df': [2, 4, 6],
    'cvec__max_df': [0.2, 0.3, 0.7],
    'cvec__ngram_range': [(1,1)]}


gscv_nb = GridSearchCV(pipe, pipe_params, cv=5, n_jobs =-1, verbose=1)
gscv_nb.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 5 folds for each of 27 candidates, totalling 135 fits


[Parallel(n_jobs=-1)]: Done  74 tasks      | elapsed:    4.3s
[Parallel(n_jobs=-1)]: Done 128 out of 135 | elapsed:    8.0s remaining:    0.4s
[Parallel(n_jobs=-1)]: Done 135 out of 135 | elapsed:    8.5s finished


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec', CountVectorizer()),
                                       ('nb', MultinomialNB())]),
             n_jobs=-1,
             param_grid={'cvec__max_df': [0.2, 0.3, 0.7],
                         'cvec__max_features': [100, 200, 500],
                         'cvec__min_df': [2, 4, 6],
                         'cvec__ngram_range': [(1, 1)]},
             verbose=1)

In [17]:
gscv_nb.best_params_

{'cvec__max_df': 0.3,
 'cvec__max_features': 100,
 'cvec__min_df': 6,
 'cvec__ngram_range': (1, 1)}

In [18]:
#fitting model with optimised params 
opt_gscv_nb = gscv_nb.best_estimator_
opt_gscv_nb.fit(X_train, y_train)

Pipeline(steps=[('cvec',
                 CountVectorizer(max_df=0.3, max_features=100, min_df=6)),
                ('nb', MultinomialNB())])

In [24]:
#create dataframe of metrics based on optimised model 
opt_results_nb = pd.DataFrame()

opt_results_nb['model'] = ['cvec + multinomial nb']
opt_results_nb['optimised_params'] = [gscv_nb.best_params_]
opt_results_nb['train_score'] = opt_gscv_nb.score(X_train, y_train)
opt_results_nb['test_score'] = opt_gscv_nb.score(X_test, y_test)

opt_results_nb

Unnamed: 0,model,optimised_params,train_score,test_score
0,cvec + multinomial nb,"{'cvec__max_df': 0.3, 'cvec__max_features': 10...",0.78125,0.775


In [20]:
#confusion matrix for multinomial nb 

y_pred_nb = opt_gscv_nb.predict(X_test)
cm = confusion_matrix(y_test, y_pred_nb)
cm_df = pd.DataFrame(cm,
                     index = [0,1,2], 
                     columns = [0,1,2])
cm_df

Unnamed: 0,0,1,2
0,11,30,0
1,3,144,1
2,3,8,0


In [21]:
multilabel_confusion_matrix(y_test, y_pred_nb)

array([[[153,   6],
        [ 30,  11]],

       [[ 14,  38],
        [  4, 144]],

       [[188,   1],
        [ 11,   0]]])

In [22]:
print(classification_report(y_test, y_pred_nb, digits=3))

              precision    recall  f1-score   support

           0      0.647     0.268     0.379        41
           1      0.791     0.973     0.873       148
           2      0.000     0.000     0.000        11

    accuracy                          0.775       200
   macro avg      0.479     0.414     0.417       200
weighted avg      0.718     0.775     0.724       200



In [23]:
y_prob_nb = opt_gscv_nb.predict_proba(X_test)

macro_roc_auc_ovo = roc_auc_score(y_test, y_prob_nb, multi_class="ovo",
                                  average="macro")
weighted_roc_auc_ovo = roc_auc_score(y_test, y_prob_nb, multi_class="ovo",
                                     average="weighted")

print(f'OVO ROC AUC scores: {macro_roc_auc_ovo:.4f}(macro), {weighted_roc_auc_ovo:.4f}(weighted by prevalence)')

OVO ROC AUC scores: 0.5812(macro), 0.6026(weighted by prevalence)


In [30]:
# fitting production model 

opt_gscv_nb.fit(X,y)

Pipeline(steps=[('cvec',
                 CountVectorizer(max_df=0.3, max_features=100, min_df=6)),
                ('nb', MultinomialNB())])

In [72]:
#import unlabelled data 
test = pd.read_csv('./unlabelled_posts/2000_2999_unlabelled.csv')
test.head()

Unnamed: 0,post,manual label,date,source,post_clean
0,Normal day to day probably but we're talking a...,,2020-04-09 22:12:56,reddit,normal day day probably talking food provided ...
1,Maybe they are not the main reason. But it's s...,,2020-05-08 09:27:47,reddit,maybe not main reason still big reason
2,Quite f up but there's really nothing much nor...,,2020-06-04 23:53:00,hardwarezone,quite f really nothing much normal citizens li...
3,Not everyone. Only Singapore Citizens and PR.\...,,2020-04-05 02:31:02,reddit,not everyone singapore citizens pr cna officia...
4,"Maybe too many liao, tmr might be a big spike ...",,2020-05-04 19:34:00,hardwarezone,maybe many tmr might big spike number no point...


In [66]:
#preprocessing test data 

test.isnull().sum()
test['post_clean'].fillna('nopost', inplace=True)

test_posts = test['post_clean']

In [67]:
#fit count vectorizer with optimised params 
cvec = CountVectorizer(max_df=0.3, max_features=100, min_df=1)
test = cvec.fit_transform(test).toarray()

In [68]:
test_pred = opt_gscv_nb.predict(test_posts)

In [73]:
test['preds'] = test_pred

In [74]:
test.head()

Unnamed: 0,post,manual label,date,source,post_clean,preds
0,Normal day to day probably but we're talking a...,,2020-04-09 22:12:56,reddit,normal day day probably talking food provided ...,1
1,Maybe they are not the main reason. But it's s...,,2020-05-08 09:27:47,reddit,maybe not main reason still big reason,1
2,Quite f up but there's really nothing much nor...,,2020-06-04 23:53:00,hardwarezone,quite f really nothing much normal citizens li...,0
3,Not everyone. Only Singapore Citizens and PR.\...,,2020-04-05 02:31:02,reddit,not everyone singapore citizens pr cna officia...,1
4,"Maybe too many liao, tmr might be a big spike ...",,2020-05-04 19:34:00,hardwarezone,maybe many tmr might big spike number no point...,1


In [75]:
test.to_csv('./labelled_posts/2000_2999_nb_labelled.csv')