This is a assignment to deal with unhealthy conversations. The assignment description website is in
https://ml.auc-computing.nl/assignment3.html. The repository I used for the assignment is https://github.com/conversationai/unhealthy-conversations. There are some files I will use in my model: train.csv, test.csv, val.csv. 

# Import libaries

In [62]:
import numpy as np
import pandas as pd
from pathlib import Path

from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, GradientBoostingClassifier, RandomForestClassifier, VotingClassifier, ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings(action="ignore")

In [77]:
data_dir = Path("corpus/")
train_data= pd.read_csv(data_dir / "train.csv")
test_data = pd.read_csv(data_dir / "test.csv")
val_data = pd.read_csv(data_dir / "val.csv")

I delete 'trusted judgements', 'unit id', 'comment' in the columns_use, and the target_column would be the 'healthy' column.

In [78]:

columns_use = ['antagonise','antagonise:confidence','condescending','condescending:confidence',
              'dismissive','dismissive:confidence','generalisation','generalisation:confidence',
              'generalisation_unfair', 'generalisation_unfair:confidence','hostile',
               'hostile:confidence', 'sarcastic','sarcastic:confidence']
target_column = ['healthy']

X_train = train_data[columns_use]
y_train = train_data[target_column]

X_test = test_data[columns_use]
y_test = test_data[target_column]

X_val = val_data[columns_use]
y_val = val_data[target_column]

Create a function combining each sub-attributes with its confidence score, and apply the function to X_train, and X_test. The way to combine the attributes with confidence socre is find the confidence scoe when the attribute is true. Therefore, if the attribute is 1, the confidence score should be 1*score, and if the attribute is 0, the confidence score should be 1*(1-score). Combining the two calculation, the new column should be 1*score + (1-1)*(1-score) for true, and 0*score + (1-0)*(1-score). After giving a new data to the attributes, we drop the columns of attributes confidence level.

In [79]:
attributes = ['antagonise', 'condescending', 'dismissive', 'generalisation', 
             'generalisation_unfair', 'hostile', 'sarcastic']
confidence = ['antagonise:confidence','condescending:confidence', 
                         'dismissive:confidence', 'generalisation:confidence', 
                         'generalisation_unfair:confidence', 'hostile:confidence', 
                         'sarcastic:confidence']
    
def preprocess(X):
    for i in range(len(attributes)):
        X[attributes[i]] = X[attributes[i]] * X[confidence[i]] + (1- X[attributes[i]]) * (1-X[confidence[i]])
        
preprocess(X_train)
X_train = X_train.drop(confidence, axis=1)

preprocess(X_test)
X_test = X_test.drop(confidence, axis=1)

preprocess(X_val)
X_val = X_val.drop(confidence, axis=1)


Create a score function which includes accuracy_score, confusion_matrix, precision_score, recall_score, and f1_score. 

In [35]:
def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    if train:
        pred = clf.predict(X_train)
        print("Train Result:\n===========================================")
        print(f"accuracy score: {accuracy_score(y_train, pred):.4f}\n")
        print(f"Classification Report: \n \tPrecision: {precision_score(y_train, pred)}\n\tRecall Score: {recall_score(y_train, pred)}\n\tF1 score: {f1_score(y_train, pred)}\n")
        print(f"Confusion Matrix: \n {confusion_matrix(y_train, clf.predict(X_train))}\n")
        
    elif train==False:
        pred = clf.predict(X_test)
        print("Test Result:\n===========================================")        
        print(f"accuracy score: {accuracy_score(y_test, pred)}\n")
        print(f"Classification Report: \n \tPrecision: {precision_score(y_test, pred)}\n\tRecall Score: {recall_score(y_test, pred)}\n\tF1 score: {f1_score(y_test, pred)}\n")
        print(f"Confusion Matrix: \n {confusion_matrix(y_test, pred)}\n")

# Train some classifiers and aggregate them in voting classifier


I would like to train some classifiers and do hyper-parameter tuning to them, and then add them into voting classifier.

In [18]:
log_model = LogisticRegression(max_iter=1000000)
log_model.fit(X_train, y_train)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000000,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

The next model I will build is a **random forest**. A random forest is considered an ensemble model in itself, since it is a collection of decision trees combined to make a more accurate model. I will then do hyper-parameter tuning to find the optimal number of trees.

In [19]:
rf = RandomForestClassifier()

# create a dictionary of all values we want to test for n_estimators
params_rf = {'n_estimators': [50, 100, 200]}

# use gridsearch to test all values for n_estimators
rf_gs = GridSearchCV(rf, params_rf, cv=5)

rf_gs.fit(X_train, y_train)

# save best model
rf_best = rf_gs.best_estimator_

# check best n_estimators value
print(rf_gs.best_params_)


{'n_estimators': 200}


The final model I will build is **SVM**. I will use Gridsearch to find the optimal hyper-parameter as well. I get the optimal C=10, gamma=1, and kernel='rbf'.

In [20]:
sv = SVC()

# defining parameter range 
param_sv = {'C': [0.1, 1, 10],  
              'gamma': [1, 0.1, 0.01], 
              'kernel': ['rbf']}  
  
#use gridsearch to test all values
sv_gs = GridSearchCV(sv, param_sv, refit = True, verbose = 3) 
  
# fitting the model for grid search 
sv_gs.fit(X_train, y_train) 

# save best model
sv_best = sv_gs.best_estimator_

# check best c, gamma, and kernel
print(sv_gs.best_params_)



Fitting 3 folds for each of 9 candidates, totalling 27 fits
[CV] C=0.1, gamma=1, kernel=rbf ......................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] .......... C=0.1, gamma=1, kernel=rbf, score=0.945, total=   3.6s
[CV] C=0.1, gamma=1, kernel=rbf ......................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.6s remaining:    0.0s


[CV] .......... C=0.1, gamma=1, kernel=rbf, score=0.946, total=   3.9s
[CV] C=0.1, gamma=1, kernel=rbf ......................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    7.5s remaining:    0.0s


[CV] .......... C=0.1, gamma=1, kernel=rbf, score=0.946, total=   3.6s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV] ........ C=0.1, gamma=0.1, kernel=rbf, score=0.940, total=   3.6s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV] ........ C=0.1, gamma=0.1, kernel=rbf, score=0.942, total=   3.9s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV] ........ C=0.1, gamma=0.1, kernel=rbf, score=0.941, total=   3.8s
[CV] C=0.1, gamma=0.01, kernel=rbf ...................................
[CV] ....... C=0.1, gamma=0.01, kernel=rbf, score=0.925, total=   4.0s
[CV] C=0.1, gamma=0.01, kernel=rbf ...................................
[CV] ....... C=0.1, gamma=0.01, kernel=rbf, score=0.925, total=   4.3s
[CV] C=0.1, gamma=0.01, kernel=rbf ...................................
[CV] ....... C=0.1, gamma=0.01, kernel=rbf, score=0.925, total=   4.2s
[CV] C=1, gamma=1, kernel=rbf ........................................
[CV] .

[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:  2.1min finished


{'C': 10, 'gamma': 1, 'kernel': 'rbf'}


**Hard voting**


In [21]:

voting_clf = VotingClassifier(
    estimators=[("lr", log_model), ("rf", rf_best), ("svc", sv_best)],
    voting="hard",
)
voting_clf.fit(X_train, y_train)

for clf in (log_model, rf_best, sv_best, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.9428248587570621
RandomForestClassifier 0.9459887005649718
SVC 0.9459887005649718
VotingClassifier 0.9459887005649718


In [70]:
print_score(voting_clf, X_train, y_train, X_test, y_test, train=True)
print_score(voting_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
accuracy score: 0.9522

Classification Report: 
 	Precision: 0.9573106283029947
	Recall Score: 0.9926327325864588
	F1 score: 0.9746517606265319

Confusion Matrix: 
 [[ 1201  1454]
 [  242 32606]]

Test Result:
accuracy score: 0.9459887005649718

Classification Report: 
 	Precision: 0.9554665409990575
	Recall Score: 0.9878197320341048
	F1 score: 0.9713738172236195

Confusion Matrix: 
 [[ 131  189]
 [  50 4055]]



**Soft voting**

In [72]:
soft_svm_clf = SVC(probability=True, C=10, gamma=1, kernel='rbf')

soft_voting_clf = VotingClassifier(
    estimators=[("lr", log_model), ("rf", rf_best), ("svc", soft_svm_clf)],
    voting="soft",
)
soft_voting_clf.fit(X_train, y_train)

for clf in (log_model, rf_best, sv_best, soft_voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.9428248587570621
RandomForestClassifier 0.9455367231638419
SVC 0.9459887005649718
VotingClassifier 0.9446327683615819


In [73]:
print_score(soft_voting_clf, X_train, y_train, X_test, y_test, train=True)
print_score(soft_voting_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
accuracy score: 0.9541

Classification Report: 
 	Precision: 0.9574723760954307
	Recall Score: 0.9945202143205065
	F1 score: 0.9756447205339944

Confusion Matrix: 
 [[ 1204  1451]
 [  180 32668]]

Test Result:
accuracy score: 0.9446327683615819

Classification Report: 
 	Precision: 0.9534774436090225
	Recall Score: 0.9885505481120584
	F1 score: 0.9706972850137543

Confusion Matrix: 
 [[ 122  198]
 [  47 4058]]



# Bagging

Bagging involves taking multiple samples from my training data with replacement and training a model for each sample. The final output is averaged across the predictions of sub-models. The model I will use is bagged decision trees, random forest, and extra trees.

In [36]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=100,
    bootstrap=True,
    n_jobs=-1,
)
bag_clf.fit(X_train, y_train)
print_score(bag_clf, X_train, y_train, X_test, y_test, train=True)
print_score(bag_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
accuracy score: 0.9459

Classification Report: 
 	Precision: 0.9565127538970241
	Recall Score: 0.9863309790550414
	F1 score: 0.9711930455635491

Confusion Matrix: 
 [[ 1182  1473]
 [  449 32399]]

Test Result:
accuracy score: 0.9441807909604519

Classification Report: 
 	Precision: 0.957977207977208
	Recall Score: 0.9829476248477467
	F1 score: 0.9703017915113622

Confusion Matrix: 
 [[ 143  177]
 [  70 4035]]



**Out-of-bag Evaluation**

In [39]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42),
    n_estimators=500,
    bootstrap=True,
    oob_score=True,
    random_state=40,
)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.9442018984311185

In [40]:
print_score(bag_clf, X_train, y_train, X_test, y_test, train=True)
print_score(bag_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
accuracy score: 0.9943

Classification Report: 
 	Precision: 0.9940676170586277
	Recall Score: 0.9998477837311252
	F1 score: 0.996949322324586

Confusion Matrix: 
 [[ 2459   196]
 [    5 32843]]

Test Result:
accuracy score: 0.9450847457627118

Classification Report: 
 	Precision: 0.9591060389919163
	Recall Score: 0.9827040194884288
	F1 score: 0.9707616411984117

Confusion Matrix: 
 [[ 148  172]
 [  71 4034]]



**Random forest**

In [42]:
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=16,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=500,
                       n_jobs=-1, oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [43]:
print_score(rnd_clf, X_train, y_train, X_test, y_test, train=True)
print_score(rnd_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
accuracy score: 0.9476

Classification Report: 
 	Precision: 0.9557581997352552
	Recall Score: 0.989131758402338
	F1 score: 0.9721586403961522

Confusion Matrix: 
 [[ 1151  1504]
 [  357 32491]]

Test Result:
accuracy score: 0.9455367231638419

Classification Report: 
 	Precision: 0.9560906515580736
	Recall Score: 0.9866017052375152
	F1 score: 0.971106581944611

Confusion Matrix: 
 [[ 134  186]
 [  55 4050]]



**Extra Trees**

In [75]:
ext_clf = ExtraTreesClassifier(n_estimators=1000, max_features=7, random_state=42,n_jobs=-1)
ext_clf.fit(X_train, y_train)

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features=7, max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=-1,
                     oob_score=False, random_state=42, verbose=0,
                     warm_start=False)

In [76]:
print_score(ext_clf, X_train, y_train, X_test, y_test, train=True)
print_score(ext_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
accuracy score: 0.9943

Classification Report: 
 	Precision: 0.9951466618133284
	Recall Score: 0.9987518265952265
	F1 score: 0.9969459849578363

Confusion Matrix: 
 [[ 2495   160]
 [   41 32807]]

Test Result:
accuracy score: 0.9385310734463277

Classification Report: 
 	Precision: 0.9588221211395739
	Recall Score: 0.9756394640682094
	F1 score: 0.967157691378894

Confusion Matrix: 
 [[ 148  172]
 [ 100 4005]]



## Boosting

Booting algorithm trains weak learners sequencially. I will use AdaBoost and Gradient Boosting.

**AdaBoost**

In [53]:
ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1),
    n_estimators=200,
    algorithm="SAMME.R",
    learning_rate=0.5,
    random_state=42,
)
ada_clf.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=DecisionTreeClassifier(class_weight=None,
                                                         criterion='gini',
                                                         max_depth=1,
                                                         max_features=None,
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=1,
                                                         min_samples_split=2,
                                                         min_weight_fraction_leaf=0.0,
                                                         presort=False,
                                                         random_state=None,
                             

In [54]:
print_score(ada_clf, X_train, y_train, X_test, y_test, train=True)
print_score(ada_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
accuracy score: 0.9459

Classification Report: 
 	Precision: 0.9558437730287399
	Recall Score: 0.9871833901607404
	F1 score: 0.9712608389618558

Confusion Matrix: 
 [[ 1157  1498]
 [  421 32427]]

Test Result:
accuracy score: 0.9414689265536723

Classification Report: 
 	Precision: 0.9548249763481551
	Recall Score: 0.9834348355663824
	F1 score: 0.96891875675027

Confusion Matrix: 
 [[ 129  191]
 [  68 4037]]



**Stochastic Gradient Boosting**

In [67]:
grad_clf = GradientBoostingClassifier(max_depth=2,n_estimators=100, random_state=42)
grad_clf.fit(X_train, y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=2,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='auto',
                           random_state=42, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [68]:
print_score(grad_clf, X_train, y_train, X_test, y_test, train=True)
print_score(grad_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
accuracy score: 0.9483

Classification Report: 
 	Precision: 0.9564065581491185
	Recall Score: 0.989162201656113
	F1 score: 0.9725086424926296

Confusion Matrix: 
 [[ 1174  1481]
 [  356 32492]]

Test Result:
accuracy score: 0.9450847457627118

Classification Report: 
 	Precision: 0.9552098066949553
	Recall Score: 0.987088915956151
	F1 score: 0.9708877440996766

Confusion Matrix: 
 [[ 130  190]
 [  53 4052]]



## Summaray

In this notebook, I dicovered ensemble machine learning algorithms to improve the performance of classifying healthy comments. I learned about Voting Ensembles for averaging the predictions for any arbitrary models, Bagging Ensembles including Bagged Decision Trees, Random Forest and Extra Trees, and Boosting Ensembles including AdaBoost and Gradient Boosting. My best model is the Hard voting classifier including Logistic regression, Random Forest classifier, and SVM. The model can predict the test dataset with accuracy score: 0.9459, Precision: 0.9554, Recall Score: 0.9878, F1 score: 0.9713. I think the reason why the hard voting classifier performs best is that I found the optimal hyper-parameter in the random forest and SVM which increase performance for the voting classifier. I will use my best model on the validation set to see the final result.

In [81]:
pred = voting_clf.predict(X_val)
print("Val Result:\n===========================================")        
print(f"accuracy score: {accuracy_score(y_val, pred)}\n")
print(f"Classification Report: \n \tPrecision: {precision_score(y_val, pred)}\n\tRecall Score: {recall_score(y_val, pred)}\n\tF1 score: {f1_score(y_val, pred)}\n")
print(f"Confusion Matrix: \n {confusion_matrix(y_val, pred)}\n")

Val Result:
accuracy score: 0.944206008583691

Classification Report: 
 	Precision: 0.9496958352831072
	Recall Score: 0.9921779516010756
	F1 score: 0.9704722056186491

Confusion Matrix: 
 [[ 121  215]
 [  32 4059]]

