This is a assignment to deal with unhealthy conversations. The assignment description website is in
https://ml.auc-computing.nl/assignment3.html. The repository I used for the assignment is https://github.com/conversationai/unhealthy-conversations. There are some files I will use in my model: train.csv, test.csv, val.csv. 

# Import libaries

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path

from mlxtend.classifier import EnsembleVoteClassifier
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, GradientBoostingClassifier, RandomForestClassifier, VotingClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from xgboost.sklearn import XGBClassifier 

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings(action="ignore")

In [2]:
data_dir = Path("corpus/")
train_data= pd.read_csv(data_dir / "train.csv")
test_data = pd.read_csv(data_dir / "test.csv")
val_data = pd.read_csv(data_dir / "val.csv")

I delete 'trusted judgements', 'unit id', 'comment' in the columns_use, and the target_column would be the 'healthy' column.

In [3]:

columns_use = ['antagonise','antagonise:confidence','condescending','condescending:confidence',
              'dismissive','dismissive:confidence','generalisation','generalisation:confidence',
              'generalisation_unfair', 'generalisation_unfair:confidence','hostile',
               'hostile:confidence', 'sarcastic','sarcastic:confidence']
target_column = ['healthy']

X_train = train_data[columns_use]
y_train = train_data[target_column]

X_test = test_data[columns_use]
y_test = test_data[target_column]

X_val = val_data[columns_use]
y_val = val_data[target_column]

Create a function combining each sub-attributes with its confidence score, and apply the function to X_train, and X_test. The way to combine the attributes with confidence socre is find the confidence scoe when the attribute is true. Therefore, if the attribute is 1, the confidence score should be 1*score, and if the attribute is 0, the confidence score should be 1*(1-score). Combining the two calculation, the new column should be 1*score + (1-1)*(1-score) for true, and 0*score + (1-0)*(1-score). After giving a new data to the attributes, we drop the columns of attributes confidence level.

In [4]:
attributes = ['antagonise', 'condescending', 'dismissive', 'generalisation', 
             'generalisation_unfair', 'hostile', 'sarcastic']
confidence = ['antagonise:confidence','condescending:confidence', 
                         'dismissive:confidence', 'generalisation:confidence', 
                         'generalisation_unfair:confidence', 'hostile:confidence', 
                         'sarcastic:confidence']
    
def preprocess(X):
    for i in range(len(attributes)):
        X[attributes[i]] = X[attributes[i]] * X[confidence[i]] + (1- X[attributes[i]]) * (1-X[confidence[i]])
        
preprocess(X_train)
X_train = X_train.drop(confidence, axis=1)

preprocess(X_test)
X_test = X_test.drop(confidence, axis=1)

preprocess(X_val)
X_val = X_val.drop(confidence, axis=1)


Create a score function which includes accuracy_score, confusion_matrix, precision_score, recall_score, and f1_score. 

In [5]:
def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    if train:
        pred = clf.predict(X_train)
        print("Train Result:\n===========================================")
        print(f"accuracy score: {accuracy_score(y_train, pred):.4f}\n")
        print(f"Classification Report: \n \tPrecision: {precision_score(y_train, pred)}\n\tRecall Score: {recall_score(y_train, pred)}\n\tF1 score: {f1_score(y_train, pred)}\n")
        print(f"Confusion Matrix: \n {confusion_matrix(y_train, clf.predict(X_train))}\n")
        
    elif train==False:
        pred = clf.predict(X_test)
        print("Test Result:\n===========================================")        
        print(f"accuracy score: {accuracy_score(y_test, pred)}\n")
        print(f"Classification Report: \n \tPrecision: {precision_score(y_test, pred)}\n\tRecall Score: {recall_score(y_test, pred)}\n\tF1 score: {f1_score(y_test, pred)}\n")
        print(f"Confusion Matrix: \n {confusion_matrix(y_test, pred)}\n")

# Train some classifiers and aggregate them in voting classifier


I would like to train some classifiers and do hyper-parameter tuning to them, and then add them into voting classifier.

In [6]:
log_model = LogisticRegression(max_iter=1000000)
log_model.fit(X_train, y_train)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000000,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

The next model I will build is a **random forest**. A random forest is considered an ensemble model in itself, since it is a collection of decision trees combined to make a more accurate model. I will then do hyper-parameter tuning to find the optimal number of trees.

In [7]:
rf = RandomForestClassifier()

# create a dictionary of all values we want to test for n_estimators
params_rf = {'n_estimators': [50, 100, 200]}

# use gridsearch to test all values for n_estimators
rf_gs = GridSearchCV(rf, params_rf, cv=5)

rf_gs.fit(X_train, y_train)

# save best model
rf_best = rf_gs.best_estimator_

# check best n_estimators value
print(rf_gs.best_params_)


{'n_estimators': 200}


The final model I will build is **SVM**. I will use Gridsearch to find the optimal hyper-parameter as well. I get the optimal C=10, gamma=1, and kernel='rbf'.

In [8]:
sv = SVC()

# defining parameter range 
param_sv = {'C': [0.1, 1, 10],  
              'gamma': [1, 0.1, 0.01], 
              'kernel': ['rbf']}  
  
#use gridsearch to test all values
sv_gs = GridSearchCV(sv, param_sv, refit = True, verbose = 3) 
  
# fitting the model for grid search 
sv_gs.fit(X_train, y_train) 

# save best model
sv_best = sv_gs.best_estimator_

# check best c, gamma, and kernel
print(sv_gs.best_params_)



Fitting 3 folds for each of 9 candidates, totalling 27 fits
[CV] C=0.1, gamma=1, kernel=rbf ......................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] .......... C=0.1, gamma=1, kernel=rbf, score=0.945, total=   3.8s
[CV] C=0.1, gamma=1, kernel=rbf ......................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.8s remaining:    0.0s


[CV] .......... C=0.1, gamma=1, kernel=rbf, score=0.946, total=   3.7s
[CV] C=0.1, gamma=1, kernel=rbf ......................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    7.5s remaining:    0.0s


[CV] .......... C=0.1, gamma=1, kernel=rbf, score=0.946, total=   3.5s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV] ........ C=0.1, gamma=0.1, kernel=rbf, score=0.940, total=   3.5s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV] ........ C=0.1, gamma=0.1, kernel=rbf, score=0.942, total=   3.8s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV] ........ C=0.1, gamma=0.1, kernel=rbf, score=0.941, total=   3.7s
[CV] C=0.1, gamma=0.01, kernel=rbf ...................................
[CV] ....... C=0.1, gamma=0.01, kernel=rbf, score=0.925, total=   4.2s
[CV] C=0.1, gamma=0.01, kernel=rbf ...................................
[CV] ....... C=0.1, gamma=0.01, kernel=rbf, score=0.925, total=   4.2s
[CV] C=0.1, gamma=0.01, kernel=rbf ...................................
[CV] ....... C=0.1, gamma=0.01, kernel=rbf, score=0.925, total=   4.1s
[CV] C=1, gamma=1, kernel=rbf ........................................
[CV] .

[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:  2.0min finished


{'C': 10, 'gamma': 1, 'kernel': 'rbf'}


**Hard voting**


In [9]:

voting_clf = VotingClassifier(
    estimators=[("lr", log_model), ("rf", rf_best), ("svc", sv_best)],
    voting="hard",
)
voting_clf.fit(X_train, y_train)

for clf in (log_model, rf_best, sv_best, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.9428248587570621
RandomForestClassifier 0.9466666666666667
SVC 0.9459887005649718
VotingClassifier 0.9459887005649718


In [10]:
print_score(voting_clf, X_train, y_train, X_test, y_test, train=True)
print_score(voting_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
accuracy score: 0.9522

Classification Report: 
 	Precision: 0.9573106283029947
	Recall Score: 0.9926327325864588
	F1 score: 0.9746517606265319

Confusion Matrix: 
 [[ 1201  1454]
 [  242 32606]]

Test Result:
accuracy score: 0.9459887005649718

Classification Report: 
 	Precision: 0.9552520018841263
	Recall Score: 0.9880633373934227
	F1 score: 0.9713806729732967

Confusion Matrix: 
 [[ 130  190]
 [  49 4056]]



**Soft voting**

In [11]:
soft_svm_clf = SVC(probability=True, C=10, gamma=1, kernel='rbf')

soft_voting_clf = VotingClassifier(
    estimators=[("lr", log_model), ("rf", rf_best), ("svc", soft_svm_clf)],
    voting="soft",
)
soft_voting_clf.fit(X_train, y_train)

for clf in (log_model, rf_best, sv_best, soft_voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.9428248587570621
RandomForestClassifier 0.9464406779661017
SVC 0.9459887005649718
VotingClassifier 0.9441807909604519


In [12]:
print_score(soft_voting_clf, X_train, y_train, X_test, y_test, train=True)
print_score(soft_voting_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
accuracy score: 0.9541

Classification Report: 
 	Precision: 0.9575565716965646
	Recall Score: 0.9945202143205065
	F1 score: 0.9756884296039662

Confusion Matrix: 
 [[ 1207  1448]
 [  180 32668]]

Test Result:
accuracy score: 0.9441807909604519

Classification Report: 
 	Precision: 0.9530295913574448
	Recall Score: 0.9885505481120584
	F1 score: 0.9704651440870502

Confusion Matrix: 
 [[ 120  200]
 [  47 4058]]



# Bagging

Bagging involves taking multiple samples from my training data with replacement and training a model for each sample. The final output is averaged across the predictions of sub-models. The model I will use is bagged decision trees, random forest, and extra trees.

In [13]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=100,
    bootstrap=True,
    n_jobs=-1,
)
bag_clf.fit(X_train, y_train)
print_score(bag_clf, X_train, y_train, X_test, y_test, train=True)
print_score(bag_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
accuracy score: 0.9455

Classification Report: 
 	Precision: 0.9546176470588236
	Recall Score: 0.9880966877739893
	F1 score: 0.9710686931546194

Confusion Matrix: 
 [[ 1112  1543]
 [  391 32457]]

Test Result:
accuracy score: 0.9453107344632768

Classification Report: 
 	Precision: 0.9560802833530107
	Recall Score: 0.9863580998781973
	F1 score: 0.9709832134292566

Confusion Matrix: 
 [[ 134  186]
 [  56 4049]]



**Out-of-bag Evaluation**

In [32]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42),
    n_estimators=500,
    bootstrap=True,
    oob_score=True,
    random_state=40,
)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.9442018984311185

In [34]:
print_score(bag_clf, X_train, y_train, X_test, y_test, train=True)
print_score(bag_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
accuracy score: 0.9943

Classification Report: 
 	Precision: 0.9940676170586277
	Recall Score: 0.9998477837311252
	F1 score: 0.996949322324586

Confusion Matrix: 
 [[ 2459   196]
 [    5 32843]]

Test Result:
accuracy score: 0.9450847457627118

Classification Report: 
 	Precision: 0.9591060389919163
	Recall Score: 0.9827040194884288
	F1 score: 0.9707616411984117

Confusion Matrix: 
 [[ 148  172]
 [  71 4034]]



**Random forest**

In [62]:
rnd_clf = RandomForestClassifier(n_estimators=200, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=16,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=-1, oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [63]:
print_score(rnd_clf, X_train, y_train, X_test, y_test, train=True)
print_score(rnd_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
accuracy score: 0.9477

Classification Report: 
 	Precision: 0.9563284056776017
	Recall Score: 0.9886446663419386
	F1 score: 0.9722180642457264

Confusion Matrix: 
 [[ 1172  1483]
 [  373 32475]]

Test Result:
accuracy score: 0.9462146892655368

Classification Report: 
 	Precision: 0.9572002837550249
	Recall Score: 0.9861144945188794
	F1 score: 0.9714422846172306

Confusion Matrix: 
 [[ 139  181]
 [  57 4048]]



**Extra Trees**

In [40]:
ext_clf = ExtraTreesClassifier(n_estimators=1000, max_features=7, random_state=42,n_jobs=-1)
ext_clf.fit(X_train, y_train)

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features=7, max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=-1,
                     oob_score=False, random_state=42, verbose=0,
                     warm_start=False)

In [42]:
print_score(ext_clf, X_train, y_train, X_test, y_test, train=True)
print_score(ext_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
accuracy score: 0.9943

Classification Report: 
 	Precision: 0.9951466618133284
	Recall Score: 0.9987518265952265
	F1 score: 0.9969459849578363

Confusion Matrix: 
 [[ 2495   160]
 [   41 32807]]

Test Result:
accuracy score: 0.9385310734463277

Classification Report: 
 	Precision: 0.9588221211395739
	Recall Score: 0.9756394640682094
	F1 score: 0.967157691378894

Confusion Matrix: 
 [[ 148  172]
 [ 100 4005]]



**Voting**

I will use hard voting to enseblme the three models in bagging. A Voting Classifier is a machine learning model that trains on an ensemble of numerous models and predicts an output based on their highest probability of chosen class as the output. 

In [64]:
bv_clf = VotingClassifier(
    estimators=[("bag", bag_clf), ("rnd", rnd_clf ), ("ext", ext_clf)],
    voting="hard",
)

bv_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('bag',
                              BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None,
                                                                                      criterion='gini',
                                                                                      max_depth=None,
                                                                                      max_features=None,
                                                                                      max_leaf_nodes=None,
                                                                                      min_impurity_decrease=0.0,
                                                                                      min_impurity_split=None,
                                                                                      min_samples_leaf=1,
                                                                                      min_samples_split=2,
   

In [65]:
print_score(bv_clf, X_train, y_train, X_test, y_test, train=True)
print_score(bv_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
accuracy score: 0.9943

Classification Report: 
 	Precision: 0.9940676170586277
	Recall Score: 0.9998477837311252
	F1 score: 0.996949322324586

Confusion Matrix: 
 [[ 2459   196]
 [    5 32843]]

Test Result:
accuracy score: 0.9444067796610169

Classification Report: 
 	Precision: 0.9579871825302635
	Recall Score: 0.9831912302070646
	F1 score: 0.970425583072854

Confusion Matrix: 
 [[ 143  177]
 [  69 4036]]



## Boosting

Booting algorithm trains weak learners sequencially. I will use AdaBoost, Gradient Boosting, and XGB Boosting to see if add additional models to the overall ensemble model sequentially will increase the performance

**AdaBoost**

In [44]:
ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1),
    n_estimators=200,
    algorithm="SAMME.R",
    learning_rate=0.5,
    random_state=42,
)
ada_clf.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=DecisionTreeClassifier(class_weight=None,
                                                         criterion='gini',
                                                         max_depth=1,
                                                         max_features=None,
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=1,
                                                         min_samples_split=2,
                                                         min_weight_fraction_leaf=0.0,
                                                         presort=False,
                                                         random_state=None,
                             

In [46]:
print_score(ada_clf, X_train, y_train, X_test, y_test, train=True)
print_score(ada_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
accuracy score: 0.9459

Classification Report: 
 	Precision: 0.9558437730287399
	Recall Score: 0.9871833901607404
	F1 score: 0.9712608389618558

Confusion Matrix: 
 [[ 1157  1498]
 [  421 32427]]

Test Result:
accuracy score: 0.9414689265536723

Classification Report: 
 	Precision: 0.9548249763481551
	Recall Score: 0.9834348355663824
	F1 score: 0.96891875675027

Confusion Matrix: 
 [[ 129  191]
 [  68 4037]]



**Stochastic Gradient Boosting**

In [48]:
grad_clf = GradientBoostingClassifier(max_depth=2,n_estimators=100, random_state=42)
grad_clf.fit(X_train, y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=2,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='auto',
                           random_state=42, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [50]:
print_score(grad_clf, X_train, y_train, X_test, y_test, train=True)
print_score(grad_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
accuracy score: 0.9483

Classification Report: 
 	Precision: 0.9564065581491185
	Recall Score: 0.989162201656113
	F1 score: 0.9725086424926296

Confusion Matrix: 
 [[ 1174  1481]
 [  356 32492]]

Test Result:
accuracy score: 0.9450847457627118

Classification Report: 
 	Precision: 0.9552098066949553
	Recall Score: 0.987088915956151
	F1 score: 0.9708877440996766

Confusion Matrix: 
 [[ 130  190]
 [  53 4052]]



**XGB Boost**

In [52]:
xgb_clf = XGBClassifier(max_depth=2,n_estimators=100, random_state=42)
xgb_clf.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=2,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='binary:logistic', random_state=42, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [54]:
print_score(xgb_clf, X_train, y_train, X_test, y_test, train=True)
print_score(xgb_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
accuracy score: 0.9509

Classification Report: 
 	Precision: 0.9581933887219374
	Recall Score: 0.9901059425231369
	F1 score: 0.9738883066327294

Confusion Matrix: 
 [[ 1236  1419]
 [  325 32523]]

Test Result:
accuracy score: 0.9455367231638419

Classification Report: 
 	Precision: 0.9563060935285782
	Recall Score: 0.9863580998781973
	F1 score: 0.9710996522364791

Confusion Matrix: 
 [[ 135  185]
 [  56 4049]]



**Voting**

I will use the same hard voting to ensemble the three models above together.

In [56]:
eclf = VotingClassifier(
    estimators=[("ada", ada_clf), ("grad", grad_clf), ("xgb", xgb_clf)],
    voting="hard",
)

eclf.fit(X_train, y_train)


VotingClassifier(estimators=[('ada',
                              AdaBoostClassifier(algorithm='SAMME.R',
                                                 base_estimator=DecisionTreeClassifier(class_weight=None,
                                                                                       criterion='gini',
                                                                                       max_depth=1,
                                                                                       max_features=None,
                                                                                       max_leaf_nodes=None,
                                                                                       min_impurity_decrease=0.0,
                                                                                       min_impurity_split=None,
                                                                                       min_samples_leaf=1,
                                   

In [57]:
print_score(eclf, X_train, y_train, X_test, y_test, train=True)
print_score(eclf, X_train, y_train, X_test, y_test, train=False)

Train Result:
accuracy score: 0.9487

Classification Report: 
 	Precision: 0.9570725671017354
	Recall Score: 0.9889186556259133
	F1 score: 0.9727350312177155

Confusion Matrix: 
 [[ 1198  1457]
 [  364 32484]]

Test Result:
accuracy score: 0.9450847457627118

Classification Report: 
 	Precision: 0.9558545797922569
	Recall Score: 0.9863580998781973
	F1 score: 0.9708668025416617

Confusion Matrix: 
 [[ 133  187]
 [  56 4049]]



Overall I get a accuracy score for all my models. Hard voting for Linear regression, Random forest, and SVM performs best. I would like to choose some of them and then do voting classifier again with SVM, Random Forest, bagging decision tree, and XGB boost.

|                        | Linera regression | SVM    | Hard voting | Soft voting | Bagging(decision tree) | Random Forest | Extra trees | Bag voting | Ada boost | Gradient boosting | XGB boost | Boost voting |
|------------------------|-------------------|--------|-------------|-------------|------------------------|---------------|-------------|------------|-----------|-------------------|-----------|--------------|
| accuracy score of test | 0.9428            | 0.9460 | 0.9460      | 0.9441      | 0.9453                 | 0.9462        | 0.9385      | 0.9444     | 0.9414    | 0.9450            | 0.9455    | 0.9451       |

In [69]:
eclf_final = VotingClassifier(
    estimators=[("svc", sv_best), ("rnd", rnd_clf), ("bag", bag_clf), ("xgb", xgb_clf)],
    voting="hard",
)

eclf_final.fit(X_train, y_train)

VotingClassifier(estimators=[('svc',
                              SVC(C=10, cache_size=200, class_weight=None,
                                  coef0=0.0, decision_function_shape='ovr',
                                  degree=3, gamma=1, kernel='rbf', max_iter=-1,
                                  probability=False, random_state=None,
                                  shrinking=True, tol=0.001, verbose=False)),
                             ('rnd',
                              RandomForestClassifier(bootstrap=True,
                                                     class_weight=None,
                                                     criterion='gini',
                                                     max_depth=None,
                                                     max_features='auto...
                                            max_delta_step=0, max_depth=2,
                                            min_child_weight=1, missing=nan,
                                     

In [70]:
print_score(eclf_final, X_train, y_train, X_test, y_test, train=True)
print_score(eclf_final, X_train, y_train, X_test, y_test, train=False)

Train Result:
accuracy score: 0.9529

Classification Report: 
 	Precision: 0.9599870164360117
	Recall Score: 0.9904103750608865
	F1 score: 0.9749614157064298

Confusion Matrix: 
 [[ 1299  1356]
 [  315 32533]]

Test Result:
accuracy score: 0.9484745762711865

Classification Report: 
 	Precision: 0.959033862183282
	Recall Score: 0.9866017052375152
	F1 score: 0.9726224783861672

Confusion Matrix: 
 [[ 147  173]
 [  55 4050]]



## Summaray

In this notebook, I dicovered ensemble machine learning algorithms to improve the performance of classifying healthy comments. I learned about Voting Ensembles for averaging the predictions for any arbitrary models, Bagging Ensembles including Bagged Decision Trees, Random Forest and Extra Trees, and Boosting Ensembles including AdaBoost, Gradient Boosting, and XGB boost. My best model is the Hard voting classifier including SVM, Random Forest, bagging decision tree, and XGB boost. The model can predict the test dataset with accuracy score: 0.9484, Precision: 0.9590, Recall Score: 0.9866, F1 score: 0.9726. I choose these models to enseble is that they can perform individually themselves and combine them give me a better predictions. I will use the model on the val file to see the final result.

In [73]:
pred = voting_clf.predict(X_val)
print("Val Result:\n===========================================")        
print(f"accuracy score: {accuracy_score(y_val, pred)}\n")
print(f"Classification Report: \n \tPrecision: {precision_score(y_val, pred)}\n\tRecall Score: {recall_score(y_val, pred)}\n\tF1 score: {f1_score(y_val, pred)}\n")
print(f"Confusion Matrix: \n {confusion_matrix(y_val, pred)}\n")

Val Result:
accuracy score: 0.944206008583691

Classification Report: 
 	Precision: 0.9496958352831072
	Recall Score: 0.9921779516010756
	F1 score: 0.9704722056186491

Confusion Matrix: 
 [[ 121  215]
 [  32 4059]]

