This is a assignment to deal with unhealthy conversations. The assignment description website is in
https://ml.auc-computing.nl/assignment3.html. The repository I used for the assignment is https://github.com/conversationai/unhealthy-conversations. There are some files I will use in my model: train.csv, test.csv, val.csv. 

# Import libaries

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path

from sklearn.ensemble import RandomForestClassifier, VotingClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings(action="ignore")

In [2]:
data_dir = Path("corpus/")
train_data= pd.read_csv(data_dir / "train.csv")
test_data = pd.read_csv(data_dir / "test.csv")

I delete 'trusted judgements', 'unit id', 'comment' in the columns_use, and the target_column would be the 'healthy' column.

In [3]:

columns_use = ['antagonise','antagonise:confidence','condescending','condescending:confidence',
              'dismissive','dismissive:confidence','generalisation','generalisation:confidence',
              'generalisation_unfair', 'generalisation_unfair:confidence','hostile',
               'hostile:confidence', 'sarcastic','sarcastic:confidence']
target_column = ['healthy']

X_train = train_data[columns_use]
y_train = train_data[target_column]

X_test = test_data[columns_use]
y_test = test_data[target_column]


Create a function combining each sub-attributes with its confidence score, and apply the function to X_train, and X_test. The way to combine the attributes with confidence socre is find the confidence scoe when the attribute is true. Therefore, if the attribute is 1, the confidence score should be 1*score, and if the attribute is 0, the confidence score should be 1*(1-score). Combining the two calculation, the new column should be 1*score + (1-1)*(1-score) for true, and 0*score + (1-0)*(1-score). After giving a new data to the attributes, we drop the columns of attributes confidence level.

In [4]:
attributes = ['antagonise', 'condescending', 'dismissive', 'generalisation', 
             'generalisation_unfair', 'hostile', 'sarcastic']
confidence = ['antagonise:confidence','condescending:confidence', 
                         'dismissive:confidence', 'generalisation:confidence', 
                         'generalisation_unfair:confidence', 'hostile:confidence', 
                         'sarcastic:confidence']
    
def preprocess(X):
    for i in range(len(attributes)):
        X[attributes[i]] = X[attributes[i]] * X[confidence[i]] + (1- X[attributes[i]]) * (1-X[confidence[i]])
        
preprocess(X_train)
X_train = X_train.drop(confidence, axis=1)

preprocess(X_test)
X_test = X_test.drop(confidence, axis=1)




# Train some classifiers and aggregate them in voting classifier


First, I use logistic regression with max_iter = 1000000

In [5]:
log_model = LogisticRegression(max_iter=1000000)
log_model.fit(X_train, y_train)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000000,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

The next model I will build is a **random forest**. A random forest is considered an ensemble model in itself, since it is a collection of decision trees combined to make a more accurate model. I will then do hyper-parameter tuning to find the optimal number of trees.

In [6]:
rf = RandomForestClassifier()

# create a dictionary of all values we want to test for n_estimators
params_rf = {'n_estimators': [50, 100, 200]}

# use gridsearch to test all values for n_estimators
rf_gs = GridSearchCV(rf, params_rf, cv=5)

rf_gs.fit(X_train, y_train)

# save best model
rf_best = rf_gs.best_estimator_

# check best n_estimators value
print(rf_gs.best_params_)


{'n_estimators': 200}


The final model I will build is **SVM**. I will use Gridsearch to find the optimal hyper-parameter as well. I get the optimal C=10, gamma=1, and kernel='rbf'.

In [7]:
sv = SVC()

# defining parameter range 
param_sv = {'C': [0.1, 1, 10],  
              'gamma': [1, 0.1, 0.01], 
              'kernel': ['rbf']}  
  
#use gridsearch to test all values
sv_gs = GridSearchCV(sv, param_sv, refit = True, verbose = 3) 
  
# fitting the model for grid search 
sv_gs.fit(X_train, y_train) 

# save best model
sv_best = sv_gs.best_estimator_

# check best c, gamma, and kernel
print(sv_gs.best_params_)



Fitting 3 folds for each of 9 candidates, totalling 27 fits
[CV] C=0.1, gamma=1, kernel=rbf ......................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] .......... C=0.1, gamma=1, kernel=rbf, score=0.945, total=   3.6s
[CV] C=0.1, gamma=1, kernel=rbf ......................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.6s remaining:    0.0s


[CV] .......... C=0.1, gamma=1, kernel=rbf, score=0.946, total=   3.7s
[CV] C=0.1, gamma=1, kernel=rbf ......................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    7.3s remaining:    0.0s


[CV] .......... C=0.1, gamma=1, kernel=rbf, score=0.946, total=   3.6s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV] ........ C=0.1, gamma=0.1, kernel=rbf, score=0.940, total=   3.6s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV] ........ C=0.1, gamma=0.1, kernel=rbf, score=0.942, total=   3.8s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV] ........ C=0.1, gamma=0.1, kernel=rbf, score=0.941, total=   4.0s
[CV] C=0.1, gamma=0.01, kernel=rbf ...................................
[CV] ....... C=0.1, gamma=0.01, kernel=rbf, score=0.925, total=   4.4s
[CV] C=0.1, gamma=0.01, kernel=rbf ...................................
[CV] ....... C=0.1, gamma=0.01, kernel=rbf, score=0.925, total=   4.1s
[CV] C=0.1, gamma=0.01, kernel=rbf ...................................
[CV] ....... C=0.1, gamma=0.01, kernel=rbf, score=0.925, total=   4.0s
[CV] C=1, gamma=1, kernel=rbf ........................................
[CV] .

[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:  2.0min finished


{'C': 10, 'gamma': 1, 'kernel': 'rbf'}


**Hard voting**


In [8]:

voting_clf = VotingClassifier(
    estimators=[("lr", log_model), ("rf", rf_best), ("svc", sv_best)],
    voting="hard",
)
voting_clf.fit(X_train, y_train)

for clf in (log_model, rf_best, sv_best, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.9428248587570621
RandomForestClassifier 0.9466666666666667
SVC 0.9459887005649718
VotingClassifier 0.9455367231638419


**Soft voting**

In [9]:
soft_svm_clf = SVC(probability=True, C=10, gamma=1, kernel='rbf')

soft_voting_clf = VotingClassifier(
    estimators=[("lr", log_model), ("rf", rf_best), ("svc", soft_svm_clf)],
    voting="soft",
)
soft_voting_clf.fit(X_train, y_train)

for clf in (log_model, rf_best, sv_best, soft_voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.9428248587570621
RandomForestClassifier 0.9446327683615819
SVC 0.9459887005649718
VotingClassifier 0.944858757062147


Above my best classifier is random forest classifier with accuracy score 0.9466, and the second best is SVC with accuracy score 0.945988. My hard and soft voting classifiers perform worse individual classifier.

# Bagging

In [10]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=100,
    bootstrap=True,
    n_jobs=-1,
)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.9453107344632768

**Out of bag**

In [11]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42),
    n_estimators=500,
    bootstrap=True,
    oob_score=True,
    random_state=40,
)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.9442018984311185

In [12]:
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.9450847457627118

Above, I used bagging classifier and try out of bag. I got lower accuracy score for out of bag when setting oob_score=true. This gives me a clue that within bagging, instances might not be sampled several times.