<a href="https://colab.research.google.com/github/chadi-aebi/DMML2021_Rolex/blob/main/code/Notebook_Rolex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1> UNIL Team Rolex




In this notebook we will proceed as follows:

First we prepare the notebook by importing essential methods and components for text analytics. Then we will start with some preparations such as building a tokenizer with different possible features to apply this later in the classification.

Subsequently, we start the text analytics divided by the different classifiers starting with a baseline calculation.
Each classifier section starts with a classification without any data preprocessing or other features. Then we tune the hyperparameters for the classifier to find the best parameters. After that, we train models that also implement the preprocessing of data and in the end we try out dimensionality reduction and standardisation.

The notebook has the following chapters:



*   0.1.   Preparation to start working - impor necessary methods etc.
*   0.2.   Further preparation for classification


1.     Baseline calculation
2.     Logistic Regression
3.     kNN Classifier
4.     Decision Tree
5.     Random Forests





# 0.1 Preparation to start working - import necessary methods etc.

**Remarks from Slack:** Basically we want to have your baseline solutions in that table. So without any data cleaning and pre-processing, who would the models mentioned in the table would perform (for each model you are also supposed to do hyper-parameter optimization to find the best hyper-parameters). This will give you the baseline accuracies that you can try to improve further by doing data preprocessing/cleaning or by using other models

In [1]:
#Install and update spacy
!pip install -U spacy
#Download the french language model
!python -m spacy download fr_core_news_sm

Collecting fr-core-news-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.2.0/fr_core_news_sm-3.2.0-py3-none-any.whl (17.4 MB)
[K     |████████████████████████████████| 17.4 MB 1.4 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
import string
import csv
import time

In [3]:
#Classifier
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree

#Other
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from spacy import displacy
from spacy.lang.fr.stop_words import STOP_WORDS
from spacy.lang.fr import French

# 0.2 Further preparations to starkt with classification

Set random_seed, Vectorizers without preprocessing and load the french language model



In [4]:
np.random_seed = 0

In [5]:
#Set TF-IDF and Count Vectorizer without any more specifications
tfidf_vector = TfidfVectorizer()
count_vector = CountVectorizer()
#with preprocessing
#tfidf_vector = TfidfVectorizer(tokenizer=spacy_tokenizer)


In [6]:
#Load the french language model
nlp = spacy.load('fr_core_news_sm')

In [7]:
#Import stop words from french language model and puncutations
stop_words=spacy.lang.fr.stop_words.STOP_WORDS
punctuations = string.punctuation

In [8]:
#Create a tokenizer function that can be used for preprocessing the data for classification - we try out different combinations of the sentence features

def spacy_tokenizer(sentence):
    # Create token object, which is used to create documents with linguistic annotations.
    mytokens = nlp(sentence)

    # Lemmatize each token and convert each token into lowercase
    mytokens = [ word.lemma_.lower().strip() for word in mytokens ]
    ## alternative way
    # mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Remove stop words and punctuation
    #mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]
    # Remove punctuation
    #mytokens = [ word for word in mytokens if word not in punctuations ]

    # Remove anonymous dates and people
    #mytokens = [ word.replace('xx/', '').replace('xxxx/', '').replace('xx', '') for word in mytokens ]
    #mytokens = [ word for word in mytokens if word not in ["xxxx", "xx", ""]]

    # Return preprocessed list of tokens
    return mytokens

We found that stopword removal did not lead to better results. This is well possible because by removing frequent and rather simple words you might remove the majority of words that appear in sentences of A1/A2 difficulty. Without those words it will be difficult to differentiate between sentences that are more sophisticated and those that only stay at a very basic level. 

In [9]:
#Function for model evaluation
def evaluate(true, pred):
    precision = precision_score(true, pred, average= 'macro')
    recall = recall_score(true, pred, average = 'macro')
    f1 = f1_score(true, pred, average = 'macro')
    #print(f"CONFUSION MATRIX:\n{confusion_matrix(true, pred)}")
    print(f"ACCURACY SCORE:\n{accuracy_score(true, pred):.4f}")
    print(f"CLASSIFICATION REPORT:\n\tPrecision: {precision:.4f}\n\tRecall: {recall:.4f}\n\tF1_Score: {f1:.4f}")

<h1>  Getting started - text analytics per classifier


# 1. Baseline

First, we start by calculating the baseline.

In [10]:
data=pd.read_csv('https://raw.githubusercontent.com/chadi-aebi/DMML2021_Rolex/main/data/training_data.csv', index_col='id')
X = data['sentence']
ylabels = data['difficulty']
print(ylabels.value_counts(normalize=True))

A1    0.169375
C2    0.168125
C1    0.166250
B1    0.165625
A2    0.165625
B2    0.165000
Name: difficulty, dtype: float64


# 2. Logistic Regression
<h2> 2.1 Logistic Regression without any data cleaning or tuning

In [10]:
lr_data=pd.read_csv('https://raw.githubusercontent.com/chadi-aebi/DMML2021_Rolex/main/data/training_data.csv', index_col='id')
lr_test_df = pd.read_csv('https://raw.githubusercontent.com/chadi-aebi/DMML2021_Rolex/main/data/unlabelled_test_data.csv', index_col='id')
lr_data.shape

(4800, 2)

In [11]:
X_lr = lr_data['sentence']
ylabels_lr = lr_data['difficulty']

X_train_lr, X_test_lr, y_train_lr, y_test_lr = train_test_split(X_lr, ylabels_lr, test_size=0.2, random_state=0, stratify=ylabels_lr)

In [12]:
# Define classifier
lreg = LogisticRegression()

In [83]:
# Create pipeline
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', lreg)])

# Fit model on training set
pipe.fit(X_train_lr, y_train_lr)

Pipeline(steps=[('vectorizer', TfidfVectorizer()),
                ('classifier', LogisticRegression())])

In [84]:
# Predictions
y_pred_lr = pipe.predict(X_test_lr)

#accuracy_score(y_test_lr,y_pred_lr)
evaluate(y_test_lr, y_pred_lr)


ACCURACY SCORE:
0.4604
CLASSIFICATION REPORT:
	Precision: 0.4578
	Recall: 0.4595
	F1_Score: 0.4554


This was a first model without any other features. Let's have a look at some wrong predictions to find some hints what could be improved.

In [85]:
df = pd.DataFrame(X_test_lr, columns=["sentence"])
df["actual"] = y_test_lr
df["predicted"] = y_pred_lr

incorrect = df[df["actual"] != df["predicted"]]
incorrect.head()

Unnamed: 0_level_0,sentence,actual,predicted
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2003,Il est également connu pour ses publicités tél...,C1,B2
2585,"Edgar, étincelant de furie, dominait tous les ...",C1,B2
2302,Ils sont heureux.,A1,B2
2958,Les canons renversèrent d'abord à peu près six...,C1,B2
3862,Parce que la philosophie se trouve de plus en ...,C1,C2


In [86]:
submission_test_lr = pd.DataFrame(y_pred_lr, columns=['difficulty'])
submission_test_lr

Unnamed: 0,difficulty
0,C2
1,B2
2,C2
3,A1
4,B2
...,...
955,C1
956,B2
957,C2
958,A2


In [87]:
#submission_test_lr.to_csv('submission_21-12-12.csv')

<h2> 2.2 Logistic Regression with hyperparameters tuning

In [14]:
# Define parameters to test
grid_lr = {
    'C': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100],
    'penalty': ['none', 'l1', 'l2', 'elasticnet'],
    #'max_iter': list(range(100,800,100)),
    'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
}

# Define and fit model
lreg = LogisticRegression()
lreg_cv = GridSearchCV(lreg, grid_lr, cv=10)
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', lreg_cv)])

pipe.fit(X_train_lr, y_train_lr)

# Print results
print("Hyperparameters:", lreg_cv.best_params_)
y_pred_lr = pipe.predict(X_test_lr)
evaluate(y_test_lr, y_pred_lr)

  "Setting penalty='none' will ignore the C and l1_ratio parameters"
  "Setting penalty='none' will ignore the C and l1_ratio parameters"
  "Setting penalty='none' will ignore the C and l1_ratio parameters"
  "Setting penalty='none' will ignore the C and l1_ratio parameters"
  "Setting penalty='none' will ignore the C and l1_ratio parameters"
  "Setting penalty='none' will ignore the C and l1_ratio parameters"
  "Setting penalty='none' will ignore the C and l1_ratio parameters"
  "Setting penalty='none' will ignore the C and l1_ratio parameters"
  "Setting penalty='none' will ignore the C and l1_ratio parameters"
  "Setting penalty='none' will ignore the C and l1_ratio parameters"
  "Setting penalty='none' will ignore the C and l1_ratio parameters"
  "Setting penalty='none' will ignore the C and l1_ratio parameters"
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/prepro

Hyperparameters: {'C': 10, 'penalty': 'l2', 'solver': 'liblinear'}
ACCURACY SCORE:
0.4677
CLASSIFICATION REPORT:
	Precision: 0.4640
	Recall: 0.4667
	F1_Score: 0.4631


In [None]:
#safe best parameters
Hyperparameters= {'C': 10, 'penalty': 'l2', 'solver': 'liblinear'}

<h2> 2.3 Logistic Regression with preprocessing 

<h3> 2.3.1 Set tokenizer with preprocessing methods

In [15]:
tfidf_vec_lr = TfidfVectorizer(tokenizer=spacy_tokenizer)

In [18]:
# Define classifier
lreg = LogisticRegression(C=10, penalty = 'l2', solver = 'liblinear')

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf_vec_lr),
                 ('classifier', lreg)])

# Fit model on training set
pipe.fit(X_train_lr, y_train_lr)

Pipeline(steps=[('vectorizer',
                 TfidfVectorizer(tokenizer=<function spacy_tokenizer at 0x7f2fabbb5170>)),
                ('classifier', LogisticRegression(C=10, solver='liblinear'))])

In [19]:
# Predictions
y_pred_lr = pipe.predict(X_test_lr)

#accuracy_score(y_test_lr,y_pred_lr)
evaluate(y_test_lr, y_pred_lr)

ACCURACY SCORE:
0.4938
CLASSIFICATION REPORT:
	Precision: 0.4899
	Recall: 0.4927
	F1_Score: 0.4899


In [20]:
# Define classifier
lreg = LogisticRegression(C=10, penalty = 'l2', solver = 'liblinear')

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf_vec_lr),
                 ('classifier', lreg)])

# Fit model on whole data set
pipe.fit(X_lr, ylabels_lr)

Pipeline(steps=[('vectorizer',
                 TfidfVectorizer(tokenizer=<function spacy_tokenizer at 0x7f2fabbb5170>)),
                ('classifier', LogisticRegression(C=10, solver='liblinear'))])

In [21]:
y_pred_test=pipe.predict(lr_test_df['sentence'])

In [22]:
submission_test = pd.DataFrame(y_pred_test, columns=['difficulty'])
submission_test

Unnamed: 0,difficulty
0,C1
1,A2
2,A1
3,B1
4,C2
...,...
1195,B1
1196,A2
1197,C2
1198,B2


In [23]:
submission_test.to_csv('submission_21-12-15.csv')

<h3> 2.3.2 Word Embeddings

<h3> 2.3.3 Dimensionality Reduction

In [24]:
pca = PCA(n_components=900) #n_components can be varied to try out different models

In [21]:
X_train_vec_lr = tfidf_vector.fit_transform(X_train_lr).toarray()
X_test_vec_lr = tfidf_vector.transform(X_test_lr).toarray()
print(X_train_vec_lr.shape)
X_train_vec_lr

(3840, 12903)


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [16]:
#build pipe without Scaler & PCA
scaler = StandardScaler()
pipe = Pipeline([
                 ('lreg', lreg),
                 ])

# Fit model
pipe.fit(X_train_vec_lr, y_train_lr)
print('Train Accuracy: ', round(pipe.score(X_train_vec_lr, y_train_lr), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec_lr, y_test_lr), 4))

Train Accuracy:  0.8836
Test Accuracy:  0.4604


In [17]:
#build pipe with  StandardScaler
scaler = StandardScaler()
pipe = Pipeline([
                 ('scaler', scaler),
                 ('lreg', lreg),
                 ])

# Fit model
pipe.fit(X_train_vec_lr, y_train_lr)
print('Train Accuracy: ', round(pipe.score(X_train_vec_lr, y_train_lr), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec_lr, y_test_lr), 4))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Train Accuracy:  0.9992
Test Accuracy:  0.3979


In [25]:
#build pipe with PCA 
pipe = Pipeline([
                 ('pca', pca),
                 ('lreg', lreg),
                 ])

# Fit model

pipe.fit(X_train_vec_lr, y_train_lr)
print('Train Accuracy: ', round(pipe.score(X_train_vec_lr, y_train_lr), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec_lr, y_test_lr), 4))

Train Accuracy:  0.6768
Test Accuracy:  0.4344


In [26]:
#build pipe with PCA & StandardScaler

pipe = Pipeline([
                 ('scaler', scaler),
                 ('pca', pca),
                 ('lreg', lreg),
                 ])

# Fit model

pipe.fit(X_train_vec_lr, y_train_lr)
print('Train Accuracy: ', round(pipe.score(X_train_vec_lr, y_train_lr), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec_lr, y_test_lr), 4))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Train Accuracy:  0.9078
Test Accuracy:  0.3781


# 3. kNN
<h2> 3.1 kNN without any data cleaning or tuning
 

In [27]:
knn_data=pd.read_csv('https://raw.githubusercontent.com/chadi-aebi/DMML2021_Rolex/main/data/training_data.csv', index_col='id')
knn_test_df = pd.read_csv('https://raw.githubusercontent.com/chadi-aebi/DMML2021_Rolex/main/data/unlabelled_test_data.csv', index_col='id')

In [29]:
X_knn = knn_data['sentence']
ylabels_knn = knn_data['difficulty']

X_train_knn, X_test_knn, y_train_knn, y_test_knn = train_test_split(X_knn, ylabels_knn, test_size=0.2, random_state=0, stratify=ylabels_knn)

In [None]:
# Define classifier
knn = KNeighborsClassifier()


In [None]:
# Create pipeline
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', knn)])

# Fit model on training set
pipe.fit(X_train_knn, y_train_knn)

Pipeline(steps=[('vectorizer', TfidfVectorizer()),
                ('classifier', KNeighborsClassifier())])

In [None]:
y_pred_knn = pipe.predict(X_test_knn)

accuracy_score(y_test_knn,y_pred_knn)
#evaluate(y_test_knn, y_pred_knn)

0.315625

<h2> 3.2 kNN with hyperparameter tuning


In [79]:
# Define parameters to test

#grid_knn = {'n_neighbors':np.arange(1,100),
 #       'p':np.arange(1,3),
 #       'weights':['uniform','distance']
 #      }

# Define and fit model

#knn = KNeighborsClassifier()
#knn_cv = GridSearchCV(knn, grid, cv=10)
#pipe = Pipeline([('vectorizer', tfidf_vector),
 #                ('classifier', knn_cv)])

#pipe.fit(X_train_knn, y_train_knn)

# Print results

#print("Hyperparameters:", knn_cv.best_params_)
#y_pred_knn = pipe.predict(X_test_knn)
#evaluate(y_test_knn, y_pred_knn)

IndentationError: ignored

We save the best parameters from the GridSearch to use it for further models:

In [None]:
bestparams_knn = {'n_neighbors': 82, 'p': 2, 'weights': 'distance'}

<h2> 3.3 kNN with preprocessing

<h3> 3.3.1 Set tokenizer with preprocessing methods

<h3> 3.3.2 Word Embeddings

<h3> 3.3.3 Dimensionality Reduction

In [None]:
pca = PCA(n_components=200) #n_components can be varied to try out different models

In [None]:
X_train_vec_knn = tfidf_vector.fit_transform(X_train_knn).toarray()
X_test_vec_knn = tfidf_vector.transform(X_test_knn).toarray()
print(X_train_vec_knn.shape)
X_train_vec_knn

(3840, 12903)


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
#build pipe without Scaler & PCA
scaler = StandardScaler()
pipe = Pipeline([
                 ('knn', knn),
                 ])

# Fit model
pipe.fit(X_train_vec_knn, y_train_knn)
print('Train Accuracy: ', round(pipe.score(X_train_vec_knn, y_train_knn), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec_knn, y_test_knn), 4))

Train Accuracy:  0.8836
Test Accuracy:  0.4604


In [None]:
#build pipe with  StandardScaler
scaler = StandardScaler()
pipe = Pipeline([
                 ('scaler', scaler),
                 ('knn', knn),
                 ])

# Fit model
pipe.fit(X_train_vec_knn, y_train_knn)
print('Train Accuracy: ', round(pipe.score(X_train_vec_knn, y_train_knn), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec_knn, y_test_knn), 4))

In [None]:
#build pipe with PCA 
pipe = Pipeline([
                 ('pca', pca),
                 ('knn', knn),
                 ])


# Fit model
pipe.fit(X_train_vec_knn, y_train_knn)
print('Train Accuracy: ', round(pipe.score(X_train_vec_knn, y_train_knn), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec_knn, y_test_knn), 4))

In [None]:
#build pipe with PCA & StandardScaler

pipe = Pipeline([
                 ('scaler', scaler),
                 ('pca', pca),
                 ('knn', knn),
                 ])

# Fit model
pipe.fit(X_train_vec_knn, y_train_knn)
print('Train Accuracy: ', round(pipe.score(X_train_vec_knn, y_train_knn), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec_knn, y_test_knn), 4))

# 4. Decision Tree
<h2> 4.1 Decision Tree without any data cleaning

In [63]:
tree_data=pd.read_csv('https://raw.githubusercontent.com/chadi-aebi/DMML2021_Rolex/main/data/training_data.csv', index_col='id')
tree_test_df = pd.read_csv('https://raw.githubusercontent.com/chadi-aebi/DMML2021_Rolex/main/data/unlabelled_test_data.csv', index_col='id')

In [64]:
X_tree = tree_data['sentence']
ylabels_tree = tree_data['difficulty']

X_train_tree, X_test_tree, y_train_tree, y_test_tree = train_test_split(X_tree, ylabels_tree, test_size=0.2, random_state=0, stratify=ylabels_tree)

In [69]:
# Define classifier
tree = DecisionTreeClassifier()

In [70]:

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', tree)])

# Fit model on training set
pipe.fit(X_train_tree, y_train_tree)

Pipeline(steps=[('vectorizer', TfidfVectorizer()),
                ('classifier', DecisionTreeClassifier())])

In [71]:
y_pred_tree = pipe.predict(X_test_tree)

accuracy_score(y_test_tree,y_pred_tree)
#evaluate(y_test_tree, y_pred_tree)

0.30104166666666665

<h2> 4.2 Decision Tree with hyperparameter tuning

In [53]:
# Grid Search - hyperparameter tuning


# Define parameters to test
#grid_tree ={"criterion" : ["gini", "entropy"],
            "splitter":["best","random"],
            "max_depth" : [1,5,20,50,100, None],
           "min_samples_leaf":[1,5,20,50,100, None],
           "max_features":["auto","log2","sqrt",None],
           "max_leaf_nodes":[None,10,50,100] }

# Define and fit model
#tree = tree = DecisionTreeClassifier()
#tree_cv = GridSearchCV(tree, grid_tree, cv=10)
#pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', tree_cv)])

#pipe.fit(X_train_tree, y_train_tree)

# Print results
#print("Hyperparameters:", tree_cv.best_params_)
#y_pred_tree = pipe.predict(X_test_tree)
#evaluate(y_test_tree, y_pred_tree)

3840 fits failed out of a total of 23040.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
3840 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 681, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/tree/_classes.py", line 942, in fit
    X_idx_sorted=X_idx_sorted,
  File "/usr/local/lib/python3.7/dist-packages/sklearn/tree/_classes.py", line 241, in fit
    if not 0.0 < self.min_samples_leaf <= 0.5:
TypeError: '<' not supported between instances of 'float' and 'NoneType'



Hyperparameters: {'criterion': 'gini', 'max_depth': 20, 'max_features': None, 'max_leaf_nodes': 100, 'min_samples_leaf': 1, 'splitter': 'best'}
ACCURACY SCORE:
0.3281
CLASSIFICATION REPORT:
	Precision: 0.3274
	Recall: 0.3275
	F1_Score: 0.3133


We save the best parameters from the GridSearch to use it for further models:

In [None]:
bestparams_tree = {'criterion': 'gini', 'max_depth': 20, 'max_features': None, 'max_leaf_nodes': 100, 'min_samples_leaf': 1, 'splitter': 'best'}

<h2> 4.3 Decision Tree with preprocessing

<h3> 4.3.1 Set tokenizer with preprocessing methods


In [65]:
tfidf_vec_tree = TfidfVectorizer(tokenizer=spacy_tokenizer)

In [67]:
# Define classifier
tree = DecisionTreeClassifier(criterion = 'gini', max_depth = 20, max_features = None, max_leaf_nodes= 100, min_samples_leaf = 1, splitter = 'best')

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf_vec_tree),
                 ('classifier', tree)])

# Fit model on training set
pipe.fit(X_train_tree, y_train_tree)

Pipeline(steps=[('vectorizer',
                 TfidfVectorizer(tokenizer=<function spacy_tokenizer at 0x7fe32ad6b710>)),
                ('classifier',
                 DecisionTreeClassifier(max_depth=20, max_leaf_nodes=100))])

In [68]:
# Predictions
y_pred_tree = pipe.predict(X_test_tree)

evaluate(y_test_tree, y_pred_tree)

ACCURACY SCORE:
0.3281
CLASSIFICATION REPORT:
	Precision: 0.3392
	Recall: 0.3273
	F1_Score: 0.3224


<h3> 4.3.2 Word embedding


<h3> 4.3.3 Dimensionality Reduction

In [72]:
pca = PCA(n_components=200) #n_components can be varied to try out different models

In [73]:
X_train_vec_tree = tfidf_vector.fit_transform(X_train_tree).toarray()
X_test_vec_tree = tfidf_vector.transform(X_test_tree).toarray()
print(X_train_vec_tree.shape)
X_train_vec_tree

(3840, 12903)


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [75]:
#build pipe without Scaler & PCA
scaler = StandardScaler()
pipe = Pipeline([
                 ('tree', tree),
                 ])

# Fit model
pipe.fit(X_train_vec_tree, y_train_tree)
print('Train Accuracy: ', round(pipe.score(X_train_vec_tree, y_train_tree), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec_tree, y_test_tree), 4))

Train Accuracy:  0.9992
Test Accuracy:  0.3073


In [76]:
#build pipe with  StandardScaler
scaler = StandardScaler()
pipe = Pipeline([
                 ('scaler', scaler),
                 ('tree', tree),
                 ])
# Fit model
pipe.fit(X_train_vec_tree, y_train_tree)
print('Train Accuracy: ', round(pipe.score(X_train_vec_tree, y_train_tree), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec_tree, y_test_tree), 4))

Train Accuracy:  0.9992
Test Accuracy:  0.3052


In [77]:
#build pipe with PCA 
pipe = Pipeline([
                 ('pca', pca),
                 ('tree', tree),
                 ])


# Fit model
pipe.fit(X_train_vec_tree, y_train_tree)
print('Train Accuracy: ', round(pipe.score(X_train_vec_tree, y_train_tree), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec_tree, y_test_tree), 4))

Train Accuracy:  0.999
Test Accuracy:  0.2865


In [78]:
#build pipe with PCA & StandardScaler

pipe = Pipeline([
                 ('scaler', scaler),
                 ('pca', pca),
                 ('tree', tree),
                 ])

# Fit model
pipe.fit(X_train_vec_tree, y_train_tree)
print('Train Accuracy: ', round(pipe.score(X_train_vec_tree, y_train_tree), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec_tree, y_test_tree), 4))

Train Accuracy:  0.9992
Test Accuracy:  0.2906


# 5. Random Forest
<h2> 5.1 Random Forest without any data cleaning


In [None]:
rf_data=pd.read_csv('https://raw.githubusercontent.com/chadi-aebi/DMML2021_Rolex/main/data/training_data.csv', index_col='id')
rf_test_df = pd.read_csv('https://raw.githubusercontent.com/chadi-aebi/DMML2021_Rolex/main/data/unlabelled_test_data.csv', index_col='id')
rf_test_df.head()

Unnamed: 0_level_0,sentence
id,Unnamed: 1_level_1
0,Nous dûmes nous excuser des propos que nous eû...
1,Vous ne pouvez pas savoir le plaisir que j'ai ...
2,"Et, paradoxalement, boire froid n'est pas la b..."
3,"Ce n'est pas étonnant, car c'est une saison my..."
4,"Le corps de Golo lui-même, d'une essence aussi..."


In [None]:
X_rf = rf_data['sentence']
ylabels_rf = rf_data['difficulty']

X_train_rf, X_test_rf, y_train_rf, y_test_rf = train_test_split(X_rf, ylabels_rf, test_size=0.2, random_state=0, stratify=ylabels_rf)

In [None]:
X_train_rf_df = pd.DataFrame(X_train_rf)

In [None]:
# Define classifier
rfc = RandomForestClassifier()

In [None]:
# Create pipeline with tfidf
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', rfc)])

# Fit model on training set
pipe.fit(X_train_rf, y_train_rf)

Pipeline(steps=[('vectorizer', TfidfVectorizer()),
                ('classifier', RandomForestClassifier())])

In [None]:
y_pred_rf = pipe.predict(X_test_rf)

#accuracy_score(y_test_rf,y_pred_rf)
evaluate(y_test_rf, y_pred_rf)

0.39791666666666664

<h2> 5.2 Random Forest with Hyperparameter Tuning

In [None]:
#Tuning Hyperparameters with RandmonizedSearchCV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 500, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the grid
grid_rf = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

print(random_grid)

{'n_estimators': [500, 666, 833, 1000, 1166, 1333, 1500, 1666, 1833, 2000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False]}


In [None]:
#Crossvalidation with RandomizedSearchCV
rf_random = RandomizedSearchCV(estimator = rfc, param_distributions = random_grid_rf, n_iter = 10, cv = 5, verbose=2, random_state=0, n_jobs = -1)

# Create pipeline with tfidf
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', rf_random)])

# Fit model on training set
pipe.fit(X_train_rf, y_train_rf)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


KeyboardInterrupt: ignored

In [None]:
#Save and display best parameters
best_param_1 = rf_random.best_params_
best_param_1

In [None]:
y_pred_rf = pipe.predict(X_test_rf)

evaluate(y_test_rf,y_pred_rf)

In [None]:
# Define and fit model with GridSearchCV
rfc = RandomForestClassifier()
rfc_cv = GridSearchCV(rfc, grid_rfc, cv=10)
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', rfc_cv)])

pipe.fit(X_train_rf, y_train_rf)

# Print results
print("Hyperparameters:", rfc_cv.best_params_)
y_pred_tree = pipe.predict(X_test_rf)
evaluate(y_test_tree, y_pred_rf)

<h2> 5.3 Random Forest with preprocessing

<h3> 5.3.1 Set tokenizer with preprocessing methods

In [None]:
tfidf_vec_rf = TfidfVectorizer(tokenizer=spacy_tokenizer)

In [None]:
# Define classifier with best params - 07.12.2021 #1
#rfc = RandomForestClassifier('bootstrap': False,
 #'max_depth': 80,
 #'max_features': 'auto',
 #'min_samples_leaf': 1,
 #'min_samples_split': 10,
 #'n_estimators': 916)

# Create pipeline with tfidf
#pipe = Pipeline([('vectorizer', count_vector),
                 #('classifier', rfc)])

# Fit model on training set
#pipe.fit(X_train_rf, y_train_rf)

In [None]:
# Define classifier with best params - 07.12.2021 #2
rfc = RandomForestClassifier(bootstrap=False,
 max_depth= 70,
 max_features= 'auto',
 min_samples_leaf= 1,
 min_samples_split= 10,
 n_estimators= 1166)

In [None]:
# Create pipeline with tfidf (Use whole dataset)
pipe = Pipeline([('vectorizer', tfidf_vec_rf),
                 ('classifier', rfc)])

# Fit model on training set
pipe.fit(X_rf, ylabels_rf)

Pipeline(steps=[('vectorizer',
                 TfidfVectorizer(tokenizer=<function spacy_tokenizer at 0x7f8ce5c10cb0>)),
                ('classifier',
                 RandomForestClassifier(bootstrap=False, max_depth=70,
                                        min_samples_split=10,
                                        n_estimators=1166))])

In [None]:
# Create pipeline with tfidf
pipe = Pipeline([('vectorizer', tfidf_vec_rf),
                 ('classifier', rfc)])

# Fit model on training set
pipe.fit(X_train_rf, y_train_rf)

Pipeline(steps=[('vectorizer',
                 TfidfVectorizer(tokenizer=<function spacy_tokenizer at 0x7f98feb1c950>)),
                ('classifier',
                 RandomForestClassifier(bootstrap=False, max_depth=70,
                                        min_samples_split=10,
                                        n_estimators=1166))])

In [None]:
y_pred_rf = pipe.predict(X_test_rf)

accuracy_score(y_test_rf,y_pred_rf)

0.465625

In [None]:
precision = precision_score(y_test_rf,y_pred_rf, average=None)
recall = recall_score(y_test_rf,y_pred_rf, average = None)
f1 = f1_score(y_test_rf,y_pred_rf, average = None)
print(precision)

[0.52991453 0.4527027  0.37951807 0.44604317 0.41176471 0.53246753]


In [None]:
#evaluate(y_test_rf, y_pred_rf)

ACCURACY SCORE:
0.4656


TypeError: ignored

In [None]:
y_pred_test=pipe.predict(rf_test_df['sentence'])


In [None]:
submission_test = pd.DataFrame(y_pred_test, columns=['difficulty'])
submission_test

Unnamed: 0,difficulty
0,C2
1,A2
2,A2
3,A2
4,C2
...,...
1195,B1
1196,A2
1197,C2
1198,A1


In [None]:

submission_test.to_csv('submission_21-12-13_2.csv')

<h3> 5.3.2 Word Embeddings

In [None]:
#Vectorizing - Word Embeddings
with nlp.disable_pipes():
    vectors = np.array([nlp(lang.sentence).vector for idx, lang in X_train_rf_df.iterrows()])
    
vectors.shape

(3840, 96)

In [None]:
#pipe = Pipeline([
                # ('rfc', rfc),
                # ])

# Fit model
#start = time.time()
#pipe.fit(vectors, y_train_rf)
#end = time.time()
#print('Time: ', round(end-start, 4))
#print('Train Accuracy: ', round(pipe.score(vectors, y_train_rf), 4))
#print('Test Accuracy: ', round(pipe.score(vectors, y_test_rf), 4))

In [None]:
#Doc2Vec
from gensim.models.doc2vec import TaggedDocument
sample_tagged = rf_data.apply(lambda r: TaggedDocument(words=spacy_tokenizer(r['sentence']), tags=[r.difficulty]), axis=1)

In [None]:
train_tagged_rf, test_tagged_rf = train_test_split(sample_tagged, test_size = 0.2, random_state = 0)

In [None]:
import multiprocessing
cores = multiprocessing.cpu_count()

In [None]:
# Define Doc2Vec and build vocabulary
from gensim.models import Doc2Vec

model_dbow = Doc2Vec(dm=0, vector_size=30, negative=6, hs=0, min_count=1, sample=0, workers=cores, epoch=300)
model_dbow.build_vocab([x for x in train_tagged_rf.values])

In [None]:
# Train distributed Bag of Word model
model_dbow.train(train_tagged_rf, total_examples=model_dbow.corpus_count, epochs=model_dbow.epochs)

In [None]:
tagged

In [None]:
# Select X and y
def vec_for_learning(model, tagged_docs):
    sents = tagged_docs.values
    targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words, steps=100)) for doc in sents])
    regressors = model.infer_vector(doc.words, steps=100)
    return targets, regressors

y_train_rf, X_train_rf = vec_for_learning(model_dbow, train_tagged_rf)
y_test_rf, X_test_rf = vec_for_learning(model_dbow, test_tagged_rf)

In [None]:
# Fit model on training set - same algorithm as before
rfc.fit(X_train_rf, y_train_rf)

# Predictions
y_pred_rf = rfc.predict(X_test_rf)

# Evaluate model
print(round(accuracy_score(y_test_rf, y_pred_rf), 4))
conf_mat = confusion_matrix(y_test_rf, y_pred_rf)
fig, ax = plt.subplots(figsize=(15,15))
sns.heatmap(conf_mat, annot=True, fmt='d')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

<h3> 5.3.3 Dimensionality Reduction

In [None]:
pca = PCA(n_components=200) #n_components can be varied to try out different models

In [None]:
X_train_vec_rf = tfidf_vector.fit_transform(X_train_rf).toarray()
X_test_vec_rf = tfidf_vector.transform(X_test_rf).toarray()
print(X_train_vec_rf.shape)
X_train_vec_rf

In [None]:
#build pipe without Scaler & PCA
scaler = StandardScaler()
pipe = Pipeline([
                 ('rfc', rfc),
                 ])

# Fit model
start = time.time()
pipe.fit(X_train_vec_rf, y_train_rf)
end = time.time()
print('Time: ', round(end-start, 4))
print('Train Accuracy: ', round(pipe.score(X_train_vec_rf, y_train_rf), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec_rf, y_test_rf), 4))

In [None]:
#build pipe with  StandardScaler
scaler = StandardScaler()
pipe = Pipeline([
                 ('scaler', scaler),
                 ('rfc', rfc),
                 ])

# Fit model
start = time.time()
pipe.fit(X_train_vec_rf, y_train_rf)
end = time.time()
print('Time: ', round(end-start, 4))
print('Train Accuracy: ', round(pipe.score(X_train_vec_rf, y_train_rf), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec_rf, y_test_rf), 4))

In [None]:
#build pipe with PCA 
pipe = Pipeline([
                 ('pca', pca),
                 ('rfc', rfc),
                 ])

# Fit model
start = time.time()
pipe.fit(X_train_vec_rf, y_train_rf)
end = time.time()
print('Time: ', round(end-start, 4))
print('Train Accuracy: ', round(pipe.score(X_train_vec_rf, y_train_rf), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec_rf, y_test_rf), 4))

In [None]:
#build pipe with PCA & StandardScaler

pipe = Pipeline([
                 ('scaler', scaler),
                 ('pca', pca),
                 ('rfc', rfc),
                 ])

# Fit model
start = time.time()
pipe.fit(X_train_vec_rf, y_train_rf)
end = time.time()
print('Time: ', round(end-start, 4))
print('Train Accuracy: ', round(pipe.score(X_train_vec_rf, y_train_rf), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec_rf, y_test_rf), 4))