# Data Mining and Machine Learning - Project

## Detecting Difficulty Level of French Texts

### Step by step guidelines

The following are a set of step by step guidelines to help you get started with your project for the Data Mining and Machine Learning class. 
To test what you learned in the class, we will hold a competition. You will create a classifier that predicts how the level of some text in French (A1,..., C2). The team with the highest rank will get some goodies in the last class (some souvenirs from tech companies: Amazon, LinkedIn, etc).

**2 people per team**

Choose a team here:
https://moodle.unil.ch/mod/choicegroup/view.php?id=1305831


#### 1. 📂 Create a public GitHub repository for your team using this naming convention `DMML2022_[your_team_name]` with the following structure:
- data (folder) 
- code (folder) 
- documentation (folder)
- a readme file (.md): *mention team name, participants, brief description of the project, approach, summary of results table and link to the explainatory video (see below).*

All team members should contribute to the GitHub repository.

#### 2. 🇰 Join the competititon on Kaggle using the invitation link we sent on Slack.

Under the Team tab, save your team name (`UNIL_your_team_name`) and make sure your team members join in as well. You can merge your user account with your teammates in order to create a team.

#### 3. 📓 Read the data into your colab notebook. There should be one code notebook per team, but all team members can participate and contribute code. 

You can use either direct the Kaggle API and your Kaggle credentials (as explained below and **entirely optional**), or dowload the data form Kaggle and upload it onto your team's GitHub repository under the data subfolder.

#### 4. 💎 Train your models and upload the code under your team's GitHub repo. Set the `random_state=0`.
- baseline
- logistic regression with TFidf vectoriser (simple, no data cleaning)
- KNN & hyperparameter optimisation (simple, no data cleaning)
- Decision Tree classifier & hyperparameter optimisation (simple, no data cleaning)
- Random Forests classifier (simple, no data cleaning)
- another technique or combination of techniques of your choice

BE CREATIVE! You can use whatever method you want, in order to climb the leaderboard. The only rule is that it must be your own work. Given that, you can use all the online resources you want. 

#### 5. 🎥 Create a YouTube video (5-10 minutes) of your solution and embed it in your notebook. Explain the algorithms used and the evaluation of your solutions. *Select* projects will also be presented live by the group during the last class.


### Submission details (one per team)

1. Download a ZIPped file of your team's repository and submit it in Moodle here. IMPORTANT: in the comment of the submission, insert a link to the repository on Github.
https://moodle.unil.ch/mod/assign/view.php?id=1305833



### Grading (one per team)
- 20% Kaggle Rank
- 50% code quality (using classes, splitting into proper files, documentation, etc)
- 15% github quality (include link to video, table with progress over time, organization of code, images, etc)
- 15% video quality (good sound, good slides, interesting presentation).

## Some further details for points 3 and 4 above.

### 3. Read data into your notebook with the Kaggle API (optional but useful). 

You can also download the data from Kaggle and put it in your team's repo the data folder.

In [1]:
# reading in the data via the Kaggle API

# mount your Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [2]:
# Install and update spaCy
!pip install -U spacy

# Download the large french language model
!python -m spacy download fr_core_news_lg

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fr-core-news-lg==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_lg-3.4.0/fr_core_news_lg-3.4.0-py3-none-any.whl (571.8 MB)
[K     |████████████████████████████████| 571.8 MB 20 kB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_lg')


In [3]:
# install 
! pip install kaggle transformers sentencepiece Keras-Preprocessing

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### IMPORTANT
Log into your Kaggle account, go to Account > API > Create new API token. You will obtain a kaggle.json file. Save it in your Google Drive (not in a folder, in your general drive).

In [4]:
!mkdir ~/.kaggle

mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [5]:
#read in your Kaggle credentials from Google Drive
!cp /content/drive/MyDrive/kaggle.json ~/.kaggle/kaggle.json

In [6]:
!mkdir data

mkdir: cannot create directory ‘data’: File exists


In [7]:
# download the dataset from the competition page
! kaggle competitions download -c detecting-french-texts-difficulty-level-2022

detecting-french-texts-difficulty-level-2022.zip: Skipping, found more recently modified local copy (use --force to force download)


In [8]:
!unzip "detecting-french-texts-difficulty-level-2022.zip" -d data

Archive:  detecting-french-texts-difficulty-level-2022.zip
replace data/sample_submission.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: data/sample_submission.csv  
  inflating: data/training_data.csv  
  inflating: data/unlabelled_test_data.csv  


In [9]:
import pandas as pd
import numpy as np
import json
import string
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import torch
from tqdm import tqdm, trange

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from keras_preprocessing.sequence import pad_sequences
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import CamembertTokenizer, CamembertForSequenceClassification, AdamW, get_linear_schedule_with_warmup
from collections import Counter
from sklearn import metrics

In [10]:
# read in data
df = pd.read_csv('/content/data/training_data.csv')
df_pred = pd.read_csv('/content/data/unlabelled_test_data.csv')
df_example_submission = pd.read_csv('/content/data/sample_submission.csv')

Have a look at the data on which to make predictions.

And this is the format for your submissions.

### 4. Train your models

Set your X and y variables. 
Set the `random_state=0`
Split the data into a train and test set using the following parameters `train_test_split(X, y, test_size=0.2, random_state=0)`.

#### 4.1.Baseline
What is the baseline for this classification problem?

In [None]:
np.random.seed = 0

In [None]:
base_rate = np.max(df.difficulty.value_counts()/df.difficulty.shape[0]) 

print(f"Base rate:\n{base_rate:.4f}")

Base rate:
0.1694


#### 4.2. Logistic Regression (without data cleaning)

Train a simple logistic regression model using a Tfidf vectoriser.

In [43]:
X = df['sentence'] 
y = df['difficulty'] 

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [44]:
# Define vectorizer
tfidf = TfidfVectorizer()

# Define classifier
classifier = LogisticRegression() #different solver? number iterations?

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf),
                 ('classifier', classifier)])

# Fit model on training set
pipe.fit(X_train, y_train)

# Predictions
y_pred = pipe.predict(X_test)

Calculate accuracy, precision, recall and F1 score on the test set.

In [45]:
#Evaluate the model
def evaluate(true, pred):
    precision = precision_score(true, pred, average='weighted') #other average options: None, 'micro', 'macro'
    recall = recall_score(true, pred, average='weighted')
    f1 = f1_score(true, pred, average='weighted')
    conf_mat = confusion_matrix(y_test, y_pred)
    accuracy = accuracy_score(true, pred)
    print(f"CONFUSION MATRIX:\n{confusion_matrix(true, pred)}")
    print(f"ACCURACY SCORE:\n{accuracy_score(true, pred):.4f}")
    print(f"CLASSIFICATION REPORT:\n\tPrecision: {precision:.4f}\n\tRecall: {recall:.4f}\n\tF1_Score: {f1:.4f}")
    return precision, recall, f1, accuracy

# Evaluate and save results for table
log_reg_results = evaluate(y_test, y_pred)
log_reg_percision = log_reg_results[0]
log_reg_recall = log_reg_results[1]
log_reg_f1 = log_reg_results[2]
log_reg_accuracy = log_reg_results[3]

CONFUSION MATRIX:
[[93 31 21 10  4  2]
 [54 60 30  6  6  8]
 [12 38 64 17  9 20]
 [ 6  6 15 66 27 24]
 [ 4  4 10 37 73 45]
 [ 7  8  8 19 24 92]]
ACCURACY SCORE:
0.4667
CLASSIFICATION REPORT:
	Precision: 0.4656
	Recall: 0.4667
	F1_Score: 0.4640


In [46]:
#Test out different tfidf configurations:

# Create list of configs
def configs():

    models = list()
    
    # Define config lists
    ngram_range = [(1,1), (1,2), (1,3), (2,2), (2,3), (3,3)]
    min_df = [1]
    max_df = [1.0]
    analyzer=['word', 'char']
    
    # Create config instances
    for n in ngram_range:
        for i in min_df:
            for j in max_df:
              for a in analyzer:
                    cfg = [n, i, j, a]
                    models.append(cfg)
    return models

configs = configs()

In [None]:
# Define list for result
result = []

for config in configs:

    # Redefine vectorizer
    tfidf_vector = TfidfVectorizer(ngram_range=config[0],
                                   min_df=config[1], max_df=config[2], analyzer=config[3])

    # Define classifier
    classifier = LogisticRegression() #(solver='lbfgs', cv=5, max_iter=1000)

    # Create pipeline
    pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', classifier)])

    # Fit model on training set
    pipe.fit(X_train, y_train)

    # Predictions
    y_pred = pipe.predict(X_test)

    # Print accuracy on test set
    print("CONFIG: ", config)
    evaluate(y_test, y_pred)
    print("-----------------------")

    # Append to result
    result.append([config, accuracy_score(y_test, y_pred)])

CONFIG:  [(1, 1), 1, 1.0, 'word']
CONFUSION MATRIX:
[[93 31 21 10  4  2]
 [54 60 30  6  6  8]
 [12 38 64 17  9 20]
 [ 6  6 15 66 27 24]
 [ 4  4 10 37 73 45]
 [ 7  8  8 19 24 92]]
ACCURACY SCORE:
0.4667
CLASSIFICATION REPORT:
	Precision: 0.4656
	Recall: 0.4667
	F1_Score: 0.4640
-----------------------


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


CONFIG:  [(1, 1), 1, 1.0, 'char']
CONFUSION MATRIX:
[[84 31 17 15  5  9]
 [42 60 35 10  6 11]
 [16 35 62 18 12 17]
 [ 6  6 11 42 42 37]
 [ 5  3 13 45 43 64]
 [ 5  4  8 27 34 80]]
ACCURACY SCORE:
0.3865
CLASSIFICATION REPORT:
	Precision: 0.3888
	Recall: 0.3865
	F1_Score: 0.3846
-----------------------
CONFIG:  [(1, 2), 1, 1.0, 'word']
CONFUSION MATRIX:
[[ 87  32  23   9   3   7]
 [ 50  60  33   4   7  10]
 [ 17  36  63  11  11  22]
 [  6   3  13  56  25  41]
 [  2   4  13  19  67  68]
 [  6   5   8  12  19 108]]
ACCURACY SCORE:
0.4594
CLASSIFICATION REPORT:
	Precision: 0.4653
	Recall: 0.4594
	F1_Score: 0.4541
-----------------------


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


CONFIG:  [(1, 2), 1, 1.0, 'char']
CONFUSION MATRIX:
[[104  26  19   5   4   3]
 [ 49  68  32   7   1   7]
 [ 14  26  75  25   9  11]
 [  5   6   8  60  26  39]
 [  6   1  11  43  59  53]
 [  6   9   3  25  30  85]]
ACCURACY SCORE:
0.4698
CLASSIFICATION REPORT:
	Precision: 0.4723
	Recall: 0.4698
	F1_Score: 0.4670
-----------------------
CONFIG:  [(1, 3), 1, 1.0, 'word']
CONFUSION MATRIX:
[[ 84  33  23   8   3  10]
 [ 52  57  31   3   7  14]
 [ 18  31  58  11  10  32]
 [  4   3  10  47  20  60]
 [  2   3  13  16  59  80]
 [  5   6   6  11  17 113]]
ACCURACY SCORE:
0.4354
CLASSIFICATION REPORT:
	Precision: 0.4524
	Recall: 0.4354
	F1_Score: 0.4282
-----------------------


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


CONFIG:  [(1, 3), 1, 1.0, 'char']
CONFUSION MATRIX:
[[107  28  17   5   4   0]
 [ 50  69  37   6   1   1]
 [ 18  35  73  21   8   5]
 [  4   8   6  62  30  34]
 [  6   3   9  43  67  45]
 [  6   7   1  29  27  88]]
ACCURACY SCORE:
0.4854
CLASSIFICATION REPORT:
	Precision: 0.4855
	Recall: 0.4854
	F1_Score: 0.4828
-----------------------
CONFIG:  [(2, 2), 1, 1.0, 'word']
CONFUSION MATRIX:
[[75 38 24 17  4  3]
 [45 56 31 10 11 11]
 [25 31 57 19  9 19]
 [ 9 10 19 52 20 34]
 [ 4  4 15 40 59 51]
 [ 5  8 12 32 24 77]]
ACCURACY SCORE:
0.3917
CLASSIFICATION REPORT:
	Precision: 0.3970
	Recall: 0.3917
	F1_Score: 0.3913
-----------------------


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


CONFIG:  [(2, 2), 1, 1.0, 'char']
CONFUSION MATRIX:
[[105  29  15   5   6   1]
 [ 50  66  32   8   2   6]
 [ 18  24  77  23   9   9]
 [  4   8  10  61  28  33]
 [  8   3  11  40  63  48]
 [  6  10   4  25  32  81]]
ACCURACY SCORE:
0.4719
CLASSIFICATION REPORT:
	Precision: 0.4713
	Recall: 0.4719
	F1_Score: 0.4690
-----------------------
CONFIG:  [(2, 3), 1, 1.0, 'word']
CONFUSION MATRIX:
[[75 35 24 17  4  6]
 [46 55 31 11 11 10]
 [25 27 58 18 11 21]
 [ 9  8 20 51 18 38]
 [ 4  4 18 42 56 49]
 [ 5  9 10 29 24 81]]
ACCURACY SCORE:
0.3917
CLASSIFICATION REPORT:
	Precision: 0.3968
	Recall: 0.3917
	F1_Score: 0.3903
-----------------------
CONFIG:  [(2, 3), 1, 1.0, 'char']
CONFUSION MATRIX:
[[110  28  16   4   3   0]
 [ 54  69  33   6   1   1]
 [ 19  30  75  24   8   4]
 [  5   8   8  62  32  29]
 [  8   4   6  45  64  46]
 [  6   8   2  30  26  86]]
ACCURACY SCORE:
0.4854
CLASSIFICATION REPORT:
	Precision: 0.4865
	Recall: 0.4854
	F1_Score: 0.4823
-----------------------
CONFIG:  [(3, 3), 1, 1

Have a look at the confusion matrix and identify a few examples of sentences that are not well classified.

In [47]:
#re-evaluate:

# Define vectorizer
tfidf = TfidfVectorizer(ngram_range=(2,3), min_df = 1, max_df = 1.0, analyzer='char')

# Define classifier
classifier = LogisticRegression() #different solver? number iterations?

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf),
                 ('classifier', classifier)])

# Fit model on training set
pipe.fit(X_train, y_train)

# Predictions
y_pred = pipe.predict(X_test)

# Evaluate and save results for table
log_reg_results = evaluate(y_test, y_pred)
log_reg_percision = log_reg_results[0]
log_reg_recall = log_reg_results[1]
log_reg_f1 = log_reg_results[2]
log_reg_accuracy = log_reg_results[3]

CONFUSION MATRIX:
[[110  28  16   4   3   0]
 [ 54  69  33   6   1   1]
 [ 19  30  75  24   8   4]
 [  5   8   8  62  32  29]
 [  8   4   6  45  64  46]
 [  6   8   2  30  26  86]]
ACCURACY SCORE:
0.4854
CLASSIFICATION REPORT:
	Precision: 0.4865
	Recall: 0.4854
	F1_Score: 0.4823


In [None]:
misclassified_samples = X_test[y_test != y_pred]
misclassified_samples

2255    C'est en décembre 1967, après bien des invecti...
608     Giscard va pourtant réussir à transformer ce r...
2856    Un choix difficile mais important : le public ...
1889    Le débat porte plutôt sur l'utilité d'une tell...
2250    Vous eussiez juré que les gens la voyaient, l'...
                              ...                        
1450    Le slogan "100 % de nos clients achètent nos p...
3944    En été, j'aime faire de la randonnée et vous, ...
891     Très présente dans l'alimentation antillaise, ...
1005    On réinvente le dimanche dans une perspective ...
1940    Pour les femmes surtout, nuancent Régine Lemoi...
Name: sentence, Length: 494, dtype: object

Generate your first predictions on the `unlabelled_test_data.csv`. make sure your predictions match the format of the `unlabelled_test_data.csv`.

In [None]:
unseen_test = df_pred['sentence']
unseen_test

0       Nous dûmes nous excuser des propos que nous eû...
1       Vous ne pouvez pas savoir le plaisir que j'ai ...
2       Et, paradoxalement, boire froid n'est pas la b...
3       Ce n'est pas étonnant, car c'est une saison my...
4       Le corps de Golo lui-même, d'une essence aussi...
                              ...                        
1195    C'est un phénomène qui trouve une accélération...
1196    Je vais parler au serveur et voir si on peut d...
1197    Il n'était pas comme tant de gens qui par pare...
1198        Ils deviennent dangereux pour notre économie.
1199    Son succès a généré beaucoup de réactions néga...
Name: sentence, Length: 1200, dtype: object

In [None]:
unseen_test_data_pred = pipe.predict(unseen_test)
df_pred['difficulty'] = unseen_test_data_pred
df_pred_results = df_pred[[ "id","difficulty"]]
df_pred_results

Unnamed: 0,id,difficulty
0,0,C2
1,1,B1
2,2,A1
3,3,B1
4,4,C2
...,...,...
1195,1195,B1
1196,1196,A2
1197,1197,C2
1198,1198,A2


#### 4.3. KNN (without data cleaning)

Train a KNN classification model using a Tfidf vectoriser. Show the accuracy, precision, recall and F1 score on the test set.

In [48]:
# Define classifier
classifier = KNeighborsClassifier()

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf),
                 ('classifier', classifier)])

# Fit model on training set
pipe.fit(X_train, y_train)

# Predictions
y_pred = pipe.predict(X_test)

# Evaluation - test set
evaluate(y_test, y_pred)

CONFUSION MATRIX:
[[110  34   8   0   1   8]
 [ 65  55  12   6   3  23]
 [ 32  34  25   7   8  54]
 [  7   7   3  26  12  89]
 [  6   2   6   8  21 130]
 [  8   4   3   7  10 126]]
ACCURACY SCORE:
0.3781
CLASSIFICATION REPORT:
	Precision: 0.4124
	Recall: 0.3781
	F1_Score: 0.3390


(0.4123541248033652, 0.378125, 0.3390151728554922, 0.378125)

Try to improve it by tuning the hyper parameters (`n_neighbors`,   `p`, `weights`).

In [None]:
# Grid Search - hyperparameter tuning

# Define parameters to test
grid = {'n_neighbors':np.arange(1,100),
        'p':np.arange(1,3),
        'weights':['uniform','distance']
       }

# Define and fit model
knn = KNeighborsClassifier()
knn_cv = GridSearchCV(knn, grid, cv=10)


# Create pipeline
pipe = Pipeline([('vectorizer', tfidf),
                 ('classifier', knn_cv)])

# Fit model on training set
pipe.fit(X_train, y_train)

# Print results
print("Hyperparameters:", knn_cv.best_params_)

Hyperparameters: {'n_neighbors': 2, 'p': 2, 'weights': 'uniform'}


In [49]:
# Fit optimal KNN model
knn = KNeighborsClassifier(n_neighbors=2, p=2, weights='uniform')

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf),
                 ('classifier', knn)])

# Fit model on training set
pipe.fit(X_train, y_train)

# Predictions
y_pred = pipe.predict(X_test)

# Evaluate and save results for table
knn_results = evaluate(y_test, y_pred)
knn_percision = knn_results[0]
knn_recall = knn_results[1]
knn_f1 = knn_results[2]
knn_accuracy = knn_results[3]

CONFUSION MATRIX:
[[123  25  10   1   0   2]
 [ 75  57  15   6   3   8]
 [ 51  38  29   8  10  24]
 [ 11  20  16  39  13  45]
 [ 11   5   9  21  46  81]
 [ 13  15   4  15  22  89]]
ACCURACY SCORE:
0.3990
CLASSIFICATION REPORT:
	Precision: 0.4037
	Recall: 0.3990
	F1_Score: 0.3767


#### 4.4. Decision Tree Classifier (without data cleaning)

Train a Decison Tree classifier, using a Tfidf vectoriser. Show the accuracy, precision, recall and F1 score on the test set.

In [50]:
# Define classifier
classifier = DecisionTreeClassifier()

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf),
                 ('classifier', classifier)])

# Fit model on training set
pipe.fit(X_train, y_train)

# Predictions
y_pred = pipe.predict(X_test)

# Evaluation - test set
evaluate(y_test, y_pred)

CONFUSION MATRIX:
[[66 44 23 12  9  7]
 [38 40 38 23 11 14]
 [15 31 43 26 23 22]
 [12  8 25 30 40 29]
 [ 8 11 24 37 53 40]
 [ 9 11 19 26 41 52]]
ACCURACY SCORE:
0.2958
CLASSIFICATION REPORT:
	Precision: 0.2989
	Recall: 0.2958
	F1_Score: 0.2970


(0.29894838075229746,
 0.29583333333333334,
 0.2969798534299308,
 0.29583333333333334)

Try to improve it by tuning the hyper parameters (`max_depth`, the depth of the decision tree).

In [None]:
# Grid Search - tuning tree depth

# Define parameter to test
grid = {'max_depth':np.arange(1,7)}

# Define and fit model

tree = DecisionTreeClassifier()
tree_cv = GridSearchCV(tree, grid, cv=5)

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf),
                 ('classifier', tree_cv)])

# Fit model on training set
pipe.fit(X_train, y_train)

# Print results
print("Hyperparameters:", tree_cv.best_params_)

Hyperparameters: {'max_depth': 6}


In [51]:
# Fit optimal tree model
classifier = DecisionTreeClassifier(max_depth=6)

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf),
                 ('classifier', classifier)])

# Fit model on training set
pipe.fit(X_train, y_train)

# Predictions
y_pred = pipe.predict(X_test)

# Evaluate and save results for table
tree_results = evaluate(y_test, y_pred)
tree_percision = tree_results[0]
tree_recall = tree_results[1]
tree_f1 = tree_results[2]
tree_accuracy = tree_results[3]

CONFUSION MATRIX:
[[99 19 25 14  4  0]
 [59 39 29 25  6  6]
 [29 23 42 42 12 12]
 [ 9  6 20 52 40 17]
 [ 7  5 36 41 68 16]
 [ 9  3 25 41 54 26]]
ACCURACY SCORE:
0.3396
CLASSIFICATION REPORT:
	Precision: 0.3464
	Recall: 0.3396
	F1_Score: 0.3305


#### 4.5. Random Forest Classifier (without data cleaning)

Try a Random Forest Classifier, using a Tfidf vectoriser. Show the accuracy, precision, recall and F1 score on the test set.

In [52]:
# Define classifier
classifier = RandomForestClassifier()

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf),
                 ('classifier', classifier)])

# Fit model on training set
pipe.fit(X_train, y_train)

# Predictions
y_pred = pipe.predict(X_test)

# Evaluation - test set
evaluate(y_test, y_pred)

CONFUSION MATRIX:
[[114  27  13   6   1   0]
 [ 63  61  25  10   2   3]
 [ 25  32  54  34   9   6]
 [  8   6  16  60  28  26]
 [  8   9  13  40  66  37]
 [  7  14   8  25  39  65]]
ACCURACY SCORE:
0.4375
CLASSIFICATION REPORT:
	Precision: 0.4362
	Recall: 0.4375
	F1_Score: 0.4317


(0.4362195592957977, 0.4375, 0.43168945231089956, 0.4375)

In [None]:
# Grid Search - tuning tree depth

# Define parameter to test
grid = {'max_depth':np.arange(1,7)}

# Define and fit model

tree = RandomForestClassifier()
tree_cv = GridSearchCV(tree, grid, cv=5)

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf),
                 ('classifier', tree_cv)])

# Fit model on training set
pipe.fit(X_train, y_train)

# Print results
print("Hyperparameters:", tree_cv.best_params_)

Hyperparameters: {'max_depth': 6}


In [53]:
# Fit optimal forest model
classifier = RandomForestClassifier(max_depth=6, random_state=0)

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf),
                 ('classifier', classifier)])

# Fit model on training set
pipe.fit(X_train, y_train)

# Predictions
y_pred = pipe.predict(X_test)

# Evaluate and save results for table
forest_results = evaluate(y_test, y_pred)
forest_percision = forest_results[0]
forest_recall = forest_results[1]
forest_f1 = forest_results[2]
forest_accuracy = forest_results[3]

CONFUSION MATRIX:
[[130   9  17   1   3   1]
 [ 98  31  16  14   3   2]
 [ 38  29  37  34  12  10]
 [ 14   8   7  45  32  38]
 [ 16   6  10  26  53  62]
 [ 15   5   7  19  27  85]]
ACCURACY SCORE:
0.3969
CLASSIFICATION REPORT:
	Precision: 0.3886
	Recall: 0.3969
	F1_Score: 0.3723


#### 4.6. Any other technique, including data cleaning if necessary

Try to improve accuracy by training a better model using the techniques seen in class, or combinations of them.

As usual, show the accuracy, precision, recall and f1 score on the test set.

##Model: Data Augmentation and Ensemble Classifier

In [54]:
#data augmentation:

nlp = spacy.load('fr_core_news_lg')
punctuations = string.punctuation
stop_words = spacy.lang.fr.stop_words.STOP_WORDS

# Entity Recognition
def add_entity(sentence):
    # Tokenize the sentence
    doc = nlp(sentence)
    # Return text and label for each sentence
    return [(i.text, i.label_) for i in doc.ents]

# Part of speech
def add_POS(sentence):
    # Tokenize the sentence
    doc = nlp(sentence)
    # Return tag of each token
    return [(i, i.pos_) for i in doc]

def entity_count(sentence: string):
    # count entity elements in each sentence
    ner = add_entity(sentence)
    count = Counter([i[1] for i in ner])
    return count

def POS_count(sentence: string):
    # count pos elements in each sentence
    pos = add_POS(sentence)
    count = Counter([i[1] for i in pos])
    return count

def data_augmentation(dataframe: pd.DataFrame):
    dataframe["num_words"] = dataframe["sentence"].apply(lambda x: len(x.split())) #add number of words per sentence
    dataframe["avg_word_length"] = dataframe['sentence'].apply(lambda x: np.sum([len(w) for w in x.split()]) / len(x.split())) #add average word length
    dataframe['num_stopwords'] = dataframe['sentence'].apply(lambda x: np.sum([1 for word in x.split(' ') if word in stop_words])) #calculate number of stopwords
    dataframe['ratio_of_stopwords'] = dataframe['num_words'] / dataframe['num_stopwords'] #calculate share of stop words
    
    # Iterate over each row in the dataframe
    for index, row in df.iterrows():
        # Part-Of-Speech
        counter_pos = POS_count(row['sentence'])
        for i in counter_pos:
            dataframe.loc[index, i] = counter_pos[i]
        
        # Entity 
        counter_ner = entity_count(row['sentence'])
        for id in counter_ner:
            dataframe.loc[index, i] = counter_ner[i]
            
    return dataframe.fillna(0)


In [55]:
#remove stop words and punctuation:

# Define tokenizer function
def spacy_tokenizer(sentence):

    # Create token object, which is used to create documents with linguistic annotations.
    mytokens = nlp(sentence)

    # Convert each token into lowercase
    mytokens = [ word.lower_ for word in mytokens ]

    # Remove stop words and punctuation
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # Return preprocessed list of tokens
    return mytokens

In [56]:
df_aug = data_augmentation(df) #add data augmentation to full data set

In [57]:
# tokenizer 
count = CountVectorizer(ngram_range=(1,2), tokenizer=spacy_tokenizer) 
df_bow = count.fit_transform(df_aug.sentence)

# create the bow dataframe
df_bow = pd.DataFrame(df_bow.todense(), columns = count.get_feature_names())



In [58]:
#split bow dataframe like train and test sets from before:
X_train_df = pd.DataFrame(X_train, columns = ['sentence']) #convert to df to merge with full augmented and vectorized dataset
X_test_df = pd.DataFrame(X_test, columns = ['sentence'])

X_train_bow = X_train_df.join(df_bow).drop(['sentence'], axis=1) #join and drop sentence
X_test_bow = X_test_df.join(df_bow).drop(['sentence'], axis=1)

In [59]:
#Logistic regression:

log_reg = LogisticRegression()
log_reg.fit(X_train_bow, y_train)

#prediction:
y_pred = log_reg.predict(X_test_bow)

# Evaluation - test set:
evaluate(y_test, y_pred)

CONFUSION MATRIX:
[[107  39  13   1   1   0]
 [ 78  63  20   1   1   1]
 [ 59  38  50   7   1   5]
 [ 41  21  24  47   5   6]
 [ 52  22  21  27  31  20]
 [ 41  17  17  12  17  54]]
ACCURACY SCORE:
0.3667
CLASSIFICATION REPORT:
	Precision: 0.4361
	Recall: 0.3667
	F1_Score: 0.3610


(0.4360684922903896,
 0.36666666666666664,
 0.3609991276306834,
 0.36666666666666664)

In [60]:
tree_clf = DecisionTreeClassifier(random_state=0)
log_clf = LogisticRegression(random_state=0)
knn_clf = KNeighborsClassifier()
forest_clf = RandomForestClassifier(random_state=0)

# Training, predicting, then evaluating the predictions
# of all three models

for clf in (tree_clf, log_clf, knn_clf, forest_clf):
    clf.fit(X_train_bow, y_train) # training
    y_pred = clf.predict(X_test_bow) # predicting
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred)) # evaluating

DecisionTreeClassifier 0.334375
LogisticRegression 0.36666666666666664
KNeighborsClassifier 0.17291666666666666
RandomForestClassifier 0.2989583333333333


In [61]:
# Combining the four models into an ensemble
from sklearn.ensemble import VotingClassifier

# The ensemble is a voting classifier that aggregates our three models
voting_clf = VotingClassifier(estimators=[('tree', tree_clf), ('log', log_clf), ('knn', knn_clf), ('forest', forest_clf)], 
                             voting='hard')

voting_clf.fit(X_train_bow, y_train) # training
y_pred_voting = voting_clf.predict(X_test_bow) # predicting
accuracy_score(y_test, y_pred_voting) 
evaluate(y_test, y_pred_voting) # evaluating

CONFUSION MATRIX:
[[136  19   6   0   0   0]
 [119  36   8   0   0   1]
 [108  21  27   2   0   2]
 [ 82  12  12  25   5   8]
 [ 83  12  17  18  21  22]
 [ 79  10  10   6  10  43]]
ACCURACY SCORE:
0.3000
CLASSIFICATION REPORT:
	Precision: 0.4215
	Recall: 0.3000
	F1_Score: 0.2770


(0.4215050559580921, 0.3, 0.27695075736976005, 0.3)

In [62]:
# Evaluate and save results for table
ensemble_results = evaluate(y_test, y_pred_voting)
ensemble_percision = ensemble_results[0]
ensemble_recall = ensemble_results[1]
ensemble_f1 = ensemble_results[2]
ensemble_accuracy = ensemble_results[3]

CONFUSION MATRIX:
[[136  19   6   0   0   0]
 [119  36   8   0   0   1]
 [108  21  27   2   0   2]
 [ 82  12  12  25   5   8]
 [ 83  12  17  18  21  22]
 [ 79  10  10   6  10  43]]
ACCURACY SCORE:
0.3000
CLASSIFICATION REPORT:
	Precision: 0.4215
	Recall: 0.3000
	F1_Score: 0.2770


##Pre-trained CamemBERT Model

In [12]:
#Use pre-trained CamemBERT model

from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder()
df['LE_label'] = LE.fit_transform(df['difficulty'])
df.head()

Unnamed: 0,id,sentence,difficulty,LE_label
0,0,Les coûts kilométriques réels peuvent diverger...,C1,4
1,1,"Le bleu, c'est ma couleur préférée mais je n'a...",A1,0
2,2,Le test de niveau en français est sur le site ...,A1,0
3,3,Est-ce que ton mari est aussi de Boston?,A1,0
4,4,"Dans les écoles de commerce, dans les couloirs...",B1,2


In [13]:
X = df['sentence'] 
y = df['LE_label'] 

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [14]:
X_train_list = X_train.values.tolist()
X_test_list = X_test.values.tolist()
y_train_list = y_train.values.tolist()
y_test_list = y_test.values.tolist()

X_full = df['sentence'].tolist() #this was our best model, so we will train on the entire dataset
y_full = df['LE_label'].tolist()

In [15]:
# Initialize CamemBERT tokenizer
tokenizer = CamembertTokenizer.from_pretrained("camembert-base",do_lower_case=True)

In [16]:
def preprocess(raw_texts, labels=None):
    """
    Input: list of texts
    Returns: pytorch dataloader (format needed for CamemBERT model) including input_ids, attention_masks, and labels (if training)

    """
    encoded_batch = tokenizer.batch_encode_plus(raw_texts,
                                                add_special_tokens=True,
                                                padding='longest', #pad sentences to match longest sentence (dataset must have rows with same length to input in CamemBERT model)
                                                return_attention_mask=True,
                                                return_tensors = 'pt') #need pytorch tensors for inputs into CamemBERT model
    if labels: #if training will input and return difficulty labels, if using on unseen data there will be no label inputs
        labels = torch.tensor(labels)
        return encoded_batch['input_ids'], encoded_batch['attention_mask'], labels
    return encoded_batch['input_ids'], encoded_batch['attention_mask']

In [17]:
#Combine input ids, labels, and attention masks for:

#training set
input_ids, attention_mask, labels_train = preprocess(X_train_list, y_train_list)
train_dataset = TensorDataset(input_ids, attention_mask, labels_train)

#full dataset (this is the final model and so we trained it on the full dataset)
input_ids, attention_mask, labels_test = preprocess(X_full, y_full)
full_dataset = TensorDataset(input_ids, attention_mask, labels_test)

In [32]:
batch_size = 4 #4, 8, 16, 32 (used 32 for kaggle submission)

# Create DataLoaders
train_dataloader = DataLoader(train_dataset, sampler = RandomSampler(train_dataset), batch_size = batch_size)
full_dataloader = DataLoader(full_dataset, sampler = SequentialSampler(full_dataset), batch_size = batch_size)

In [33]:
# Load pretrained camemBERT model 
model = CamembertForSequenceClassification.from_pretrained("camembert-base", num_labels=6)

Some weights of the model checkpoint at camembert-base were not used when initializing CamembertForSequenceClassification: ['lm_head.layer_norm.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias', 'lm_head.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias']
- This IS expected if you are initializing CamembertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of CamembertForSequenceClassification were not initialized from the model checkpoint at camembert-base and are newly initialized: ['classifier.dense.bias', 

In [34]:
optimizer = AdamW(model.parameters(),lr = 2e-5, eps = 1e-8 )

In [35]:
import gc 
gc.collect()
torch.cuda.empty_cache()

In [36]:
# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

In [37]:
#Train on full dataset
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 
torch.cuda.set_device(0)

model.to(device)

epochs = 10 #(set to 50 for kaggle submission)

# Training
for epoch in range(0, epochs):
    
    print("")
    print(f'########## Epoch {epoch} / {epochs} ##########')

    # Tracking variables for training
    train_loss = 0
    num_train_examples, num_train_steps = 0, 0
    # Put the model into training mode
    model.train()
    # For each batch of training data
    for step, batch in enumerate(train_dataloader): #full_dataloader (when training on full dataset for kaggle submission)

        input_id = batch[0].to(device)
        attention_mask = batch[1].to(device)
        labels = batch[2].to(device)

        # Clear old gradients before backward pass
        model.zero_grad()        
        # Forward pass
        outputs = model(input_id,token_type_ids=None, attention_mask=attention_mask, labels=labels)
        # Get loss value
        loss = outputs[0]
        # Backward pass
        loss.backward()
        # Clip the norm of the gradients to 1.0 to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        # Update parameters and take a step 
        optimizer.step()
        # Update tracking variables
        train_loss += loss.item()
        num_train_examples += input_id.size(0)
        num_train_steps += 1

    print("Train loss: {}".format(train_loss/num_train_steps))
    torch.save(model.state_dict(), "/content/drive/MyDrive/Colab Notebooks/DMML_project/camembert_model.pt")
    
 
    gc.collect()
    torch.cuda.empty_cache()

print("Model saved!")
torch.save(model.state_dict(), "/content/drive/MyDrive/Colab Notebooks/DMML_project/camembert_model.pt")


########## Epoch 0 / 10 ##########
Train loss: 1.3139491327727835

########## Epoch 1 / 10 ##########
Train loss: 0.9698459891757617

########## Epoch 2 / 10 ##########
Train loss: 0.723643769378153

########## Epoch 3 / 10 ##########
Train loss: 0.5597505398948367

########## Epoch 4 / 10 ##########
Train loss: 0.4430055876408005

########## Epoch 5 / 10 ##########
Train loss: 0.3514152300924858

########## Epoch 6 / 10 ##########
Train loss: 0.28636445532053284

########## Epoch 7 / 10 ##########
Train loss: 0.2299204265839459

########## Epoch 8 / 10 ##########
Train loss: 0.18814127410842046

########## Epoch 9 / 10 ##########
Train loss: 0.1963006818931414
Model saved!


In [38]:
#testing set
input_ids_test, attention_mask_test, labels_test_test = preprocess(X_test_list, y_test_list)

In [39]:
# Apply the finetuned model (Camembert)
device = torch.device('cpu') 
model.to(device)

y_pred = []

with torch.no_grad():
    # Forward pass, calculate logit predictions
    outputs =  model(input_ids_test,token_type_ids=None, attention_mask=attention_mask_test)
    logits = outputs[0]
    logits = logits.detach().cpu().numpy() 
    y_pred.extend(np.argmax(logits, axis=1).flatten())

In [40]:
evaluate(y_test_list, y_pred)

CONFUSION MATRIX:
[[117  35   8   0   1   0]
 [ 39  90  28   7   0   0]
 [  9  43  91  15   2   0]
 [  0   1  16  88  31   8]
 [  0   2   5  54  81  31]
 [  0   0   4  11  39 104]]
ACCURACY SCORE:
0.5948
CLASSIFICATION REPORT:
	Precision: 0.5985
	Recall: 0.5948
	F1_Score: 0.5952


In [42]:
# Evaluate and save results for table
cb_results = evaluate(y_test_list, y_pred)
cb_percision = cb_results[0]
cb_recall = cb_results[1]
cb_f1 = cb_results[2]
cb_accuracy = cb_results[3]

CONFUSION MATRIX:
[[117  35   8   0   1   0]
 [ 39  90  28   7   0   0]
 [  9  43  91  15   2   0]
 [  0   1  16  88  31   8]
 [  0   2   5  54  81  31]
 [  0   0   4  11  39 104]]
ACCURACY SCORE:
0.5948
CLASSIFICATION REPORT:
	Precision: 0.5985
	Recall: 0.5948
	F1_Score: 0.5952


##Kaggle Submission

In [24]:
#Before running make sure to train model on FULL dataset. This is the final output for submission to kaggle:
#test on unseen data:
sentences = df_pred['sentence'].to_list()

#tokenize:
tokenized_sentences_ids = [tokenizer.encode(sentence,add_special_tokens=True,padding='longest') for sentence in sentences]
# Pad the encoded sentences:
tokenized_sentences_ids = pad_sequences(tokenized_sentences_ids, dtype="long", truncating="post", padding="post")

# Create attention masks:
attention_masks = []
for seq in tokenized_sentences_ids:
  seq_mask = [float(i>0) for i in seq]
  attention_masks.append(seq_mask)

#create pytorch inputs for model:
prediction_inputs = torch.tensor(tokenized_sentences_ids)
prediction_masks = torch.tensor(attention_masks)

In [25]:
# Apply the finetuned model (Camembert)
flat_pred = []
with torch.no_grad():
    # Forward pass, calculate logit predictions
    outputs =  model(prediction_inputs.to(device),token_type_ids=None, attention_mask=prediction_masks.to(device))
    logits = outputs[0]
    logits = logits.detach().cpu().numpy() 
    flat_pred.extend(np.argmax(logits, axis=1).flatten())

In [26]:
#convert results to a dataframe
d = {'sentence':sentences,'difficulty_label':flat_pred}
df_test = pd.DataFrame(d)

#relabel difficulty:
difficulty = {
    0:"A1",
    1:"A2",
    2:"B1",
    3:"B2", 
    4:"C1", 
    5:"C2"
}
#map categorical value back to difficulty rating:
df_test["difficulty"] = df_test['difficulty_label'].map(difficulty)
df_test.reset_index(inplace=True)
df_test = df_test.rename(columns = {'index':'id'})

df_pred_results = df_test[[ "id","difficulty"]]
df_pred_results

Unnamed: 0,id,difficulty
0,0,C2
1,1,A2
2,2,B1
3,3,A2
4,4,C2
...,...,...
1195,1195,B1
1196,1196,A2
1197,1197,C2
1198,1198,B2


In [None]:
df_pred_results.to_csv('/content/drive/MyDrive/Colab Notebooks/DMML_project/microsoft_submission11.csv', index=False) 

#### 4.7. Show a summary of your results

In [63]:
!pip install tabulate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [64]:
from tabulate import tabulate

In [65]:
results_df = [["Precision", log_reg_percision, knn_percision, tree_percision, forest_percision, ensemble_percision, cb_percision],
              ["Recall", log_reg_recall, knn_recall, tree_recall, forest_recall, ensemble_recall, cb_recall],
              ["F1-Score", log_reg_f1, knn_f1, tree_f1, forest_f1, ensemble_f1, cb_f1],
              ["Accuracy", log_reg_accuracy, knn_accuracy, tree_accuracy, forest_accuracy, ensemble_accuracy, cb_accuracy]]

col_names = ["Logistic Regression", "KNN", "Decision Tree", "Random Forest", "Ensemble + Data Augmentation", "CamemBERT"]
print(tabulate(results_df, headers=col_names))

             Logistic Regression       KNN    Decision Tree    Random Forest    Ensemble + Data Augmentation    CamemBERT
---------  ---------------------  --------  ---------------  ---------------  ------------------------------  -----------
Precision               0.486521  0.403741         0.346448         0.388571                        0.421505     0.598524
Recall                  0.485417  0.398958         0.339583         0.396875                        0.3          0.594792
F1-Score                0.482333  0.376673         0.330539         0.372308                        0.276951     0.595161
Accuracy                0.485417  0.398958         0.339583         0.396875                        0.3          0.594792
