## Course Project: Text Classification with Rakuten France Product Data

The project focuses on the topic of large-scale product type code text classification where the goal is to predict each product’s type code as defined in the catalog of Rakuten France. This project is derived from a data challenge proposed by Rakuten Institute of Technology, Paris. Details of the data challenge is [available in this link](https://challengedata.ens.fr/challenges/35).

The above data challenge focuses on multimodal product type code classification using text and image data. **For this project we will work with only text part of the data.**

Please read carefully the description of the challenge provided in the above link. **You can disregard any information related to the image part of the data.**

### To obtain the data
You have to register yourself [in this link](https://challengedata.ens.fr/challenges/35) to get access to the data.

For this project you will only need the text data. Download the training files `x_train` and `y_train`, containing the item texts, and the corresponding product type code labels.

### Pandas for handling the data
The files you obtained are in CSV format. We strongly suggest to use Python Pandas package to load and visualize the data. [Here is a basic tutorial](https://data36.com/pandas-tutorial-1-basics-reading-data-files-dataframes-data-selection/) on how to handle data in CSV file using Pandas.

If you open the `x_train` dataset using Pandas, you will find that it contains following columns:
1. an integer ID for the product
2. **designation** - The product title
3. description
4. productid
5. imageid

For this project we will only need the integer ID and the designation. You can [`drop`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) the other columns.

The training output file `y_train.csv` contains the **prdtypecode**, the target/output variable for the classification task, for each integer id in the training input file `X_train.csv`.

### Task for the break
1. Register yourself and download the training and test for text data. You do not need the `supplementary files` for this project.
2. Load the data using pandas and disregard unnecessary columns as mentioned above.
3. On the **designation** column, apply the preprocessing techniques.

### Task for the end of the course
After this preprocessing step, you have now access to a TF-IDF matrix that constitute our data set for the final evaluation project. The project guidelines are:
1. Apply all approaches taught in the course and practiced in lab sessions (Decision Trees, Bagging, Random forests, Boosting, Gradient Boosted Trees, AdaBoost, etc.) on this data set. The goal is to predict the target variable (prdtypecode).
2. Compare performances of all these models in terms of the weighted-f1 scores you can output. 
3. Conclude about the most appropriate approach on this data set for the predictive task. 
4. Write a report in .tex format that adress all these guidelines with a maximal page number of 5 (including figures, tables and references). We will take into account the quality of writing and presentation of the report.

In [1]:
!pip install spacy



In [2]:
!python -m spacy download fr_core_news_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('fr_core_news_sm')


Libraries

In [1]:
import numpy as np
import pandas as pd
import spacy
import fr_core_news_sm

# Choose number of samples tested
import math

# import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# dimension reduction
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import SparsePCA

# plots
import matplotlib.pyplot as plt
import seaborn as sns

# pick at random 10% of the total samples
import random 

# classifiers
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier

# hyperparameters tuning
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import validation_curve
from sklearn.model_selection import RandomizedSearchCV

In [2]:
# Load spaCy for french
spacy_nlp = fr_core_news_sm.load()

In [3]:
Y_train = pd.read_csv('Y_train.csv')

## TF-IDF matrix

Construct the TF-IDF matrix from the pre-processed data. 

In [4]:
X_train = pd.read_csv('X_train_cleaned_.csv')
X_test = pd.read_csv('X_test_cleaned_.csv')

In [5]:
# create a list from the processed cells
doc_clean_train =  X_train['designation_cleaned'].astype('U').tolist()
doc_clean_test = X_test['designation_cleaned'].astype('U').tolist()

#doc_clean = doc_clean_train + doc_clean_test
doc_clean = doc_clean_train

In [6]:
# convert raw documents into TF-IDF matrix.
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(doc_clean)

print("Shape of the TF-IDF Matrix:")
print(X_tfidf.shape)

Shape of the TF-IDF Matrix:
(84916, 61697)


### Use the TF-IDF matrix as the training set

If applicable, based on the applied transformations:
*   Define the X_train_T matrix as the truncated or TF-IDF matrix
*   Divide back in train and test



In [7]:
# if no transformation is applied i.e. no PCA / truncated SVD: 
X_transformed = X_tfidf
# print(X_transformed)

In [8]:
X_train_T = X_transformed[:84916]
X_test_T = X_transformed[84916:]

print(X_train_T.shape) # 84916
print(X_test_T.shape) # 13812

(84916, 61697)
(0, 61697)


## Apply various models to predict the target variable
1. Decision Trees
2. Bagging
3. Random forests
4. Boosting
5. Gradient Boosted Trees
6. AdaBoost, etc.

### Reduce the dataset to 20% of its initial size for faster training

The selection is done at random i.e. 20% of the training samples are chosen at random within the training set. Assuming the data is uniformly distributed.

In [9]:
row, col = X_train.shape
N = math.ceil(row * 0.9) # 20% of the data
#N = 1000

In [10]:
rdsample = random.sample(range(1, row), N) 

X_train_sample = X_train_T[rdsample,]
Y_train_sample = Y_train.to_numpy()[rdsample,]
Y_train_sample = pd.DataFrame(Y_train_sample[:,1])

print(X_train_sample.shape)
print(Y_train_sample.shape)

(76425, 61697)
(76425, 1)


### 1. Decision trees - Ariel

1.   First parameters tried: { 'criterion':['gini','entropy'],'max_depth': np.arange(3, 15), 'splitter':['best', 'random']} - **Optimal:** { 'criterion':['gini'],'max_depth': 14, 'splitter':['best']}
2.   Second parameters tried: { 'criterion':['gini','entropy'],'max_depth': np.arange(13, 30), 'splitter':['best', 'random']} - **Optimal:** { 'criterion':['gini'],'max_depth': 27, 'splitter':['random']}



In [13]:
parameters = {'criterion':['gini'],
              'max_depth': [None], 
              'splitter':['random']}

grid_dec_tree = GridSearchCV(tree.DecisionTreeClassifier(), 
                             parameters, 
                             cv = 5, 
                             scoring = 'f1_weighted', 
                             verbose = 10, 
                             n_jobs=-1)

result = grid_dec_tree.fit(X_train_sample, Y_train_sample)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:   37.3s remaining:   55.9s
[Parallel(n_jobs=-1)]: Done   3 out of   5 | elapsed:   37.3s remaining:   24.9s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   37.6s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   37.6s finished


In [14]:
# print results from grid search 

print(result.best_params_)
print(result.best_score_)

{'criterion': 'gini', 'max_depth': None, 'splitter': 'random'}
0.7168261191216211


### 2. Random forests - Camille

Source : https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74

In [16]:
parameters = {'bootstrap': [True],
              'max_depth': [300],
              'max_features': ['auto'],
              'min_samples_leaf': [1],
              'min_samples_split': [5],
              'n_estimators': [1800]}

rand_forest = GridSearchCV(RandomForestClassifier(), 
                                 parameters, 
                                 cv = 5, 
                                 scoring = 'f1_weighted', 
                                 verbose = 10,
                                 n_jobs = -1)

result = rand_forest.fit(X_train_sample, Y_train_sample)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed: 27.2min remaining: 40.8min
[Parallel(n_jobs=-1)]: Done   3 out of   5 | elapsed: 27.2min remaining: 18.1min
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed: 27.2min remaining:    0.0s


KeyboardInterrupt: 

In [None]:
# print results from grid search 
print(result.best_params_)
print(result.best_score_)

### 3. Boosting - Niels

In [17]:
!pip install xgboost

Collecting xgboost
  Downloading xgboost-1.0.2-py3-none-manylinux1_x86_64.whl (109.7 MB)
[K     |████████████████████████████████| 109.7 MB 9.5 kB/s s eta 0:00:01  |█████████████████████▏          | 72.5 MB 4.7 MB/s eta 0:00:08
Installing collected packages: xgboost
Successfully installed xgboost-1.0.2


In [None]:
parameters = {
    'eta': [0.2], 
    'max_depth': [None], 
    "n_estimators" : [215],
    'objective': ['multi:softprob'], 
    'num_classes' : [27], 
    'nthread' : [-1],
    'subsample' : [0.8],
    'colsample_bytree' : [0.8],
    'colsample_bylevel' : [1], 
    'tree_method' : ['hist'],
    } 

model = GridSearchCV(xgb.XGBClassifier(), 
                           parameters, 
                           cv = 3, 
                           scoring = 'f1_weighted', 
                           verbose = 10, 
                           n_jobs = -1)

result = model.fit(X_train_sample, Y_train_sample)

Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.


In [17]:
parameters = {
    'eta': [0.2], 
    'max_depth': [None], 
    "n_estimators" : [215],
    'objective': ['multi:softprob'], 
    'num_classes' : [27], 
    'nthread' : [-1],
    'subsample' : [0.8],
    'colsample_bytree' : [0.8],
    'colsample_bylevel' : [1], 
    'tree_method' : ['hist'],
    'silent' : [False],
    'min_child_weight' : [1], 
    'max_delta_step' : [0],
    'booster' : ['gbtree']
    } 

model = GridSearchCV(xgb.XGBClassifier(), 
                           parameters, 
                           cv = 3, 
                           scoring = 'f1_weighted', 
                           verbose = 10, 
                           n_jobs = -1)

result = model.fit(X_train_sample, Y_train_sample)

Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:  3.7min remaining:    0.0s


KeyboardInterrupt: 

In [None]:
print(result.best_paµrams_)
print(result.best_score_)

### 4. Gradient Boosted Trees - Camille

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html

https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/

https://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting

https://towardsdatascience.com/understanding-gradient-boosting-machines-using-xgboost-and-lightgbm-parameters-3af1f9db9700

In [None]:
parameters = {
    "loss" : ['deviance'], 
    "max_depth": [10],
    'max_leaf_nodes': [4], 
    'min_samples_split': [2],
    "n_estimators": [200],
    "learning_rate": [0.1],
    "max_features" : ['auto']
}

grad_boost = GridSearchCV(GradientBoostingClassifier(), 
                                parameters, 
                                cv = 5, 
                                scoring = 'f1_weighted', 
                                verbose = 10, 
                                n_jobs = -1)

result = grad_boost.fit(X_train_sample, Y_train_sample)

In [None]:
# print results from grid search 

print(result.best_params_)
print(result.best_score_)
#print(grad_boost.cv_results_)

### 5. Bagging - Niels

In [None]:
parameters = {"base_estimator__criterion" : ["gini"],
              "base_estimator__splitter" :   ["random"],
              "base_estimator__max_depth" :   [400],
              "n_estimators": [50, 100], 
              "max_samples" : [0.8]
             }

DTC = DecisionTreeClassifier(random_state = 11, 
                             max_features = 0.8)

BC = BaggingClassifier(base_estimator = DTC)

bagging = GridSearchCV(BC, 
                       parameters,  
                       cv = 5, 
                       scoring = 'f1_weighted', 
                       verbose = 10,
                       n_jobs = -1)

result = bagging.fit(X_train_sample, Y_train_sample)

In [None]:
# print results from grid search 

print(result.best_params_)
print(result.best_score_)

### 6. AdaBoost - Ariel

In [None]:
parameters = {"n_estimators" : [50], 
              "learning_rate" : [0.1]}

DTC = DecisionTreeClassifier(random_state = 11, 
                             max_features = "auto", 
                             class_weight = "balanced", 
                             max_depth = 100)

RFC = RandomForestClassifier(criterion = 'gini', 
                             n_estimators = 2000, 
                             min_samples_split = 10, 
                             min_samples_leaf = 2, 
                             max_features = 'sqrt', 
                             max_samples = 0.8,
                             max_depth = 100, 
                             bootstrap = True)

ABC = AdaBoostClassifier(base_estimator = RFC)

adaboost = GridSearchCV(ABC, 
                        parameters, 
                        cv = 3, 
                        scoring = 'f1_weighted', 
                        verbose = 20, 
                        n_jobs = -1)

result = adaboost.fit(X_train_sample, Y_train_sample)

In [None]:
# print results from grid search 
print(result.best_params_)
print(result.best_score_)