## Course Project: Text Classification with Rakuten France Product Data

The project focuses on the topic of large-scale product type code text classification where the goal is to predict each product’s type code as defined in the catalog of Rakuten France. This project is derived from a data challenge proposed by Rakuten Institute of Technology, Paris. Details of the data challenge is [available in this link](https://challengedata.ens.fr/challenges/35).

The above data challenge focuses on multimodal product type code classification using text and image data. **For this project we will work with only text part of the data.**

Please read carefully the description of the challenge provided in the above link. **You can disregard any information related to the image part of the data.**

### To obtain the data
You have to register yourself [in this link](https://challengedata.ens.fr/challenges/35) to get access to the data.

For this project you will only need the text data. Download the training files `x_train` and `y_train`, containing the item texts, and the corresponding product type code labels.

### Pandas for handling the data
The files you obtained are in CSV format. We strongly suggest to use Python Pandas package to load and visualize the data. [Here is a basic tutorial](https://data36.com/pandas-tutorial-1-basics-reading-data-files-dataframes-data-selection/) on how to handle data in CSV file using Pandas.

If you open the `x_train` dataset using Pandas, you will find that it contains following columns:
1. an integer ID for the product
2. **designation** - The product title
3. description
4. productid
5. imageid

For this project we will only need the integer ID and the designation. You can [`drop`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) the other columns.

The training output file `y_train.csv` contains the **prdtypecode**, the target/output variable for the classification task, for each integer id in the training input file `X_train.csv`.

### Task for the break
1. Register yourself and download the training and test for text data. You do not need the `supplementary files` for this project.
2. Load the data using pandas and disregard unnecessary columns as mentioned above.
3. On the **designation** column, apply the preprocessing techniques.

### Task for the end of the course
After this preprocessing step, you have now access to a TF-IDF matrix that constitute our data set for the final evaluation project. The project guidelines are:
1. Apply all approaches taught in the course and practiced in lab sessions (Decision Trees, Bagging, Random forests, Boosting, Gradient Boosted Trees, AdaBoost, etc.) on this data set. The goal is to predict the target variable (prdtypecode).
2. Compare performances of all these models in terms of the weighted-f1 scores you can output. 
3. Conclude about the most appropriate approach on this data set for the predictive task. 
4. Write a report in .tex format that adress all these guidelines with a maximal page number of 5 (including figures, tables and references). We will take into account the quality of writing and presentation of the report.

In [8]:
import numpy as np
import pandas as pd
import spacy

# Load spaCy for french
spacy_nlp = spacy.load("fr_core_news_sm")

In [17]:
%%capture
from tqdm import tqdm_notebook as tqdm
tqdm().pandas()

In [2]:
# download data
X_train = pd.read_csv('data/X_train.csv')
Y_train = pd.read_csv('data/Y_train.csv')
X_test = pd.read_csv('data/X_test.csv')

In [93]:
# X_train.head()
# Y_train.head()
# X_test.head()

## Pre-processing

1. We only keep the designation and id. 
2. We normalize the accents, put the text in lower-case, remove the punctuation, tokenise the extracts

In [4]:
# designation and ids
def cleaning(X_train): 
    X_train = X_train.drop(['description', 'productid','imageid'], axis=1)
    X_train.columns = ['integer_id', 'designation']
    return X_train

In [5]:
def normalize_accent(string):
    string = string.replace('á', 'a')
    string = string.replace('â', 'a')

    string = string.replace('é', 'e')
    string = string.replace('è', 'e')
    string = string.replace('ê', 'e')
    string = string.replace('ë', 'e')

    string = string.replace('î', 'i')
    string = string.replace('ï', 'i')

    string = string.replace('ö', 'o')
    string = string.replace('ô', 'o')
    string = string.replace('ò', 'o')
    string = string.replace('ó', 'o')

    string = string.replace('ù', 'u')
    string = string.replace('û', 'u')
    string = string.replace('ü', 'u')

    string = string.replace('ç', 'c')
    
    return string

In [6]:
def raw_to_tokens(raw_string, spacy_nlp):
    # Write code for lower-casing
    string = raw_string.lower()
    
    # Write code to normalize the accents
    string = normalize_accent(string)
        
    # Write code to tokenize
    spacy_tokens = spacy_nlp(string)
        
    # Write code to remove punctuation tokens and create string tokens
    string_tokens = [token.orth_ for token in spacy_tokens if not token.is_punct if not token.is_stop]
    
    # Write code to join the tokens back into a single string
    clean_string = " ".join(string_tokens)
    
    return clean_string

### Apply pre-processing functions

In [26]:
# X_train - step takes roughly ~15:30 min > Uncomment below to run text procesing
X_train = cleaning(X_train)
# X_train['designation_cleaned'] = X_train['designation'].progress_apply(lambda x: raw_to_tokens(x, spacy_nlp))

HBox(children=(FloatProgress(value=0.0, max=84916.0), HTML(value='')))




In [42]:
# X_test - step takes roughly ~2:20 min > Uncomment below to run text procesing
X_test = cleaning(X_test)
# X_test['designation_cleaned'] = X_test['designation'].progress_apply(lambda x: raw_to_tokens(x, spacy_nlp))

HBox(children=(FloatProgress(value=0.0, max=13812.0), HTML(value='')))




In [44]:
# save folders to avoid re-processing everytime
X_train.to_csv(r'data/X_train_cleaned.csv', index = False, header=True)
X_test.to_csv(r'data/X_test_cleaned.csv', index = False, header=True)

## TF-IDF matrix

Construct the TF-IDF matrix from the pre-processed data. 

In [47]:
X_train = pd.read_csv('data/X_train_cleaned.csv')
X_test = pd.read_csv('data/X_test_cleaned.csv')

In [60]:
# create a list from the processed cells
doc_clean =  X_train['designation_cleaned'].astype('U').tolist()

In [61]:
# import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# convert raw documents into TF-IDF matrix.
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(doc_clean)

print("Shape of the TF-IDF Matrix:")
print(X_tfidf.shape)

Shape of the TF-IDF Matrix:
(84916, 79402)


### PCA of the TFIDF matrix 
We apply a PCA on the TF-IDF matrix to reduce the dimension. Given the matrix is very sparse, this improves the speed of the algorithms. We opt for **XY** principal components corresponding to 85% variance explained. SInce the matrix is very sparse, Sparse PCA model is used. 

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.SparsePCA.html

In [104]:
from sklearn.decomposition import SparsePCA

transformer = SparsePCA(n_components=5, random_state=0)
transformer.fit(X_tfidf.toarray())

X_transformed = transformer.transform(X_tfidf)
X_transformed.shape

# most values in the components_ are zero (sparsity)
np.mean(transformer.components_ == 0)

MemoryError: Unable to allocate 50.2 GiB for an array with shape (84916, 79402) and data type float64

In [103]:

X_transformed = X_tfidf
# print(X_transformed)

## Apply various models to predict the target variable
1. Decision Trees
2. Bagging
3. Random forests
4. Boosting
5. Gradient Boosted Trees
6. AdaBoost, etc.

### 1. Decision trees

In [95]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
#from sklearn.cross_validation import  cross_val_score

parameters = param_grid = { 'criterion':['gini','entropy'],'max_depth': np.arange(3, 15)}
grid_dec_tree = GridSearchCV(tree.DecisionTreeClassifier(), parameters, cv = 5, scoring = 'f1')
result = grid_dec_tree.fit(X_tfidf, Y_train)

# add-in tqdm time taken for the algorithm

KeyboardInterrupt: 

After this preprocessing step, you have now access to a TF-IDF matrix that constitute our data set for the final evaluation project. The project guidelines are:
1. Apply all approaches taught in the course and practiced in lab sessions () on this data set. The goal is to predict the target variable (prdtypecode).
2. Compare performances of all these models in terms of the weighted-f1 scores you can output. 
3. Conclude about the most appropriate approach on this data set for the predictive task. 
4. Write a report in .tex format that adress all these guidelines with a maximal page number of 5 (including figures, tables and references). We will take into account the quality of writing and presentation of the report.