In this notebook the data will be pre-processed before fitting the model. 

This includes:
* Fixing the Clothing ID that is duplicated across Classes.
* Removing articles with a low number of reviews where the recommendation probability is not clear yet.
* Adding a column with the recommendation probability (our modelling target).
* Pre-computing the NLP per review using the title and the text. 
    * This will save time when fitting the model, especially in cross validation. 
    * We can still dynamically select which NLP columns to use in the pipeline


Import requirements

In [1]:
import os
from pathlib import Path

import pandas as pd

# custom transformer to process text
from stylesense.text_transformers import TextProcessor

Load data

In [2]:
file_path = Path(os.path.abspath('')).parent / "data" / "reviews.csv"
df = pd.read_csv(file_path)

Update duplicated Clothing ID

In [3]:
clothing_class_counts = df.groupby('Clothing ID')['Class Name'].nunique()
multiple_class_ids = clothing_class_counts[clothing_class_counts > 1]

for id in multiple_class_ids.index:   
    next_id = max(df['Clothing ID']) + 1
    df.loc[(df['Clothing ID'] == id) & (df['Division Name'] == "General"), 'Clothing ID'] = next_id

Remove article IDs with low number of reviews

In [4]:
min_reviews = 10
num_reviews = df['Clothing ID'].value_counts()
df = df[df['Clothing ID'].isin(num_reviews[num_reviews >= min_reviews].index)]

Pre-process NLP

In [5]:
df = TextProcessor().transform(df)
df.drop(["Title", "Review Text"], axis=1, inplace=True)

Export CSV file

In [7]:
file_path = Path(os.path.abspath('')).parent / "data" / "reviews_processed.csv"
df.to_csv(file_path, index=False)