# Final Project of Python Course
#### Koc University, Istanbul, Turkey
#### Author: Amir Ranjouriheravi
#### May 2020


## Description of the project

### The below description is quoted from the following source:
https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews

### Context
This is a Women’s Clothing E-Commerce dataset revolving around the reviews written by customers. Its nine supportive features offer a great environment to parse out the text through its multiple dimensions. Because this is real commercial data, it has been anonymized, and references to the company in the review text and body have been replaced with “retailer”.

### Content
This dataset includes 23486 rows and 10 feature variables. Each row corresponds to a customer review, and includes the variables:

Clothing ID: Integer Categorical variable that refers to the specific piece being reviewed.
Age: Positive Integer variable of the reviewers age.
Title: String variable for the title of the review.
Review Text: String variable for the review body.
Rating: Positive Ordinal Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.
Recommended IND: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.
Positive Feedback Count: Positive Integer documenting the number of other customers who found this review positive.
Division Name: Categorical name of the product high level division.
Department Name: Categorical name of the product department name.
Class Name: Categorical name of the product class name.
### Acknowledgements
Anonymous but real source

### Inspiration
I look forward to come quality NLP! There is also some great opportunities for feature engineering, and multivariate analysis.

### Importing requiring libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score
from catboost import CatBoostClassifier, Pool

### The illustration of the data

In [2]:
df = pd.read_csv('Womens Clothing E-Commerce Reviews.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


### Preprocessing on the datase

In [3]:
df.drop(df.columns[[0]], axis = 1, inplace = True)
df.fillna(0)
df[['Title', 'Review Text', 'Division Name', 'Department Name', 'Class Name', 'Clothing ID']] =\
df[['Title', 'Review Text', 'Division Name', 'Department Name', 'Class Name', 'Clothing ID']].astype(str)
text_col = ['Title', 'Review Text']
num_col = ['Positive Feedback Count']
cat_col = ['Division Name', 'Department Name', 'Class Name', 'Clothing ID']

### Defining classes and feature variables

In [4]:
X = df.drop(columns=['Recommended IND', 'Rating'])
y_recom = df['Recommended IND']
y_star = df['Rating']

## Text preprocessing (.json file)

In [5]:
txt_pre = [{"text_processing_options" : {
    "tokenizers" : [{
        "tokenizer_id" : "Sense",
        "lowercasing" : "True",
        "separator_type" : "BySense",
        "token_types" : ["Word"],
        "number_process_policy" : "Skip",
        "sub_tokens_policy" : "SeveralTokens",
        "languages" : ["english"]
    }],

    "dictionaries" : [{
        "dictionary_id" : "BiGram",
        "gram_order" : "2"
    }, {
        "dictionary_id" : "Word",
        "gram_order" : "1"
    }],

    "feature_processing" : {
        "default" : [{
            "dictionaries_names" : ["Word", "BiGram"],
            "feature_calcers" : ["BoW", "NaiveBayes", "BM25"],
            "tokenizers_names" : ["Sense"]
        }]
    }
}
}]

# Training on recommendation of costumers
### Spliting to test and train dataset

In [6]:
Xtr, Xts, ytr, yts = train_test_split(X, y_recom, test_size = 0.20, random_state = 123)
train_pool = Pool(data = Xtr, label = ytr, cat_features = cat_col, text_features = text_col)
test_pool = Pool(data = Xts, label = yts, cat_features = cat_col, text_features = text_col)

### Fitting the model

In [7]:
clf = CatBoostClassifier(iterations = 20, logging_level = 'Silent', random_seed = 123, cat_features = cat_col,\
                         text_features = text_col, text_processing = txt_pre, task_type = 'GPU')
clf.fit(train_pool)

<catboost.core.CatBoostClassifier at 0x1a6d8b22080>

### Calculating the accuracy of the model

In [8]:
train_accuracy = clf.score(train_pool)
test_accuracy = clf.score(test_pool)
print ('The train accuracy and test accuracy are ',\
       '{:.1%}'.format(train_accuracy),' and ',\
       '{:.1%}'.format(test_accuracy), 'respectively.')
ypred = clf.predict(Xts)
print(classification_report(yts, ypred))
auc = roc_auc_score(yts, ypred)
print('Also, the area under ROC curve equals ', '{:.2}'.format(auc),'. \n', sep = '')

The train accuracy and test accuracy are  94.0%  and  91.5% respectively.
              precision    recall  f1-score   support

           0       0.74      0.75      0.75       781
           1       0.95      0.95      0.95      3917

    accuracy                           0.92      4698
   macro avg       0.85      0.85      0.85      4698
weighted avg       0.92      0.92      0.92      4698

Also, the area under ROC curve equals 0.85. 



# Training on number of stars given
### Spliting to test and train dataset

In [9]:
Xtr, Xts, ytr, yts = train_test_split(X, y_star, test_size = 0.20, random_state = 123)
train_pool = Pool(data = Xtr, label = ytr, cat_features = cat_col, text_features = text_col)
test_pool = Pool(data = Xts, label = yts, cat_features = cat_col, text_features = text_col)

### Fitting the model

In [10]:
clf = CatBoostClassifier(iterations = 20, logging_level = 'Silent', random_seed = 123, cat_features = cat_col,\
                         text_features = text_col, text_processing = txt_pre, task_type = 'GPU')
clf.fit(train_pool)

<catboost.core.CatBoostClassifier at 0x1a6d8b0ac88>

### Calculating the accuracy of the model

In [11]:
train_accuracy = clf.score(train_pool)
test_accuracy = clf.score(test_pool)
print ('The train accuracy and test accuracy are ',\
       '{:.1%}'.format(train_accuracy),' and ',\
       '{:.1%}'.format(test_accuracy), 'respectively.')
ypred = clf.predict(Xts)
print(classification_report(yts, ypred))

The train accuracy and test accuracy are  74.3%  and  66.5% respectively.
              precision    recall  f1-score   support

           1       0.45      0.25      0.32       163
           2       0.32      0.25      0.28       285
           3       0.40      0.50      0.44       549
           4       0.51      0.30      0.38      1042
           5       0.78      0.91      0.84      2659

    accuracy                           0.66      4698
   macro avg       0.49      0.44      0.45      4698
weighted avg       0.64      0.66      0.64      4698

