# Kaggle ecommerce challenge (Improvisation)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import os
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import pandas_profiling
%matplotlib inline

In [2]:
df = pd.read_csv('Womens Clothing E-Commerce Reviews.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [3]:
from collections import Counter
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
import nltk
from nltk.corpus import stopwords
from nltk import sent_tokenize, word_tokenize
import re

In [4]:
top_N = 100
#convert list of list into text

a = df['Review Text'].str.lower().str.cat(sep=' ')

# removes punctuation,numbers and returns list of words
b = re.sub('[^A-Za-z]+', ' ', a)

#remove all the stopwords from the text
stop_words = list(get_stop_words('en'))         
nltk_words = list(stopwords.words('english'))   
stop_words.extend(nltk_words)

word_tokens = word_tokenize(b)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

# Remove characters which have length less than 2  
without_single_chr = [word for word in filtered_sentence if len(word) > 2]

# Remove numbers
cleaned_data_title = [word for word in without_single_chr if not word.isnumeric()]  

In [5]:
df=df.dropna(axis=0,how='any')
rating_class = df[(df['Rating'] == 1) | (df['Rating'] == 5)]
X_review=rating_class['Review Text']
y=rating_class['Rating']

In [6]:
import string
def text_process(review):
    nopunc=[word for word in review if word not in string.punctuation]
    nopunc=''.join(nopunc)
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

## Now there are two things, I would like to see here in order to increase the in sample accuracy and precision - recall

1) Perform hyperparamater tuning - The only hyperparamter that comes to mind while tuning naive bayes classifier is smoothing or alpha which is by default set at 1 (add one smoothing). However, at no smoothing  (i.e. alpha = 0), you are overfitting the model. So the value of alpha has to be carefully chosen

2) Look into a different technique of vectorization which is called TF IDF Vectorization

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
bow_transformer=CountVectorizer(analyzer=text_process).fit(X_review)
X_review = bow_transformer.transform(X_review)

## Multinomial Naive Bayes Classifier 

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X_review, y, test_size=0.3, random_state=101)
nb = MultinomialNB()
nb.fit(X_train, y_train) 
nb.score(X_train,y_train)

0.9706828302820386

In [9]:
predict = nb.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, predict))
print('\n')
print(classification_report(y_test, predict))

[[  65  134]
 [  14 3252]]


             precision    recall  f1-score   support

          1       0.82      0.33      0.47       199
          5       0.96      1.00      0.98      3266

avg / total       0.95      0.96      0.95      3465



### Hyperparamter tuning for Naive Bayes (Laplace smoothing factor)

In [10]:
from sklearn.model_selection import GridSearchCV
parameters = {'alpha':[i/10 for i in range(1,10)]}
gridsearch = GridSearchCV(nb,param_grid = parameters,scoring='accuracy',cv=10)
gridsearch = gridsearch.fit(X_train, y_train) 
best_accuracy = gridsearch.best_score_
best_paramters = gridsearch.best_params_

In [11]:
best_paramters

{'alpha': 0.3}

In [12]:
nb = MultinomialNB(alpha=best_paramters['alpha'])
nb.fit(X_train, y_train) 
nb.score(X_train,y_train)

0.989732805541811

In [13]:
predict = nb.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, predict))
print('\n')
print(classification_report(y_test, predict))

[[ 122   77]
 [  63 3203]]


             precision    recall  f1-score   support

          1       0.66      0.61      0.64       199
          5       0.98      0.98      0.98      3266

avg / total       0.96      0.96      0.96      3465



#### Hence, there is an increase in precision by 1%

### TF IDF Vectorization technique with Multinomial Naive Bayes 

In [14]:
top_N = 100
#convert list of list into text

a = df['Review Text'].str.lower().str.cat(sep=' ')

# removes punctuation,numbers and returns list of words
b = re.sub('[^A-Za-z]+', ' ', a)

#remove all the stopwords from the text
stop_words = list(get_stop_words('en'))         
nltk_words = list(stopwords.words('english'))   
stop_words.extend(nltk_words)

word_tokens = word_tokenize(b)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

# Remove characters which have length less than 2  
without_single_chr = [word for word in filtered_sentence if len(word) > 2]

# Remove numbers
cleaned_data_title = [word for word in without_single_chr if not word.isnumeric()]  
df=df.dropna(axis=0,how='any')
rating_class = df[(df['Rating'] == 1) | (df['Rating'] == 5)]
X_review=rating_class['Review Text']
y=rating_class['Rating']
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
bow_transformer=TfidfVectorizer(analyzer=text_process).fit(X_review)
X_review = bow_transformer.transform(X_review)

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X_review, y, test_size=0.3, random_state=101)
nb = MultinomialNB()
nb.fit(X_train, y_train) 
nb.score(X_train,y_train)

0.9391390400791687

In [16]:
predict = nb.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, predict))
print('\n')
print(classification_report(y_test, predict))

[[   0  199]
 [   0 3266]]


             precision    recall  f1-score   support

          1       0.00      0.00      0.00       199
          5       0.94      1.00      0.97      3266

avg / total       0.89      0.94      0.91      3465



In [17]:
from sklearn.model_selection import GridSearchCV
parameters = {'alpha':[i/10 for i in range(1,10)]}
gridsearch = GridSearchCV(nb,param_grid = parameters,scoring='accuracy',cv=10)
gridsearch = gridsearch.fit(X_train, y_train) 
best_accuracy = gridsearch.best_score_
best_paramters = gridsearch.best_params_

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X_review, y, test_size=0.3, random_state=101)
nb = MultinomialNB(alpha = best_paramters['alpha'])
nb.fit(X_train, y_train) 
nb.score(X_train,y_train)

0.9701880257298368

In [19]:
predict = nb.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, predict))
print('\n')
print(classification_report(y_test, predict))

[[  44  155]
 [   9 3257]]


             precision    recall  f1-score   support

          1       0.83      0.22      0.35       199
          5       0.95      1.00      0.98      3266

avg / total       0.95      0.95      0.94      3465



#### Hence, at its best smoothing parameter, this technique is able to achieve 95% of precision and recall, which is lower than what we got for Count Vectorizer technique as demonstrated above

#### Frankly speaking, TF IFD vectorization is not a good match with Naive Bayes because the primary assumption of multinomial NB classifer is - All features are independent to each other but TF IFD violates that 

## (1) Linear Support Vector Classifier with TF IDF Vectorized data

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X_review, y, test_size=0.3, random_state=101)
from sklearn.svm import SVC
svc = SVC(kernel='linear')
svc.fit(X_train, y_train) 
svc.score(X_train,y_train)

0.9909698169223157

In [21]:
predict = svc.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, predict))
print('\n')
print(classification_report(y_test, predict))

[[  98  101]
 [   8 3258]]


             precision    recall  f1-score   support

          1       0.92      0.49      0.64       199
          5       0.97      1.00      0.98      3266

avg / total       0.97      0.97      0.96      3465



### Let's perform hyperparameter tuning for the penalty factor of support vector classfier and then use the best penalty factor (C) for our prediction 

In [22]:
from sklearn.model_selection import GridSearchCV
parameters = {'C':[i/10 for i in range(1,10)]}
gridsearch = GridSearchCV(svc,param_grid = parameters,scoring='accuracy',cv=10)
gridsearch = gridsearch.fit(X_train, y_train) 
best_accuracy = gridsearch.best_score_
best_paramters = gridsearch.best_params_

In [23]:
best_paramters

{'C': 0.9}

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X_review, y, test_size=0.3, random_state=101)
from sklearn.svm import SVC
svc = SVC(kernel='linear',C=best_paramters['C'])
svc.fit(X_train, y_train) 
svc.score(X_train,y_train)

0.9889905987135081

##### This means there is no overfitting as such

In [25]:
predict = svc.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, predict))
print('\n')
print(classification_report(y_test, predict))

[[  91  108]
 [   8 3258]]


             precision    recall  f1-score   support

          1       0.92      0.46      0.61       199
          5       0.97      1.00      0.98      3266

avg / total       0.97      0.97      0.96      3465



#### We get 2% better precision and 1% better recall than Multinomial Naive Bayes Classifier 

#### I tried count vectorization with SVM and was getting precision - recall of 96%. So, TF IFD with SVM is a better match for this data 

#### This tells us that default penalty parameter (C=1.0) was restricting the classification by unnecessary shrinking of the margin between hyperplanes. We dropped the penalty by 0.1 and received same results

## (2) Decision Trees for Classification with Count Vectorization

In [26]:
top_N = 100
#convert list of list into text

a = df['Review Text'].str.lower().str.cat(sep=' ')

# removes punctuation,numbers and returns list of words
b = re.sub('[^A-Za-z]+', ' ', a)

#remove all the stopwords from the text
stop_words = list(get_stop_words('en'))         
nltk_words = list(stopwords.words('english'))   
stop_words.extend(nltk_words)

word_tokens = word_tokenize(b)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

# Remove characters which have length less than 2  
without_single_chr = [word for word in filtered_sentence if len(word) > 2]

# Remove numbers
cleaned_data_title = [word for word in without_single_chr if not word.isnumeric()]  
df=df.dropna(axis=0,how='any')
rating_class = df[(df['Rating'] == 1) | (df['Rating'] == 5)]
X_review=rating_class['Review Text']
y=rating_class['Rating']
from sklearn.feature_extraction.text import CountVectorizer
bow_transformer=CountVectorizer(analyzer=text_process).fit(X_review)
X_review = bow_transformer.transform(X_review)

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X_review, y, test_size=0.3, random_state=101)
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(criterion="entropy",max_depth=10,min_samples_split=4)
dtc.fit(X_train, y_train) 
dtc.score(X_train,y_train)

0.9651162790697675

In [28]:
predict = dtc.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, predict))
print('\n')
print(classification_report(y_test, predict))

[[  66  133]
 [  52 3214]]


             precision    recall  f1-score   support

          1       0.56      0.33      0.42       199
          5       0.96      0.98      0.97      3266

avg / total       0.94      0.95      0.94      3465



### Important thing to note here is, I am restricting how much this tree will grow by keep max_depth = 10

### If, I don't do this, this model will overfit. But in order to make an informed decision about max_depth let's do a tuning for this hyperparameter using Grid Search Cross Validation 

In [29]:
from sklearn.model_selection import GridSearchCV
parameters = {'max_depth':[i for i in range(10,30)]}
gridsearch = GridSearchCV(dtc,param_grid = parameters,scoring='accuracy',cv=10)
gridsearch = gridsearch.fit(X_train, y_train) 
best_accuracy = gridsearch.best_score_
best_paramters = gridsearch.best_params_

In [30]:
best_paramters

{'max_depth': 15}

In [31]:
X_train, X_test, y_train, y_test = train_test_split(X_review, y, test_size=0.3, random_state=101)
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(criterion="entropy",max_depth=best_paramters['max_depth'],
                             min_samples_split=3)
dtc.fit(X_train, y_train) 
dtc.score(X_train,y_train)
# Just have to make sure that this tuning didn't result into overfitting

0.9768678871845621

In [32]:
predict = dtc.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, predict))
print('\n')
print(classification_report(y_test, predict))

[[  68  131]
 [  55 3211]]


             precision    recall  f1-score   support

          1       0.55      0.34      0.42       199
          5       0.96      0.98      0.97      3266

avg / total       0.94      0.95      0.94      3465



### Since decision trees have low bias and high variance, they can be enhanced using bootstrap aggregating

In [33]:
from sklearn.ensemble import BaggingClassifier
bg = BaggingClassifier(base_estimator = dtc, n_estimators = 10)
bg.fit(X_train, y_train) 
bg.score(X_train,y_train)

0.9748886689757545

In [34]:
predict = bg.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, predict))
print('\n')
print(classification_report(y_test, predict))

[[  63  136]
 [  55 3211]]


             precision    recall  f1-score   support

          1       0.53      0.32      0.40       199
          5       0.96      0.98      0.97      3266

avg / total       0.93      0.94      0.94      3465



### Important Points to Note

1) Although, bagging improved our basic decision tree model, but we didn't get a better result than Naive Bayes

2) I tried Adaptive Boost with Decision Tree as base model but nothing changed much

3) Random forest (type of bagging technique) was of no help either

4) One serious problem with Decision Trees and Ensemble methods (simple bagging, random forest and adaptive bossting) is that they are prone to over fitting, so hyper parameter tuning based on accuracy can give you some major issues when the model is used for out of sample data

In [35]:
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier(base_estimator = dtc, n_estimators = 10)
ada.fit(X_train, y_train) 
predict = ada.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, predict))
print('\n')
print(classification_report(y_test, predict))

[[  71  128]
 [  49 3217]]


             precision    recall  f1-score   support

          1       0.59      0.36      0.45       199
          5       0.96      0.98      0.97      3266

avg / total       0.94      0.95      0.94      3465



In [36]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators = 10,random_state = 101)
rfc.fit(X_train, y_train) 
predict = rfc.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, predict))
print('\n')
print(classification_report(y_test, predict))

[[  44  155]
 [  10 3256]]


             precision    recall  f1-score   support

          1       0.81      0.22      0.35       199
          5       0.95      1.00      0.98      3266

avg / total       0.95      0.95      0.94      3465



#### Among these ensemble methods, I am getting better precision - recall using Random Forests

#### So, I will not delve into hyperparameter tuning of random forests 

#### The reason for that is - there are multiple parametes and if tuned perfectly there is a chance of overfitting  


## Conclusion 

1) Hyperparameter tuning of Laplace smoothing (alpha = 0.3) in Multinomial Naive Bayes increases the precision by 1%

2) Linear Support Vector Classifier gave precision - recall of 97% which is awesome because there was no overfitting in training set. Also, the grid search cross validation helped us to reduce the penalty parameter by 0.1 

3) Random Forest gives same precision as Multinomial Naive Bayes but 1% lower recall. However, from my experience I can say that ensemble methods are great (be it bagging or boosting). 

4) Another thing to note is that Count Vectorization goes well with Multinomial Naive Bayes (MNB) becasue it fulfils the basic assumption of MNB which states that features should be independent of each other. While that's not the case with SVM because, it has to classify using hyperplanes so TF IFD reduces the weight for words that appear a lot in the document. This makes it easier for SVM to look into bigger picture while classifying. 

5) There is compelling reason as to why discriminative classifiers (such as logistic regression and SVM) are preferred over generative classifiers (such as Naive Bayes). The reason behind this is SVMs solve for posterior probability directly without bothering about the prior probability as an intermediate process. This is the reason that SVM worked really well here. This is something that Vapnik suggested and Andrew Ng published a paper about the comparative analysis. 

This was the basic motivation of using SVM and Random Forest because they are discriminative models. Usually discriminative classifiers beat generative classifiers when we have huge data set.