## Women's E-Commerce Clothing Reviews

Prepared by Debora Callegari

### Imports

In [1]:
#Importing the necessary packages:

# Basic libraries
import numpy as np
import pandas as pd

# Graphs
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# WordCloud
from os import path
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

# Text preprocessing
import nltk
import string
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# Model and evaluation
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from math import sqrt

# Regularization of Linear Model
from sklearn.linear_model import Ridge

# Ignoring the warnings
import warnings
warnings.filterwarnings('ignore')

### Step 6: Train - test split

As mentioned before, the problem that I am going to be tackling is to predict the scores of women's clothes based on text reviews. 

Firstly, I am not going to use all the features. After getting the results I can work on it and then start to consider feature selection for evaluating the models.

In [61]:
clean_data.head(2)

Unnamed: 0,clothing_id,age,title,review_text,rating,recommended_ind,positive_feedback_count,class_name,division_name_General,division_name_General Petite,division_name_Intimates,department_name_Bottoms,department_name_Dresses,department_name_Intimate,department_name_Jackets,department_name_Tops,department_name_Trend,lower_review
0,767,33,Null,Absolutely wonderful - silky and sexy and comf...,4,1,0,Intimates,0,0,1,0,0,1,0,0,0,absolutely wonderful - silky and sexy and comf...
1,1080,34,Null,Love this dress! it's sooo pretty. i happene...,5,1,4,Dresses,1,0,0,0,1,0,0,0,0,love this dress! it's sooo pretty. i happened ...


In [62]:
df = pd.DataFrame(clean_data[['lower_review', 'rating']])

In [63]:
df.head()

Unnamed: 0,lower_review,rating
0,absolutely wonderful - silky and sexy and comf...,4
1,love this dress! it's sooo pretty. i happened ...,5
2,i had such high hopes for this dress and reall...,3
3,"i love, love, love this jumpsuit. it's fun, fl...",5
4,this shirt is very flattering to all due to th...,5


In [64]:
# Applying the function 'text_clean'
df['clean_review'] = df['lower_review'].apply(text_clean)

In [65]:
df.head()

Unnamed: 0,lower_review,rating,clean_review
0,absolutely wonderful - silky and sexy and comf...,4,absolutely wonderful silky sexy comfortable
1,love this dress! it's sooo pretty. i happened ...,5,love dress sooo pretty happened find store im ...
2,i had such high hopes for this dress and reall...,3,high hopes dress really wanted work initially ...
3,"i love, love, love this jumpsuit. it's fun, fl...",5,love love love jumpsuit fun flirty fabulous ev...
4,this shirt is very flattering to all due to th...,5,shirt flattering due adjustable front tie perf...


In [66]:
# Seperating the dataset into X and y for prediction
X = df['clean_review']
y = df['rating']

In [67]:
print("X shape:", X.shape)
print("y shape:", y.shape)

X shape: (22628,)
y shape: (22628,)


In [68]:
# Splitting into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

In [69]:
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)

X_train shape: (18102,)
y_train shape: (18102,)
X_test shape: (4526,)
y_test shape: (4526,)


### Step 7: Converting Text to Numbers 

As we know, machines, as opposed to humans, and do not understand the raw text. 

Therefore, to proceed with this project I need to convert the text into numbers. Various approaches convert text into the corresponding numerical form. For this project, I will use Bag of Words.

#### 7.1. Bag of Words

In [70]:
# Creating CountVectorizer object
count_vect = CountVectorizer()

In [71]:
X_train_bow = count_vect.fit_transform(X_train)
X_train_bow = X_train_bow.toarray()

In [72]:
X_train_bow.shape

(18102, 17051)

In [73]:
X_test_bow = count_vect.transform(X_test).toarray()

In [74]:
X_test_bow.shape

(4526, 17051)

#### 7.1.1. Finding Term Frequency - Inverse Document Frequency (TF-IDF)

I will transform the X_train_bow that was created as bag-of-words into TF-IDF below.

In [75]:
# Creating TfidfTransformer object
tfidf_transformer = TfidfTransformer()

In [76]:
X_train_tfidf = tfidf_transformer.fit_transform(X_train_bow)
X_train_tfidf = X_train_tfidf.toarray()

In [77]:
X_train_tfidf.shape

(18102, 17051)

In [78]:
X_train_tfidf

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [79]:
X_test_tfidf = tfidf_transformer.transform(X_test_bow).toarray()

In [80]:
X_test_tfidf.shape

(4526, 17051)

In [81]:
X_test_tfidf

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

### Step 8: Fitting Linear Regression model

After all the process shown above, we have the text reviews represented as vectors, so we can train the regression problem to predict the rating. 

I'll be using scikit-learn here, choosing the Linear Regression to start with.

In [83]:
lm_tfidf = LinearRegression()

In [84]:
# Fitting linear regression model into the training data
lm_tfidf.fit(X_train_tfidf, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [85]:
#The bias term
bias_tfidf = lm_tfidf.intercept_

#The coefficient, notice it returns an array with one spot for each feature
coefficient_tfidf = lm_tfidf.coef_[0]

print('Bias:', bias_tfidf)
print('Coefficients:', coefficient_tfidf)

Bias: 3.928981229305636
Coefficients: 2.083512892576648


In [86]:
lm_tfidf.intercept_ # value of beta_0

3.928981229305636

In [87]:
lm_tfidf.coef_ # value of beta_1

array([ 2.08351289e+00, -1.19742940e+00,  4.34722861e-01, ...,
       -2.33509820e+13, -1.10697478e+13,  1.14238739e+13])

In [88]:
print("Score on train data model was fitted to:", lm_tfidf.score(X_train_tfidf, y_train))
print("Score on test data model was fitted to:", lm_tfidf.score(X_test_tfidf, y_test))

Score on train data model was fitted to: 0.821626730136805
Score on test data model was fitted to: -5.293842132604825e+25


In [89]:
# Evaluating the model
y_pred_train_tfidf = lm_tfidf.predict(X_train_tfidf)
y_pred_test_tfidf = lm_tfidf.predict(X_test_tfidf)

In [90]:
# Regression Evaluation Metrics
print('MAE:', metrics.mean_absolute_error(y_train, y_pred_train_tfidf))
print('MSE:', metrics.mean_squared_error(y_train, y_pred_train_tfidf))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_train, y_pred_train_tfidf)))

MAE: 0.3526187669308382
MSE: 0.22203490900896689
RMSE: 0.4712058032420302


In [91]:
# Regression Evaluation Metrics
print('MAE:', metrics.mean_absolute_error(y_test, y_pred_test_tfidf))
print('MSE:', metrics.mean_squared_error(y_test, y_pred_test_tfidf))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_test_tfidf)))

MAE: 2256636674713.957
MSE: 6.600894467516365e+25
RMSE: 8124588892686.426


The previous output shows that the MSE, one of the two evaluation metrics, is around 0.22 for train data and a value of 6.600894467516365e+25 for test data. On the other hand, the score is around 0.82 for train data and a small negative number for test data, which we can see that the model is not doing well.

### Step 9: Regularization of Linear Regression

I will now apply regularization to the previous Linear Regression model.

#### 9.1. Ridge Regression

For this project, I will use Ridge Regression where the loss function is modified to minimize the complexity of the model. 

Therefore, I need to tune the parameter called alpha. As we know, a low alpha value can lead to over-fitting, although a high alpha value can lead to under-fitting.

In [92]:
ridge_reg = Ridge(alpha = 0.01)
ridge_reg.fit(X_train_tfidf, y_train)

pred_train_ridge_reg = ridge_reg.predict(X_train_tfidf)

print(np.sqrt(mean_squared_error(y_train, pred_train_ridge_reg)))
print(r2_score(y_train, pred_train_ridge_reg))

pred_test_ridge_reg = ridge_reg.predict(X_test_tfidf)

print(np.sqrt(mean_squared_error(y_test,pred_test_ridge_reg))) 
print(r2_score(y_test, pred_test_ridge_reg))

0.39228574296508856
0.8763729341035184
1.0326981403220232
0.14470686438078995


In [102]:
ridge_reg = Ridge(alpha = 5)
ridge_reg.fit(X_train_tfidf, y_train)

pred_train_ridge_reg = ridge_reg.predict(X_train_tfidf)

print(np.sqrt(mean_squared_error(y_train, pred_train_ridge_reg)))
print(r2_score(y_train, pred_train_ridge_reg))

pred_test_ridge_reg = ridge_reg.predict(X_test_tfidf)

print(np.sqrt(mean_squared_error(y_test,pred_test_ridge_reg))) 
print(r2_score(y_test, pred_test_ridge_reg))

0.7195771424728874
0.5840288300922226
0.7848992537444788
0.5059214168090347


Considering the output above for alpha equal to 5, it is possible to see that the MSE and R-squared values for the Ridge Regression model on the training data is 0.71 and 58.40 percent, respectively. For the test data, the result for these metrics is 0.50 and 50.59 percent, respectively.

### Step 10: Comparing the results

Based on the previous results, the performance of the Linear Rregression and Ridge Regression models is summarized below:

1. Linear Regression Model: Test set MSE with a value of 6.600894467516365e+25 and R-square of -5.293842132604825e+25.

2. Ridge Regression Model: Test set MSE of 0.78 and R-square of 50.59 percent.

The Linear Regression model is performing the worst. Moreover, the Rigde Reegression model is performing better with values of R-squared.

### Step 11: Feature Selection

In [106]:
from sklearn import feature_selection

In [107]:
clean_data.head(3)

Unnamed: 0,clothing_id,age,title,review_text,rating,recommended_ind,positive_feedback_count,class_name,division_name_General,division_name_General Petite,division_name_Intimates,department_name_Bottoms,department_name_Dresses,department_name_Intimate,department_name_Jackets,department_name_Tops,department_name_Trend,lower_review
0,767,33,Null,Absolutely wonderful - silky and sexy and comf...,4,1,0,Intimates,0,0,1,0,0,1,0,0,0,absolutely wonderful - silky and sexy and comf...
1,1080,34,Null,Love this dress! it's sooo pretty. i happene...,5,1,4,Dresses,1,0,0,0,1,0,0,0,0,love this dress! it's sooo pretty. i happened ...
2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,Dresses,1,0,0,0,1,0,0,0,0,i had such high hopes for this dress and reall...


In [108]:
clean_data.columns

Index(['clothing_id', 'age', 'title', 'review_text', 'rating',
       'recommended_ind', 'positive_feedback_count', 'class_name',
       'division_name_General', 'division_name_General Petite',
       'division_name_Intimates', 'department_name_Bottoms',
       'department_name_Dresses', 'department_name_Intimate',
       'department_name_Jackets', 'department_name_Tops',
       'department_name_Trend', 'lower_review'],
      dtype='object')

In [109]:
features_to_include = ['age','recommended_ind', 'positive_feedback_count', 'division_name_General', 
                       'division_name_General Petite','division_name_Intimates', 'department_name_Bottoms',
                       'department_name_Dresses', 'department_name_Intimate',
                       'department_name_Jackets', 'department_name_Tops','department_name_Trend']

In [110]:
feature_importances = feature_selection.mutual_info_regression(clean_data[features_to_include], clean_data['rating'])

In [111]:
feat_importance_df = pd.DataFrame(list(zip(features_to_include,feature_importances)), 
                                  columns=['feature','importance'])

In [112]:
feat_importance_df.sort_values(by='importance', ascending=False, inplace=True)

In [113]:
top_feats = [x for x in feat_importance_df['feature'][0:5]]
top_feats

['recommended_ind',
 'department_name_Dresses',
 'department_name_Trend',
 'division_name_General Petite',
 'department_name_Jackets']

In [114]:
feature_importances

array([0.        , 0.33393926, 0.        , 0.00244024, 0.00402106,
       0.        , 0.        , 0.00879674, 0.        , 0.00356997,
       0.        , 0.00471641])

In [115]:
feat_importance_df

Unnamed: 0,feature,importance
1,recommended_ind,0.333939
7,department_name_Dresses,0.008797
11,department_name_Trend,0.004716
4,division_name_General Petite,0.004021
9,department_name_Jackets,0.00357
3,division_name_General,0.00244
0,age,0.0
2,positive_feedback_count,0.0
5,division_name_Intimates,0.0
6,department_name_Bottoms,0.0


#### 11.1. Fitting Linear Regression Model

In [116]:
linear_reg = LinearRegression()

In [117]:
regfit = linear_reg.fit(clean_data[top_feats], clean_data['rating'])

In [118]:
regfit.coef_

array([ 2.29471354, -0.01806347, -0.1797236 ,  0.00814313,  0.03335014])

In [119]:
pd.DataFrame(list(zip(top_feats,regfit.coef_)), columns=['feature', 'coef'])

Unnamed: 0,feature,coef
0,recommended_ind,2.294714
1,department_name_Dresses,-0.018063
2,department_name_Trend,-0.179724
3,division_name_General Petite,0.008143
4,department_name_Jackets,0.03335


In [120]:
regfit.score(clean_data[top_feats], clean_data['rating'])

0.6283646728902714

In [121]:
y_fit = regfit.predict(clean_data[top_feats])

In [122]:
r2_score(clean_data['rating'], y_fit)

0.6283646728902714

### Future approaches

Considering all the previous results, using the regularization for this preliminary project, I just could see a slight improvement.

Hence, I believe that some future approaches will provide better results and expand the goals for this project. Some of them are listed below:

- Analyze the importance of each feature: Try to understand the coefficients and the importance of the features for the model and the correlation that the features have with the target variable (direct or indirect) and see how the model will perform after this approach. I believe it is important to add features based on feature selection in the step 11 because there are relevant information connected with the target and would improve the final results.


- Cross-validation: Another technique that can also be tried along with feature selection techniques to improve the results as well.


- Trying some approaches for imbalanced data: As we can see, the dataset shows imbalanced classes for some features. So, I believe that it is important to apply two options to handle imbalanced classes. Even though it's important to consider that information could be lost, I believe that trying different techniques and comparing the final results could be another approach to take.


- Tuning the regularization parameter: In this case, I tuned alpha equal to 0.01 and equal to 5. So, it will be interesting to try other iterations to improve model performance. The best way to do it is altered by hyperparameter tuning to arrive at the optimal alpha value.