# Predicting Reviewer Birth Decade from E-Commerce Reviews
by Akhila Ashokan, November 2020 

***

**Task:** Predict the reviewer’s birth decade (90’s, 80’s, 00’s, etc)  using only text features. This task falls into the category of authorship profiling where text data is used to determine demographic information about the author (e.g. age, gender, personality, etc.)

**Data Set:** Kaggle's [Women's E-Commerce Clothing Reviews](https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews) Data Set

**Programming Language:** Python 3.5 in the Jupyter Notebook Environment


*** 


In [57]:
# General libraries
import pandas as pd

# Data Visualization 
import plotly.express as px
import 

# ML Libraries 
from sklearn import preprocessing 

# Warnings
import warnings
warnings.filterwarnings('ignore')

## Data Exploration 

The first step in any ML task is to get to know the data set. Here, I'm simply playing around with the dataset to get a sense of it's size, missing dimensions, and visualizing the features. 

In [36]:
# import data set 
data_set = pd.read_csv('Womens Clothing E-Commerce Reviews.csv')
print("Data Set has " + str(data_set.shape[0]) + " rows and " + str(data_set.shape[1]) + " columns.")
data_set.head()

Data Set has 23486 rows and 11 columns.


Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [37]:
# check data type of each row and non-null counts 
data_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Unnamed: 0               23486 non-null  int64 
 1   Clothing ID              23486 non-null  int64 
 2   Age                      23486 non-null  int64 
 3   Title                    19676 non-null  object
 4   Review Text              22641 non-null  object
 5   Rating                   23486 non-null  int64 
 6   Recommended IND          23486 non-null  int64 
 7   Positive Feedback Count  23486 non-null  int64 
 8   Division Name            23472 non-null  object
 9   Department Name          23472 non-null  object
 10  Class Name               23472 non-null  object
dtypes: int64(6), object(5)
memory usage: 2.0+ MB


Now, I will check what percent of each field is missing. 
Note that title is missing 16.2% of values and review test is missing 3.6% of values  

In [38]:
# check number of missing values
data_set.isnull().sum()/len(data_set) *100

Unnamed: 0                  0.000000
Clothing ID                 0.000000
Age                         0.000000
Title                      16.222430
Review Text                 3.597888
Rating                      0.000000
Recommended IND             0.000000
Positive Feedback Count     0.000000
Division Name               0.059610
Department Name             0.059610
Class Name                  0.059610
dtype: float64

Convert age to birth year and then to birth decade (Note: I am assuming here that this data was collected within a span of a year and not over several decades in order to determine the birth year of the individual) 

In [39]:
# convert age into birth decade 
data_set['Year'] = 2018
data_set['Birth Year'] = data_set['Year'] - data_set['Age']
data_set['Birth Decade'] = data_set['Birth Year'] - (data_set['Birth Year']%10)
data_set.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,Year,Birth Year,Birth Decade
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates,2018,1985,1980
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses,2018,1984,1980
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,2018,1958,1950
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants,2018,1968,1960
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses,2018,1971,1970


The defined task requires that I exclude the numerical features from the model so I will remove them from the dataframe before I continue exploring this data set. Notice that I am keeping the Division Name and Department Name.

In [40]:
# exclude numerical features
data_set = data_set[['Age', 'Birth Year', 'Birth Decade', 'Title', 'Review Text', 'Division Name', 'Department Name', 'Class Name']]
data_set.dtypes

Age                 int64
Birth Year          int64
Birth Decade        int64
Title              object
Review Text        object
Division Name      object
Department Name    object
Class Name         object
dtype: object

In [41]:
# summarize all information about data set 
data_set.describe(include='all')  

Unnamed: 0,Age,Birth Year,Birth Decade,Title,Review Text,Division Name,Department Name,Class Name
count,23486.0,23486.0,23486.0,19676,22641,23472,23472,23472
unique,,,,13993,22634,3,6,20
top,,,,Love it!,Perfect fit and i've gotten so many compliment...,General,Tops,Dresses
freq,,,,136,3,13850,10468,6319
mean,43.198544,1974.801456,1970.314656,,,,,
std,12.279544,12.279544,12.573437,,,,,
min,18.0,1919.0,1910.0,,,,,
25%,34.0,1966.0,1960.0,,,,,
50%,41.0,1977.0,1970.0,,,,,
75%,52.0,1984.0,1980.0,,,,,


Next, I like to visualize the data to get a sense of what interesting insights can be gleaned. First, I want to see what the most common birth decade is.

In [53]:
px.histogram(data_set, x = 'Birth Decade', marginal = 'box')

In [51]:
data_set['Birth Decade'].unique()

array([1980, 1950, 1960, 1970, 1990, 1930, 1940, 1920, 1910, 2000],
      dtype=int64)

In [85]:
data_set['Word Count in Review'] = data_set['Review Text'].str.split().str.len()
data_set['Word Count in Review'].unique()

array([  8.,  62.,  98.,  22.,  36., 101.,  97.,  34.,  72.,  66.,  91.,
        69.,  96.,  95.,  73.,  58.,  33.,  57.,  60., 105.,  85.,  41.,
        94.,  47.,  32.,  84.,  17.,  18.,  79.,  52.,  29.,  70.,  16.,
        23.,  26.,  50.,  48.,  39.,  10.,  15.,  49.,  37.,   9.,  86.,
        30., 100.,  51., 103.,  21.,  12.,  43.,  55.,  89.,  99., 109.,
        87.,  93., 102.,  77.,  82.,  81.,  nan,  45.,  13.,  54.,   7.,
        88.,  40.,  92.,  64.,  24.,  71.,  80.,  76.,  75.,  56.,  61.,
       104.,  25.,  90.,  63.,  46.,  67.,  14.,  38.,  11.,  31.,  42.,
        78.,  35.,  44.,  83.,  19.,  53., 106., 107.,  74.,  65.,  59.,
        27.,  68.,  20., 108.,  28.,   6.,   2., 111.,   4., 110.,   5.,
         3., 115., 113., 112., 114.])

In [87]:
# title word count distribution by birth decade
data_set['Word Count in Title'] = data_set['Title'].str.split().str.len()
data_set['Word Count in Title'].unique()

array([nan,  4.,  3.,  2.,  5.,  1.,  8.,  6., 10.,  7., 11.,  9., 12.])

In [81]:
# kernel density esitmation between birth year and word count in title 
px.violin(data_set, x="Birth Year", y="Word Count in Title", orientation="h")

In [None]:
# scatter plot for birth decade and division + pearson correlation coefficient 


In [None]:
# scatter plot for birth decade and department + pearson correlation coefficient 

## Preprocessing

Now, I want to clean up this data set into a version that can easily be consumed by the models we'll train on. Here are the steps: 

1) Handle missing values

2) Split data into train and test sets. 

In [78]:
# look at the data set once again
data_set.sample(3)

Unnamed: 0,Age,Birth Year,Birth Decade,Title,Review Text,Division Name,Department Name,Class Name,Word Count,Word List,Word Count in Review,Word Count in Title
11312,59,1959,1950,Love the shape!,This is an awesome shirt! gorgeous quality and...,General,Tops,Knits,49.0,"[This, is, an, awesome, shirt!, gorgeous, qual...",49.0,3.0
23201,54,1964,1960,So chic!,This is a wow! so chic...reminds of something ...,General Petite,Tops,Blouses,48.0,"[This, is, a, wow!, so, chic...reminds, of, so...",48.0,2.0
7200,42,1976,1970,"Beautiful, comfy, but odd skirt","I tried this dress on in the store, and nearly...",General,Dresses,Dresses,97.0,"[I, tried, this, dress, on, in, the, store,, a...",97.0,5.0


In [None]:
# remove numeric columns 
processed_df = data_set[['Birth Decade', 'Title', 'Review Text', 'Division Name', 'Department Name']]

In [None]:
# clean up data set 

In [None]:
# split data set into train, validation, and test sets 


## Training

## Prediction 

## Evaluation 

## Areas of Improvement and  Future Explorations