In [99]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

In [100]:
reviews = pd.read_csv(r"C:\Users\Carlos\Documents\Python Projects\Women's E-Commerce Clothing\reviews.csv")

In [101]:
reviews

Unnamed: 0,clothing_id,age,review_title,review_text,recommended,division_name,department_name,review_date,rating
0,1095,39,"Cute,looks like a dress on",If you are afraid of the jumpsuit trend but li...,True,General,Dresses,2019-07-08,Liked it
1,1095,28,"So cute, great print!",I love fitted top dresses like this but i find...,True,General,Dresses,2019-05-17,Loved it
2,699,37,So flattering!,"I love these cozy, fashionable leggings. they ...",True,Initmates,Intimate,2019-06-24,Loved it
3,1072,36,Effortless,"Another reviewer said it best, ""i love the way...",True,General Petite,Dresses,2019-12-06,Loved it
4,1094,32,You need this!,Rompers are my fav so i'm biased writing this ...,True,General,Dresses,2019-10-04,Loved it
...,...,...,...,...,...,...,...,...,...
4995,918,38,Unique sweater,I tried it in the store but was not the true s...,True,General Petite,Tops,2019-05-26,Loved it
4996,950,33,The brown/gray version is cropped,"The photos don't look it, but the one i receiv...",False,General,Tops,2019-10-21,Hated it
4997,1086,36,,"Simple, classic, figure flattering, and reason...",True,General Petite,Dresses,2019-10-18,Loved it
4998,1033,28,"If you have a big booty, get these jeans now",I have a tough time shopping for jeans because...,True,General Petite,Bottoms,2019-11-24,Loved it


In [102]:
reviews.columns

Index(['clothing_id', 'age', 'review_title', 'review_text', 'recommended',
       'division_name', 'department_name', 'review_date', 'rating'],
      dtype='object')

In [103]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   clothing_id      5000 non-null   int64 
 1   age              5000 non-null   int64 
 2   review_title     4174 non-null   object
 3   review_text      4804 non-null   object
 4   recommended      5000 non-null   bool  
 5   division_name    4996 non-null   object
 6   department_name  4996 non-null   object
 7   review_date      5000 non-null   object
 8   rating           5000 non-null   object
dtypes: bool(1), int64(2), object(6)
memory usage: 317.5+ KB


Transform the recommended feature. Start by printing the feature’s .value_counts().




In [104]:
reviews['recommended'].value_counts()

True     4166
False     834
Name: recommended, dtype: int64

Since this is a True/False feature, we want to transform it to 1 for True and 0 for False.

To do this, create a dictionary called binary_dict where:

The keys are what is currently in the recommended feature.
The values are what we want in the new column (0s and 1s).


In [105]:
binary_dict = {True:1, False:0}

Using binary_dict, transform the recommended column so that it will now be binary. Print the results using .value_counts() to confirm the transformation.




In [106]:
reviews['recommended'] = reviews['recommended'].map(binary_dict)

In [107]:
reviews['recommended'].value_counts()

1    4166
0     834
Name: recommended, dtype: int64

Let’s run through a similar process to transform the rating feature. This is ordinal data so our transformation should make that more clear. Again, start by printing the .value_counts().



In [108]:
reviews['rating'].value_counts()

Loved it     2798
Liked it     1141
Was okay      564
Not great     304
Hated it      193
Name: rating, dtype: int64

Create a dictionary called rating_dict where the keys are what is currently in the feature and the values are what we want in the new column. You can use the hierarchy listed above to make your dictionary.




In [109]:
rating_dict = {'Loved it':5, 'Liked_it':4, 'Was okay':3, 'Not great':2, 'Hated it':1}

Using rating_dict, transform the rating column so it contains numerical values. Print the results using .value_counts() to confirm the transformation.




In [110]:
reviews['rating'] = reviews['rating'].map(rating_dict)

In [111]:
reviews['rating'].value_counts()

5.0    2798
3.0     564
2.0     304
1.0     193
Name: rating, dtype: int64

Let’s now transform the department_name feature. This process will be slightly different, but start by printing the .value_counts() of the feature.

Use Panda’s get_dummies to one-hot encode our feature.
Attach the results back to our original data frame.
Print the column names to see!



In [112]:
reviews['department_name'].value_counts()

Tops        2196
Dresses     1322
Bottoms      848
Intimate     378
Jackets      224
Trend         28
Name: department_name, dtype: int64

In [113]:
one_hot = pd.get_dummies(reviews['department_name'])

In [114]:
reviews = reviews.join(one_hot)

In [115]:
reviews.columns

Index(['clothing_id', 'age', 'review_title', 'review_text', 'recommended',
       'division_name', 'department_name', 'review_date', 'rating', 'Bottoms',
       'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend'],
      dtype='object')

Let’s make one more feature transformation!

Transform the review_date feature.

This feature is listed as an object type, but we want this to be transformed into a date-time feature.

Transform review_date into a date-time feature.
Print the feature type to confirm the transformation.


In [116]:
reviews['review_date'].dtype

dtype('O')

In [117]:
reviews['review_date'] = pd.to_datetime(reviews['review_date'])

In [118]:
reviews['review_date'].dtype

dtype('<M8[ns]')

In [119]:
reviews.dtypes

clothing_id                 int64
age                         int64
review_title               object
review_text                object
recommended                 int64
division_name              object
department_name            object
review_date        datetime64[ns]
rating                    float64
Bottoms                     uint8
Dresses                     uint8
Intimate                    uint8
Jackets                     uint8
Tops                        uint8
Trend                       uint8
dtype: object

In [120]:
reviews.columns

Index(['clothing_id', 'age', 'review_title', 'review_text', 'recommended',
       'division_name', 'department_name', 'review_date', 'rating', 'Bottoms',
       'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend'],
      dtype='object')

The final step we will take in our transformation project is scaling our data. We notice that we have a wide range of numbers thus far, so it is best to put everything on the same scale.

Let’s get our data frame to only have the numerical features we created.

In [121]:
reviews = reviews[['clothing_id', 'age', 'recommended',
       'rating', 'Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend']].copy()

Reset the index to be our clothing_id feature.




In [122]:
reviews = reviews.set_index(reviews['clothing_id'])

We are ready to scale our data! Perform a .fit_transform() on our data set, and print the results to see how the features have changed.




In [123]:
scaler = StandardScaler()

In [124]:
scaler.fit_transform(reviews)

array([[ 0.85669131, -0.34814459,  0.44742824, ..., -0.21656679,
        -0.88496718, -0.07504356],
       [ 0.85669131, -1.24475223,  0.44742824, ..., -0.21656679,
        -0.88496718, -0.07504356],
       [-1.06545809, -0.51116416,  0.44742824, ..., -0.21656679,
        -0.88496718, -0.07504356],
       ...,
       [ 0.81300609, -0.59267395,  0.44742824, ..., -0.21656679,
        -0.88496718, -0.07504356],
       [ 0.55574873, -1.24475223,  0.44742824, ..., -0.21656679,
        -0.88496718, -0.07504356],
       [-0.33251728,  1.68960003,  0.44742824, ..., -0.21656679,
         1.12998541, -0.07504356]])