### Transforming Data into Features

You are a data scientist at a clothing company and are working with a data set of customer reviews. This dataset is originally from Kaggle and has a lot of potential for various machine learning purposes. You are tasked with transforming some of these features to make the data more useful for analysis. To do this, you will have time to practice the following:

* Transforming categorical data
* Scaling your data
* Working with date-time features

Let’s get started!

Let’s start with some basic exploring by performing the following:

First, import your dataset. 

In [2]:
#import libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

#import data
reviews = pd.read_csv("reviews.csv")

Next, we want to look at the column names of our dataset along with their data types.

In [3]:
#print column names
print(reviews.columns)
 
#print .info
print(reviews.info())

Index(['clothing_id', 'age', 'review_title', 'review_text', 'recommended',
       'division_name', 'department_name', 'review_date', 'rating'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   clothing_id      5000 non-null   int64 
 1   age              5000 non-null   int64 
 2   review_title     4174 non-null   object
 3   review_text      4804 non-null   object
 4   recommended      5000 non-null   bool  
 5   division_name    4996 non-null   object
 6   department_name  4996 non-null   object
 7   review_date      5000 non-null   object
 8   rating           5000 non-null   object
dtypes: bool(1), int64(2), object(6)
memory usage: 317.5+ KB
None


#### Data Transformations

Transform the recommended feature. 

In [4]:
#look at the counts of recommended
print(reviews.recommended.value_counts())

True     4166
False     834
Name: recommended, dtype: int64


Since this is a True/False feature, we want to transform it to 1 for True and 0 for False.

In [5]:
#create binary dictionary
binary_dict = {True:1, False:0}
 
#transform column
reviews.recommended = reviews.recommended.map(binary_dict)
 
#print your transformed column
print(reviews.recommended.value_counts())

1    4166
0     834
Name: recommended, dtype: int64


Let’s run through a similar process to transform the rating feature. This is ordinal data so our transformation should make that more clear. Assign numerical values to the ratings.

In [6]:
#look at the counts of rating
print(reviews.rating.value_counts())
 
#create dictionary
rating_dict = {'Loved it': 5, 'Liked it': 4, 'Was okay': 3, 'Not great': 2, 'Hated it': 1}
 
#transform rating column
reviews.rating = reviews.rating.map(rating_dict)
 
#print your transformed column values
print(reviews.rating.value_counts())

Loved it     2798
Liked it     1141
Was okay      564
Not great     304
Hated it      193
Name: rating, dtype: int64
5    2798
4    1141
3     564
2     304
1     193
Name: rating, dtype: int64


Let’s now transform the department_name feature. To encode those names we will use *One Hot Encoding*

In [7]:
#get the number of categories in a feature
print(reviews.department_name.value_counts())
 
#perform get_dummies
one_hot = pd.get_dummies(reviews.department_name)

#join the new columns back onto the original
reviews = reviews.join(one_hot)

#print column names
print(reviews.columns)

Tops        2196
Dresses     1322
Bottoms      848
Intimate     378
Jackets      224
Trend         28
Name: department_name, dtype: int64
Index(['clothing_id', 'age', 'review_title', 'review_text', 'recommended',
       'division_name', 'department_name', 'review_date', 'rating', 'Bottoms',
       'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend'],
      dtype='object')


Let’s make one more feature transformation! Transform the review_date feature. Make it as date-time feature.

In [8]:
#transform review_date to date-time data
reviews.review_date = pd.to_datetime(reviews.review_date)

#print review_date data type 
print(reviews.review_date.dtypes)

datetime64[ns]


#### Scaling the Data

The final step we will take in our transformation project is scaling our data. We notice that we have a wide range of numbers thus far, so it is best to put everything on the same scale.

Let’s get our data frame to only have the numerical features we created.

In [9]:
#get only numerical columns from our existing dataframe
reviews = reviews[['clothing_id', 'age', 'recommended', 'rating', 'Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend']].copy()
 
#reset index
reviews = reviews.set_index('clothing_id')
print(reviews.head())
#instantiate standard scaler
scaler = StandardScaler()
 
#fit transform data
reviews_scaled = scaler.fit_transform(reviews)
print(reviews_scaled)

             age  recommended  rating  Bottoms  Dresses  Intimate  Jackets  \
clothing_id                                                                  
1095          39            1       4        0        1         0        0   
1095          28            1       5        0        1         0        0   
699           37            1       5        0        0         1        0   
1072          36            1       5        0        1         0        0   
1094          32            1       5        0        1         0        0   

             Tops  Trend  
clothing_id               
1095            0      0  
1095            0      0  
699             0      0  
1072            0      0  
1094            0      0  
[[-0.34814459  0.44742824 -0.1896478  ... -0.21656679 -0.88496718
  -0.07504356]
 [-1.24475223  0.44742824  0.71602461 ... -0.21656679 -0.88496718
  -0.07504356]
 [-0.51116416  0.44742824  0.71602461 ... -0.21656679 -0.88496718
  -0.07504356]
 ...
 [-0.59267395  0