# Transforming Data into Features

I am a data scientist at a clothing company and are working with a data set of customer reviews. This dataset is originally from [Kaggle](https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews) and has a lot of potential for various machine learning purposes. I am tasked with transforming some of these features to make the data more useful for analysis. To do this,
* Transforming categorical data
* Scaling your data
* Working with date-time features

In [61]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

In [67]:
# Import Data

reviews = pd.read_csv("data/Womens Clothing E-Commerce Reviews.csv")

#Drop the Unnamed: 0 axis from the dataframe
reviews.drop("Unnamed: 0", axis=1, inplace = True)
reviews.head(2)

#print column names 
reviews.columns

# Fix the column names 
reviews.columns = ["clothing_id", "age", "title", "review_text", "rating",
                  "recommended", "positive_feedback", "division_name", 
                  "department_name", "class_name"]

#print info
reviews.info()

# Look at the counts of recommended
print(reviews["recommended"].value_counts())

#Look at the counts of rating
reviews["rating"]

reviews["department_name"].value_counts()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   clothing_id        23486 non-null  int64 
 1   age                23486 non-null  int64 
 2   title              19676 non-null  object
 3   review_text        22641 non-null  object
 4   rating             23486 non-null  int64 
 5   recommended        23486 non-null  int64 
 6   positive_feedback  23486 non-null  int64 
 7   division_name      23472 non-null  object
 8   department_name    23472 non-null  object
 9   class_name         23472 non-null  object
dtypes: int64(5), object(5)
memory usage: 1.8+ MB
1    19314
0     4172
Name: recommended, dtype: int64


Tops        10468
Dresses      6319
Bottoms      3799
Intimate     1735
Jackets      1032
Trend         119
Name: department_name, dtype: int64

In [68]:
# Perform One-hot encoding on the department_name column

one_hot = pd.get_dummies(reviews["department_name"])

# Join the result backt to our orgincal data frame
reviews = reviews.join(one_hot)
reviews.columns

Index(['clothing_id', 'age', 'title', 'review_text', 'rating', 'recommended',
       'positive_feedback', 'division_name', 'department_name', 'class_name',
       'Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend'],
      dtype='object')

In [69]:
reviews.head()

Unnamed: 0,clothing_id,age,title,review_text,rating,recommended,positive_feedback,division_name,department_name,class_name,Bottoms,Dresses,Intimate,Jackets,Tops,Trend
0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates,0,0,1,0,0,0
1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses,0,1,0,0,0,0
2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,0,1,0,0,0,0
3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants,1,0,0,0,0,0
4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses,0,0,0,0,1,0


In [74]:
# Set index to clothing_id


reviews["Jackets"].describe()

count    23486.000000
mean         0.043941
std          0.204968
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: Jackets, dtype: float64

# Filter Methods

In [75]:
df = pd.DataFrame(data={
    'edu_goal': ['bachelors', 'bachelors', 'bachelors', 'masters', 'masters', 'masters', 'masters', 'phd', 'phd', 'phd'],
    'hours_study': [1, 2, 3, 3, 3, 4, 3, 4, 5, 5],
    'hours_TV': [4, 3, 4, 3, 2, 3, 2, 2, 1, 1],
    'hours_sleep': [10, 10, 8, 8, 6, 6, 8, 8, 10, 10],
    'height_cm': [155, 151, 160, 160, 156, 150, 164, 151, 158, 152],
    'grade_level': [8, 8, 8, 8, 8, 8, 8, 8, 8, 8],
    'exam_score': [71, 72, 78, 79, 85, 86, 92, 93, 99, 100]
})
 

In [87]:
df

Unnamed: 0,edu_goal,hours_study,hours_TV,hours_sleep,height_cm,grade_level,exam_score
0,bachelors,1,4,10,155,8,71
1,bachelors,2,3,10,151,8,72
2,bachelors,3,4,8,160,8,78
3,masters,3,3,8,160,8,79
4,masters,3,2,6,156,8,85
5,masters,4,3,6,150,8,86
6,masters,3,2,8,164,8,92
7,phd,4,2,8,151,8,93
8,phd,5,1,10,158,8,99
9,phd,5,1,10,152,8,100


Our goal is to use the data to predict how well each student will perform on the exam. Thus, our target variable is exam_score and the remaining 6 variables are our features. We’ll prepare the data by separating the features matrix (X) and the target vector (y):

In [93]:
# X = df.drop(df["exam_score"])
# y = df["exam_score"]


# X = df.iloc[:,:-1]
# y = df.iloc[:,-1]


0     71
1     72
2     78
3     79
4     85
5     86
6     92
7     93
8     99
9    100
Name: exam_score, dtype: int64