# Lesson 06 Assignment

In this assignment, we want to read the `retail-churn.csv` dataset that we examined in a previous assignment and begin to pre-process it. The goal of the assignment is to become familiar with some common pre-processing and feature engineering steps by implementing them.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

col_names = ['user_id', 'gender', 'address', 'store_id', 'trans_id', 'timestamp', 'item_id', 'quantity', 'dollar']
churn = pd.read_csv("../../data/retail-churn.csv", sep = ",", skiprows = 1, names = col_names)

### Some basic EDA
present the first few rows and use pandas' `describe` to get an overview of the data
<br/>&nbsp;&nbsp;<span style="color:red" float:right>[0 point]</span>

In [2]:
churn

Unnamed: 0,user_id,gender,address,store_id,trans_id,timestamp,item_id,quantity,dollar
0,101981,F,E,2860,818463,11/1/2000 0:00,4.710000e+12,1,37
1,101981,F,E,2861,818464,11/1/2000 0:00,4.710000e+12,1,17
2,101981,F,E,2862,818465,11/1/2000 0:00,4.710000e+12,1,23
3,101981,F,E,2863,818466,11/1/2000 0:00,4.710000e+12,1,41
4,101981,F,E,2864,818467,11/1/2000 0:00,4.710000e+12,8,288
...,...,...,...,...,...,...,...,...,...
252199,2179605,B,G,251838,1630692,2/28/2001 0:00,2.250000e+12,2,138
252200,2179605,B,G,251839,1630821,2/28/2001 0:00,4.710000e+12,1,96
252201,2179605,B,G,251840,1630931,2/28/2001 0:00,4.710000e+12,1,89
252202,2179605,B,G,251841,1631033,2/28/2001 0:00,4.710000e+12,1,108


In [3]:
churn.describe()

Unnamed: 0,user_id,store_id,trans_id,item_id,quantity,dollar
count,252204.0,252204.0,252204.0,252204.0,252204.0,252204.0
mean,1395660.0,126101.5,1229771.0,4467833000000.0,1.385692,130.911389
std,609476.9,72805.167983,235099.2,1679512000000.0,3.705732,388.142169
min,1113.0,0.0,817747.0,20008820.0,1.0,1.0
25%,993715.0,63050.75,1025926.0,4710000000000.0,1.0,42.0
50%,1586046.0,126101.5,1233476.0,4710000000000.0,1.0,76.0
75%,1862232.0,189152.25,1433222.0,4710000000000.0,1.0,132.0
max,2179605.0,252203.0,1635482.0,9790000000000.0,1200.0,70589.0


The new data frame will be called `churn_processed`, which stores the pre-processed columns as you run through each of the these steps. You will need to make sure your columns are properly named.

1. Cast the `timestamp` column in churn into a column of type `datetime` and put the column into the `churn_processed` dataframe.  The new column should also be named `timestamp`.  Extract two new columns from `timestamp`: `dow` is the day of the week and `month` is the month of the year. Then drop the `timestamp` column from `churn_processed`.  Present the first few rows of `churn_processed`.
<br/>&nbsp;&nbsp;<span style="color:red" float:right>[2 point]</span>

In [None]:
# Add code here
# Create the new (empty) data frame, called churn_processed

# Cast timestamp to datetime

# Create a dow column

# Create a month column

# Drop Timestamp

# See what we have


2. Add `address` from `churn` to `churn_processed`. One-hot encode `address`, `dow` and `month`. Then drop columns `address`, `dow`, and `month` from `churn_processed`.  Finally, show some of the dataframe.
<br/>&nbsp;&nbsp;<span style="color:red" float:right>[2 point]</span>  

In [None]:
# Add code here

# add address column

# One-hot-encode address, dow and month

# Create Column Names

# Add one-hot endcoded values to new columns

# Drop address, dow and month

# Show the dataframe


3. So far we dropped `address`, `dow`, `month`, and `timestamp`.  Why would we want to drop all these columns?
<br/>&nbsp;&nbsp;<span style="color:red" float:right>[1 point]</span> 

**Why would we want to drop all these columns?**<br/>
Add Comment here


4. Rescale `dollar` using min-max normalization. Use `pandas` and `numpy` to do it and call the rescaled column `dollar_std_minmax`.  Then see what the first few rows of the dataframe looks like.
<br/>&nbsp;&nbsp;<span style="color:red" float:right>[1 point]</span> 

In [None]:
# Add code here

# Min-max of dollar using numpy and pandas

# See what the dataframe looks like


You can read about **robust normalization** [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html). The word **robust** in statistics generally refers to methods that behave reasonably even if the data is unusual.  For example, you can say that the median is a *robust* measure for the "average" of the data, while the mean is not.  In this respect a normalization is similar to an average. For a normalization **robust** might mean that the method is not affected by outliers. 
<br/><br/>
5. Write briefly about what makes robust normalization different from Z-normalization.  Write briefly about what makes robust normalization more robust than Z-normalization.  Rescale `quantity` using robust normalization. Call the rescaled column `qty_std_robust` and add it to `churn_processed`.  Compare minimum, maximum, mean, standard deviation, and the median of the original churn['quantity'] with the robust-normalized churn_processed['qty_std_robust'].  Comment on what went wrong.
<br/>&nbsp;&nbsp;<span style="color:red" float:right>[3 point]</span>

**Robust normalization vs. Z-normalization**<br/>
Add Comment here:<br/>

**Robust normalization is more robust than Z-normalization**<br/>
Add Comment here:<br/>
  

In [None]:
# Add Code here:


In [None]:
# Compare minimum, maximum, mean, standard deviation, and the median of churn['quantity'] with churn_processed['qty_std_robust']
# Add code here


**Failure of Robust Normalization**<br/>
Add Comment here:<br/>


6. Rescale `quantity` using Z-normalization, but normalize `quantity` **per user**, i.e. group by `user_id` so that the mean and standard deviation computed to normalize are computed separately by each `user_id`. Call the rescaled feature `qty_std_Z_byuser`. Present a histogram of `qty_std_Z_byuser`.  Briefly describe why and when you think this kind of normalization makes sense.
<br/>&nbsp;&nbsp;<span style="color:red" float:right>[3 point]</span>

In [None]:
# Z-normalize quantity per user_id
# Add code here


In [None]:
# Present histogram of qty_std_Z_byuser
# Add code here


**What could be the purpose of this normalization?**
<br/>
Add comment here:


7. Convert `item_id` into a category column in `churn_processed`.  Replace the `item_id` of all the items sold only once in the entire data with `"999999"`.  How many item ids are of category `"999999"`?  Display 10 rows of `churn_processed` where `item_id` is category `"999999"`.
<br/>&nbsp;&nbsp;<span style="color:red" float:right>[2 point]</span>

In [None]:
# Convert item_id into a category column in churn_processed
# Add code here


In [None]:
# Add code here

# Add Category

#  Replace the item_id of all the items sold only once in the entire data with "999999"   

# How many item ids are of category "999999"

# Display 10 rows of churn_processed where item_id is category 999999


# End of assignment