# Lesson 06 Assignment

In this assignment, we want to read the `retail-churn.csv` dataset that we examined in a previous assignment and begin to pre-process it. The goal of the assignment is to become familiar with some common pre-processing and feature engineering steps by implementing them.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

col_names = ['user_id', 'gender', 'address', 'store_id', 'trans_id', 'timestamp', 'item_id', 'quantity', 'dollar']
churn = pd.read_csv("../../data/retail-churn.csv", sep = ",", skiprows = 1, names = col_names)

### Some basic EDA
present the first few rows and use pandas' `describe` to get an overview of the data
<br/>&nbsp;&nbsp;<span style="color:red" float:right>[0 point]</span>

In [None]:
churn

In [None]:
churn.describe(include='all')

The new data frame will be called `churn_processed`, which stores the pre-processed columns as you run through each of the these steps. You will need to make sure your columns are properly named.

1. Cast the `timestamp` column in churn into a column of type `datetime` and put the column into the `churn_processed` dataframe.  The new column should also be named `timestamp`.  Extract two new columns from `timestamp`: `dow` is the day of the week and `month` is the month of the year. Then drop the `timestamp` column from `churn_processed`.  Present the first few rows of `churn_processed`.
<br/>&nbsp;&nbsp;<span style="color:red" float:right>[2 point]</span>

In [None]:
# Add code here
# Create the new (empty) data frame, called churn_processed
churn_processed = pd.DataFrame()
# Cast timestamp to datetime
churn_processed['timestamp'] = pd.to_datetime(churn.timestamp)
# Create a dow column
churn_processed['dow'] = churn_processed.timestamp.dt.day_of_week
# Create a month column
churn_processed['month'] = churn_processed.timestamp.dt.month
# Drop Timestamp
churn_processed.drop('timestamp',axis=1,inplace=True)
# See what we have
churn_processed

2. Add `address` from `churn` to `churn_processed`. One-hot encode `address`, `dow` and `month`. Then drop columns `address`, `dow`, and `month` from `churn_processed`.  Finally, show some of the dataframe.
<br/>&nbsp;&nbsp;<span style="color:red" float:right>[2 point]</span>  

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
# Add code here
# add address column
churn_processed['address'] = churn.address
# One-hot-encode address, dow and month
one_hot = OneHotEncoder(sparse_output=False)
# Create Column Names
one_hot.fit(churn_processed)
# Add one-hot endcoded values to new columns
churn_processed[one_hot.get_feature_names_out()] = one_hot.transform(churn_processed)
# Drop address, dow and month
churn_processed.drop(['dow','month','address'], axis=1, inplace=True)
# Show the dataframe
churn_processed

3. So far we dropped `address`, `dow`, `month`, and `timestamp`.  Why would we want to drop all these columns?
<br/>&nbsp;&nbsp;<span style="color:red" float:right>[1 point]</span> 

**Why would we want to drop all these columns?**<br/>

> We seem to be generating a dataframe that will search for any correlating or predictive features within the categorical data from the data source by encoding it into binary switches. Maintaining the source data columns is unnecessary and could cause bugs or problems as we continue.


4. Rescale `dollar` using min-max normalization. Use `pandas` and `numpy` to do it and call the rescaled column `dollar_std_minmax`.  Then see what the first few rows of the dataframe looks like.
<br/>&nbsp;&nbsp;<span style="color:red" float:right>[1 point]</span> 

In [None]:
# Add code here
# Min-max of dollar using numpy and pandas
churn['dallar_std_minmax'] = (churn.dollar - churn.dollar.min())/(churn.dollar.max() - churn.dollar.min()) 
# See what the dataframe looks like
churn

You can read about **robust normalization** [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html). The word **robust** in statistics generally refers to methods that behave reasonably even if the data is unusual.  For example, you can say that the median is a *robust* measure for the "average" of the data, while the mean is not.  In this respect a normalization is similar to an average. For a normalization **robust** might mean that the method is not affected by outliers. 
<br/><br/>
5. 
  - Write briefly about what makes robust normalization different from Z-normalization.  
  - Write briefly about what makes robust normalization more robust than Z-normalization.  
  - Rescale `quantity` using robust normalization. Call the rescaled column `qty_std_robust` and add it to `churn_processed`.  
  - Compare minimum, maximum, mean, standard deviation, and the median of the original churn['quantity'] with the robust-normalized churn_processed['qty_std_robust'].  
  - Comment on what went wrong.
<br/>&nbsp;&nbsp;<span style="color:red" float:right>[3 point]</span>

**Robust normalization vs. Z-normalization**<br/>

> Robust appears to properly consider out-sized outliers and attempt to eliminate their influence in the rescaling of the data. I like how SciKit Learn called it "zoomed in" in their example plot, as it seems to be a method of looking at the majority of the data. Z-normalization will scale the data set down but keep the outsized outliers, thus creating the same plotting problem despite the new scale.

**Robust normalization is more robust than Z-normalization**<br/>

> I'm not familiar with how the word robust is being applied here, but I'll take a stab at explaining it without that word.

> Robust Normalization focuses on the middle of the dataset, the inter-quartile range, which means that anything that lies above the 75th percentile and below the 25th percentile doesn't factor into how the data is normalized _as heavily_ as it does with Z-score normalizing.


In [None]:
from sklearn.preprocessing import robust_scale

In [None]:
# Add Code here:
churn_processed['qty_std_robust'] = robust_scale(churn.quantity)

In [None]:
# Compare minimum, maximum, mean, standard deviation, and the median of churn['quantity'] with churn_processed['qty_std_robust']
# Add code here
compare = pd.DataFrame([churn['quantity'].describe(), churn_processed['qty_std_robust'].describe()])
compare

**Failure of Robust Normalization**<br/>

> Robust normalizing failed on the `quantity` data because there is not inter-quartile range on the data due to the density of values of 1. See the describe results below.


6. Rescale `quantity` using Z-normalization, but normalize `quantity` **per user**, i.e. group by `user_id` so that the mean and standard deviation computed to normalize are computed separately by each `user_id`. Call the rescaled feature `qty_std_Z_byuser`. Present a histogram of `qty_std_Z_byuser`.  Briefly describe why and when you think this kind of normalization makes sense.
<br/>&nbsp;&nbsp;<span style="color:red" float:right>[3 point]</span>

In [None]:
# Z-normalize quantity per user_id
# Add code here
churn_processed['qty_std_Z_byuser'] = churn.groupby('user_id', group_keys=False)['quantity'].apply(lambda x: (x - x.mean()/x.std()) if x.std() > 0 else x)
churn_processed['qty_std_Z_byuser']

In [None]:
churn_processed['qty_std_Z_byuser'].describe()

In [None]:
# Present histogram of qty_std_Z_byuser
# Add code here
churn_processed['qty_std_Z_byuser'].plot.hist()

In [None]:
churn.quantity.plot()

**What could be the purpose of this normalization?**

> I'm unsure if this is useful without dropping the quite extreme outlier of 1200. Almost no other quantity comes even close. 

> That said, this is a useful approach to try to standardize between users and identify some variability between users buying habits


7. Convert `item_id` into a category column in `churn_processed`.  Replace the `item_id` of all the items sold only once in the entire data with `"999999"`.  How many item ids are of category `"999999"`?  Display 10 rows of `churn_processed` where `item_id` is category `"999999"`.
<br/>&nbsp;&nbsp;<span style="color:red" float:right>[2 point]</span>

In [None]:
# Convert item_id into a category column in churn_processed
# Add code here
churn_processed['item_id'] = churn['item_id'].astype('category')
churn_processed.select_dtypes('category')

In [None]:
churn_processed.item_id.value_counts()


In [None]:
# Add code here
# Add Category
churn_processed.item_id = churn_processed.item_id.cat.add_categories(999999)
#  Replace the item_id of all the items sold only once in the entire data with "999999"   
for item in (churn_processed.item_id.value_counts() == 1).items():
    if item[1]:
        churn_processed.loc[churn_processed.item_id == item[0],'item_id'] = 999999
# How many item ids are of category "999999"
churn_processed.item_id.value_counts()[999999]

In [None]:
# Display 10 rows of churn_processed where item_id is category 999999
churn_processed[churn_processed['item_id'] == 999999]

# End of assignment