# Lesson 06 Assignment

In this assignment, we want to read the `retail-churn.csv` dataset that we examined in a previous assignment and begin to pre-process it. The goal of the assignment is to become familiar with some common pre-processing and feature engineering steps by implementing them.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime as dt
%matplotlib inline

col_names = ['user_id', 'gender', 'address', 'store_id', 'trans_id', 'timestamp', 'item_id', 'quantity', 'dollar']
churn = pd.read_csv("../data/retail-churn.csv", sep = ",", skiprows = 1, names = col_names)

### Some basic EDA
present the first few rows and use pandas' `describe` to get an overview of the data
<br/>&nbsp;&nbsp;<span style="color:red" float:right>[0 point]</span>

In [2]:
# Present a random selection of 25 rows
churn.sample(1000).head(10)

Unnamed: 0,user_id,gender,address,store_id,trans_id,timestamp,item_id,quantity,dollar
114321,2100074,A,F,23985,1192317,12/28/2000 0:00,4710000000000.0,1,54
30262,2151847,D,E,125656,914131,11/13/2000 0:00,4710000000000.0,1,16
25447,743587,J,F,120885,909214,11/12/2000 0:00,4710000000000.0,1,61
245823,1881011,D,E,243229,1614262,2/25/2001 0:00,4720000000000.0,1,479
16701,1550900,D,E,91035,875260,11/8/2000 0:00,4710000000000.0,2,500
201828,1909395,I,E,44436,1488088,2/5/2001 0:00,4710000000000.0,1,86
206513,1841664,G,F,224743,1470176,2/8/2001 0:00,8850000000000.0,1,49
32235,117593,D,F,133475,925024,11/15/2000 0:00,4710000000000.0,1,76
135560,1848687,K,H,27934,1256433,1/11/2001 0:00,4900000000000.0,1,128
222533,1988765,E,F,235978,1536271,2/15/2001 0:00,4710000000000.0,1,29


In [3]:
# Use .describe() to provide summary stats by col
print(churn.describe())

# Show the number of null or missing values by column
print(churn.isnull().sum())

            user_id       store_id      trans_id       item_id       quantity  \
count  2.522040e+05  252204.000000  2.522040e+05  2.522040e+05  252204.000000   
mean   1.395660e+06  126101.500000  1.229771e+06  4.467833e+12       1.385692   
std    6.094769e+05   72805.167983  2.350992e+05  1.679512e+12       3.705732   
min    1.113000e+03       0.000000  8.177470e+05  2.000882e+07       1.000000   
25%    9.937150e+05   63050.750000  1.025926e+06  4.710000e+12       1.000000   
50%    1.586046e+06  126101.500000  1.233476e+06  4.710000e+12       1.000000   
75%    1.862232e+06  189152.250000  1.433222e+06  4.710000e+12       1.000000   
max    2.179605e+06  252203.000000  1.635482e+06  9.790000e+12    1200.000000   

              dollar  
count  252204.000000  
mean      130.911389  
std       388.142169  
min         1.000000  
25%        42.000000  
50%        76.000000  
75%       132.000000  
max     70589.000000  
user_id      0
gender       0
address      0
store_id     0
tra

The new data frame will be called `churn_processed`, which stores the pre-processed columns as you run through each of the these steps. You will need to make sure your columns are properly named.

1. Cast the `timestamp` column in churn into a column of type `datetime` and put the column into the `churn_processed` dataframe.  The new column should also be named `timestamp`.  Extract two new columns from `timestamp`: `dow` is the day of the week and `month` is the month of the year. Then drop the `timestamp` column from `churn_processed`.  Present the first few rows of `churn_processed`.
<br/>&nbsp;&nbsp;<span style="color:red" float:right>[2 point]</span>

***NOTES ON SUBMISSION***

I chose to create a copy of _churn_ and then transform the copy into _churn_processed_. This does create additional demands on memory, but it also ensures absolute _row-wise_ consistency as the variables are transformed. Additionally, the process of transformation can be more easily traced by the developer.

In order to preserve _churn_, we explicitly recast it as a dataframe, creating a separate object, not just a new reference to the same object.

In [4]:
# Add code here
# Create a new dataframe object that is a copy of churn, called churn_processed
churn_processed = pd.DataFrame(churn) # < non-empty...copy of orginal dataset
# Cast timestamp to datetime
churn_processed['timestamp'] = pd.to_datetime(churn_processed['timestamp'])
# Create a dow column
churn_processed['dow'] = churn_processed['timestamp'].dt.weekday.astype(object)
# Create a month column
churn_processed['month'] = churn_processed['timestamp'].dt.month.astype(object)
# display column data types and summary of the dataframe
print(churn_processed.dtypes)
churn_processed.head(5)

user_id               int64
gender               object
address              object
store_id              int64
trans_id              int64
timestamp    datetime64[ns]
item_id             float64
quantity              int64
dollar                int64
dow                  object
month                object
dtype: object


Unnamed: 0,user_id,gender,address,store_id,trans_id,timestamp,item_id,quantity,dollar,dow,month
0,101981,F,E,2860,818463,2000-11-01,4710000000000.0,1,37,2,11
1,101981,F,E,2861,818464,2000-11-01,4710000000000.0,1,17,2,11
2,101981,F,E,2862,818465,2000-11-01,4710000000000.0,1,23,2,11
3,101981,F,E,2863,818466,2000-11-01,4710000000000.0,1,41,2,11
4,101981,F,E,2864,818467,2000-11-01,4710000000000.0,8,288,2,11


In [5]:
# drop the 'timestamp' column and reinspect
# Drop Timestamp
churn_processed.drop('timestamp', inplace = True, axis = 1)
# See what we have
churn_processed.head()

Unnamed: 0,user_id,gender,address,store_id,trans_id,item_id,quantity,dollar,dow,month
0,101981,F,E,2860,818463,4710000000000.0,1,37,2,11
1,101981,F,E,2861,818464,4710000000000.0,1,17,2,11
2,101981,F,E,2862,818465,4710000000000.0,1,23,2,11
3,101981,F,E,2863,818466,4710000000000.0,1,41,2,11
4,101981,F,E,2864,818467,4710000000000.0,8,288,2,11


2. Add `address` from `churn` to `churn_processed`. One-hot encode `address`, `dow` and `month`. Then drop columns `address`, `dow`, and `month` from `churn_processed`.  Finally, show some of the dataframe.
<br/>&nbsp;&nbsp;<span style="color:red" float:right>[2 point]</span>  

In [6]:
# Add code here
# import required packages
from sklearn.preprocessing import OneHotEncoder
# add address column
"""
This step is unnecessary, since we
already have the column in the set.
Instead, we'll drop all unused columns and ensure
that column type is set to categorical
"""
# One-hot-encode address, dow and month
ohe = OneHotEncoder(sparse=False)
X = churn_processed.select_dtypes(include=[object])
# drop 'gender' column, since it is binary
X = X.drop('gender', axis = 1)
print(X.dtypes)
# X.head()
ohe.fit(X)
# Create Column Names
cols = ohe.get_feature_names_out(X.columns)
# Add one-hot endcoded values to new columns
churn_encoded = churn_processed.join(pd.DataFrame(ohe.transform(X), columns=cols))
# Drop address, dow and month
churn_encoded = churn_encoded.drop(['address', 'dow', 'month'], axis = 1)
# Show the dataframe
churn_encoded.head()

address    object
dow        object
month      object
dtype: object


Unnamed: 0,user_id,gender,store_id,trans_id,item_id,quantity,dollar,address_A,address_B,address_C,...,dow_1,dow_2,dow_3,dow_4,dow_5,dow_6,month_1,month_2,month_11,month_12
0,101981,F,2860,818463,4710000000000.0,1,37,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,101981,F,2861,818464,4710000000000.0,1,17,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,101981,F,2862,818465,4710000000000.0,1,23,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,101981,F,2863,818466,4710000000000.0,1,41,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,101981,F,2864,818467,4710000000000.0,8,288,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


3. So far we dropped `address`, `dow`, `month`, and `timestamp`.  Why would we want to drop all these columns?
<br/>&nbsp;&nbsp;<span style="color:red" float:right>[1 point]</span> 

**Why would we want to drop all these columns?**<br/>

***Answer:***

Given that these columns are not independent of the _one hot-encoded_ columns, including them may create issues associated with multicollinearity, and may confuse the results for the model.


4. Rescale `dollar` using min-max normalization. Use `pandas` and `numpy` to do it and call the rescaled column `dollar_std_minmax`.  Then see what the first few rows of the dataframe looks like.
<br/>&nbsp;&nbsp;<span style="color:red" float:right>[1 point]</span> 

In [7]:
# Add code here
# define a function to apply the minmax norm to an array...because I was too lazy to go looking for a pre-defined method...
def min_max_norm(arr):
    offset = np.min(arr)
    scale = np.max(arr)-offset
    return np.array([(x-offset)/scale for x in arr])
# Min-max of dollar using numpy and pandas
churn_encoded['dollar_std_minmax'] = min_max_norm(churn_encoded['dollar'])
# See what the dataframe looks like
churn_encoded

Unnamed: 0,user_id,gender,store_id,trans_id,item_id,quantity,dollar,address_A,address_B,address_C,...,dow_2,dow_3,dow_4,dow_5,dow_6,month_1,month_2,month_11,month_12,dollar_std_minmax
0,101981,F,2860,818463,4.710000e+12,1,37,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.000510
1,101981,F,2861,818464,4.710000e+12,1,17,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.000227
2,101981,F,2862,818465,4.710000e+12,1,23,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.000312
3,101981,F,2863,818466,4.710000e+12,1,41,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.000567
4,101981,F,2864,818467,4.710000e+12,8,288,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.004066
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
252199,2179605,B,251838,1630692,2.250000e+12,2,138,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.001941
252200,2179605,B,251839,1630821,4.710000e+12,1,96,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.001346
252201,2179605,B,251840,1630931,4.710000e+12,1,89,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.001247
252202,2179605,B,251841,1631033,4.710000e+12,1,108,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.001516


You can read about **robust normalization** [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html). The word **robust** in statistics generally refers to methods that behave reasonably even if the data is unusual.  For example, you can say that the median is a *robust* measure for the "average" of the data, while the mean is not.  In this respect a normalization is similar to an average. For a normalization **robust** might mean that the method is not affected by outliers. 
<br/><br/>
5. Write briefly about what makes robust normalization different from Z-normalization.  Write briefly about what makes robust normalization more robust than Z-normalization.  Rescale `quantity` using robust normalization. Call the rescaled column `qty_std_robust` and add it to `churn_processed`.  Compare minimum, maximum, mean, standard deviation, and the median of the original churn['quantity'] with the robust-normalized churn_processed['qty_std_robust'].  Comment on what went wrong.
<br/>&nbsp;&nbsp;<span style="color:red" float:right>[3 point]</span>

**Robust normalization vs. Z-normalization**<br/>
<strong>Answer</strong>:<br/>

Mathematically, the robust transformation differs from z-transformation in that it uses the _median_ rather than the _mean_ as the offset, and the _interquartile range_ or _IQR_ rather than the _standard deviation_ as the scale. That is, for a z-norm, we calculate 
$$
z^{*}=\frac{x-\overline{x}}{\sigma}\mbox{    , but}
$$
for a robust norm, we calculate
$$
r^{*}=\frac{x-M}{IQR}\mbox{    .}
$$

**Robust normalization is more robust than Z-normalization**<br/>
<strong>Answer</strong>:<br/>
In the majority of cases, using the robust transformation confers the benefit of reducing the effect of outliers on the resulting values. However, in the case of normally distributed data, there may be advantages of precision to using z-normalization. 

This is probably a great reason to become very familiar with your data prior to deciding on transformations. A stitch in time...
  

In [8]:
# Add Code here:
# define a function to apply the robust norm to an array
def robust_norm(arr):
    offset = np.median(arr)
    """
    For this section, you may test the original function by 
    replacing '10^(-9999)' with the commented statement that follows.
    This was a hotfix to prevent the code from 
    throwing a runtime error for division by zero
    """
    scale = 10^(-9999) #np.quantile(arr, q = 0.75) - np.quantile(arr, q = 0.25)
    return np.array([(x-offset)/scale for x in arr])

# apply the function to the quantity column
churn_encoded['qty_std_robust'] = robust_norm(churn_encoded['quantity'])
# See what the dataframe looks like
churn_encoded

Unnamed: 0,user_id,gender,store_id,trans_id,item_id,quantity,dollar,address_A,address_B,address_C,...,dow_3,dow_4,dow_5,dow_6,month_1,month_2,month_11,month_12,dollar_std_minmax,qty_std_robust
0,101981,F,2860,818463,4.710000e+12,1,37,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.000510,-0.000000
1,101981,F,2861,818464,4.710000e+12,1,17,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.000227,-0.000000
2,101981,F,2862,818465,4.710000e+12,1,23,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.000312,-0.000000
3,101981,F,2863,818466,4.710000e+12,1,41,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.000567,-0.000000
4,101981,F,2864,818467,4.710000e+12,8,288,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.004066,-0.000701
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
252199,2179605,B,251838,1630692,2.250000e+12,2,138,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.001941,-0.000100
252200,2179605,B,251839,1630821,4.710000e+12,1,96,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.001346,-0.000000
252201,2179605,B,251840,1630931,4.710000e+12,1,89,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.001247,-0.000000
252202,2179605,B,251841,1631033,4.710000e+12,1,108,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.001516,-0.000000


In [9]:
# Compare minimum, maximum, mean, standard deviation, and the median of churn['quantity'] with churn_processed['qty_std_robust']
print(churn['quantity'].describe(),'\n',churn_encoded['qty_std_robust'].describe())

count    252204.000000
mean          1.385692
std           3.705732
min           1.000000
25%           1.000000
50%           1.000000
75%           1.000000
max        1200.000000
Name: quantity, dtype: float64 
 count    252204.000000
mean         -0.000039
std           0.000371
min          -0.120032
25%          -0.000000
50%          -0.000000
75%           0.000000
max          -0.000000
Name: qty_std_robust, dtype: float64


**Failure of Robust Normalization**<br/>
***Answer***:<br/>

Because $IQR =0$, $\frac{x-M}{IQR}$ is undefined for all $x\in X$, and our request will result in a runtime error. The obvious solution is to substitute the _IQR_ with and infinitesimal representation, such as $10^{-9999}$.

It is likely that this is what _sklearn_ implements in order to avoid errors, but that practice may actually create problems by not drawing the researcher's attention to the lack of variance in the statistic.

6. Rescale `quantity` using Z-normalization, but normalize `quantity` **per user**, i.e. group by `user_id` so that the mean and standard deviation computed to normalize are computed separately by each `user_id`. Call the rescaled feature `qty_std_Z_byuser`. Present a histogram of `qty_std_Z_byuser`.  Briefly describe why and when you think this kind of normalization makes sense.
<br/>&nbsp;&nbsp;<span style="color:red" float:right>[3 point]</span>

In [16]:
# Z-normalize quantity per user_id
# define a function to apply the z norm ("normal" norm?) to an array
def z_norm(offset, scale, arr):
    return np.array([(x-offset)/scale for x in arr])
# 
mod_mean = np.mean(churn_encoded['quantity'])
mod_std = np.std(churn_encoded['quantity'])
# create a df of aggregated user id's and their means
churn2 = pd.DataFrame(churn_encoded.groupby('user_id').quantity.mean())

# apply the function to the quantity column
churn2['qty_std_Z_byuser'] = z_norm(mod_mean, mod_std, churn2['quantity'])
# merge the dataframes with forward fill (one-to-many on user_id)
churn_encoded = pd.merge_ordered(churn_encoded, churn2['qty_std_Z_byuser'], on = 'user_id', fill_method = 'ffill')
# See what the dataframe looks like
churn_encoded

Unnamed: 0,user_id,gender,store_id,trans_id,item_id,quantity,dollar,address_A,address_B,address_C,...,dow_4,dow_5,dow_6,month_1,month_2,month_11,month_12,dollar_std_minmax,qty_std_robust,qty_std_Z_byuser
0,1113,K,118152,904890,4.710000e+12,2,29,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.000397,-0.0001,-0.029121
1,1113,K,118153,905431,4.900000e+12,3,391,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.005525,-0.0002,-0.029121
2,1113,K,118154,1000113,4.900000e+12,1,111,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.001558,-0.0000,-0.029121
3,1113,K,118155,1000416,7.620000e+12,1,268,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.003783,-0.0000,-0.029121
4,1113,K,118156,1000417,4.710000e+12,1,179,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.002522,-0.0000,-0.029121
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
252199,2179605,B,251838,1630692,2.250000e+12,2,138,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.001941,-0.0001,-0.010669
252200,2179605,B,251839,1630821,4.710000e+12,1,96,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.001346,-0.0000,-0.010669
252201,2179605,B,251840,1630931,4.710000e+12,1,89,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.001247,-0.0000,-0.010669
252202,2179605,B,251841,1631033,4.710000e+12,1,108,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.001516,-0.0000,-0.010669


In [11]:
# Present histogram of qty_std_Z_byuser
plt.hist(churn_encoded['qty_std_Z_byuser'], bins = 100)
plt.xlabel('Z-Scores')
plt.ylabel('Count')
plt.title('Distribution of Z-Scores')
plt.show()
# present an additional histogram of qty_std_robust, for comparison
plt.hist(churn_encoded['qty_std_robust'], bins = 100)
plt.xlabel('Robust Normalization Scores')
plt.ylabel('Count')
plt.title('Distribution of Robust Normalization Scores')
plt.show()

# display a summary of the newly augmented data
churn_encoded.describe()

KeyError: 'qty_std_Z_byuser'

**What could be the purpose of this normalization?**
<br/>
<strong>Answer</strong>:<br/>

Using _z-normalization_ may confer benefits in terms of better representing the outliers through model effect. We notice that the _robust normalization scores_ are clustered rather closely, since they eliminate the effect of the outlier on the statistic. While this may be desireable in some contexts, it may also inhibit our ability to connect effect sizes to parameter values. 

7. Convert `item_id` into a category column in `churn_processed`.  Replace the `item_id` of all the items sold only once in the entire data with `"999999"`.  How many item ids are of category `"999999"`?  Display 10 rows of `churn_processed` where `item_id` is category `"999999"`.
<br/>&nbsp;&nbsp;<span style="color:red" float:right>[2 point]</span>

In [None]:
# Convert item_id into a category column in churn_processed
# Add Category
churn_encoded['item_id'].astype(object)
#  Replace the item_id of all the items sold only once in the entire data with "999999"
churn_encoded.loc[churn_encoded['quantity'] == 1,'item_id'] = "999999"
# How many item ids are of category "999999"
print(f"There are {len(churn_encoded.loc[churn_encoded['item_id']=='999999', 'item_id'])} item_id values that are '999999'.")
# Display 10 rows of churn_processed where item_id is category 999999
churn_encoded.loc[churn_encoded['quantity'] == 1,:].head(10)

# End of assignment