# AA - Feature Engineering

There are several ways how to select relevant variables. Standard analysis include:

__Description:__
* <a href = '#sec1'>Preliminary</a>
* <a href = '#sec2'> Feature Engineering </a>
* <a href = '#sec21'> Exercise 1: Calculations </a>
* <a href = '#sec22'> Exercise 2: Transformations </a>
    * <a href = '#sec221'> 1.Binary encoding </a>
    * <a href = '#sec222'> 2.One-hot encoding </a>
    * <a href = '#sec223'> 3.Log transformation</a>
    * <a href = '#sec224'> 4.Standardization </a>

------------
<a id='sec1'></a>
# Preliminary

#### Import required packages and change directory 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn import feature_selection, model_selection

# Displays max. 100 rows
pd.set_option('display.max_columns', 100)

# Set working directory
#os.chdir(default_path)

#### Load the working file

In [None]:
# Load the data
churn_df = pd.read_pickle("churn_for_engineering.p")

------------
<a id='sec2'></a>
# Feature Engineering
This process is called feature engineering (i.e., variable creation), by
taking existing variables and calculating new ones out of these. Let's take a quick look at the existing variables:

In [None]:
churn_df.head(10)

<a id='sec21'></a>
## Exercise 1: Calculations
Based on the three hypothesis, which features would your create to test the hypothesis? How would you visualize the relationship with the new feature and the target variable?

* __Hypothesis 1:__ Older people are more likely to churn.
* __Hypothesis 2:__ People with sudden change in transaction behaviour are more likely to churn.
* __Hypothesis 3:__ People with a shorter tenure are less loyal and more likely to churn.


---

### Hypothesis 1: Older people are more likely to churn.

**Calculation of time difference**

For the calculation of 'age' we need to calculate the time difference between the date_of_birth and the current date.

    pd.to_datetime("now"): Returnes the current timestamp (date and hour)

In [None]:
# Calculate the current datetime
curr_time = pd.to_datetime("now")
curr_time

In [None]:
#TASK: Calculate the time difference between the current datetime and the date_of_birth
churn_df['age'] = ?????
churn_df['age'].head()

We see that the output is a timestamp. To retrieve the actual age of a customer, we additionally need to:
* __dt.days():__ Extract the number of days only
* Convert days into years by dividing the number of days by __365.25__
* Format the numbers correctly
    * __round(0):__ Rounds the number to have 0 digits
    * __astype(np.int64):__ Converts the number to integer

In [None]:
churn_df['age'] = (churn_df['age'].dt.days/365.25).round(0).astype(np.int64)
churn_df['age'].head()

In [None]:
# Deletes the column as we won't need it anymore
del churn_df['date_of_birth']

### Visual inspection of new feature with target

In [None]:
#TASK: Generate a box plot to compare age distributions for churners and non-churners
sns.boxplot(x=????, y= ?????, data=churn_df, palette='Set2')

In [None]:
#TASK: Generate a KD plot to compare age distributions for churners and non-churners
sns.kdeplot(churn_df[churn_df[????]==1][????])
sns.kdeplot(churn_df[churn_df[????]==0][????])
plt.title("age")
plt.legend(['churn_flag:1','churn_flag:0'], loc='upper right')
plt.show()

### Conclusion: 


In [None]:
#TASK:  Write down your conclussions regarding the hypothesis that older people are more likely to churn.




---

### Hypothesis 2: People with sudden change in transaction behaviour are more likely to churn.

In [None]:
#TASK: Describe what you think could be used to describe a change in transaction behaviour from our data





In [None]:
# Maybe the growth in the last balance over the last 6 months?
churn_df['last_balance_growth_6M'] = 100*(churn_df['last_balance'] - churn_df['last_balance_minus_6_months'])/abs(churn_df['last_balance_minus_6_months'])

### Visual inspection of new feature with target

In [None]:
#TASK: Generate a box plot to compare the balance growth over the last 6 months for churners and non-churners
sns.boxplot(x=????, y= ??????, data=churn_df, palette='Set2')

In [None]:
#TASK: Generate a KD plot to compare the balance growth over the last 6 months for churners and non-churners
sns.kdeplot(churn_df[churn_df[?????]==0][?????])
plt.title("last_balance_growth_6M")
plt.legend(['churn_flag:1','churn_flag:0'], loc='upper right')
plt.show()

#### Visual inspection doesn't allow any conclusions yet. There seem to be significant outliers which we should remove first. Some of the outliers are due to the fact that we calculate a growth rate using positive and negative values.

In [None]:
# Compare with quartiles
churn_df.last_balance_growth_6M.describe(percentiles=[0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99])

In [None]:
#TASK: Choose threshold to remove outliers
churn_df = churn_df.loc[((churn_df.last_balance_growth_6M > ?????) & (churn_df.last_balance_growth_6M < ?????))]

In [None]:
sns.boxplot(x="churn_flag", y= 'last_balance_growth_6M', data=churn_df, palette='Set2')

In [None]:
sns.kdeplot(churn_df[churn_df['churn_flag']==1]['last_balance_growth_6M'])
sns.kdeplot(churn_df[churn_df['churn_flag']==0]['last_balance_growth_6M'])
plt.title("last_balance_growth_6M")
plt.legend(['churn_flag:1','churn_flag:0'], loc='upper right')
plt.show()

### Conclusion:  

In [None]:
#TASK:  Write down your conclussions regarding the hypothesis that a sudden change in transaction behaviour indicates a likeliness to churn




---

### Hypothesis 3: People with a shorter tenure are less loyal and more likely to churn.

The tenure is calculated as follows:
* **Customer has churned:** Time difference between the contract_end and contract_start
* **Customer has not churned:** Time difference between the current date and contract_start

In [None]:
#TASK: Calculate tenure for customers who have churned and those who have not
curr_time = pd.to_datetime("now")
churn_df['tenure_churn'] = ((churn_df[?????] - churn_df[??????]).dt.days/365.25).round(0).fillna(0).astype(np.int64)
churn_df['tenure_nochurn'] = ((curr_time - churn_df[?????]).dt.days/365.25).round(0).astype(np.int64)

In [None]:
#TASK: Create one tenure variable depending whether the customer has churned or not
churn_df['tenure'] = np.where(churn_df[????]==1, churn_df['tenure_churn'] , churn_df['tenure_nochurn'])

# Delete as we don't need it anymore
del churn_df['contract_start']
del churn_df['contract_end']
del churn_df['tenure_churn']
del churn_df['tenure_nochurn']

### Visual inspection of new feature with target

In [None]:
#TASK: Generate a box plot to compare the tenure for churners and non-churners
sns.boxplot(x=?????, y= ?????, data=churn_df, palette='Set2')

In [None]:
#TASK: Generate a KD plot to compare the tenure for churners and non-churners
sns.kdeplot(churn_df[churn_df[?????]==1][?????])
sns.kdeplot(churn_df[churn_df[?????]==0][?????])
plt.title("tenure")
plt.legend(['churn_flag:1','churn_flag:0'], loc='upper right')
plt.show()

### Conclusion: 

In [None]:
#TASK:  Write down your conclussions regarding the hypothesis that people with a lower tenure at the bank are more likely to churn



------------
<a id='sec22'></a>
## Exercise 2: Transformations

Some machine learning models work only on numerical variables. In order to use categorical variable, they need to be transformed into numerical values.
- Binary encoding
- One-hot encoding

Numerical variables can be transformed to improve model performance! Commonly applied transformations are:
- Log transformations
- Standardization

<a id='sec221'></a>
### 1. Binary encoding

Binary encodings are a special case of categoric features (such as gender). It transforms the category levels, currently present as strings, into binary code.

In [None]:
churn_df['gender'].head(5)

In [None]:
#TASK: Create a new feature called male by applying binary encoding to the gender variable
churn_df['male'] = churn_df[?????].map( {'M':1, 'W':0} ).astype('category')
del churn_df['gender']

In [None]:
# Display transformed column
churn_df['male'].head()

<a id='sec222'></a>
### 2. One-hot encoding
One-hot encoding converts each category value into a new column and assigns a 1 or 0 value to the column. There are many libraries out there that support one-hot encoding but the simplest one is using pandas get_dummies() method.

    pd.get_dummies: Convert categorical variable into dummy/indicator variables.


#### Apply one-hot encoding to categorical variable profession

We see that the majority of customers are engaged in an 'Unknown' profession and Journalism. The rest is equally distributed across the other professions.

In [None]:
# Number of unique professions
churn_df['profession'].value_counts()

In [None]:
# Variable profession before one-hot encoding
churn_df['profession'].head(10)

In [None]:
# TASK: apply one-hot encoding to the 'profession' variable
# The prefix parameter defines the prefix for the newly created binary variables
x = pd.get_dummies(churn_df[?????], prefix='prof').head(5)
x

To concatenate these new features with the existing data, do the following:

In [None]:
churn_df = pd.concat([churn_df, pd.get_dummies(churn_df['profession'], prefix='prof').astype('category')], axis=1)

#### Apply one-hot encoding to categorical variable segment

In [None]:
# TASK: apply one-hot encoding to the 'segment' variable
churn_df = pd.concat([churn_df, pd.get_dummies(churn_df[?????], prefix='segment').astype('category')], axis=1)

In [None]:
del churn_df['profession']
del churn_df['segment']

#### Remove variable ZIP Code

The categorical variable **ZIP Code** has more than 16055 levels. It would not make sense to create a binary variable for each level. Instead one could search for geographic or demographic variables based on these zipcodes and merge it to the data set (e.g. city, income distribution, age range etc.) to add more valueable features. However, for demonstration purposes we will just remove the variable.

In [None]:
churn_df.ZIP.nunique()

In [None]:
del churn_df['ZIP']

<a id='sec223'></a>
### 3. Log transformation
Log transformations are commonly applied on skewed data to make it more normally distributed. This makes predictions more consistent between high and low values.

    np.log(): Natural logarithm, element-wise

#### Apply log-transformation to numeric variable cash_withdrawals_value

In [None]:
# Before log transformation
sns.distplot(churn_df['cash_withdrawals_value'], bins = 100)
plt.title("Before log-transformation - cash_withdrawals_value")

In [None]:
#TASK:  Apply log transformation log(1+p)
logged_var = np.log1p(churn_df[?????])

In [None]:
# After log transformation
sns.distplot(logged_var, bins = 100)
plt.title("After log-transformation - cash_withdrawals_value")

<a id='sec224'></a>
### 4. Standardization
Another transformation commonly applied is standardization. All inputs have mean of 0 and stdev of 1. The goal is to make input variables of different unit of measures comparable.

    preprocessing.StandardScaler(): Standardize features by removing the mean and scaling to unit variance

#### Apply standardization to numeric variable last_balance

In [None]:
sns.distplot(churn_df['last_balance'], bins = 100)
plt.title("Before standardization - last_balance")

In [None]:
# Use array.reshape(-1, 1) as our data set contains only one sample (one variable)
x_array = np.array(churn_df['last_balance']).reshape(-1, 1)

# Create the Scaler object
scaler = preprocessing.StandardScaler()

# Fit to data, then transform it 
stand_var = scaler.fit_transform(x_array)

In [None]:
# Visualisation after standardization
sns.distplot(stand_var, bins = 100)
plt.title("After standardization - last_balance")

### Write data file to disk

In [None]:
churn_df.to_pickle("churn_for_features.p")