# Workshop day 2 - data preparation

This script is used for breakout exercises during the workshop.
Make sure to run the code section 'Preparation' before starting the exercises, in order to load required packages and to load and prepare the dataset.

Run this code section before you start with any exercise in order to load required packages and to load and prepare the dataset.

### Import required packages
__Required packages:__ 
* numpy
* pandas  
* matplotlib

You can import packages using the 

    import

command. In this module, we are working primarily with **pandas**, which is the standard package in Python for data manipulation.

For visualization purposes, we will use **matplotlib** (the standard Python plotting library).

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max.columns', 50)

### Load data

In [None]:
churn_df = pd.read_csv("bankingchurn_data_prep.csv")
churn_df.head()

As you can see, there are various NA values. Also, you should always ensure if columns have the proper data type (more on that in the field work exercises).

<a id='sec41'></a>
# Exercise 1: Eliminate outliers

### Clean variable last_balance from outliers
Visual inspection using a boxplot or a histogram.

    plt.hist()
    plt.boxplot()

Let's include two reference lines into the histogram to make potential outliers more visible (2 standard deviations away from the mean)

In [None]:
# Calculate mean and standard deviation of last_balance
mean = churn_df.last_balance.mean()
std = churn_df.last_balance.std()

# Create the histogram of last_balance with two reference lines
plt.hist(churn_df.last_balance, bins = 30)
plt.axvline(mean - 2 * std, color='r', linestyle='dashed', linewidth=2)
plt.axvline(mean + 2 * std, color='r', linestyle='dashed', linewidth=2)
plt.show()

In [None]:
# Create the boxplot of last_balance
plt.boxplot(churn_df.last_balance)
plt.show()

#### Compare with quartiles
    describe(percentiles=[...]): you can define the percentiles which should be displayed in the summary analysis

In [None]:
# Use describe on last_balance with 1st, 5th, 25th, 50th, 75th, 95th and 99th percentiles
churn_df.last_balance.describe(percentiles=[0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99])

#### Based on the analyses, select appropriate thresholds and exclude them
Be prepared to explain why you chose a specific value.

In [None]:
#TASK: replace the '?????' with appropriate thresholds 
#TASK: use the 2nd line of code if you want to filter out high and low outliers
data2 = churn_df.loc[churn_df.last_balance < ?????]
#data2 = churn_df.loc[churn_df.last_balance.between(?????, ?????)]

In [None]:
#Re-create the histogram to review the new distribution after cleaning
mean = data2.last_balance.mean()
std = data2.last_balance.std()

# Create the histogram of last_balance with two reference lines
plt.hist(data2.last_balance, bins = 30)
plt.axvline(mean - 2 * std, color='r', linestyle='dashed', linewidth=2)
plt.axvline(mean + 2 * std, color='r', linestyle='dashed', linewidth=2)
plt.show()

<a id='sec43'></a>
### Exercise 2: Handling missing values
Most algorithms cannot deal with data gaps, like missing values or "errors". 
(meaning the information from the column is lost).

We have to decide what we do with missing values. Some options are: 
 - Exclude variables with NAs (e.g. using a threshold)
 - Removing rows with NAs 
 - Consider NAs as a separate catogory
 - Impute missing values with an aggregated value (e.g. mean, median, min, max)



#### Identify rows with missing values
     
     isna(): Check for missing values

In [None]:
# Count number of rows with missing values in each column
churn_df.isna().sum()

In [None]:
# Calculate the percentage of missingvalues for each column and assign to a data frame called 'missing_pct'
missing_pct = churn_df.isna().mean().sort_values(ascending=False)

# Filter 'missing_pct' for only columns which have any missing values
missing_pct = missing_pct[missing_pct > 0]

# Plot 'missing_pct'
plt.figure(figsize=(5,5))
sns.barplot(x=missing_pct, y=missing_pct.index, ci=None).\
    set_title('Percent of missing values per variable (only variables with missing values)')
plt.show()

### Let's decide how to handle each of the 3 variables
#### Contract end
Variable contains the date a contract has ended.

Pick one of the following actions:

* Keep variable as is
* Remove variable
* Impute with median
* Encode missing values as separate value

#### Credit rating
Contains the credit rating (i.e., the probability of  fulfilling a credit) of the customer, if they have any. Values can range from 0 to 100.

In [None]:
# Check the histogram and descriptives of credit rating
plt.hist(churn_df.credit_rating, bins=30)
churn_df.credit_rating.describe()

Pick one of the following actions:

* Keep variable as is
* Remove variable
* Impute with median
* Encode missing values as a specific value

#### Profession
Contrains the description of the profession of a customer.

In [None]:
# Check the values that can appear in the profession column
churn_df.profession.value_counts()

Pick one of the following actions:

* Keep variable as is
* Remove variable
* Impute with mode (most frequent value)
* Encode missing values as separate category

### Apply cleaning steps
Fill in the variable names for each outlier treatment step based on your decision above.

In [None]:
# Remove variable from dataset
churn_df = churn_df.drop(columns='?????')

In [None]:
# Impute variable with the median value of the non NA entries
median = churn_df['?????'].median()
churn_df['?????'] = churn_df['?????'].fillna(median)

In [None]:
# Impute variable with the mode value (the most frequent value) of the non NA entries
mode = churn_df['?????'].mode()
churn_df['?????'] = churn_df['?????'].fillna(mode)

In [None]:
# Codify missing values as a separate category ("unknown")
churn_df['?????'] = churn_df['????'].fillna('Unknown')

# Well done!