# Module 3: Exercise B

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

import warnings
warnings.filterwarnings("ignore")

In this exercise, we will work with a dataset related to the direct marketing campaign (phone calls) of a banking institution. The classification goal is to predict if a client will subscribe to a term deposit. The variable __y__ represents the outcome of the phone call:

- `1` indicates that the client subscribed to a term deposit.
- `0` indicates that they did not.

## Data Preprocessing

In [2]:
bank = pd.read_csv('bank_data.csv')
bank.head()

Unnamed: 0,age,job,education,default,balance,housing,loan,duration,campaign,pdays,previous,y
0,30.0,unemployed,primary,no,1787.0,no,no,79.0,1.0,999.0,0.0,0
1,33.0,services,secondary,no,4789.0,yes,yes,220.0,1.0,339.0,4.0,0
2,35.0,management,tertiary,no,1350.0,yes,no,185.0,1.0,330.0,1.0,0
3,30.0,management,tertiary,no,,yes,yes,199.0,4.0,999.0,0.0,0
4,59.0,blue-collar,secondary,no,0.0,yes,no,226.0,1.0,999.0,0.0,0


In [3]:
bank.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   age        4521 non-null   float64
 1   job        4483 non-null   object 
 2   education  4521 non-null   object 
 3   default    4521 non-null   object 
 4   balance    4424 non-null   float64
 5   housing    4521 non-null   object 
 6   loan       4521 non-null   object 
 7   duration   4521 non-null   float64
 8   campaign   4521 non-null   float64
 9   pdays      4463 non-null   float64
 10  previous   4521 non-null   float64
 11  y          4521 non-null   int64  
dtypes: float64(6), int64(1), object(5)
memory usage: 424.0+ KB


### Missing Values

>__Task 1__
>
>Count the number of NAs for each column

In [None]:
...

We see some missing values in some columns, let's deal with them.

>__Task 2__
>
>Drop rows with missing values and double check the NAs for each column

In [None]:
...

### Outliers

>__Task 3__
>
>- Find the the mean and standard deviation of the __balance__ column
>- Create a mask for values less than `mean-3*sd` and greater than `mean+3*sd`
>- Use the mask to filter out outlier rows and print the first 10 rows

In [None]:
...

>__Task 4__
>
>- Replace outliers in the __balance__ column with the thresholds
>- Check descriptive statistics of the data

In [None]:
...

### Categorical Data

The data set contains categorical variables. Let's convert them.

>__Task 5__
>
>- Check values and counts in each value of the columns: __job__, __education__, __default__, __housing__, __loan__
>- Convert variables into numerical ones (Hint: create a dictionary and use `.replace` for binary categorical variables, and use `pd.get_dummies` for multi-categorical variables)

In [None]:
...

You can double check the results by printing the meta information again:

In [11]:
bank.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4332 entries, 0 to 4520
Data columns (total 25 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   age                  4332 non-null   float64
 1   default              4332 non-null   int64  
 2   balance              4332 non-null   float64
 3   housing              4332 non-null   int64  
 4   loan                 4332 non-null   int64  
 5   duration             4332 non-null   float64
 6   campaign             4332 non-null   float64
 7   pdays                4332 non-null   float64
 8   previous             4332 non-null   float64
 9   y                    4332 non-null   int64  
 10  job_admin.           4332 non-null   uint8  
 11  job_blue-collar      4332 non-null   uint8  
 12  job_entrepreneur     4332 non-null   uint8  
 13  job_housemaid        4332 non-null   uint8  
 14  job_management       4332 non-null   uint8  
 15  job_retired          4332 non-null   u

### Visualization

>__Task 6__
>
> Create a scatter plot of two variables __balance__ and __age__ labelled with __y__

In [None]:
...

### Data Scaling

K-means algorithm computes the distance between data points and cluster centers. When certain features have a significantly larger scale compared to others, they can disproportionately influence the distance calculations, potentially leading to suboptimal clustering outcomes. Implementing scaling can help alleviate this issue.

>__Task 7__
>
>- Define `X` by dropping the __y__ column
>- Define `y` as the __y__ column
>- Define `X_scaled` by scaling `X` using `MinMaxScaler`

In [None]:
...

##  K-means Clustering

>__Task 8__
>
>- Apply k-means to `X` with 2 clusters and print coordinates of cluster centers
>- Print labels of each point
>- Create a scatter plot of __balance__ and __age__ variables labelled with point labels

In [None]:
...

>__Task 9__
>
>- Apply k-means to `X_scaled` with 2 clusters and print coordinates of cluster centers
>- Print labels of each point
>- Create a scatter plot of __balance__ and __age__ variables labelled with point labels

In [None]:
...

The clustering results show significant differences between unscaled and scaled variables. In the case of unscaled data, the __balance__ column, which has larger values compared to other columns, exerts a significant influence on the k-means algorithm.

## Hierarchical Clustering

Let's take a sample of 25 from `X` and use hierarchical clustering to cluster them:

In [20]:
np.random.seed(156)
X_sample = X.sample(25, random_state=146)

>__Task 10__
>
>- Apply hierarchy to `X_sample` with each of linkage criteria: single, complete, average, centroid, and ward
>- Plot a dendrogram of each criterion and set `orientation` to `left`

In [None]:
# Single
...

In [None]:
# Complete
...

In [None]:
# Average
...

In [None]:
# Centroid
...

In [None]:
# Ward
...

### Agglomerative Clustering

 Now, let's proceed with clustering the entire data set using agglomerative clustering from scikit-learn.

>__Task 11__
>
>- Apply agglomerative clustering with 2 clusters
>- Define `cluster_labels` as the result of each sample’s clustering assignment
>- Create a scatter plot of __duration__ and __pdays__ variables labelled with `cluster_labels`

In [None]:
...

>__Task 12__
>
>- Apply agglomerative clustering with 2 clusters and `complete` linkage
>- Define `cluster_labels` as the result of each sample’s clustering assignment
>- Create a scatter plot of __duration__ and __pdays__ variables labelled with `cluster_labels`

In [None]:
...