<a href="https://www.kaggle.com/code/asishjosekakkadan/clustering-loan-borrowers?scriptVersionId=271451989" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

This is publicly available data from LendingClub.com. Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back.

## Understanding the data

The data set has 9,500 loans with information on the loan structure, the borrower, and whether the loan was paid back in full. We will get rid of the target column not.fully.paid to meet the unsupervised aspect.

In [None]:
import pandas as pd
loan_data = pd.read_csv("/kaggle/input/loan-data/loan_data.csv")
loan_data.head()

In [None]:
loan_data.info()

## Preprocessing the data

In [None]:
percent_missing =round(100*(loan_data.isnull().sum())/len(loan_data),2)
percent_missing

In [None]:
cleaned_data = loan_data.drop(['purpose', 'not.fully.paid'], axis=1)
cleaned_data.info()

#### Outliers analysis
One of the weaknesses of hierarchical clustering is that it is sensitive to outliers.  The distribution of each variable is given by the boxplot.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(14, 10)) 
sns.boxplot(data = cleaned_data)

In [None]:
def remove_outliers(data):
   
    df = data.copy()
       
    for col in list(df.columns):
     
          Q1 = df[str(col)].quantile(0.05)
          Q3 = df[str(col)].quantile(0.95)
          IQR = Q3 - Q1
          lower_bound = Q1 - 1.5*IQR
          upper_bound = Q3 + 1.5*IQR
     
          df = df[(df[str(col)] >= lower_bound) & 
    
            (df[str(col)] <= upper_bound)]
       
    return df



In [None]:
without_outliers = remove_outliers(cleaned_data)

In [None]:
plt.figure(figsize=(14, 10)) 
sns.boxplot(data = without_outliers)

In [None]:
without_outliers.shape

The shape of the data is now 9,319 rows and 12 columns. This means that 259 observations were outliers, which have been dropped. 

#### Rescale the data
Since hierarchical clustering uses Euclidean distance, which is very sensitive to dealing with variables with different scales, it’s wise to rescale all the variables before computing the distance. 

In [None]:
from sklearn.preprocessing import StandardScaler

data_scaler = StandardScaler()

scaled_data = data_scaler.fit_transform(without_outliers)

## Applying the hierarchical clustering algorithm 

### <u> scipy.cluster.hierarchy.linkage + dendrogram </u>



In [None]:
from scipy.cluster.hierarchy import linkage, dendrogram

complete_clustering = linkage(scaled_data, method="complete", metric="euclidean")
average_clustering = linkage(scaled_data, method="average", metric="euclidean")
single_clustering = linkage(scaled_data, method="single", metric="euclidean")

The optimal number of clusters can be obtained by identifying the highest vertical line that does not intersect with any other clusters (horizontal line)

In [None]:
dendrogram(complete_clustering)
plt.show()

For complete linkage, it is the blue line on the right, and it generates three clusters. 

In [None]:
plt.figure(figsize=(20, 8)) 
dendrogram(average_clustering)
plt.show()

For the average linkage, it is the first blue vertical line, and it generates two clusters.

In [None]:
dendrogram(single_clustering,
          truncate_mode='lastp',
          p=999,
          show_leaf_counts=True)
plt.show()

For the single linkage, it is the first vertical line, and it generates only one cluster. 


From the above observations, the average linkage seems to be the one that provides the best clustering, as opposed to the single and complete linkage, which respectively suggests considering one cluster and three clusters. Also, the optimal cluster number of two corresponds to our prior knowledge about the dataset, which is the two types of borrowers.

In [None]:
from scipy.cluster.hierarchy import cut_tree

cluster_labels = cut_tree(average_clustering, n_clusters=2).reshape(-1,)
without_outliers['Cluster'] = cluster_labels
sns.boxplot(x='Cluster', y='fico', data=without_outliers)

### INFERENCE

From the above boxplot, we can observe that: 

Borrowers from cluster 0 have the highest credit scores.  
Whereas borrowers from cluster 1 have lower credit scores.

### <u>sklearn.cluster.AgglomerativeClustering</u>

In [None]:
without_outliers.columns

In [None]:
from sklearn.cluster import AgglomerativeClustering

clust = AgglomerativeClustering(n_clusters=2, linkage='average')
labels = clust.fit_predict(scaled_data)

without_outliers['Clusterr'] = labels

print(without_outliers['Clusterr'].value_counts())

In [None]:
print(without_outliers['Cluster'].value_counts())