# **Cluster Analysis with the Best Features**

## Objectives

*   Answer **business requirement 3**: 
    * The client is interested in identifying any hidden patterns in the data to derive conclusions that can benefit the business. 
*   Perform Cluster analysis on the dataset to extract typical default profile.
*   Evaluate the following hypothesis:
    * Default on previous loan(s) makes it unlikely to acquire new loan.
    * The higher the loan_percent_income, the higher the probability of rejecting the loan
    * However high a person's income, it does not impact loan approval if the loan_to_income ratio is below a defined threshold.
    * If debtor has rented home and has no record of default, the likelihood of approving loan is guaranteed.

## Inputs

* outputs/datasets/collection/row/LoanDefaultDataset.csv
* The best feature extracted from **jupyter_notebooks/05-a - Modelling and Evaluaiton - Cluster Analysis.ipynb**.
  * best_features = `['loan_percent_income', 'loan_int_rate', 'loan_amnt', 'previous_loan_defaults_on_file']`.
* The general Cluster pipeline from the **jupyter_notebooks/05-a - Modelling and Evaluaiton - Cluster Analysis.ipynb** with only the best features instead of the entire dataset.

## Outputs

* Cluster Pipeline
* Train Set
* Most important features to define a cluster plot
* Clusters Profile Description
* Cluster Silhouette

---

## **SetUp**

### Imports

In [None]:
import os
import warnings
import pandas as pd
import numpy as np
import plotly.express as px
from matplotlib import rcParams
import matplotlib.pyplot as plt
import joblib
from feature_engine.encoding import OrdinalEncoder
from feature_engine import transformation as vt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer

### Change working directory

* Change the working directory from its current folder to its parent folder.

In [None]:
current_dir = os.getcwd()
current_dir

* Make the parent of the current directory the new current directory.

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

* Confirm the new current directory.

In [None]:
current_dir = os.getcwd()
current_dir

## **Dataset Loading**

- Load the row dataset.
- Drop loan_status since it is not considered a feature for this analysis.

In [None]:
df = (pd.read_csv(
    "outputs/datasets/collection/row/LoanDefaultDataset.csv").drop(['loan_status'], axis=1))
print(df.shape)
df.head(3)

## **Cluster Pipeline with Best Features**

- the best features are extracted from the pervious Notebook. 
  - best_features = `['loan_percent_income', 'loan_int_rate', 'loan_amnt', 'previous_loan_defaults_on_file']`.
- Store the best_features 
- Those features are to be used to explain each cluster's profile.

In [None]:
best_features_pipeline_all_variables = ['loan_percent_income', 'loan_int_rate', 'loan_amnt', 'previous_loan_defaults_on_file']
best_features_pipeline_all_variables

- Define a dataset with only the best features

In [None]:
df_reduced = df.filter(best_features_pipeline_all_variables)
df_reduced.head(3)

- Define the new Cluster pipeline for the best features analysis.

In [None]:
def PipelineCluster():
    pipeline_base = Pipeline([
        ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary',
                                                     variables=['previous_loan_defaults_on_file'])),
        ("YeoJohnsonTransformer", vt.YeoJohnsonTransformer(variables = ['loan_amnt','loan_percent_income'])),

        ("scaler", StandardScaler()),

        ("model", KMeans(n_clusters=4, random_state=0)),
        
        ])
    return pipeline_base


PipelineCluster()


## **Apply Elbow Method and Silhouette analysis**

- Fit the cluster pipeline (without the model) to the reduced dataset for the optimum number of clusters analysis.

In [None]:
pipeline_cluster = PipelineCluster()
pipeline_analysis = Pipeline(pipeline_cluster.steps[:-1])
df_analysis = pipeline_analysis.fit_transform(df_reduced)

print(df_analysis.shape,'\n', type(df_analysis))

- Apply Elbow Analysis to evaluate the optimum range of cluster numbers to consider for the further analysis.

In [None]:
warnings.filterwarnings('ignore')
rcParams['font.family'] = 'DejaVu Sans'

visualizer = KElbowVisualizer(KMeans(random_state=0), k=(1,11))
visualizer.fit(df_analysis) 
visualizer.show() 
plt.show()

- Apply Silhouette Analysis to evaluate the quality of clustering over a range of candidate clusters numbers extracted from the Elbow method.

In [None]:
n_cluster_start, n_cluster_stop = 2, 6

print("=== Average Silhouette Score for different number of clusters ===")
visualizer = KElbowVisualizer(KMeans(random_state=0), k=(
    n_cluster_start, n_cluster_stop), metric='silhouette')
visualizer.fit(df_analysis)
visualizer.show()
plt.show()
print("\n")


for n_clusters in np.arange(start=n_cluster_start, stop=n_cluster_stop):

    print(f"=== Silhouette plot for {n_clusters} Clusters ===")
    visualizer = SilhouetteVisualizer(estimator=KMeans(n_clusters=n_clusters, random_state=0),
                                      colors='yellowbrick')
    visualizer.fit(df_analysis)
    visualizer.show()
    plt.show()
    print("\n")


> Result:

- Four cluster appears to perform well against the other options.

## **Clusters Distribution**

- Create a copy of the reduced dataset to analyze clusters distribution.

In [None]:
X = df_reduced.copy()
print(X.shape)
X.head(3)

- Fit Cluster pipeline (including the model) on the dataset.

In [None]:
pipeline_cluster = PipelineCluster()
pipeline_cluster.fit(X)

- Add cluster predictions to dataset to the new generated dataset.

In [None]:
X['Clusters'] = pipeline_cluster['model'].labels_
print(X.shape)
X.head(3)

- Evaluate the distribution of the four clusters labels on reduced dataset.

In [None]:
print(f"* Clusters frequencies \n{ X['Clusters'].value_counts(normalize=True).to_frame().round(2)} \n\n")
X['Clusters'].value_counts().sort_values().plot(kind='bar')
plt.show()

> Result:

- The four clusters distribution are relatively balanced. 

## **Cluster Pipeline Performance**

- Create a copy of reduced dataset to analyze clusters pipeline performance.

In [None]:
df_clf = X.copy()
print(df_clf.shape)
df_clf.head(3)

- Split Train and Test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df_clf.drop(['Clusters'], axis=1),
    df_clf['Clusters'],
    test_size=0.2,
    random_state=0
)

print(X_train.shape, X_test.shape)


- Define the cluster pipeline using RandomForestClassifier algorithm as the model with the configuration extracted from Notebook 04

In [None]:
def PipelineClf2ExplainClusters():
    pipeline_base = Pipeline([
        ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary',
                                                     variables=['previous_loan_defaults_on_file'])),
        ("YeoJohnsonTransformer", vt.YeoJohnsonTransformer(variables = ['loan_amnt','loan_percent_income'])),

        ("scaler", StandardScaler()),

        ("model", RandomForestClassifier(
            random_state=0,
            max_depth=None,
            max_leaf_nodes= None,
            min_samples_leaf= 1,
            min_samples_split= 2,
            n_estimators= 50)),
        ])
    return pipeline_base


PipelineClf2ExplainClusters()

- fit the classifier pipeline.

In [None]:
pipeline_clf_cluster = PipelineClf2ExplainClusters()
pipeline_clf_cluster.fit(X_train,y_train)

- Evaluate the performance for both the Train and the Test sets.

In [None]:
print(classification_report(y_train, pipeline_clf_cluster.predict(X_train)))

In [None]:
print(classification_report(y_test, pipeline_clf_cluster.predict(X_test)))

## **Important Features Assessment**

In [None]:
# since we don't have feature selection step in this pipeline, best_features is Xtrain columns
best_features = X_train.columns.to_list()

# create a DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
    'Feature': best_features,
    'Importance': pipeline_clf_cluster['model'].feature_importances_})
    .sort_values(by='Importance', ascending=False)
)

best_features = df_feature_importance['Feature'].to_list()

# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
      f"The model was trained on them: \n{df_feature_importance['Feature'].to_list()}")

df_feature_importance.plot(kind='bar', x='Feature', y='Importance')
plt.show()


## **Cluster Analysis**

- Create a dataset that combine both the best features and Clusters Predictions.


In [None]:
df_cluster_profile = df_clf.copy()
df_cluster_profile = df_cluster_profile.filter(items=best_features + ['Clusters'], axis=1)
df_cluster_profile.head(3)

- Analyze loan_status levels

In [None]:
df_loan_status = pd.read_csv("outputs/datasets/collection/row/LoanDefaultDataset.csv").filter(['loan_status'])
df_loan_status['loan_status'] = df_loan_status['loan_status'].astype('object')
df_loan_status.head(3)

- Load function that plots a table with description for all Clusters.

In [None]:
def DescriptionAllClusters(df, decimal_points=3):

    DescriptionAllClusters = pd.DataFrame(
        columns=df.drop(['Clusters'], axis=1).columns)
    # iterate on each cluster , calls Clusters_IndividualDescription()
    for cluster in df.sort_values(by='Clusters')['Clusters'].unique():

        EDA_ClusterSubset = df.query(
            f"Clusters == {cluster}").drop(['Clusters'], axis=1)
        ClusterDescription = Clusters_IndividualDescription(
            EDA_ClusterSubset, cluster, decimal_points)
        DescriptionAllClusters = DescriptionAllClusters.append(
            ClusterDescription)

    DescriptionAllClusters.set_index(['Cluster'], inplace=True)
    return DescriptionAllClusters


def Clusters_IndividualDescription(EDA_Cluster, cluster, decimal_points):

    ClustersDescription = pd.DataFrame(columns=EDA_Cluster.columns)
    # for a given cluster, iterate over all columns
    # if the variable is numerical, calculate the IQR: display as Q1 -- Q3.
    # That will show the range for the most common values for the numerical variable
    # if the variable is categorical, count the frequencies and displays the top 3 most frequent
    # That will show the most common levels for the category

    for col in EDA_Cluster.columns:

        try:  # eventually a given cluster will have only missing data for a given variable

            if EDA_Cluster[col].dtypes == 'object':

                top_frequencies = EDA_Cluster.dropna(
                    subset=[col])[[col]].value_counts(normalize=True).nlargest(n=3)
                Description = ''

                for x in range(len(top_frequencies)):
                    freq = top_frequencies.iloc[x]
                    category = top_frequencies.index[x][0]
                    CategoryPercentage = int(round(freq*100, 0))
                    statement = f"'{category}': {CategoryPercentage}% , "
                    Description = Description + statement

                ClustersDescription.at[0, col] = Description[:-2]

            elif EDA_Cluster[col].dtypes in ['float', 'int']:
                DescStats = EDA_Cluster.dropna(subset=[col])[[col]].describe()
                Q1 = round(DescStats.iloc[4, 0], decimal_points)
                Q3 = round(DescStats.iloc[6, 0], decimal_points)
                Description = f"{Q1} -- {Q3}"
                ClustersDescription.at[0, col] = Description

        except Exception as e:
            ClustersDescription.at[0, col] = 'Not available'
            print(
                f"** Error Exception: {e} - cluster {cluster}, variable {col}")

    ClustersDescription['Cluster'] = str(cluster)

    return ClustersDescription

- Cluster profile on most important features

In [None]:
pd.set_option('display.max_colwidth', None)
clusters_profile = DescriptionAllClusters(df= pd.concat([df_cluster_profile,df_loan_status], axis=1), decimal_points=4)
clusters_profile

- Load a custom function to plot cluster distribution per Variable (absolute and relative levels).

In [None]:
def cluster_distribution_per_variable(df, target):
    """
    The data should have 2 variables, the cluster predictions and
    the variable you want to analyze with, in this case we call "target".
    We use plotly express to create 2 plots:
    Cluster distribution across the target.
    Relative presence of the target level in each cluster.
    """
    df_bar_plot = df.value_counts(["Clusters", target]).reset_index()
    df_bar_plot.columns = ['Clusters', target, 'Count']
    df_bar_plot[target] = df_bar_plot[target].astype('object')

    print(f"Clusters distribution across {target} levels")
    fig = px.bar(df_bar_plot, x='Clusters', y='Count',
                 color=target, width=800, height=500)
    fig.update_layout(xaxis=dict(tickmode='array',
                      tickvals=df['Clusters'].unique()))
    fig.show(renderer='jupyterlab')

    df_relative = (df
                   .groupby(["Clusters", target])
                   .size()
                   .groupby(level=0)
                   .apply(lambda x:  100*x / x.sum())
                   .reset_index()
                   .sort_values(by=['Clusters'])
                   )
    df_relative.columns = ['Clusters', target, 'Relative Percentage (%)']

    print(f"Relative Percentage (%) of {target} in each cluster")
    fig = px.line(df_relative, x='Clusters', y='Relative Percentage (%)',
                  color=target, width=800, height=500)
    fig.update_layout(xaxis=dict(tickmode='array',
                      tickvals=df['Clusters'].unique()))
    fig.update_traces(mode='markers+lines')
    fig.show(renderer='jupyterlab')

> Result:

* Cluster 0: No previous loan default with maximum loan-to-income 11% makes it likely (69%) to default.
* Cluster 1: previous loan default lead to 100% default prediction irrelevant to what other features are.
* Cluster 2: previous loan default lead to 100% default prediction irrelevant to what other features are.
* Cluster 3: No previous loan default with maximum loan-to-income range between (16% -28%) makes it unlikely (58%) to default.

- Clusters distribution across loan_status levels & Relative Percentage of loan_status in each cluster

In [None]:
df_cluster_vs_loan_status=  df_loan_status.copy()
df_cluster_vs_loan_status['Clusters'] = X['Clusters']
cluster_distribution_per_variable(df=df_cluster_vs_loan_status, target='loan_status')

## **Summary**

The following is the summary of this notebook.

* There are clear patterns that could be extracted by fitting the cluster pipeline to the best feature as apposed to fitting the 
same pipeline with all the features.
* Clusters 1 and 2 interpretation shows that default on previous loan(s) is **100%** default predictor.
In such case, all other features will not be relevant in terms of predicting a default event.
* Cluster 0 has a typical profile that **must** possess no
defaults on any previous loans with loan-to-income ratio of less
than 11% with a risk of default of **69%**.
* Cluster 3 has typical profile that **must** possess no
default on any previous loans with loan-to-income ratio between
(16% -28%) with a risk of default of **42%**.
* Both 0 and 3 Clusters possess a **counter intuitive** interpretation. In there, higher loan-to-income ratio has lower risk of default. Naturally, the opposite should be true, i.e. lower loan-to-income leads to lower risk of default.
The customer should provide additional variables, for example, the length of loan maturity, pledged collateral, to assess the truthy of this counter intuitive outcome. 
Since, the reason that the higher loan-to-income comes with low risk of default could be attributed to the fact that loan majority period is so short that it does not make the loan at huge risk."

## **Push files to Repo**

Generate the following files:

* Cluster Pipeline
* Train Set
* Feature importance plot
* Clusters Description
* Cluster Silhouette

- Create an output folder to save the output files of this notebook.

In [None]:
version = 'v1'
file_path = f'outputs/ml_pipeline/cluster_analysis/{version}'

try:
    os.makedirs(name=file_path)
except Exception as e:
    print(e)

- Display the Cluster pipeline.

In [None]:
pipeline_cluster

- Save the cluster pipeline in the output folder.

In [None]:
joblib.dump(value=pipeline_cluster, filename=f"{file_path}/cluster_pipeline.pkl")

- Evaluate the cluster pipeline file size before git commit and push.

In [None]:
clf_pipeline_model_file_path = "outputs/ml_pipeline/cluster_analysis/v1/cluster_pipeline.pkl"
file_size_bytes = os.path.getsize(clf_pipeline_model_file_path)
print("File size in bytes:", file_size_bytes)

file_size_mb = file_size_bytes/(1024 * 1024)
print("File size in MB:", round(file_size_mb, 2))

- Display the Train Set.

In [None]:
print(df_reduced.shape)
df_reduced.head(3)

- Save the train set.

In [None]:
df_reduced.to_csv(f"{file_path}/TrainSet.csv", index=False)

## Most important features plot

- Display the important features.

In [None]:
df_feature_importance.plot(kind='bar',x='Feature',y='Importance', figsize=(8,4))
plt.show()

- Save the important features in the output folder.

In [None]:
df_feature_importance.plot(kind='bar',x='Feature',y='Importance', figsize=(8,4))
plt.savefig(f"{file_path}/features_define_cluster.png", bbox_inches='tight', dpi=150)

- Display the cluster profile

In [None]:
clusters_profile

- Display the cluster profile.

In [None]:
clusters_profile.to_csv(f"{file_path}/clusters_profile.csv")

- Display cluster silhouette plot.

In [None]:
visualizer = SilhouetteVisualizer(Pipeline(pipeline_cluster.steps[-1:])[0] , colors='yellowbrick')
visualizer.fit(df_analysis)
visualizer.show()
plt.show()

- Save cluster silhouette plot.

In [None]:
fig, axes = plt.subplots(figsize=(7,5))
fig = SilhouetteVisualizer(Pipeline(pipeline_cluster.steps[-1:])[0] , colors='yellowbrick', ax=axes)
fig.fit(df_analysis)

plt.savefig(f"{file_path}/clusters_silhouette.png", bbox_inches='tight',dpi=150)