# **Cluster Analysis with All Features**

## Objectives

*   Answer **business requirement 3**: 
    * The client is interested in identifying any hidden patterns in the data to derive conclusions that can benefit the business. 
*   Perform Cluster analysis on the dataset to extract typical default profile.
*   Evaluate the following hypothesis:
    * Default on previous loan(s) makes it unlikely to acquire new loan.
    * The higher the loan_percent_income, the higher the probability of rejecting the loan
    * However high a person's income, it does not impact loan approval if the loan_to_income ratio is below a defined threshold.
    * If debtor has rented home and has no record of default, the likelihood of approving loan is guaranteed.

## Inputs

* outputs/datasets/collection/row/LoanDefaultDataset.csv
* Instructions on which variables to use for data cleaning and feature engineering. Those instructions are found in FeatureEngineering Notebook.

## Outputs

* TBD

---

## **SetUp**

### Main Imports

In [1]:
# Standard Library
import os
import warnings
# General Utilities
import pandas as pd
import numpy as np
# Visualization
import matplotlib.pyplot as plt
from matplotlib import rcParams
import seaborn as sns
import plotly.express as px
from yellowbrick.cluster import KElbowVisualizer
from yellowbrick.cluster import SilhouetteVisualizer
# ML and Data processing
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
# Feature Engineering
from feature_engine.encoding import OrdinalEncoder
from feature_engine.selection import SmartCorrelatedSelection
from feature_engine import transformation as vt

### Change working directory

* Change the working directory from its current folder to its parent folder.

In [None]:
current_dir = os.getcwd()
current_dir

* Make the parent of the current directory the new current directory.

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

* Confirm the new current directory.

In [None]:
current_dir = os.getcwd()
current_dir

## **Dataset Loading**

- Load the row dataset.
- Drop loan_status since it is not considered a feature for this analysis.

In [None]:
df = (pd.read_csv(
    "outputs/datasets/collection/row/LoanDefaultDataset.csv").drop(['loan_status'], axis=1))
print(df.shape)
df.head(3)

## **ML Cluster Pipeline**

- Define the cluster pipeline.

In [6]:
def PipelineCluster():
    pipeline_base = Pipeline([
        ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary',
                                                     variables=[
                                                         'person_gender',
                                                         'person_education',
                                                         'person_home_ownership',
                                                         'loan_intent',
                                                         'previous_loan_defaults_on_file',])),
        ("YeoJohnsonTransformer", vt.YeoJohnsonTransformer(variables=[
            'person_income',
            'loan_amnt',
            'loan_percent_income',
            'credit_score',])),

        ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=None,
         method="spearman", threshold=0.7, selection_method="variance")), # to be dropped = ['person_age', 'cb_person_cred_hist_length'].

        ("scaler", StandardScaler()),

        ("PCA", PCA(n_components=50, random_state=0)),

        ("model", KMeans(n_clusters=50, random_state=0)),


    ])
    return pipeline_base

- Apply the cluster pipeline without performing the last two steps, ie. the PCA and the model pipeline fit.
- The output dataset will then be analyzed by PCA component analysis.

In [None]:
pipeline_cluster = PipelineCluster()
pipeline_pca = Pipeline(pipeline_cluster.steps[:-2])
df_pca = pipeline_pca.fit_transform(df)

print(df_pca.shape,'\n', type(df_pca))

## **Principal Component Analysis (PCA)**

- Perform a PCA analysis on the df_pca dataset to set the optimal number of components for cluster analysis.
- This is done by applying the PCA separately to the scaled data, i.e. the output of the pipeline in the step above.
- Since the 'loan_status' is dropped while loading the dataset and 'person_age', 'cb_person_cred_hist_length' are dropped by the SmartCorrelatedSelection, the dataset is left with 11 variables (features).
- The 11 variables are used for the PCA analysis. ie. n_component = 11.

In [None]:
sns.set_style("whitegrid")

n_components = 11


def pca_components_analysis(df_pca, n_components):
    pca = PCA(n_components=n_components).fit(df_pca)
    x_PCA = pca.transform(df_pca)  # array with transformed PCA

    ComponentsList = ["Component " + str(number)
                      for number in range(n_components)]
    dfExplVarRatio = pd.DataFrame(
        data=np.round(100 * pca.explained_variance_ratio_, 3),
        index=ComponentsList,
        columns=['Explained Variance Ratio (%)'])

    dfExplVarRatio['Accumulated Variance'] = dfExplVarRatio['Explained Variance Ratio (%)'].cumsum(
    )

    PercentageOfDataExplained = dfExplVarRatio['Explained Variance Ratio (%)'].sum(
    )

    print(
        f"* The {n_components} components explain {round(PercentageOfDataExplained,2)}% of the data \n")
    plt.figure(figsize=(9, 6))
    sns.lineplot(data=dfExplVarRatio,  marker="o")
    plt.xticks(rotation=90)
    plt.yticks(np.arange(0, 110, 10))
    plt.show()


pca_components_analysis(df_pca=df_pca, n_components=n_components)

- Consider PCA with only 6 components.

In [None]:
pca_components_analysis(df_pca=df_pca,n_components=8)

> Result:

* The 6 components explain 70.91% of the data.

- Update the cluster pipeline with with 6 components.

In [None]:
def PipelineCluster():
    pipeline_base = Pipeline([
        ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary',
                                                     variables=[
                                                         'person_gender', 'person_education', 'person_home_ownership',
                                                         'loan_intent', 'previous_loan_defaults_on_file'])),
        ("YeoJohnsonTransformer", vt.YeoJohnsonTransformer(variables = ['person_income','loan_amnt',
                                                                        'loan_percent_income','credit_score'])),

        ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=None,
         method="spearman", threshold=0.7, selection_method="variance")),

        ("scaler", StandardScaler()),

        ("PCA", PCA(n_components=8, random_state=0)),

        ("model", KMeans(n_clusters=50, random_state=0)),


    ])
    return pipeline_base


PipelineCluster()

## **Elbow Method and Silhouette Score**

- Apply Elbow Method and Silhouette Score to identify the optimum number of clusters.
- Apply the cluster pipeline up to the PCA step.

In [None]:
pipeline_cluster = PipelineCluster()
pipeline_analysis = Pipeline(pipeline_cluster.steps[:-1])
df_analysis = pipeline_analysis.fit_transform(df)

print(df_analysis.shape,'\n', type(df_analysis))

- Set the range of the Elbow method is limited to 10.

In [None]:
rcParams['font.family'] = 'DejaVu Sans'
warnings.filterwarnings('ignore')

visualizer = KElbowVisualizer(KMeans(random_state=0), k=(1,11)) # 11 is not inclusive, it will plot until 10
visualizer.fit(df_analysis) 
visualizer.show() 
plt.show()

> Result:

- The graph suggests a 4 cluster with a sharp drop in the distortion score appears to be between 2 and 5.

- Apply Silhouette analysis to span 2, 3 and 4 cluster groups.

In [None]:
rcParams['font.family'] = 'DejaVu Sans' # use default font to remove non existence fonts warnings
warnings.filterwarnings('ignore')

# 5 is not inclusive, it will stop at 4
n_cluster_start, n_cluster_stop = 2, 5

print("=== Average Silhouette Score for different number of clusters ===")
visualizer = KElbowVisualizer(KMeans(random_state=0), k=(
    n_cluster_start, n_cluster_stop), metric='silhouette')
visualizer.fit(df_analysis)
visualizer.show()
plt.show()
print("\n")


for n_clusters in np.arange(start=n_cluster_start, stop=n_cluster_stop):

    print(f"=== Silhouette plot for {n_clusters} Clusters ===")
    visualizer = SilhouetteVisualizer(estimator=KMeans(n_clusters=n_clusters, random_state=0),
                                      colors='yellowbrick')
    visualizer.fit(df_analysis)
    visualizer.show()
    plt.show()
    print("\n")


- The best Silhouette Coefficient value appears with 2 cluster.
- Therefore, 2 clusters are implemented for this analysis.

In [None]:
def PipelineCluster():
    pipeline_base = Pipeline([
        ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary',
                                                     variables=[
                                                         'person_gender', 'person_education', 'person_home_ownership',
                                                         'loan_intent', 'previous_loan_defaults_on_file'])),
        ("YeoJohnsonTransformer", vt.YeoJohnsonTransformer(variables = ['person_income','loan_amnt',
                                                                        'loan_percent_income','credit_score'])),

        ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=None,
         method="spearman", threshold=0.7, selection_method="variance")),

        ("scaler", StandardScaler()),

        ("PCA", PCA(n_components=8, random_state=0)),

        ("model", KMeans(n_clusters=2, random_state=0)),


    ])
    return pipeline_base


PipelineCluster()

- Display data for training cluster pipeline.

In [None]:
X = df.copy()
print(X.shape)
X.head(3)

- Fit the dataset into the cluster pipeline.

In [None]:
pipeline_cluster = PipelineCluster()
pipeline_cluster.fit(X)

## **Add cluster predictions to dataset**

- add "`Clusters`" as a new column to the dataset.

In [None]:
X['Clusters'] = pipeline_cluster['model'].labels_
print(X.shape)
X.head(3)

- Display cluster's distribution.

In [None]:
warnings.filterwarnings('ignore')
print(f"* Clusters frequencies \n{ X['Clusters'].value_counts(normalize=True).to_frame().round(2)} \n\n")
X['Clusters'].value_counts().sort_values().plot(kind='bar')
plt.show()

- Cluster's distribution is almost balanced with the following distribution: Cluster 0 contribute 51% while Class one contribute to 49%.

- Visualize PCA components on the clusters.

In [None]:
sns.set_style("whitegrid")
plt.figure(figsize=(10, 6))
sns.scatterplot(x=df_analysis[:, 0], y=df_analysis[:, 1],
                hue=X['Clusters'], palette='Set1', alpha=0.6)
plt.scatter(x=pipeline_cluster['model'].cluster_centers_[:, 0], y=pipeline_cluster['model'].cluster_centers_[:, 1],
            marker="x", s=169, linewidths=3, color="black")
plt.xlabel("PCA Component 0")
plt.ylabel("PCA Component 1")
plt.title("PCA Components colored by Clusters")
plt.show()


- Save cluster predictions from this cluster pipeline that considers all variables.

In [None]:
cluster_predictions_with_all_variables = X['Clusters']
cluster_predictions_with_all_variables

## **Fit a classifier for cluster predictions**

- Copy `X` to a DataFrame `df_clf`.

In [None]:
df_clf = X.copy()
print(df_clf.shape)
df_clf.head(3)

## **Split Train and Test sets**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df_clf.drop(['Clusters'], axis=1),
    df_clf['Clusters'],
    test_size=0.2,
    random_state=0
)

print(X_train.shape, X_test.shape)

## **Classifier Pipeline**

- Create classifier pipeline with **RandomForestClassifier** which has the following configuration: `RandomForestClassifier(
            random_state=0,
            max_depth=None,
            max_leaf_nodes= None,
            min_samples_leaf= 1,
            min_samples_split= 2,
            n_estimators= 140),` .

In [None]:
def PipelineClf2ExplainClusters():
    pipeline_base = Pipeline([
        ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary',
                                                     variables=[
                                                         'person_gender', 'person_education', 'person_home_ownership',
                                                         'loan_intent', 'previous_loan_defaults_on_file'])),
        ("YeoJohnsonTransformer", vt.YeoJohnsonTransformer(variables = ['person_income','loan_amnt',
                                                                        'loan_percent_income','credit_score'])),

        ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=None,
         method="spearman", threshold=0.7, selection_method="variance")),

        ("scaler", StandardScaler()),

        ("feat_selection", SelectFromModel(RandomForestClassifier(
            random_state=0,
            max_depth=None,
            max_leaf_nodes= None,
            min_samples_leaf= 1,
            min_samples_split= 2,
            n_estimators= 140))),

        ("model", RandomForestClassifier(
            random_state=0,
            max_depth=None,
            max_leaf_nodes= None,
            min_samples_leaf= 1,
            min_samples_split= 2,
            n_estimators= 140)),
        ])
    return pipeline_base


PipelineClf2ExplainClusters()


- Fit the classifier to the training data.

In [None]:
pipeline_clf_cluster = PipelineClf2ExplainClusters()
pipeline_clf_cluster.fit(X_train, y_train)

## **Evaluate classifier performance**

- Train Set.

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_train, pipeline_clf_cluster.predict(X_train)))

- Test Set.

In [None]:
print(classification_report(y_test, pipeline_clf_cluster.predict(X_test)))

## **Important Features**

- Assess the most important Features that define a cluster

In [None]:
# after data cleaning and feature engineering, the feature space changes

# how many data cleaning and feature engineering steps does your pipeline have?
data_cleaning_feat_eng_steps = 3
columns_after_data_cleaning_feat_eng = (Pipeline(pipeline_clf_cluster.steps[:data_cleaning_feat_eng_steps])
                                        .transform(X_train)
                                        .columns)

best_features = columns_after_data_cleaning_feat_eng[pipeline_clf_cluster['feat_selection'].get_support(
)].to_list()

# create DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
    'Feature': columns_after_data_cleaning_feat_eng[pipeline_clf_cluster['feat_selection'].get_support()],
    'Importance': pipeline_clf_cluster['model'].feature_importances_})
    .sort_values(by='Importance', ascending=False)
)

# reassign best features in importance order
best_features = df_feature_importance['Feature'].to_list()

# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
      f"The model was trained on them: \n{best_features} \n")
df_feature_importance.plot(kind='bar', x='Feature', y='Importance')
plt.show()


- best_features = `['loan_percent_income', 'loan_int_rate', 'loan_amnt', 'previous_loan_defaults_on_file']`.
- Store the best_features 
- Those features are to be used to explain each cluster's profile.

In [None]:
best_features_pipeline_all_variables = ['loan_percent_income', 'loan_int_rate', 'loan_amnt', 'previous_loan_defaults_on_file']
best_features_pipeline_all_variables

## **Cluster Analysis**

- Create a DataFrame that contains best features and Clusters Predictions.

In [None]:
df_cluster_profile = df_clf.copy()
df_cluster_profile = df_cluster_profile.filter(items=['loan_percent_income', 'loan_int_rate', 'loan_amnt', 'previous_loan_defaults_on_file']  + ['Clusters'], axis=1)
print(df_cluster_profile.shape)
df_cluster_profile.head(3)

- analyze Loan_status levels.

In [None]:
df_loan_status = pd.read_csv("outputs/datasets/collection/row/LoanDefaultDataset.csv").filter(['loan_status'])
df_loan_status['loan_status'] = df_loan_status['loan_status'].astype('object')
df_loan_status.head(3)

- Load function that plots a table with description for all Clusters

In [31]:
def DescriptionAllClusters(df, decimal_points=3):

    DescriptionAllClusters = pd.DataFrame(
        columns=df.drop(['Clusters'], axis=1).columns)
    # iterate on each cluster , calls Clusters_IndividualDescription()
    for cluster in df.sort_values(by='Clusters')['Clusters'].unique():

        EDA_ClusterSubset = df.query(
            f"Clusters == {cluster}").drop(['Clusters'], axis=1)
        ClusterDescription = Clusters_IndividualDescription(
            EDA_ClusterSubset, cluster, decimal_points)
        DescriptionAllClusters = DescriptionAllClusters.append(
            ClusterDescription)

    DescriptionAllClusters.set_index(['Cluster'], inplace=True)
    return DescriptionAllClusters


def Clusters_IndividualDescription(EDA_Cluster, cluster, decimal_points):

    ClustersDescription = pd.DataFrame(columns=EDA_Cluster.columns)
    # for a given cluster, iterate over all columns
    # if the variable is numerical, calculate the IQR: display as Q1 -- Q3.
    # That will show the range for the most common values for the numerical variable
    # if the variable is categorical, count the frequencies and displays the top 3 most frequent
    # That will show the most common levels for the category

    for col in EDA_Cluster.columns:

        try:  # eventually a given cluster will have only missing data for a given variable

            if EDA_Cluster[col].dtypes == 'object':

                top_frequencies = EDA_Cluster.dropna(
                    subset=[col])[[col]].value_counts(normalize=True).nlargest(n=3)
                Description = ''

                for x in range(len(top_frequencies)):
                    freq = top_frequencies.iloc[x]
                    category = top_frequencies.index[x][0]
                    CategoryPercentage = int(round(freq*100, 0))
                    statement = f"'{category}': {CategoryPercentage}% , "
                    Description = Description + statement

                ClustersDescription.at[0, col] = Description[:-2]

            elif EDA_Cluster[col].dtypes in ['float', 'int']:
                DescStats = EDA_Cluster.dropna(subset=[col])[[col]].describe()
                Q1 = round(DescStats.iloc[4, 0], decimal_points)
                Q3 = round(DescStats.iloc[6, 0], decimal_points)
                Description = f"{Q1} -- {Q3}"
                ClustersDescription.at[0, col] = Description

        except Exception as e:
            ClustersDescription.at[0, col] = 'Not available'
            print(
                f"** Error Exception: {e} - cluster {cluster}, variable {col}")

    ClustersDescription['Cluster'] = str(cluster)

    return ClustersDescription

- Display a table to describe cluster profile based on the best features.

In [None]:
pd.set_option('display.max_colwidth', None)
clusters_profile = DescriptionAllClusters(df=pd.concat([df_cluster_profile,df_loan_status], axis=1))
clusters_profile

> **Result**:
* The result does not provide a clear profile description.
* In general terms, the following pattern is extracted:
  * Debtor is very likely to default (92% of the time) if the loan-to-income ratio is low, interest rate is low (below 11%) and loan amount is below 8000 and the debtor has previously default on loan.
  *  Debtor is still somewhat likely to default (62% of the time) if the loan-to-income ratio is high (above 15%), interest rate is low (above 11%) and loan amount is above 8000 and the debtor has previously default on loan.

* **Summary**
  * It is difficult to extract a clear distinction from those patterns that can help the business.
  * Additional Analysis hence is required.

- Load a custom function to plot cluster distribution per Variable (absolute and relative levels)

In [33]:
def cluster_distribution_per_variable(df, target):
    """
    The data should have 2 variables, the cluster predictions and
    the variable you want to analyze with, in this case we call "target".
    We use plotly express to create 2 plots:
    Cluster distribution across the target.
    Relative presence of the target level in each cluster.
    """
    df_bar_plot = df.value_counts(["Clusters", target]).reset_index()
    df_bar_plot.columns = ['Clusters', target, 'Count']
    df_bar_plot[target] = df_bar_plot[target].astype('object')

    print(f"Clusters distribution across {target} levels")
    fig = px.bar(df_bar_plot, x='Clusters', y='Count',
                 color=target, width=800, height=500)
    fig.update_layout(xaxis=dict(tickmode='array',
                      tickvals=df['Clusters'].unique()))
    fig.show(renderer='jupyterlab')

    df_relative = (df
                   .groupby(["Clusters", target])
                   .size()
                   .groupby(level=0)
                   .apply(lambda x:  100*x / x.sum())
                   .reset_index()
                   .sort_values(by=['Clusters'])
                   )
    df_relative.columns = ['Clusters', target, 'Relative Percentage (%)']

    print(f"Relative Percentage (%) of {target} in each cluster")
    fig = px.line(df_relative, x='Clusters', y='Relative Percentage (%)',
                  color=target, width=800, height=500)
    fig.update_layout(xaxis=dict(tickmode='array',
                      tickvals=df['Clusters'].unique()))
    fig.update_traces(mode='markers+lines')
    fig.show(renderer='jupyterlab')

- Clusters distribution across loan_status levels & Relative Percentage of loan_status in each cluster

In [None]:
df_cluster_vs_loan_status=  df_loan_status.copy()
df_cluster_vs_loan_status['Clusters'] = X['Clusters']
cluster_distribution_per_variable(df=df_cluster_vs_loan_status, target='loan_status')

## **Summary**

* It is difficult to extract a clear pattern that can help the business when using all the variables in the dataset.
* Hence, additional analysis is required. Hence, the cluster analysis is ought to be repeated by looking at the best features
  extracted from the analysis of this Notebook.