<a href="https://colab.research.google.com/github/dhar9571/Capstone-Project-Classification-ML---Data-Mobile-Price-Range-Dataset/blob/main/Capstone_Project_Classification_ML_Data_Mobile_Price_Range_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - Classification
##### **Contribution**    - Individual

# **Project Summary -**

**Introduction**:

The aim of this machine learning classification project is to predict the likelihood of default payments by customers in Taiwan. By accurately estimating the probability of default, the project intends to provide valuable insights for risk management. The focus is on developing a model that surpasses traditional binary classification by incorporating the estimated probability of default. The evaluation of customer credit card payment default will be carried out using the K-S chart, which enables a comprehensive assessment of default risk.

# **GitHub Link -**

https://github.com/dhar9571/Capstone-Project-Classification-ML---Data-Mobile-Price-Range-Dataset.git

# **Problem Statement**


This project addresses the critical problem of default payment by customers on credit card payments. Payment defaults can have a significant financial impact on both individuals and financial institutions. Therefore, accurate prediction of the probability of default can greatly aid risk management. While binary classification (credible or non-credible customers) is a common approach, this project goes beyond this and emphasizes the importance of estimating the probability of default to improve prediction accuracy.

# **General Guidelines** : -

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.

     The additional credits will have advantages over other students during Star Student selection.

             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.


```
# Chart visualization code
```


*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

#Libraries for EDA:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#Libraries for Feature Engineering and Model Training:

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.feature_selection import SelectKBest, chi2, mutual_info_classif, f_classif
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, recall_score, precision_score, classification_report
from xgboost import XGBClassifier

### Dataset Loading

In [None]:
# Load Dataset

df = pd.read_excel("C:\\Users\\dk957\\Downloads\\default of credit card clients.xls")

### Dataset First View

In [None]:
# Dataset First Look

# Setting the display option to show all the features:
pd.set_option('display.max_columns', None)

df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

print(f'This Dataset has {df.shape[0]} rows and {df.shape[1]} columns')

### Dataset Information

In [None]:
# Dataset Info

df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

print(f'This dataframe has {df.duplicated().sum()} duplicate observations')

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

df.isna().sum()

**Observation**:

This dataset does not have any missing values

In [None]:
# Visualizing the missing values

# Setting up figure size:
plt.figure(figsize=(15, 10))

# Plotting heatmap
sns.heatmap(df.isna(), yticklabels=False, cbar=True, cmap='viridis')

**Observation**:

As per above heatmap, we can clearly see that there is no missing/null value exists in the  dataset.

### What did you know about your dataset?

1. This Dataset has 30000 rows and 25 columns
2. The dataset has 15 numerical and 10 categorical features
3. The dataset has no duplicate values
4. The dataset does not have any missing/null values
5. This dataset does not have any feature with 'object' datatype as all the features are already converted to integer

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

for x in df.columns:
    print(x, end=" , ")

In [None]:
# Dataset Describe

df.describe()

### Variables Description

**ID**: This column represents the unique identifier for each individual in the dataset. It is likely a numerical value but can be treated as categorical if it does not carry any meaningful numerical information.

**LIMIT_BAL**: This column represents the credit limit of the individual's credit card. It is a numerical feature.

**SEX**: This column represents the gender of the individual. It is a categorical feature.

**EDUCATION**: This column represents the educational background of the individual. It is a categorical feature.

**MARRIAGE**: This column represents the marital status of the individual. It is a categorical feature.

**AGE**: This column represents the age of the individual. It is a numerical feature.

**PAY_0 to PAY_6**: These columns represent the repayment status of the individual for the respective months. They indicate whether the individual made timely payments or had delayed payments. They are categorical features.

**BILL_AMT1 to BILL_AMT6**: These columns represent the amount of bill statement for the respective months. They are numerical features.

**PAY_AMT1 to PAY_AMT6**: These columns represent the amount of previous payments made by the individual for the respective months. They are numerical features.

**default payment next month**: This column represents whether the individual defaulted on the credit card payment in the next month. It is the target variable and can be treated as a categorical feature.

### Check Unique Values for each variable.

In [None]:
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Removing ID feature as it has unique identifiers which does not have valuable affect on output feature:

df.drop(columns=["ID"],axis=1,inplace=True)

In [None]:
# Checking unique values of EDUCATION categorical feature:

df["EDUCATION"].unique()

In [None]:
# Checking unique values of MARRIAGE categorical feature:

df["MARRIAGE"].unique()

### What all manipulations have you done and insights you found?

1. Removed ID column as it has unique identifiers which are not much important
2. As the dataset does not have any duplicate and null values, much data wrangling is not required

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

### Chart - 1

In [None]:
# Chart - 1 LIMIT_BAL vs SEX wise Default_Payment_Next_Month

ax = sns.barplot(x=df["default payment next month"],y=df["LIMIT_BAL"],hue=df["SEX"])

for item  in ax.containers:
    ax.bar_label(item)
plt.xticks([0, 1],["No","Yes"])
plt.xlabel("Default Payment Next Month")
plt.title("LIMIT_BAL vs SEX wise Default_Payment_Next_Month")
plt.show()

##### 1. Why did you pick the specific chart?

Bar chart allows for easy comparison among different categories, making it possible to see patterns, trends, and relationships.

##### 2. What is/are the insight(s) found from the chart?

With the plot, it is clearly visible that:

1. The defaulters have less credit limit compared to the Non-Defaulters.
2. There is not significance difference in the credit limits of Males and Femails.

##### 3. Will the gained insights help creating a positive business impact?

Based on the information of credit limit, it may become easy to interpret the chances of payment defaults.

### Chart - 2

In [None]:
# Chart - 2 Education Type vs Default_Payment_Next_Month

# Setting up plot size:

plt.figure(figsize=(10,6))

aa = sns.barplot(hue = df["default payment next month"], x=df["EDUCATION"], y = df["LIMIT_BAL"])

for value in aa.containers:
    aa.bar_label(value)

plt.title("Education Type vs Default_Payment_Next_Month")
plt.ylabel("LIMIT_BAL")
plt.show()

##### 1. Why did you pick the specific chart?

Bar chart allows for easy comparison among different categories, making it possible to see patterns, trends, and relationships.

##### 2. What is/are the insight(s) found from the chart?

1. Education type 1 defaulters have the highest LIMIT_BAL.
2. Education type 0 has no defaulters.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

With the information on wise defaulters wise LIMIT_BAL as per Education Type, it would facilitate future prediction for the same.

### Chart - 3

In [None]:
# Chart - 3 Education type wise Male and Female Counts

sns.countplot(x=df["EDUCATION"], hue=df["SEX"])
plt.title("Education type wise Male and Female Counts")
plt.xlabel("Education Type")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

Bar chart allows for easy comparison among different categories, making it possible to see patterns, trends, and relationships.

##### 2. What is/are the insight(s) found from the chart?

1. Education type 1, 2 and 3 have the highest count of Males and Females.
2. In all of the education types, number of females are higher compared to males.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

With the information on male and female counts as per education type, it would be beneficial to make amendments in credit policies accordingly.

### Chart - 4

In [None]:
# Chart - 4 Distribution of Age with Payment Defaults

# Creating seperate dataframes for No Default and Default types:

default_0 = df[df['default payment next month'] == 0]
default_1 = df[df['default payment next month'] == 1]

# Plotting histogram:

plt.hist(default_0['AGE'], bins=10, alpha=0.5, label='No Default')
plt.hist(default_1['AGE'], bins=10, alpha=0.5, label='Default')

plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Distribution of Age with Payment Defaults')
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

Histograms provide a visual representation of the distribution of data. They show how data is spread across different intervals or bins, allowing us to quickly understand the overall shape, central tendency and spread of the data.

##### 2. What is/are the insight(s) found from the chart?

1. Most of the defaulters are in the age group 20 - 35 (approx).
2. The data distribution is a little skewed to the right.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Age wise Defaulters information is very important from a business point of view. Age group 20-35 should be taken into consideration before credit allowance as these have highest probability of payment defaults.

### Chart - 5

In [None]:
# Chart - 5 Marriage vs. Default Payment Next Month

#Setting up figure size:

plt.figure(figsize = (10, 6))

ab = sns.countplot(hue=df["default payment next month"], x = df["MARRIAGE"])

for item in ab.containers:
    ab.bar_label(item)
plt.title("Marriage vs. Default Payment Next Month")
plt.xlabel("Marriage Status")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

Bar chart allows for easy comparison among different categories, making it possible to see patterns, trends, and relationships.

##### 2. What is/are the insight(s) found from the chart?

1. Almost the defaulters belong to MARRIAGE status 1 and 2. Rest of the marriage types 0 and 3 have almost no defaulters.
2. Marriage Status 1 and 2 have almost same number of defaulters.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

With marriage type information, it would become easier to interpret the chances of default payments from business point of view.

### Chart - 6

In [None]:
# Chart - 6 Checking Linear Relationship between LIMIT_BAL and other numerical features

# Creating a for loop to create multiple scatterplot with LIMIT_BAL feature on X-Axis and other numerical features on Y-Axis.

for item in ['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6',
             'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']:
    sns.scatterplot(x=df["LIMIT_BAL"], y=df[item])
    plt.title(f"Association between LIMIT_BAL and {item}")
    plt.show()

##### 1. Why did you pick the specific chart?

Scatterplots allow us to visually identify any patterns, trends, or relationships between the variables. This can help determine if the variables are positively or negatively related, or if there is no apparent relationship.

##### 2. What is/are the insight(s) found from the chart?

1. The numerical features are not much linearly correlated with the LIMIT_BAL feature.
2. There is no normality in the spread of the data points.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

If the numerical features have linear relationship, it can help to make better predictions on the new/unseen dataset.

### Chart - 7 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

corr = df.corr()

#Setting up plot size:
plt.figure(figsize=(20,14))

sns.heatmap(corr, annot=True)
plt.title("Correlation Heatmap")

##### 1. Why did you pick the specific chart?

A correlation heatmap allows us to quickly identify patterns and relationships between variables. By using colors to represent correlation values, we can easily spot variables that are positively correlated (high values represented by a certain color) or negatively correlated (low values represented by a different color). This visual representation helps to identify which variables are strongly related and which are not.

##### 2. What is/are the insight(s) found from the chart?

1. BILL_AMT1 to BILL_AMT6 are highely correlated with each other.
2. PAY_0 to PAY_6 have high correlation with each other.
3. Remaining features do not have any strong correlation.

### Chart - 8 - Pair Plot

In [None]:
# Pair Plot visualization code

# Creating an instance of the pairplot:
pairplot = sns.pairplot(df)

# Creating pairplot with image settings to avoid tiny pairplots:
pairplot.savefig("pairplot.png", dpi=300)

plt.title("Pair Plot")

##### 1. Why did you pick the specific chart?

Pair plots allow us to visually explore the relationships between variables in a dataset. By examining the scatter plots, we can quickly identify patterns, trends, and associations between pairs of variables. This can help us understand the data and identify potential relationships that may require further investigation.

##### 2. What is/are the insight(s) found from the chart?

1. The numerical features are not much linearly correlated with the LIMIT_BAL feature.
2. There is no normality in the spread of the data points.
3. BILL_AMT1 to BILL_AMT6 are highely correlated with each other.
4. PAY_0 to PAY_6 have high correlation with each other.
5. Remaining features do not have any strong correlation.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis: There is no difference between LIMIT_BAL between Males in Females. (means are same)

Alternate Hypothesis: There is significant difference between LIMIT_BAL between Males in Females. (means are not same)

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# importing required library:

import scipy.stats as stats

males_data = df[df["SEX"] == 1]["LIMIT_BAL"]
females_data = df[df["SEX"] == 2]["LIMIT_BAL"]

# Applying ttest:

t_statistic, p_value = stats.ttest_ind(males_data, females_data)

# Seting up the significance level

alpha = 0.05

print(f't_statistic is {t_statistic}')
print(f'p_value is {p_value}')

if p_value < alpha:
    print("Reject the null hypothesis. There is a significant difference in LIMIT_BAL between males and females.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in LIMIT_BAL between males and females.")


##### Which statistical test have you done to obtain P-Value?

Performed t-test for obtain the p-value.

##### Why did you choose the specific statistical test?

When we have a categorical feature with 2 classes and one numerical feature, we can peform t-test to compare the means of those classes basis on the numerical feature.

##### Observation:

As per above test results, we can confirm that LIMIT_BAL for males and females are not the same which we also observed through the plot below:

![image-2.png](attachment:image-2.png)

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis: All education type have same LIMIT_BAL. (means are same)

Alternate Hypothesis: There is a significant difference in LIMIT_BAL among all education types. (means are not same).

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Extracting data for each education type

education_type_1 = df[df["EDUCATION"] == 1]["LIMIT_BAL"]
education_type_2 = df[df["EDUCATION"] == 2]["LIMIT_BAL"]
education_type_3 = df[df["EDUCATION"] == 3]["LIMIT_BAL"]
education_type_4 = df[df["EDUCATION"] == 4]["LIMIT_BAL"]

# Performing one-way ANOVA

f_statistic, p_value = stats.f_oneway(education_type_1, education_type_2, education_type_3, education_type_4)

# Seting up the significance level

alpha = 0.05

print(f'f_statistic is {f_statistic}')
print(f'p_value is {p_value}')

if p_value < alpha:
    print("Reject the null hypothesis. There is a significant difference in LIMIT_BAL among education types.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in LIMIT_BAL among education types.")


##### Which statistical test have you done to obtain P-Value?

Performed ANOVA (Analysis of Variance) test to obtain the P-value.

##### Why did you choose the specific statistical test?

ANOVA is suitable for comparing means when we have more than two classes in categorical variable.

##### Observation:

As per above test results, we can confirm that LIMIT_BAL all the education types are not the same which we also observed through the plot below:

![image.png](attachment:image.png)

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis: The number of males and females in all education types are same. (Means are same)

Alternate Hypothesis: There is a significant difference between the number of males and females with respect to the different education types. (Means are not same)

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Creating a contingency table
contingency_table = pd.crosstab(df["EDUCATION"], df["SEX"])

# Performing chi-square test of independence
chi2_stat, p_value, dof, expected = stats.chi2_contingency(contingency_table)

# Seting up the significance level

alpha = 0.05

print(f'chi2_stat is {chi2_stat}')
print(f'p_value is {p_value}')

if p_value < alpha:
    print("Reject the null hypothesis. There is a significant difference between the number of males and females across education types.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference between the number of males and females across education types.")

##### Which statistical test have you done to obtain P-Value?

Performed Chi_Square test to obtain the P-Value.

##### Why did you choose the specific statistical test?

Chi_Square test is suitable for analyzing categorical data and determining if there is a significant association between two variables.

#### Observation

As per above test results, we can confirm that the number of males and females in all education types are not the same, which we also observed through the plot below:

![image-2.png](attachment:image-2.png)

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

# Creating a For Loop to construct box plots for each feature:

for column in df.columns:
    sns.boxplot(df[column])
    plt.title(column)
    plt.show()

#### Observation:

All the numerical features have outliers which needs to be handled before feeding the data to the model.

In [None]:
# Creating a list of numerical features from the dataframe:

num_features = ['LIMIT_BAL', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
                'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

# Creating a for loop to remove the outliers from numerical features using Inter Quertile Range:

for column in num_features:
    q1, q3 = np.percentile(df[column],[25,75])
    iqr = q3-q1
    lower_range = q1-(1.5*iqr)
    upper_range = q3+(1.5*iqr)

    df[column] = df[column].apply(lambda x: np.nan if x<lower_range or x>upper_range else x)

In [None]:
# Again checking the outliers using boxplots for numerical features:

for column in num_features:
    sns.boxplot(df[column])
    plt.title(column)
    plt.show()

#### Observation:

Most of the outliers have been removed.

##### What all outlier treatment techniques have you used and why did you use those techniques?

**Technique used**: Interquartile range (IQR) method

**Reason**: Used this technique as the Interquartile range (IQR) method is a robust method to detect and remove outliers from a dataset. It is useful when the data has a non-normal distribution and contains extreme values or outliers. The IQR method calculates the range between the first quartile (Q1) and third quartile (Q3) of the data, and then identifies outliers as any data points outside of the range.

### 2. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

# Checking missing values after removing the outliers:

df.isna().sum()

In [None]:
# Creating a for loop to replace all the null/missing of numerical feature values with the mean:

for column in num_features:
    df[column].fillna(np.mean(df[column]),inplace=True)

In [None]:
# Again checking missing values:

df.isna().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

Replaced all the null/missing values with column means for all the numerical features. Mean is used to fill null values when the features are numerical and there are no outliers present.

### 3. Categorical Encoding

#### What all categorical encoding techniques have you used & why did you use those techniques?

1. All the feature have numerical values. Therefore, Label encoding is not required.
2. Created Dummy Variables for ordianal features: PAY_0, PAY_2, PAY_3, PAY_4, PAY_5, PAY_6

In [None]:
df.head()

In [None]:
df = pd.get_dummies(df,columns=["PAY_0","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6"])

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

No Textual Data is present.

### 4. Feature Selection

In [None]:
df.head()

In [None]:
# Creating seperate dataframe for dependent and independent features:

X = df.drop(["default payment next month"],axis=1)
y = df["default payment next month"]

In [None]:
# Select your features wisely to avoid overfitting

# Creating an instance of SelectKBest Class:

selectkbest = SelectKBest(score_func= mutual_info_classif, k=81)

# Fitting the instance on dataframe:

best_features = selectkbest.fit(X,y)

In [None]:
# Creating a dataframe for the feature scores:

scores = pd.DataFrame(best_features.scores_, columns = ["Scores"])

# Creating another dataframe for column names:

features = pd.DataFrame(X.columns, columns = ["Feature"])

# Concatinating both the dataframes to get each feature along with its score:

feature_score = pd.concat([features, scores], axis=1)

# Sorting the features as per the scores in decending order:

feature_score.sort_values(by="Scores", axis = 0, ascending=False, inplace=True)

In [None]:
feature_score

In [None]:
# Checking scores of each feature:

feature_score.head(60)

In [None]:
df.columns

In [None]:
# Dropping unimportant features from the dataset:

df = df.drop(columns=["EDUCATION_2","EDUCATION_3","PAY_6_6","PAY_3_-2","PAY_2_-2","EDUCATION_5","PAY_AMT5","PAY_2_4","PAY_6_4","PAY_AMT6"],axis=1)

##### What all feature selection methods have you used?

Used **SelectKBest** feature selection technique from **sklearn.feature_select** module.

##### Which all features you found important and why?

As per the highest SelectKBest feature scores, below features are found to be important as they carry the highest and moderate score.

['PAY_0', 'PAY_2', 'LIMIT_BAL', 'PAY_3', 'PAY_4', 'PAY_AMT1',
       'PAY_5', 'PAY_6', 'PAY_AMT3', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT6',
       'PAY_AMT5', 'SEX']

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Data tranformation is not required

### 6. Data Scaling

In [None]:
# Scaling your data

# Creating an instance of standardscaler class:

standard = StandardScaler()

# Creating a seperate dataframe for numerical and categorical features:

num_X = df[["LIMIT_BAL","PAY_AMT1","PAY_AMT3","PAY_AMT2","PAY_AMT4","PAY_AMT6","PAY_AMT5"]]
cat_X = df.drop(columns=["LIMIT_BAL","PAY_AMT1","PAY_AMT3","PAY_AMT2","PAY_AMT4","PAY_AMT6","PAY_AMT5"],axis=1)

# Fitting and transforming the independent feature data:

num_X = standard.fit_transform(num_X, y)

# Converting num_X data into dataframe:

num_X = pd.DataFrame(num_X, columns=["LIMIT_BAL","PAY_AMT1","PAY_AMT3","PAY_AMT2","PAY_AMT4","PAY_AMT6","PAY_AMT5"])

# concatinating the categorical and scaled numerical features:

df_scaled = pd.concat([num_X,cat_X],axis=1)

##### Which method have you used to scale you data and why?

Used StandardScaler from sklearn.preprocessing module to standardize the data for Logistic Regression model.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Dimensionality Reduction is not required as dataset have only 25 features including dependent feature. Therefore, Dimensionlity Reduction will not provide much benefit in predictions.

### 8. Data Splitting

In [None]:
# Creating seperate dataframe for dependent and independent features:

X = df_dummy.drop(columns=["default payment next month"],axis=1)
y = df_dummy["default payment next month"]

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

# train test split for standardized data:

X_train, X_test, y_train, y_test = train_test_split(X_pca,y,test_size = 0.2, random_state = 1)

##### What data splitting ratio have you used and why?

Used 4:1 ratio of train and test data as more training data will provide better predictions on unseen/test data.

### 9. Handling Imbalanced Dataset

In [None]:
# Checking Imbalanced Dataset:

df_dummy["default payment next month"].value_counts().reset_index()

In [None]:
ad = sns.barplot(x=df_dummy["default payment next month"].value_counts().reset_index()["index"],y=df_dummy["default payment next month"].value_counts().reset_index()["default payment next month"])

plt.title("Distribution of Each Class")
plt.ylabel("Count")
plt.xlabel("Class")

for item in ad.containers:
    ad.bar_label(item)

##### Do you think the dataset is imbalanced? Explain Why.

The dataset is highly imbalanced as 0 class has almost 78% distribution and 1 class has only 22% distribution.

In [None]:
# Handling the Imbalanced Dataset:

# importing required libraries:

from imblearn.over_sampling import SMOTE

# Creating an instance of this class:

imb = SMOTE(random_state = 1)

# Fitting the data:

X_train_new, y_train_new = imb.fit_resample(X_train,y_train)

In [None]:
X_train_new.shape, y_train_new.shape

In [None]:
y_train_new

In [None]:
y_train_new.value_counts().reset_index()

In [None]:
# Again checking the class distribution for balanced dataset:

ac = sns.barplot(x=y_train_new.value_counts().reset_index()["index"],y=y_train_new.value_counts().reset_index()["default payment next month"])

# Creating a For loop to show bar values:

for item in ac.containers:
    ac.bar_label(item)

plt.title("Class distribution")
plt.ylabel("Counts")
plt.xlabel("Class")

**Observation**:

The dataset is balanced as number of both the classes are equal now.

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Used Random-Over_sampling technique to handle the dataset as the number of observations are 30000 which is considered as large dataset. Choosing Under_Sampling may cause loss of information.

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Assuming your data is stored in X as a numpy array or pandas DataFrame

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create an instance of PCA
pca = PCA(n_components=3)  # Specify the number of components you want to keep

# Apply PCA to the scaled data
X_pca = pca.fit_transform(X_scaled)

# Access the principal components (eigenvectors)
principal_components = pca.components_

# Access the explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_

# Print the explained variance ratio for each component
for i, ratio in enumerate(explained_variance_ratio):
    print(f"Explained Variance Ratio of Component {i+1}: {ratio}")

# Access the singular values (eigenvalues)
singular_values = pca.singular_values_

# Print the singular values
print("Singular Values:")
print(singular_values)

# Access the projected data onto the principal components
X_projected = pca.inverse_transform(X_pca)

# You can now use X_pca or X_projected for further analysis or visualization


In [None]:
X_pca

## ***7. ML Model Implementation***

### ML Model - 1 Logistic Regression

In [None]:
# ML Model - 1 Implementation

# Creating an instance of the class:

logistic = LogisticRegression()

# Fit the Algorithm

logistic.fit(X_train_new,y_train_new)

# Predict on the model

logistic_y_pred = logistic.predict(X_test)

print(logistic_y_pred)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

print(classification_report(y_test, logistic_y_pred))

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Defining hyperparameters:

parameters = {"penalty":["l1","l2","ElasticNet"], "C": [1,2,3,4,5,6,7,8,9,10,20,30,40,50,60,70,80,90,100],
              "max_iter":[100,200,300,400,500]}

# Creating an instance of gridsearchcv:

logistic_grid = GridSearchCV(estimator=logistic, param_grid=parameters, scoring="accuracy", cv=5)

# Fit the Algorithm

logistic_grid.fit(X_train_new, y_train_new)

# Reviewing optimal values for hyperparameters:

logistic_grid.best_params_

In [None]:
# Predict on the model

logisticgrid_y_pred = logistic_grid.predict(X_test)

# Evaluating scores:

print(classification_report(y_test, logisticgrid_y_pred))

##### Which hyperparameter optimization technique have you used and why?

Used GridSearchCV technique to find the optimal values of hyperparameters as GridSearchCV performs an exhaustive search over a predefined set of hyperparameter values. It considers all possible combinations of the provided hyperparameters, allowing us to explore different combinations in a systematic and organized manner including cross-validation.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

There is no improvement after hyperparameter tuning as sometimes, the default values of hyperparameters works best for the algorithm to fit the data.

### ML Model - 2

In [None]:
# ML Model - 2 Implementation

# Creating an instance of the class:

decision = DecisionTreeClassifier()

# Fitting the model to the data

decision.fit(X_train_new, y_train_new)

# Predicting on unseen data:

decision_y_pred = decision.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

print(classification_report(y_test, decision_y_pred))

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 2 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Defining hyperparameters:

parameters_decision = {"criterion":["gini","entropy","log_loss"], "splitter": ["best","random"], "max_depth":[3,5,7,9,10,12,15,20,30,32,35,38,40]}

# Creating an instance of gridsearchcv:

decision_grid = GridSearchCV(estimator=decision, param_grid=parameters_decision, scoring="accuracy", cv=5)

# Fiting the Algorithm

decision_grid.fit(X_train_new, y_train_new)

# Reviewing optimal values for hyperparameters:

decision_grid.best_params_

In [None]:
# Predict on the model

decisiongrid_y_pred = decision_grid.predict(X_test)

# Evaluating scores:

print(classification_report(y_test, decisiongrid_y_pred))

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3 RandomForest

In [None]:
# ML Model - 3 Implementation

# Creating an instance of the class:

forest = RandomForestClassifier(n_estimators=20, oob_score = True, n_jobs = 1, random_state = 42, max_features = None,
                                min_samples_leaf = 10)

# Fitting the model to the data

forest.fit(X_train_new, y_train_new)

# Predicting on unseen data:

forest_y_pred = forest.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

print(classification_report(y_test, forest_y_pred))

In [None]:
# ML Model - 3 Implementation

# Creating an instance of the class:

boost = XGBClassifier()

# Fitting the model to the data

boost.fit(X_train_new, y_train_new)

# Predicting on unseen data:

boost_y_pred = boost.predict(X_test)

In [None]:
# Visualizing evaluation Metric Score chart

print(classification_report(y_test, boost_y_pred))

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Defining hyperparameters:

parameters_forest = {"n_estimators":[20],oob_score=True, "criterion":["gini","entropy","log_loss"], "max_depth":[10,20,30,35,38,40]}

# Creating an instance of gridsearchcv:

forest_grid = GridSearchCV(estimator=forest, param_grid=parameters_forest, scoring="accuracy", cv=5)

# Fiting the Algorithm

forest_grid.fit(X_train_new, y_train_new)

# Reviewing optimal values for hyperparameters:

forest_grid.best_params_

In [None]:
print(classification_report(y_train, forest_grid.predict(X_train)))

In [None]:
print(classification_report(y_test, forest_grid.predict(X_test)))

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***