### Credit Card Approval Determination

Credit card approval determination is a critical process for financial institutions, where they evaluate potential customers' creditworthiness to decide whether to approve or reject credit card applications. This process involves analyzing various attributes of applicants to assess their ability to repay the credit and manage their financial responsibilities. The goal is to minimize the risk of defaults while ensuring that deserving applicants have access to credit.

#### Key Attributes in Credit Card Approval

Here are the key attributes commonly used in the credit card approval process and their significance:

- **ID**: A unique identifier for each applicant. It ensures that each record is distinct and traceable.
  
- **Gender**: The applicant's gender. This demographic information can sometimes be used in statistical analyses to understand trends, although it should be handled carefully to avoid discrimination.

- **Own_car**: Indicates whether the applicant owns a car. Car ownership can be a proxy for financial stability and asset ownership.

- **Own_property**: Indicates whether the applicant owns property. Property ownership is often associated with financial stability and creditworthiness.

- **Work_phone**: Whether the applicant has a work phone. This can be an indicator of stable employment.

- **Phone**: Whether the applicant has a personal phone. Having a phone can be a basic requirement for communication and contactability.

- **Email**: Whether the applicant has an email address. This is important for communication and can also indicate digital literacy.

- **Unemployed**: Indicates whether the applicant is currently unemployed. Employment status is a crucial factor in assessing income stability.

- **Num_children**: The number of children the applicant has. This can affect the applicant's financial responsibilities and disposable income.

- **Num_family**: The total number of family members. Larger family size might indicate higher financial obligations.

- **Account_length**: The length of time the applicant has had an account with the bank. Longer account history can be a sign of stability and reliability.

- **Total_income**: The total income of the applicant. Higher income generally increases the likelihood of credit approval as it indicates the ability to repay.

- **Age**: The age of the applicant. Age can correlate with financial experience and stability.

- **Years_employed**: The number of years the applicant has been employed. Longer employment history usually suggests job stability and steady income.

- **Income_type**: The type of income (e.g., salary, business income). Different income types can have varying levels of reliability and stability.

- **Education_type**: The applicant's level of education. Higher education levels can correlate with better job prospects and financial literacy.

- **Family_status**: The applicant's marital status (e.g., single, married). Family status can influence financial responsibilities and stability.

- **Housing_type**: The type of housing (e.g., rented, owned). Owning a home can be a sign of financial stability.

- **Occupation_type**: The applicant's occupation. Certain occupations might be seen as more stable or higher earning, influencing creditworthiness.

- **Target**: The target variable indicating whether the credit card application was approved or not. This is the outcome we aim to predict based on the other attributes.

These attributes collectively provide a comprehensive profile of the applicant, enabling financial institutions to make informed decisions about credit card approvals. By analyzing these factors, banks can assess the risk associated with each applicant and ensure that credit is extended to individuals who are likely to manage their credit responsibly.

### Introduction

In this notebook, we aim to address the presence of outliers in our dataset, which can potentially skew the results of our machine learning models and lead to inaccurate predictions. Here's a detailed overview of the steps we will take to handle outliers and improve the quality of our data:

- **Understanding Outliers**:
  - Outliers are extreme values that deviate significantly from the majority of the data.
  - They can arise from various factors such as measurement errors, data entry errors, or genuine variability in the data.
  - Outliers can disproportionately impact statistical analyses and machine learning models, often leading to biased estimates and reduced model performance.

- **Importance of Handling Outliers**:
  - To ensure robust model performance and reliable predictions, it is crucial to identify and handle outliers appropriately.
  - Removing outliers helps in reducing noise and variability in the data, enhancing the accuracy and generalizability of our models.

- **Steps in This Notebook**:

  1. **Load the Dataset**:
     - Import necessary libraries.
     - Load the dataset into a pandas DataFrame for analysis and manipulation.

  2. **Inspect the Data**:
     - Get a preliminary understanding of the dataset by examining its structure, summary statistics, and the distribution of target variables.
     - Identify the numerical columns in the dataset that will be analyzed for outliers.

  3. **Identify Outliers**:
     - For each numerical column, calculate the 99th percentile value to establish the threshold for identifying outliers.
     - Count the number of data points that exceed this 99th percentile threshold for each numerical column.
     - Print the count of outliers for each numerical column to understand the extent of outliers in the dataset.

  4. **Remove Outliers**:
     - Filter the DataFrame to exclude data points that exceed the 99th percentile threshold for each numerical column.
     - This step ensures that the dataset is cleaned by removing extreme values that could distort the analysis and model training.

  5. **Evaluate the Impact of Removing Outliers**:
     - Compare the shape of the original and cleaned datasets to understand the reduction in data size.
     - Perform summary statistics and visualizations on the cleaned dataset to ensure data integrity and distribution.

  6. **Scale the Data**:
     - Apply feature scaling using StandardScaler to normalize the numerical features, ensuring that they have a mean of 0 and a standard deviation of 1.
     - This step is essential for improving the performance and convergence speed of many machine learning algorithms.

  7. **Handle Class Imbalance with SMOTE**:
     - Apply the Synthetic Minority Over-sampling Technique (SMOTE) to balance the classes in the target variable.
     - SMOTE generates synthetic samples for the minority class, reducing the bias towards the majority class and improving model performance.

  8. **Train the Initial Model**:
     - Train a Random Forest classifier on the balanced and scaled dataset to identify feature importance.
     - Extract and visualize the feature importances to understand which features contribute the most to the model's predictions.

  9. **Select Top 10 Features**:
     - Based on the feature importances, select the top 10 most important features.
     - Reduce the dataset to include only these top 10 features for further analysis and model training.

 10. **Hyperparameter Tuning**:
     - Perform hyperparameter tuning using GridSearchCV to find the best combination of hyperparameters for the Random Forest classifier.
     - This step involves testing various combinations of parameters to optimize the model's performance.

 11. **Train the Final Model**:
     - Train the Random Forest classifier using the best parameters obtained from the hyperparameter tuning on the top 10 features.
     - Evaluate the final model using accuracy, classification report, and confusion matrix to assess its performance.

By performing these steps, we aim to enhance the quality of our data, leading to better model training and more accurate predictions. This approach ensures that our models are trained on data that truly represents the underlying patterns and relationships, resulting in reliable and effective machine learning models.

# Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

# Load Data

In [None]:
data = pd.read_csv('/kaggle/input/credit-card-eligibility-data-determining-factors/dataset.csv')
data.head()

# Basic Checks 

In [None]:
#checking Missing Values
def missing_values(data):
    missing_vals = data.isnull().sum()
    print("missing values are :\n", missing_vals)
missing_values(data)    

In [None]:
#checking Duplicated Rows
def duplicated_vals(data):
    duplicated_rows = data.duplicated().sum()
    print("duplicated rows are: \n",duplicated_rows)
duplicated_vals(data)

In [None]:
data.describe()

In [None]:
data.info()

In [None]:
categorical_cols = data.select_dtypes(include=['object', 'category']).columns.tolist()
print("categorical columns are : ",categorical_cols)
for i in data[categorical_cols]:
    print("\n")
    print(f"Distinct Value Counts of {i} :\n",data[i].value_counts())
    print("\n")

In [None]:
import pandas as pd
import numpy as np

# Function to count outliers based on the 99th percentile
def count_outliers(df):
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    outliers = {}
    for col in numeric_cols:
        upper_limit = df[col].quantile(0.99)
        outliers[col] = (df[col] > upper_limit).sum()
    return outliers

# Count outliers
outliers_count = count_outliers(data.drop(columns = 'ID'))

# Print the count of outliers for each numerical column
print("Outliers count based on the 99th percentile:")
for col, count in outliers_count.items():
    print(f"{col}: {count}")


In [None]:
# Function to remove outliers based on the 99th percentile
def remove_outliers(df):
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    for col in numeric_cols:
        upper_limit = df[col].quantile(0.99)
        df = df[df[col] <= upper_limit]
    return df

# Remove outliers
df_cleaned = remove_outliers(data.drop(columns = 'ID'))

# Check the shape of the cleaned dataset
print("Original dataset shape:", data.drop(columns = 'ID').shape)
print("Cleaned dataset shape:", df_cleaned.shape)

In [None]:
df = df_cleaned

# Exploratory Data Analysis

In [None]:
# the distribution of individuals based on gender
colors = ['#1f77b4', '#ff7f0e']
plt.figure(figsize=(8, 6))
# Calculate the proportion of males and females
gender_counts = data['Gender'].value_counts()
total_individuals = len(data)
male_count = gender_counts[1]
female_count = gender_counts[0]
male_proportion = male_count / total_individuals
female_proportion = female_count / total_individuals
sns.barplot(x=['Male', 'Female'], y=[male_count, female_count], palette=colors)
for index, value in enumerate([male_count, female_count]):
    plt.text(index, value + 5, str(value), ha='center', va='bottom', fontsize=12)
plt.xlabel('Gender', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title('Distribution of Individuals based on Gender', fontsize=14)
sns.despine()
plt.show()

In [None]:
df = data

# Set the aesthetic style of the plots
sns.set_style("whitegrid")
plt.figure(figsize=(12, 6))

# Plot the distribution of the target variable
plt.subplot(2, 3, 1)
sns.countplot(data=df, x='Target', palette='viridis')
plt.title('Distribution of Target Variable')

# Plot the distribution of numeric variables
plt.subplot(2, 3, 2)
sns.histplot(df['Age'], kde=True, bins=30, color='blue')
plt.title('Distribution of Age')

plt.subplot(2, 3, 3)
sns.histplot(df['Total_income'], kde=True, bins=30, color='green')
plt.title('Distribution of Total Income')

# Pairplot of selected features
selected_features = ['Age', 'Total_income', 'Years_employed', 'Target']
sns.pairplot(df[selected_features], hue='Target', palette='coolwarm')
plt.suptitle('Pairplot of Selected Features', y=1.02)

# Boxplot of Total Income by Target
plt.figure(figsize=(10, 5))
sns.boxplot(x='Target', y='Total_income', data=df, palette='coolwarm')
plt.title('Total Income by Target')

# Boxplot of Age by Target
plt.figure(figsize=(10, 5))
sns.boxplot(x='Target', y='Age', data=df, palette='coolwarm')
plt.title('Age by Target')

# Countplot of categorical variables
plt.figure(figsize=(15, 10))

plt.subplot(2, 3, 1)
sns.countplot(data=df, x='Gender', hue='Target', palette='viridis')
plt.title('Gender by Target')

plt.subplot(2, 3, 2)
sns.countplot(data=df, x='Own_car', hue='Target', palette='viridis')
plt.title('Own Car by Target')

plt.subplot(2, 3, 3)
sns.countplot(data=df, x='Own_property', hue='Target', palette='viridis')
plt.title('Own Property by Target')

plt.subplot(2, 3, 4)
sns.countplot(data=df, x='Work_phone', hue='Target', palette='viridis')
plt.title('Work Phone by Target')

plt.subplot(2, 3, 5)
sns.countplot(data=df, x='Phone', hue='Target', palette='viridis')
plt.title('Phone by Target')

plt.subplot(2, 3, 6)
sns.countplot(data=df, x='Email', hue='Target', palette='viridis')
plt.title('Email by Target')

plt.tight_layout()
plt.show()


In [None]:
# Calculate the correlation matrix
corr = data.select_dtypes(include=['int64', 'float64']).corr()
corr

# Model Building

In [None]:
# Step 1: Import necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE

df = data.drop(columns = ['ID'])
df = df.select_dtypes(include=['int64', 'float64'])

# Step 3: Split the dataset into features (X) and target (y)
X = df.drop('Target', axis=1)
y = df['Target']

In [None]:
# Step 4: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Step 5: Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
# Step 6: Apply SMOTE to the training data
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train_scaled, y_train)


In [None]:
# Step 7: Train the Random Forest classifier to find feature importance
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train_smote, y_train_smote)

In [None]:
# Step 8: Extract feature importance
feature_importances = rf_classifier.feature_importances_
features = X.columns
importance_df = pd.DataFrame({'Feature': features, 'Importance': feature_importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Plot feature importance
plt.figure(figsize=(12, 8))
sns.barplot(x='Importance', y='Feature', data=importance_df)
plt.title('Feature Importance')
plt.show()

# Step 9: Analyze the top features
top_features = importance_df.head(10)
print(top_features)

In [None]:
# Select top 10 important features
top_10_features = importance_df.head(10)['Feature'].tolist()

# Step 9: Reduce the dataset to top 10 features
X_train_top10 = X_train[top_10_features]
X_test_top10 = X_test[top_10_features]

In [None]:
# Step 10: Scale the top 10 features
X_train_top10_scaled = scaler.fit_transform(X_train_top10)
X_test_top10_scaled = scaler.transform(X_test_top10)

In [None]:
# Step 11: Apply SMOTE to the top 10 features
X_train_top10_smote, y_train_top10_smote = smote.fit_resample(X_train_top10_scaled, y_train)

In [None]:
# Step 12: Hyperparameter tuning using GridSearchCV
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(estimator=RandomForestClassifier(random_state=42),
                           param_grid=param_grid,
                           cv=5, n_jobs=-1, verbose=2)

grid_search.fit(X_train_top10_smote, y_train_top10_smote)

In [None]:
# Step 13: Train the Random Forest classifier with best parameters
best_params = grid_search.best_params_
rf_classifier_tuned = RandomForestClassifier(**best_params, random_state=42)
rf_classifier_tuned.fit(X_train_top10_smote, y_train_top10_smote)

In [None]:
# Step 14: Make predictions using the tuned model
y_pred_tuned = rf_classifier_tuned.predict(X_test_top10_scaled)

# Evaluation

In [None]:
# Step 15: Evaluate the tuned model
accuracy_tuned = accuracy_score(y_test, y_pred_tuned)
print(f'Accuracy with Tuned Model: {accuracy_tuned:.2f}')
print('Classification Report with Tuned Model:')
print(classification_report(y_test, y_pred_tuned))
print('Confusion Matrix with Tuned Model:')
print(confusion_matrix(y_test, y_pred_tuned))