<a href="https://colab.research.google.com/gist/sanukjoseph/9bb0d86c5899408f524d055101c46b8a/credit_card_approval_predication.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Section 1: Questions to Answer**

1.  **Why is your proposal important in today's world?,How predicting a good client is worthy for a bank?**

 In today's world, with an ever-growing volume of financial transactions and data, a robust credit card approval prediction model is crucial for banks.
 It allows banks to assess and predict the creditworthiness of potential clients, reducing the risk of defaults and optimizing the allocation of credit.
 This is essential in a dynamic financial landscape to ensure responsible lending and mitigate financial risks.

2.   **How is it going to impact the banking sector?**

 Implementing a credit card approval prediction model can significantly impact the banking sector by streamlining the credit evaluation process.
 It enables quicker and more accurate decision-making, leading to improved customer satisfaction, reduced operational costs, and minimized financial risks.
 The model's effectiveness in assessing creditworthiness can contribute to overall financial stability in the banking sector.

3.  **If any, what is the gap in the knowledge or how your proposed method can be helpful if required in the future for any bank in India?**

 The proposed method addresses the gap in traditional credit assessment processes by leveraging machine learning techniques.
 By incorporating diverse data sources and employing advanced algorithms, the model can provide a more holistic view of a client's creditworthiness.
 In the future, this approach can be instrumental for banks in India by enhancing the accuracy of credit decisions, adapting to evolving financial landscapes, and facilitating more inclusive lending practices.




# **Section 2: Initial Hypothesis (or hypotheses)**

1.  Here you have to make some assumptions based on the questions you want to address based on the DA track or ML track.

 Assuming that certain patterns and relationships exist in the data that can be uncovered through data analysis and machine learning.
 If ML follows, please perform part 'i' together with multiple machine learning models, perform all required steps to test for hypotheses and prove your model.
 Why is your model better than any other?
 Prove it with relevant cost functions and if possible with any graphs.
 Use machine learning models such as Random Forests, Gradient Boosting, Support Vector Machines, and Logistic Regression.
 The hypothesis is that these models, with appropriate tuning, can capture complex relationships in the data.
 The justification for model selection is based on the comparison of accuracy scores, execution times, and consideration of each model's strengths and weaknesses.

2.  From step 1, you may see some relationships that you want to explore and will develop a belief about the data.

 Initial belief: Features such as annual income, education level, and employment status may have a significant impact on predicting credit card approval.
 Further exploration and analysis will validate or refine this belief.

# **Section 3: Data analysis approach**

1.  What method would you  take to prove or disprove your hypothesis?

 Use a combination of exploratory data analysis (EDA) and machine learning techniques.
 EDA will involve visualizing and understanding the distribution of key features, identifying outliers, and exploring correlations.
 Machine learning models will be trained and evaluated to confirm or refute the initial hypotheses.

2.  Which feature engineering technique is right for your project?

 Techniques such as coding scores for categorical variables, imputing missing values, and normalizing numerical features will be used.
 Scale and feature selection will be considered based on the requirements of the selected machine learning models.

3.  Please justify your data analysis approach.

 The chosen approach is reasonable because it combines the strengths of  exploratory data analysis and machine learning.
 EDA helps discover patterns and relationships, while machine learning models provide quantitative predictions.
 This dual approach enables comprehensive  data insights and accurate credit card approval predictions.

4.  Identify important patterns in your data using  EDA methods to demonstrate your results?

 Initial exploration will focus on visualizing the distribution of income, understanding the impact of education on credit approval, and examining correlations between various characteristics.
 EDA will provide insights into  data structures and guidance for further analysis and model development

In [None]:
# Section 1: Import Libraries
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from pandas import read_csv, merge,get_dummies,DataFrame
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer
from tabulate import tabulate
from sklearn.svm import SVC
from time import time

In [None]:
# Section 2: Load and Merge Data
credit_card_data = read_csv('Credit_card.csv')
credit_label_data = read_csv('Credit_card_label.csv')
merged_data = merge(credit_card_data, credit_label_data, on='Ind_ID')

In [None]:
# Section 3: Data Cleaning and Preprocessing
merged_data.dropna(inplace=True)

In [None]:
# Section 4: Feature Engineering
features = ['GENDER', 'Car_Owner', 'Propert_Owner', 'CHILDREN', 'Annual_income', 'Type_Income','EDUCATION', 'Marital_status', 'Housing_type', 'Birthday_count', 'Employed_days','Mobile_phone', 'Work_Phone', 'Phone', 'Type_Occupation', 'Family_Members']
X = get_dummies(merged_data[features])
y = merged_data['label']

In [None]:
# Section 5: Split Data and Standardize Features
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
# Section 6: Machine Learning Models
models = [
    ('Random Forest', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('Gradient Boosting', GradientBoostingClassifier(random_state=42)),
    ('SVM', SVC(random_state=42)),
    ('Logistic Regression', LogisticRegression(random_state=42))
]

In [None]:
# Section 6: Test Diffrent Models and print
model_results = []
for model_name, model in models:
    start_time = time()

    # Impute missing values
    imputer = SimpleImputer(strategy='mean')
    X_train_imputed = imputer.fit_transform(X_train_scaled)
    X_test_imputed = imputer.transform(X_test_scaled)

    # Fit the model
    model.fit(X_train_imputed, y_train)

    # Predict and evaluate
    y_pred = model.predict(X_test_imputed)
    accuracy = accuracy_score(y_test, y_pred)

    model_results.append({'Model': model_name, 'Accuracy': accuracy, 'Execution Time': time() - start_time})

# Print Model Results
print("\nModel Results:")
model_results_df = DataFrame(model_results)
print(tabulate(model_results_df, headers='keys', tablefmt='fancy_grid'))


Model Results:
╒════╤═════════════════════╤════════════╤══════════════════╕
│    │ Model               │   Accuracy │   Execution Time │
╞════╪═════════════════════╪════════════╪══════════════════╡
│  0 │ Random Forest       │   0.921951 │        0.229182  │
├────┼─────────────────────┼────────────┼──────────────────┤
│  1 │ Gradient Boosting   │   0.887805 │        0.236977  │
├────┼─────────────────────┼────────────┼──────────────────┤
│  2 │ SVM                 │   0.873171 │        0.0365639 │
├────┼─────────────────────┼────────────┼──────────────────┤
│  3 │ Logistic Regression │   0.868293 │        0.0143178 │
╘════╧═════════════════════╧════════════╧══════════════════╛


In [None]:
# Section 8: Justification of the Most Appropriate Model
print("\nJustification of the most appropriate model:")
print("Random Forest and Gradient Boosting models are chosen for their ability to handle complex relationships, feature importance analysis, and robustness against overfitting.")


Justification of the most appropriate model:
Random Forest and Gradient Boosting models are chosen for their ability to handle complex relationships, feature importance analysis, and robustness against overfitting.


In [None]:
# Section 9: Steps to Improve Model Accuracy
rf_tuned = RandomForestClassifier(n_estimators=150, max_depth=15, random_state=42)
rf_tuned.fit(X_train_scaled, y_train)
rf_tuned_pred = rf_tuned.predict(X_test_scaled)
rf_tuned_accuracy = accuracy_score(y_test, rf_tuned_pred)
print(f"\nTuned Random Forest Model Accuracy: {rf_tuned_accuracy}")


Tuned Random Forest Model Accuracy: 0.9170731707317074


In [None]:
# Section 10: SQL Queries using Pandas
print("\nSQL Queries using Pandas:")


SQL Queries using Pandas:


In [None]:
# 1. Average Income by Type
avg_income_by_type = merged_data.groupby('Type_Income')['Annual_income'].mean()
print(f"1. Average Income by Type:\n{tabulate(DataFrame(avg_income_by_type).reset_index(), headers='keys', tablefmt='fancy_grid') if not avg_income_by_type.empty else '0 data found'}")

1. Average Income by Type:
╒════╤══════════════════════╤═════════════════╕
│    │ Type_Income          │   Annual_income │
╞════╪══════════════════════╪═════════════════╡
│  0 │ Commercial associate │          233367 │
├────┼──────────────────────┼─────────────────┤
│  1 │ Pensioner            │          285300 │
├────┼──────────────────────┼─────────────────┤
│  2 │ State servant        │          216928 │
├────┼──────────────────────┼─────────────────┤
│  3 │ Working              │          182587 │
╘════╧══════════════════════╧═════════════════╛


In [None]:
# 2. Female Owners of Cars and Property
female_owners = merged_data[(merged_data['GENDER'] == 'F') & (merged_data['Car_Owner'] == 'Y') & (merged_data['Propert_Owner'] == 'Y')]
print(f"2. Female Owners of Cars and Property:\n{tabulate(female_owners, headers='keys', tablefmt='fancy_grid') if not female_owners.empty else '0 data found'}")

2. Female Owners of Cars and Property:
╒══════╤══════════╤══════════╤═════════════╤═════════════════╤════════════╤═════════════════╤══════════════════════╤═══════════════════════════════╤══════════════════════╤═════════════════════╤══════════════════╤═════════════════╤════════════════╤══════════════╤═════════╤════════════╤═══════════════════════╤══════════════════╤═════════╕
│      │   Ind_ID │ GENDER   │ Car_Owner   │ Propert_Owner   │   CHILDREN │   Annual_income │ Type_Income          │ EDUCATION                     │ Marital_status       │ Housing_type        │   Birthday_count │   Employed_days │   Mobile_phone │   Work_Phone │   Phone │   EMAIL_ID │ Type_Occupation       │   Family_Members │   label │
╞══════╪══════════╪══════════╪═════════════╪═════════════════╪════════════╪═════════════════╪══════════════════════╪═══════════════════════════════╪══════════════════════╪═════════════════════╪══════════════════╪═════════════════╪════════════════╪══════════════╪═════════╪═══════════

In [None]:
# 3. Male Customers Staying with Their Families
male_with_family = merged_data[(merged_data['GENDER'].isin(['M'])) & (merged_data['Marital_status'].isin(['Married', 'Civil marriage', 'Widow'])) & (merged_data['Family_Members'] > 1)]
print(f"3. Male Customers Staying with Their Families:\n{tabulate(male_with_family, headers='keys', tablefmt='fancy_grid') if not male_with_family.empty else '0 data found'}")

3. Male Customers Staying with Their Families:
╒══════╤══════════╤══════════╤═════════════╤═════════════════╤════════════╤═════════════════╤══════════════════════╤═══════════════════════════════╤══════════════════╤═════════════════════╤══════════════════╤═════════════════╤════════════════╤══════════════╤═════════╤════════════╤═══════════════════════╤══════════════════╤═════════╕
│      │   Ind_ID │ GENDER   │ Car_Owner   │ Propert_Owner   │   CHILDREN │   Annual_income │ Type_Income          │ EDUCATION                     │ Marital_status   │ Housing_type        │   Birthday_count │   Employed_days │   Mobile_phone │   Work_Phone │   Phone │   EMAIL_ID │ Type_Occupation       │   Family_Members │   label │
╞══════╪══════════╪══════════╪═════════════╪═════════════════╪════════════╪═════════════════╪══════════════════════╪═══════════════════════════════╪══════════════════╪═════════════════════╪══════════════════╪═════════════════╪════════════════╪══════════════╪═════════╪════════════╪══

In [None]:
top_five_income = merged_data.nlargest(5, 'Annual_income')
print(f"4. Top Five People with the Highest Income:\n{tabulate(top_five_income, headers='keys', tablefmt='fancy_grid') if not top_five_income.empty else '0 data found'}")

4. Top Five People with the Highest Income:
╒═════╤══════════╤══════════╤═════════════╤═════════════════╤════════════╤═════════════════╤══════════════════════╤═══════════════════════════════╤══════════════════════╤═══════════════════╤══════════════════╤═════════════════╤════════════════╤══════════════╤═════════╤════════════╤═══════════════════╤══════════════════╤═════════╕
│     │   Ind_ID │ GENDER   │ Car_Owner   │ Propert_Owner   │   CHILDREN │   Annual_income │ Type_Income          │ EDUCATION                     │ Marital_status       │ Housing_type      │   Birthday_count │   Employed_days │   Mobile_phone │   Work_Phone │   Phone │   EMAIL_ID │ Type_Occupation   │   Family_Members │   label │
╞═════╪══════════╪══════════╪═════════════╪═════════════════╪════════════╪═════════════════╪══════════════════════╪═══════════════════════════════╪══════════════════════╪═══════════════════╪══════════════════╪═════════════════╪════════════════╪══════════════╪═════════╪════════════╪══════════

In [None]:
# 5. Number of Married People with Bad Credit
married_bad_credit = merged_data[(merged_data['Marital_status'].isin(['Married'])) & (merged_data['label'] == 1)].shape[0]
print(f"5. Number of Married People with Bad Credit: {married_bad_credit}")

5. Number of Married People with Bad Credit: 71


In [None]:
# 6. Highest Education Level and Total Count
education_count = merged_data['EDUCATION'].value_counts().idxmax()
education_total_count = merged_data['EDUCATION'].value_counts().max()
print(f"6. Highest Education Level and Total Count: {education_count}, ({education_total_count})")

6. Highest Education Level and Total Count: Secondary / secondary special, (677)


In [None]:
# 7. Bad Credit Count Between Married Males and Females
married_bad_credit_gender = merged_data[(merged_data['Marital_status'].isin(['Married'])) & (merged_data['label'] == 1)].groupby('GENDER').size()
print(f"7. Bad Credit Count Between Married Males and Females:\n{tabulate(DataFrame(married_bad_credit_gender).reset_index(), headers='keys', tablefmt='fancy_grid') if not married_bad_credit_gender.empty else '0 data found'}")

7. Bad Credit Count Between Married Males and Females:
╒════╤══════════╤═════╕
│    │ GENDER   │   0 │
╞════╪══════════╪═════╡
│  0 │ F        │  30 │
├────┼──────────┼─────┤
│  1 │ M        │  41 │
╘════╧══════════╧═════╛
