<div style='background-color:skyblue;padding:14px'>
<p style='text-align:center'><font size='8px' color=#FFFFF face ='Product Sans'><b>Project Name:-  Credit Card Approval Prediction Dataset</b></font</p></div>



---

## Phase 1: Data Exploration & Initial Assessment
###  Project Context: Credit Card Eligibility Assessment

Credit card issuing companies need to ensure that applicants are financially reliable before approving a credit card. A poor approval decision may result in **defaults**, which lead to financial losses. Therefore, analyzing customer **application information** (demographics, income, family details) along with their **credit repayment history** is essential to decide whether a client is **eligible** or **not eligible** for a credit card.

###  Objective

* To determine **credit card eligibility** of applicants based on their financial and credit history data.
* Eligibility is defined by the customer’s repayment behavior:

  * **Eligible (DEFAULT = 0)** → The client has no serious overdue history.
  * **Not Eligible (DEFAULT = 1)** → The client has a history of serious delinquency (overdue 2+ months).

###  Dataset Details

The project uses two datasets:

1. **Application Record** → demographic and socio-economic details such as gender, age, income, employment, occupation, family size, housing, car & real estate ownership.
2. **Credit Record** → monthly repayment status (`STATUS`), credit history length (`MONTHS_BALANCE`).

After merging:

* Each customer has a **single profile row**.
* A derived column **`DEFAULT`** acts as the target:

  * `0` → customer is **eligible**.
  * `1` → customer is **not eligible**.

###  Business Problem

The task is to identify **which applicants can be safely approved** for a credit card, and **which ones should be rejected** due to past repayment problems.

###  Expected Outcome

* A dataset that clearly marks each customer as **eligible or not eligible**.
* Insights into factors that contribute to **ineligibility**, such as low income, unstable job, or poor repayment history.
* A decision-support tool for credit card approval teams.

---


### 1. Load the Dataset

##### Importing required modules

In [2]:
import numpy as np
import pandas as pd
import random
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go 
import plotly.express as px
import plotly.io as pio

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

import warnings
warnings.filterwarnings('ignore')


In [3]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [4]:
filepath=r"C:\Users\akhil\INNOMATICS\Batch 406\EDA  in python\Pandas\Datasets\merged_application_credit_with_status (1).csv"
df_cc=pd.read_csv(filepath)

#### Note:- `Here we just import neccessary modules to work with Dataset`

In [5]:
df_cc.describe()

Unnamed: 0,ID,CNT_CHILDREN,AMT_INCOME_TOTAL,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,CNT_FAM_MEMBERS,WORST_STATUS_NUM,DEFAULT
count,36457.0,36457.0,36457.0,36457.0,36457.0,36457.0,36457.0,36457.0,36457.0,36457.0,36457.0,36457.0
mean,5078227.0,0.430315,186685.7,-15975.173382,59262.935568,1.0,0.225526,0.294813,0.089722,2.198453,0.154017,0.016897
std,41875.24,0.742367,101789.2,4200.549944,137651.334859,0.0,0.417934,0.455965,0.285787,0.911686,0.523378,0.128886
min,5008804.0,0.0,27000.0,-25152.0,-15713.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
25%,5042028.0,0.0,121500.0,-19438.0,-3153.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0
50%,5074614.0,0.0,157500.0,-15563.0,-1552.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0
75%,5115396.0,1.0,225000.0,-12462.0,-408.0,1.0,0.0,1.0,0.0,3.0,0.0,0.0
max,5150487.0,19.0,1575000.0,-7489.0,365243.0,1.0,1.0,1.0,1.0,20.0,5.0,1.0


In [6]:
df_cc.columns

Index(['ID', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN',
       'AMT_INCOME_TOTAL', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
       'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'DAYS_BIRTH',
       'DAYS_EMPLOYED', 'FLAG_MOBIL', 'FLAG_WORK_PHONE', 'FLAG_PHONE',
       'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS', 'WORST_STATUS_NUM',
       'LAST_STATUS', 'DEFAULT'],
      dtype='object')

---

### 2. View Sample Data


In [7]:
df_cc.shape 

# This will give you info about no of rows and columns in the form a tuple --> (rows, columns)

(36457, 21)

In [8]:
df_cc.head()

Unnamed: 0,ID,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,...,DAYS_EMPLOYED,FLAG_MOBIL,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,WORST_STATUS_NUM,LAST_STATUS,DEFAULT
0,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,...,-4542,1,1,0,0,,2.0,1,C,0
1,5008805,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,...,-4542,1,1,0,0,,2.0,1,C,0
2,5008806,M,Y,Y,0,112500.0,Working,Secondary / secondary special,Married,House / apartment,...,-1134,1,0,0,0,Security staff,2.0,0,C,0
3,5008808,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,...,-3051,1,0,1,1,Sales staff,1.0,0,0,0
4,5008809,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,...,-3051,1,0,1,1,Sales staff,1.0,0,X,0



---
##  **Column Descriptions**


###  **Identifiers**

* **`ID`** → Unique identifier for each customer.

---

###  **Demographics**

* **`CODE_GENDER`** → Gender of applicant (`M` = Male, `F` = Female).
* **`DAYS_BIRTH`** → Age of customer in days (negative value; convert to years).

---

###  **Family & Dependents**

* **`NAME_FAMILY_STATUS`** → Marital status (`Single`, `Married`, `Civil marriage`, etc.).
* **`CNT_CHILDREN`** → Number of children the customer has.
* **`CNT_FAM_MEMBERS`** → Total family size (including applicant).

---

###  **Financial Information**

* **`AMT_INCOME_TOTAL`** → Annual income of the applicant.
* **`NAME_INCOME_TYPE`** → Income source (`Working`, `State servant`, `Pensioner`, `Commercial associate`, etc.).
* **`NAME_EDUCATION_TYPE`** → Education level (`Higher education`, `Secondary`, `Incomplete higher`, etc.).
* **`NAME_HOUSING_TYPE`** → Housing arrangement (`House / apartment`, `Rented apartment`, `Municipal apartment`, etc.).

---

###  **Employment & Occupation**

* **`DAYS_EMPLOYED`** → Number of days employed (negative value; larger magnitude = longer employment).
* **`OCCUPATION_TYPE`** → Type of occupation (`Managers`, `Laborers`, `Sales staff`, etc.).

---

###  **Assets Ownership**

* **`FLAG_OWN_CAR`** → Whether the applicant owns a car (`Y`/`N`).
* **`FLAG_OWN_REALTY`** → Whether the applicant owns real estate (`Y`/`N`).

---

###  **Contact Information**

* **`FLAG_MOBIL`** → Has mobile phone (1 = Yes, 0 = No).
* **`FLAG_WORK_PHONE`** → Has work phone (1 = Yes, 0 = No).
* **`FLAG_PHONE`** → Has personal phone (1 = Yes, 0 = No).
* **`FLAG_EMAIL`** → Has email (1 = Yes, 0 = No).

---

###  **Credit History**

* **`WORST_STATUS_NUM`** → Worst delinquency level observed:

  * 0 = no overdue,
  * 1 = 1 month overdue,
  * 2 = 2 month overdue,
  * 3 = 3 month overdue,
  * 4 = 4 month overdue,
  * 5 = 120+ days overdue.
* **`LAST_STATUS`** → Most recent repayment status (from latest month).

---

###  **Target Variable**

* **`DEFAULT`** → Credit card eligibility:

  * `0` → **Eligible** (no serious default history).
  * `1` → **Not Eligible** (defaulted in past with ≥2 months overdue).


#  STATUS Codes Explained

* **`0`** → **Paid on time** (no DPD, i.e., “days past due”).
* **`1`** → **1 month overdue**.
* **`2`** → **2 months overdue**.
* **`3`** → **3 months overdue**.
* **`4`** → **4 months overdue**.
* **`5`** → **Overdue 120+ days** (serious delinquency).
* **`C`** → **Closed** (the account was closed that month).
* **`X`** → **No loan for the month** (customer had no active credit that month).

---

###  How we use it in your project:

* If a customer ever has **`2, 3, 4, or 5`** → they are considered **defaulters (DEFAULT = 1)**.
* If they only have `0, 1, C, X` → they are considered **non-defaulters (DEFAULT = 0)**.
* `LAST_STATUS` just tells you the **most recent repayment condition**.

---





#### Note:- `Here we have 21 columns and above short description will give you enough understanding of what each column represents `

---

### 3. Check Structure

In [9]:
df_cc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36457 entries, 0 to 36456
Data columns (total 21 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   36457 non-null  int64  
 1   CODE_GENDER          36457 non-null  object 
 2   FLAG_OWN_CAR         36457 non-null  object 
 3   FLAG_OWN_REALTY      36457 non-null  object 
 4   CNT_CHILDREN         36457 non-null  int64  
 5   AMT_INCOME_TOTAL     36457 non-null  float64
 6   NAME_INCOME_TYPE     36457 non-null  object 
 7   NAME_EDUCATION_TYPE  36457 non-null  object 
 8   NAME_FAMILY_STATUS   36457 non-null  object 
 9   NAME_HOUSING_TYPE    36457 non-null  object 
 10  DAYS_BIRTH           36457 non-null  int64  
 11  DAYS_EMPLOYED        36457 non-null  int64  
 12  FLAG_MOBIL           36457 non-null  int64  
 13  FLAG_WORK_PHONE      36457 non-null  int64  
 14  FLAG_PHONE           36457 non-null  int64  
 15  FLAG_EMAIL           36457 non-null 

### Note:- 

* Number of rows and columns
* Column names
* Data types
* Amount of memory used by dataset

---

### 4. Get Summary Statistics


In [10]:
df_cc.describe()
pd.set_option("display.float_format", "{:.2f}".format)


Unnamed: 0,ID,CNT_CHILDREN,AMT_INCOME_TOTAL,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,CNT_FAM_MEMBERS,WORST_STATUS_NUM,DEFAULT
count,36457.0,36457.0,36457.0,36457.0,36457.0,36457.0,36457.0,36457.0,36457.0,36457.0,36457.0,36457.0
mean,5078227.0,0.430315,186685.7,-15975.173382,59262.935568,1.0,0.225526,0.294813,0.089722,2.198453,0.154017,0.016897
std,41875.24,0.742367,101789.2,4200.549944,137651.334859,0.0,0.417934,0.455965,0.285787,0.911686,0.523378,0.128886
min,5008804.0,0.0,27000.0,-25152.0,-15713.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
25%,5042028.0,0.0,121500.0,-19438.0,-3153.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0
50%,5074614.0,0.0,157500.0,-15563.0,-1552.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0
75%,5115396.0,1.0,225000.0,-12462.0,-408.0,1.0,0.0,1.0,0.0,3.0,0.0,0.0
max,5150487.0,19.0,1575000.0,-7489.0,365243.0,1.0,1.0,1.0,1.0,20.0,5.0,1.0


### observation:

- `Provides mean, min, max,standard deviation, quartiles for numerical features.`
- `Helps detect outliers`
---

### 5. Find Missing Values

In [11]:
df_cc.shape

(36457, 21)

In [12]:
df_cc.isnull().sum()

ID                         0
CODE_GENDER                0
FLAG_OWN_CAR               0
FLAG_OWN_REALTY            0
CNT_CHILDREN               0
AMT_INCOME_TOTAL           0
NAME_INCOME_TYPE           0
NAME_EDUCATION_TYPE        0
NAME_FAMILY_STATUS         0
NAME_HOUSING_TYPE          0
DAYS_BIRTH                 0
DAYS_EMPLOYED              0
FLAG_MOBIL                 0
FLAG_WORK_PHONE            0
FLAG_PHONE                 0
FLAG_EMAIL                 0
OCCUPATION_TYPE        11323
CNT_FAM_MEMBERS            0
WORST_STATUS_NUM           0
LAST_STATUS                0
DEFAULT                    0
dtype: int64

In [13]:
df_cc.shape
(11323/36457)*100

(36457, 21)

31.05850728255205



So, we have:

* **Dataset size:** `36457 rows`
* **Missing in `OCCUPATION_TYPE`:** `11323`
* **Percentage missing:**

$$
\frac{11323}{36457} \times 100 \approx 31.05\%
$$

---

#### What this means:

* **\~31% missing is pretty big.**
* If you fill everything with the **mode (most frequent occupation)**, you’ll artificially inflate that occupation by \~31%, which will distort your analysis/model.
* If you **drop the column**, you lose a potentially useful feature.
* If you **create a new category "Unknown"**, you keep the data and also preserve the information that occupation wasn’t reported.
---


#### observation:-  `You need to find out missing/NaN values present in the dataframe, so that you can clean your data before you perform any operations`

In [14]:
df_cc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36457 entries, 0 to 36456
Data columns (total 21 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   36457 non-null  int64  
 1   CODE_GENDER          36457 non-null  object 
 2   FLAG_OWN_CAR         36457 non-null  object 
 3   FLAG_OWN_REALTY      36457 non-null  object 
 4   CNT_CHILDREN         36457 non-null  int64  
 5   AMT_INCOME_TOTAL     36457 non-null  float64
 6   NAME_INCOME_TYPE     36457 non-null  object 
 7   NAME_EDUCATION_TYPE  36457 non-null  object 
 8   NAME_FAMILY_STATUS   36457 non-null  object 
 9   NAME_HOUSING_TYPE    36457 non-null  object 
 10  DAYS_BIRTH           36457 non-null  int64  
 11  DAYS_EMPLOYED        36457 non-null  int64  
 12  FLAG_MOBIL           36457 non-null  int64  
 13  FLAG_WORK_PHONE      36457 non-null  int64  
 14  FLAG_PHONE           36457 non-null  int64  
 15  FLAG_EMAIL           36457 non-null 

#### observation:- `Here we have successfully found  missing/NaN values in the categorical column (OCCUPATION_TYPE), moving on further we will handle any data cleaning issues in any of the column as per our requirement`      

### 6. finding out duplicate values in every column

In [15]:
df_cc.shape

(36457, 21)

In [16]:
for i in df_cc.columns:
    print(f"{i}:{df_cc[i].duplicated().sum()}")

ID:0
CODE_GENDER:36455
FLAG_OWN_CAR:36455
FLAG_OWN_REALTY:36455
CNT_CHILDREN:36448
AMT_INCOME_TOTAL:36192
NAME_INCOME_TYPE:36452
NAME_EDUCATION_TYPE:36452
NAME_FAMILY_STATUS:36452
NAME_HOUSING_TYPE:36451
DAYS_BIRTH:29274
DAYS_EMPLOYED:32817
FLAG_MOBIL:36456
FLAG_WORK_PHONE:36455
FLAG_PHONE:36455
FLAG_EMAIL:36455
OCCUPATION_TYPE:36438
CNT_FAM_MEMBERS:36447
WORST_STATUS_NUM:36451
LAST_STATUS:36449
DEFAULT:36455


In [17]:
df_cc.duplicated().sum()

0

#### observation:- Since there are no duplicates, there's no need to drop duplicates using `df.drop_duplicates(inplace=True)`
 ----

### 7. Detecting Outliers

`Detecting Outliers`

- The IQR (Interquartile Range) method was used to detect outliers in all numerical columns as part of the non-visual analysis.


**Steps:**

    - Calculate Q 1 (25 th percentile) and Q 3 (75 th percentile)
    
    - Compute IQR =Q 3 -Q 1
    
    - Define lower and upper bounds:
    
    - Lower bound = Q 1 - 1.5 x IQR
    
    - Upper bound = Q 3 + 1.5 x IQR
    
    - Any value outside this range is considered an outlier

In [18]:
numerical_columns=df_cc.select_dtypes(include=['int64', 'float64']).columns.to_list()
categorical_columns=df_cc.select_dtypes(include=['object']).columns.to_list()

In [19]:
outliers = {}
for col in numerical_columns:
    if df_cc[col].isnull().all():
        continue  # Skip the columns with all missing values
    Q1 = df_cc[col].quantile(0.25)
    Q3 = df_cc[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outlier_values = df_cc[(df_cc[col] < lower) | (df_cc[col] > upper)][col]
    if not outlier_values.empty:
        outliers[col] = outlier_values.tolist()
    

In [20]:
# if outliers:
#     for col, vals in outliers.items():
#         print(f"\nOutliers in column '{col}':\n{vals}")
#     else:
#       print("No outliers found in the numerical columns.")

------
- `Till now, we have seen project context, basic understanding of dataset structure such as size,shape, number of columns and good understanding of what columns are, summary statisctics of dataframe, finding missing values and outliers present in dataset all these are part of project phase1. In phase2, will see the data cleaning such as handling missing values, seggregating numerical and categorical columns, Visual and Non visual Analysis on Dataset so that you'll get good understanding of dataset.`
-----