## **Predictive Analysis of Credit Card Defaults**

### Objective: 
- Dataset of customers' default payments.
- The primary goal is to predict which credit card clients are likely to default using various data mining methods.

### Background: 
Traditional risk management models classify clients as either credible or not credible based on their likelihood of default. This project aims to refine this classification by identifying specific individuals who are likely to default, enhancing the precision of credit risk assessments.

Target variable
- default.payment.next.month: Default payment (1=yes, 0=no)

The dataset contains the following features

1. ID: ID of each client
2. LIMIT_BAL: Amount of given credit in dollars (includes individual and family/supplementary credit
3. SEX: Gender (1=male, 2=female)
4. EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
5. MARRIAGE: Marital status (1=married, 2=single, 3=others)
6. AGE: Age in years
7. PAY_0: Repayment status in September (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)
8. PAY_2: Repayment status in August (scale same as above)
9. PAY_3: Repayment status in July, (scale same as above)
10. PAY_4: Repayment status in June (scale same as above)
11. PAY_5: Repayment status in May (scale same as above)
12. PAY_6: Repayment status in April (scale same as above)
13. BILL_AMT1: Amount of bill statement in September (dollars)  
14. BILL_AMT2: Amount of bill statement in August (dollars)  
15. BILL_AMT3: Amount of bill statement in July (dollars)  
16. BILL_AMT4: Amount of bill statement in June (dollars)  
17. BILL_AMT5: Amount of bill statement in May (dollars)  
18. BILL_AMT6: Amount of bill statement in April (dollars)   
19. PAY_AMT1: Amount of previous payment in September (dollars)  
20. PAY_AMT2: Amount of previous payment in August (dollars)  
21. PAY_AMT3: Amount of previous payment in July (dollars)   
22. PAY_AMT4: Amount of previous payment in June (dollars)  
23. PAY_AMT5: Amount of previous payment in May (dollars)   
24. PAY_AMT6: Amount of previous payment in April (dollars)  



# 1. Reading the dataset 



In [None]:
import pandas as pd

# Read the Excel file
df = pd.read_excel("assignment_data/credit_data.xlsx")

# Create a pandas dataframe contining the first 10,000 rows from the credit card dataset
credit_df = df.head(10000)

# Delete the 'ID' column
credit_df = credit_df.drop(columns=["ID"])

# Print dataframe info
credit_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 24 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   LIMIT_BAL                   10000 non-null  int64  
 1   SEX                         9617 non-null   float64
 2   EDUCATION                   9617 non-null   float64
 3   MARRIAGE                    9617 non-null   float64
 4   AGE                         9617 non-null   float64
 5   PAY_0                       9617 non-null   float64
 6   PAY_2                       9617 non-null   float64
 7   PAY_3                       9617 non-null   float64
 8   PAY_4                       9617 non-null   float64
 9   PAY_5                       9617 non-null   float64
 10  PAY_6                       9642 non-null   float64
 11  BILL_AMT1                   9642 non-null   float64
 12  BILL_AMT2                   9642 non-null   float64
 13  BILL_AMT3                   9642


- Numeric variables are measured on a continuous or discrete scale and support arithmetic operations (for example age, bill amount).
- Ordinal variables represent categories with a meaningful order, but the intervals between them are not necessarily uniform (for example payment status).
- Nominal variables represent categories without a natural order (for example gender, marital status).



**Table: Classification of Features**

| Variable Kind | Number of Features | Feature Names |
|---------------|--------------------|---------------|
| **Numeric**   | 14 | LIMIT_BAL, AGE, BILL_AMT1, BILL_AMT2, BILL_AMT3, BILL_AMT4, BILL_AMT5, BILL_AMT6, PAY_AMT1, PAY_AMT2, PAY_AMT3, PAY_AMT4, PAY_AMT5, PAY_AMT6 |
| **Ordinal**   | 7  | EDUCATION, PAY_0, PAY_2, PAY_3, PAY_4, PAY_5, PAY_6 |
| **Nominal**   | 3  | SEX, MARRIAGE, default payment next month |



**Description:**

After removing the `ID` column, the dataset contains 24 features relevant to predicting credit card defaults.

- There are 14 numeric features, such as `LIMIT_BAL` (credit limit), `AGE` and many bill and payment amounts from April to September. These are continuous variables that reflect the financial behavior of clients.

- 7 features are classified as ordinal, including the six `PAY_` variables (`PAY_0`, `PAY_2` to `PAY_6`), which indicate repayment status on a monthly scale, with higher values corresponding to longer payment delays. 
    + `EDUCATION` is also treated as ordinal, as the values `1` (graduate school), `2` (university) and `3` (high school) follow a logical order. However, this feature includes categories `4` (others) and `5` – `6` (unknown) creates ambiguity, as these do not clearly fit into a ranked structure. To preserve the ordinal nature of the variable in analysis, these categories (`5` and `6`) might be grouped together as `4` (others) during preprocessing.

- The remaining 3 features are nominal: `SEX`, `MARRIAGE` and `default payment next month`. These represent categorical data with no inherent order and should be encoded accordingly.

This classification guides the choice of appropriate preprocessing strategies and modeling techniques, helping ensure that the data is interpreted correctly based on its underlying structure.


In [None]:


# Number of missing values for each variable
missing_values_credit = credit_df.isnull().sum().sort_values()
missing_values_credit

LIMIT_BAL                       0
default payment next month      0
PAY_AMT4                       10
PAY_AMT3                       10
PAY_AMT5                      290
PAY_AMT6                      290
BILL_AMT2                     358
BILL_AMT4                     358
BILL_AMT3                     358
PAY_6                         358
BILL_AMT1                     358
BILL_AMT5                     368
BILL_AMT6                     368
PAY_AMT1                      368
PAY_AMT2                      368
PAY_4                         383
SEX                           383
PAY_3                         383
PAY_2                         383
PAY_0                         383
AGE                           383
MARRIAGE                      383
PAY_5                         383
EDUCATION                     383
dtype: int64



The output shows that several variables in the dataset contain missing values, which need to be addressed before training model:

- Repayment status features (`PAY_0`, `PAY_2`, `PAY_3`, `PAY_4`, `PAY_5`, `PAY_6`) have between 358 and 383 missing entries. Since these variables help us understand how timely clients are with their payments and are ranked in severity, therefore, they should be imputed carefully, ideally in a way that maintains their order.

- Bill statement amounts features (`BILL_AMT1` to `BILL_AMT6`) have moderate missing entries, ranging from 358 to 368 values. These are continuous numeric variables and are typically imputed using the mean.

- Payment amount features (`PAY_AMT1` to `PAY_AMT6`) show varying levels of missing entries, with some as low as 10 and others up to 368. These numeric variables should be imputed consistently to avoid having bias.

- Demographic and categorical features such as `SEX`, `AGE`, `MARRIAGE` and `EDUCATION` each have 383 missing entries. Given their nominal or ordinal nature, mode imputation would be the most appropriate strategy.

- It is also worth noting that both the target variable (`default payment next month`) and a key numeric predictor `LIMIT_BAL`, have no missing values. This ensures the target is fully usable for supervised learning and that one of the most informative features is complete.

In summary, the missing values are non-trivial and span across multiple important predictors. Addressing them properly will help ensure the model is both reliable and accurate.



# 2 Cleaning data and dealing with categorical features 


In [None]:
# Create a new dataframe for the clean dataset to preserve the original data
cleaned_credit_df = credit_df.copy()

# Define numeric columns
numeric_col = [
    'LIMIT_BAL', 'AGE', 
    'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6',
    'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

# Impute numeric columns with mean
for col in numeric_col:
    if cleaned_credit_df[col].isnull().sum() > 0:
        cleaned_credit_df[col].fillna(cleaned_credit_df[col].mean(), inplace=True)


# Define nominal/ordinal columns 
cat_col = [
    'SEX', 'EDUCATION', 'MARRIAGE',
    'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']

# Impute nominal/ordinal columns with mode
for col in cat_col:
    cleaned_credit_df[col].fillna(cleaned_credit_df[col].mode()[0], inplace=True)


Data imputation is the process of replacing missing values with estimated ones based on observed data. This step ensures the dataset is complete and suitable for model training, preventing errors or potential biases that could arise from null values.

In this assignment, I applied two primary strategies based on feature type:

- Mean imputation was used for all numeric variables (for example `AGE`, `BILL_AMT` and `PAY_AMT`), as these features are continuous and replacing missing values with the average is statistically appropriate and maintains the overall distribution.

- Mode imputation was used for nominal and ordinal variables (for example `SEX`, `EDUCATION`, `MARRIAGE` and repayment status variables like `PAY_0`, `PAY_2`, etc), since these features represent discrete groups. Filling in the most frequent category helps preserve the variable’s interpretability and structure.

These decisions were guided by the earlier features classification (see Q2), ensuring that the imputation approach aligned with the nature of each variable. This structured strategy supports both the integrity of the dataset and the performance of the predictive models built on it.


In [None]:
# Print value counts of the 'SEX' column
print("Value counts of the SEX column:\n", cleaned_credit_df['SEX'].value_counts())

# Convert 'SEX' column to integer
cleaned_credit_df['SEX'] = cleaned_credit_df['SEX'].astype(int)

# Apply `get_dummies()` to the 'SEX' column
sex_dummies = pd.get_dummies(cleaned_credit_df['SEX'], prefix='SEX', drop_first=True, dtype=int)

# Rename SEX_2 to SEX_FEMALE and add it to the dataframe
cleaned_credit_df['SEX_FEMALE'] = sex_dummies['SEX_2']

# Drop the original 'SEX' column
cleaned_credit_df.drop(columns=['SEX'], inplace=True)



Value counts of the SEX column:
 SEX
2.0    6162
1.0    3838
Name: count, dtype: int64




The `value_counts()` function was first used to examine the distribution of the `'SEX'` variable, showing:
- `2.0` = Female with 6,162 clients  
- `1.0` = Male with 3,838 clients  

This gives a clear understanding of the gender breakdown in the dataset.

Next, the `'SEX'` column was converted to integers using `.astype(int)` to ensure compatibility with `pd.get_dummies()`.

I then applied `pd.get_dummies()` with `drop_first=True`, which avoids multicollinearity by generating only one dummy variable. Since the original values are `1 = male` and `2 = female`, this created a column named `'SEX_2'`, corresponding to clients who are female. To match the assignment instructions and clearly reflect the category it represents, I renamed `SEX_2` to `SEX_FEMALE`.

Meaning of the new variable `SEX_FEMALE`:
- `SEX_FEMALE = 1` means the client is female
- `SEX_FEMALE = 0` means the client is male


This transformation ensures the gender feature is in a numeric format suitable for machine learning models, while retaining its interpretability. I also used `dtype=int` to make sure the dummy values are stored as `0` and `1`, rather than Boolean values.

Finally, I dropped the original `'SEX'` column to prevent redundancy, as its information is now fully captured in `'SEX_FEMALE'`.


In [None]:
# Print value counts of the 'MARRIAGE' column
print("Value counts of the 'MARRIAGE' column:\n", cleaned_credit_df['MARRIAGE'].value_counts())



Value counts of the 'MARRIAGE' column:
 MARRIAGE
2.0    5518
1.0    4380
3.0      82
0.0      20
Name: count, dtype: int64




According to the dataset definition, the `MARRIAGE` variable should only contain three valid categories:

- `1` = Married
- `2` = Single
- `3` = Others

However, the `value_counts()` output shows that there are also 20 entries are coded as `0`, which is not defined in the original variable description. This suggests a discrepancy between the dataset and its documentation, suggesting a data quality issue. These undefined values may result from input errors or inconsistent coding and must be addressed during preprocessing to prevents potential bias or misleading during model training.

Most clients are either single (5,518) or married (4,380), while only 82 are classified as “others.” Since the `0` values are undefined and fewer than category `3`, a practical solution is to reassign them to category `3` (Others) to preserve all data without introducing an invalid class.

In [None]:
# Apply `get_dummies()` to 'MARRIAGE' feature and add dummy variables 'MARRIAGE_MARRIED', 'MARRIAGE_SINGLE', 'MARRIAGE_OTHER' to `df`  

# Reassign invalid values (0) to 3 = 'Other'
cleaned_credit_df['MARRIAGE'] = cleaned_credit_df['MARRIAGE'].replace({0: 3})

# Convert MARRIAGE column to integers 
cleaned_credit_df['MARRIAGE'] = cleaned_credit_df['MARRIAGE'].astype(int)

# Apply `get_dummies()` to 'MARRIAGE' feature
marriage_dummies = pd.get_dummies(cleaned_credit_df['MARRIAGE'], prefix='MARRIAGE', dtype=int)

# Rename dummy columns to match assignment instructions
marriage_dummies.rename(
    columns={
        'MARRIAGE_1': 'MARRIAGE_MARRIED',
        'MARRIAGE_2': 'MARRIAGE_SINGLE',
        'MARRIAGE_3': 'MARRIAGE_OTHER'},
    inplace=True)

# Concatenate dummy variables to the main dataframe
cleaned_credit_df = pd.concat([cleaned_credit_df, marriage_dummies], axis=1)

# Drop the original 'MARRIAGE' column
cleaned_credit_df.drop(columns=['MARRIAGE'], inplace=True)




The `value_counts()` function for the `MARRIAGE` column showed an unexpected value `0`, which does not appear in the dataset definition. According to the dataset:

- `1` is Married
- `2` is Single
- `3` is Others

The presence of `0` is likely a data entry error or inconsistency. Since the instructionss state not to delete observations or treat anomalies as missing, I reassigned all instances of `0` to category `3` (Others). This approach preserves data integrity and ensures potentially useful patterns are not lost. It is also semantically consistent since `0` clearly does not represent a meaningful marital status on its own, so grouping it with “Others” avoids introducing noise or misinterpretation into the model.

To prepare the variable for modeling, I first converted the `MARRIAGE` column to integer type using `.astype(int)`. This step ensures that the category values are clean integers, which prevents column names like `MARRIAGE_1.0` from appearing during dummy encoding.

Next, I applied `pd.get_dummies()` to the `MARRIAGE` column to convert it into binary variables. The `dtype=int` ensures that the resulting values are stored as integers (`0` and `1`), which is preferred by most machine learning algorithms over boolean types.

By default, the resulting dummy columns were named numerically (`MARRIAGE_1`, `MARRIAGE_2`, `MARRIAGE_3`), reflecting the original numeric codes. To ensure clarity and match the assignment instructions along with the dataset definitions (where `1` = married, `2` = single and `3` = others), I renamed the columns using `.rename()` as follows:

- `MARRIAGE_1` to  `MARRIAGE_MARRIED`  
- `MARRIAGE_2` to `MARRIAGE_SINGLE`  
- `MARRIAGE_3` to `MARRIAGE_OTHER`  

Each of these variables follows the binary format (`0` or `1`) and represents:

- `MARRIAGE_MARRIED` = 1 if the client is married, 0 otherwise  
- `MARRIAGE_SINGLE` = 1 if the client is single, 0 otherwise  
- `MARRIAGE_OTHER` = 1 if the client belongs to the "other" category (including any recoded `0`), 0 otherwise  

Finally, I concatenated the renamed dummy columns back into the original DataFrame and dropped the original `MARRIAGE` column to avoid redundancy. This preprocessing step ensures that the feature is clean, encoded and ready for use in machine learning models.





In [None]:
# Convert the values {0, 5, 6} to the value 4 In the column 'EDUCATION'

# Convert values 0, 5, and 6 in 'EDUCATION' to 4
cleaned_credit_df['EDUCATION'] = cleaned_credit_df['EDUCATION'].replace({0: 4, 5: 4, 6: 4})


# 3. Preparing X and y arrays 


- Create a numpy array `y` from the first 7,000 observations of `default payment next month` column from `df`
- Create a numpy array `X`  from the first 7,000 observations of all the remaining variables in `df` 


In [None]:


import numpy as np

# Create a numpy array `y` from the first 7,000 observations of `default payment next month` column from `df`
y = cleaned_credit_df['default payment next month'].iloc[:7000].to_numpy()

# Create a numpy array `X`  from the first 7,000 observations of all the remaining variables in `df`
X = cleaned_credit_df.drop(columns=['default payment next month']).iloc[:7000].to_numpy()


In [None]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split the data into 70% train and 30% test datasets with stratification and random_state=31
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=31, stratify=y)

# Standardise the data to mean zero and variance one using an approapriate `sklearn` library
sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)


# 4. Training Models and Interpretation 


### 4.1. Logistic Regression

In [None]:


from sklearn.linear_model import LogisticRegression

# Train a linear classifier (Logistic Regression) using standardised data
lr = LogisticRegression(random_state=31)
lr.fit(X_train_std, y_train)

# Compute training and test accuracies
train_acc = lr.score(X_train_std, y_train)
test_acc = lr.score(X_test_std, y_test)

# Print results
print(f"Training Accuracy (Logistic Regression): {train_acc:.3f}")
print(f"Test Accuracy (Logistic Regression): {test_acc:.3f}")



Training Accuracy (Logistic Regression): 0.821
Test Accuracy (Logistic Regression): 0.805


### 4.2. K-Nearest Neighbor

In [None]:

from sklearn.neighbors import KNeighborsClassifier

# Train nonlinear classifier (K-Nearest Neighbors) on the same dataset
knn_clf = KNeighborsClassifier(n_neighbors=5)  
knn_clf.fit(X_train_std, y_train)

# Compute training and test accuracies
train_acc_knn = knn_clf.score(X_train_std, y_train)
test_acc_knn = knn_clf.score(X_test_std, y_test)

# Print results
print(f"Training Accuracy (KNN): {train_acc_knn:.3f}")
print(f"Test Accuracy (KNN): {test_acc_knn:.3f}")

Training Accuracy (KNN): 0.843
Test Accuracy (KNN): 0.780



**1) Results obtained from the two classifiers - Comparision**

- The logistic regression model achieved a training accuracy of 0.821 and a test accuracy of 0.805. In contrast, the K-Nearest Neighbors (KNN) model reached a higher training accuracy of 0.843, but a lower test accuracy of 0.780.

- While both models performed reasonably well, logistic regression demonstrated a better generalisation to unseen data. The small gap between its training and test scores suggests the model is stable and not overfitting.

- On the other hand, KNN shows signs of overfitting. Its higher training accuracy, along with a noticeable drop in test performance, suggests that the model may be fitting too closely to the training data. This behaviour is typical of memory-based, nonlinear models such as KNN, which often capture local patterns that do not generalise well to unseen data.

**2) Recommendation model**

- Based on the observed performance, I would recommend logistic regression. Despite its slightly lower training accuracy, it achieved better performance on the test set, indicating more reliable predictions on new data — a key goal in credit risk modeling.

- Moreover, logistic regression is also highly interpretable, which is valuable in risk assessments and in financial contexts where decisions must be explainable and transparent to stakeholders. Its coefficients can provide insights into which variables most strongly influence default risk, which help support decision-making.

- KNN may still be useful in specific scenarios where the goal is to capture nonlinear patterns, but in this case, its performance suggests it is less suited for the given dataset without further tuning or feature engineering.

- Finally, this project goes beyond traditional binary classification, which simply labels clients as credible or not. Instead, it focuses on identifying exactly who is most likely to default, allowing for more targeted risk management. Among the models tested, logistic regression offered the best balance between performance, interpretability and robustness, making it highly suitable for real-world financial applications, where both accuracy and explainability are essential.
