<a href="https://colab.research.google.com/github/arezoo17/explainable-credit-risk/blob/main/Load_and_Split.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split


In [None]:
df = pd.read_csv("/Users/arezoo/Documents/1.MDDB/MasterProject/Code/German_Credit.csv")

In [None]:
# Basic shape and structure
print("Dataset shape:", df.shape)

Dataset shape: (1000, 21)


In [None]:
df.head()

Unnamed: 0,checking account,Duration,Credit_his,Purpose,Credit amount,Savings account,Present_emp,Installment rate,sex,other_debtor,...,Property,Age,Other_install,Housing,Num_credits,Job,Num_people,Telephone,Foreign worker,Class
0,A11,6,A34,A43,1169,A65,A75,4,A93,A101,...,A121,67,A143,A152,2,A173,1,A192,A201,1
1,A12,48,A32,A43,5951,A61,A73,2,A92,A101,...,A121,22,A143,A152,1,A173,1,A191,A201,2
2,A14,12,A34,A46,2096,A61,A74,2,A93,A101,...,A121,49,A143,A152,1,A172,2,A191,A201,1
3,A11,42,A32,A42,7882,A61,A74,2,A93,A103,...,A122,45,A143,A153,1,A173,2,A191,A201,1
4,A11,24,A33,A40,4870,A61,A73,3,A93,A101,...,A124,53,A143,A153,2,A173,2,A191,A201,2


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   checking account  1000 non-null   object
 1   Duration          1000 non-null   int64 
 2   Credit_his        1000 non-null   object
 3   Purpose           1000 non-null   object
 4   Credit amount     1000 non-null   int64 
 5   Savings account   1000 non-null   object
 6   Present_emp       1000 non-null   object
 7   Installment rate  1000 non-null   int64 
 8   sex               1000 non-null   object
 9   other_debtor      1000 non-null   object
 10  Present_resid     1000 non-null   int64 
 11  Property          1000 non-null   object
 12  Age               1000 non-null   int64 
 13  Other_install     1000 non-null   object
 14  Housing           1000 non-null   object
 15  Num_credits       1000 non-null   int64 
 16  Job               1000 non-null   object
 17  Num_people     

In [None]:
df.isnull().sum()

checking account    0
Duration            0
Credit_his          0
Purpose             0
Credit amount       0
Savings account     0
Present_emp         0
Installment rate    0
sex                 0
other_debtor        0
Present_resid       0
Property            0
Age                 0
Other_install       0
Housing             0
Num_credits         0
Job                 0
Num_people          0
Telephone           0
Foreign worker      0
Class               0
dtype: int64

In [None]:
df.nunique()

checking account      4
Duration             33
Credit_his            5
Purpose              10
Credit amount       921
Savings account       5
Present_emp           5
Installment rate      4
sex                   4
other_debtor          3
Present_resid         4
Property              4
Age                  53
Other_install         3
Housing               3
Num_credits           4
Job                   4
Num_people            2
Telephone             2
Foreign worker        2
Class                 2
dtype: int64

In [None]:
cat_cols = [col for col in df.columns if df[col].dtypes == 'O']

for col in cat_cols:
    print(df[col].value_counts(), "\n\n")

A14    394
A11    274
A12    269
A13     63
Name: checking account, dtype: int64 


A32    530
A34    293
A33     88
A31     49
A30     40
Name: Credit_his, dtype: int64 


A43     280
A40     234
A42     181
A41     103
A49      97
A46      50
A45      22
A44      12
A410     12
A48       9
Name: Purpose, dtype: int64 


A61    603
A65    183
A62    103
A63     63
A64     48
Name: Savings account, dtype: int64 


A73    339
A75    253
A74    174
A72    172
A71     62
Name: Present_emp, dtype: int64 


A93    548
A92    310
A94     92
A91     50
Name: sex, dtype: int64 


A101    907
A103     52
A102     41
Name: other_debtor, dtype: int64 


A123    332
A121    282
A122    232
A124    154
Name: Property, dtype: int64 


A143    814
A141    139
A142     47
Name: Other_install, dtype: int64 


A152    713
A151    179
A153    108
Name: Housing, dtype: int64 


A173    630
A172    200
A174    148
A171     22
Name: Job, dtype: int64 


A191    596
A192    404
Name: Telephone, dtype: int64 

## ðŸ“‚ Step 2: Rename Columns and Transform Target Variable

In this step, we will:

1. Rename all column names to more descriptive, human-readable ones (based on the UCI data dictionary).
2. Transform the target variable (`Class`) so that:
   - `1` â†’ `0` (Good credit â€” **non-default**)
   - `2` â†’ `1` (Bad credit â€” **default**)

This transformation aligns with standard machine learning practices where `1` represents the **positive class** (the event we want to detect â€” in this case, default risk).


| Step                   | Explanation                                                                                                                                                                      |
| ---------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Renaming columns**   | We give short, readable names to complex column titles. This makes coding easier and helps us explain results clearly in the thesis.                                             |
| **Mapping the target** | In ML, it's conventional to mark the event we care about (default) as `1`, and everything else as `0`. So we convert: <br> `Class = 1 (Good)` â†’ `0` <br> `Class = 2 (Bad)` â†’ `1` |
| **Why?**               | Most ML models and performance metrics are built around detecting the "positive class" â€” in this case, credit defaults.                                                          |
| **Validation step**    | We print out the updated column list and class distribution to confirm the changes are correct.                                                                                  |


In [None]:
#Define new column names for clarity, based on UCI dictionary
new_column_names = {
    "checking account": "checking_account",
    "Duration": "duration_months",
    "Credit_his": "credit_history",
    "Purpose": "purpose",
    "Credit amount": "credit_amount",
    "Savings account": "savings_account",
    "Present_emp": "employment_duration",
    "Installment rate": "installment_rate",
    "sex": "personal_status_sex",
    "other_debtor": "other_debtors",
    "Present_resid": "residence_since",
    "Property": "property",
    "Age": "age",
    "Other_install": "other_installment_plans",
    "Housing": "housing",
    "Num_credits": "existing_credits",
    "Job": "job",
    "Num_people": "num_dependents",
    "Telephone": "telephone",
    "Foreign worker": "foreign_worker",
    "Class": "target"
}

#Apply the renaming to the DataFrame
df.rename(columns=new_column_names, inplace=True)

#Recode the target variable: 1 â†’ 0 (good credit), 2 â†’ 1 (bad credit)
df["target"] = df["target"].map({1: 0, 2: 1})

#Double check results
print("\nâœ… Updated column names:")
print(df.columns.tolist())

print("\nðŸŽ¯ Value counts after target transformation (0 = good, 1 = default):")
print(df["target"].value_counts(normalize=True) * 100)


âœ… Updated column names:
['checking_account', 'duration_months', 'credit_history', 'purpose', 'credit_amount', 'savings_account', 'employment_duration', 'installment_rate', 'personal_status_sex', 'other_debtors', 'residence_since', 'property', 'age', 'other_installment_plans', 'housing', 'existing_credits', 'job', 'num_dependents', 'telephone', 'foreign_worker', 'target']

ðŸŽ¯ Value counts after target transformation (0 = good, 1 = default):
0    70.0
1    30.0
Name: target, dtype: float64


### ðŸ“Œ Train-Test Split (70-30) with Stratified Sampling

To ensure a fair model evaluation, we split the dataset into a training set (70%) and a test set (30%) before applying any preprocessing.  
We use **stratified sampling** to preserve the original distribution of the target variable (`target`) in both subsets. This is important due to class imbalance in the dataset (more good credits than bad).

The test set will remain untouched until the final model evaluation, ensuring unbiased performance metrics.


In [None]:
# Split the data while preserving class distribution
df_train, df_test = train_test_split(
    df,
    test_size=0.3,
    stratify=df['target'],
    random_state=42
)

# Reset indexes for clean usage
df = df_train.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

# Check the class distribution in both sets
print("Train target distribution:")
print(df['target'].value_counts(normalize=True) * 100)

print("\nTest target distribution:")
print(df_test['target'].value_counts(normalize=True) * 100)


Train target distribution:
0    70.0
1    30.0
Name: target, dtype: float64

Test target distribution:
0    70.0
1    30.0
Name: target, dtype: float64


In [None]:
df.to_csv("/Users/arezoo/Documents/1.MDDB/MasterProject/Code/df_train.csv", index=False)
df_test.to_csv("/Users/arezoo/Documents/1.MDDB/MasterProject/Code/df_test.csv", index=False)