## Step 1: Load the Dataset and Explore Its Structure

Before doing any cleaning or analysis, we must understand the raw data.

In this step, we:
- Load the dataset into a DataFrame using Pandas.
- Check the first few rows to see what the data looks like.
- Review column names and data types.
- Count how many rows and columns we have.

This helps us plan the cleaning process and identify any obvious issues like missing values, incorrect types, or strange column names.

Understanding the structure of the dataset is a crucial first step before transforming or analyzing it.


In [2]:
#Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
os.makedirs("images", exist_ok=True)

In [3]:
# Load the dataset
df = pd.read_csv("../data/inputs/raw/uci_credit_card.csv")
df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


In [5]:
# show the shape of the dataset
df.shape

(30000, 25)

In [6]:
# check the data types of the columns
df.dtypes

ID                              int64
LIMIT_BAL                     float64
SEX                             int64
EDUCATION                       int64
MARRIAGE                        int64
AGE                             int64
PAY_0                           int64
PAY_2                           int64
PAY_3                           int64
PAY_4                           int64
PAY_5                           int64
PAY_6                           int64
BILL_AMT1                     float64
BILL_AMT2                     float64
BILL_AMT3                     float64
BILL_AMT4                     float64
BILL_AMT5                     float64
BILL_AMT6                     float64
PAY_AMT1                      float64
PAY_AMT2                      float64
PAY_AMT3                      float64
PAY_AMT4                      float64
PAY_AMT5                      float64
PAY_AMT6                      float64
default.payment.next.month      int64
dtype: object

## Step 2: Handle Missing or Invalid Values

Now that we understand the structure of the dataset, the next step is to check for missing or invalid data.

In this step, we:
- Look for missing values in the dataset.
- Count how many missing values each column has.
- Decide whether to fill, drop, or investigate further.
- Also, check for invalid values (e.g., negative balances or out-of-range values).

Cleaning the data ensures we don’t get incorrect results during analysis or modeling. This is an essential part of preparing high-quality data.


In [13]:
# Check for missing values
df.isnull().sum()

ID                            0
LIMIT_BAL                     0
SEX                           0
EDUCATION                     0
MARRIAGE                      0
AGE                           0
PAY_0                         0
PAY_2                         0
PAY_3                         0
PAY_4                         0
PAY_5                         0
PAY_6                         0
BILL_AMT1                     0
BILL_AMT2                     0
BILL_AMT3                     0
BILL_AMT4                     0
BILL_AMT5                     0
BILL_AMT6                     0
PAY_AMT1                      0
PAY_AMT2                      0
PAY_AMT3                      0
PAY_AMT4                      0
PAY_AMT5                      0
PAY_AMT6                      0
default.payment.next.month    0
dtype: int64

In [15]:
# List numeric columns
numeric_cols = df.select_dtypes(include=["float64", "int64"]).columns

# Check for negative values in numeric columns
df[numeric_cols].lt(0).sum()

ID                                0
LIMIT_BAL                         0
SEX                               0
EDUCATION                         0
MARRIAGE                          0
AGE                               0
PAY_0                          8445
PAY_2                          9832
PAY_3                         10023
PAY_4                         10035
PAY_5                         10085
PAY_6                         10635
BILL_AMT1                       590
BILL_AMT2                       669
BILL_AMT3                       655
BILL_AMT4                       675
BILL_AMT5                       655
BILL_AMT6                       688
PAY_AMT1                          0
PAY_AMT2                          0
PAY_AMT3                          0
PAY_AMT4                          0
PAY_AMT5                          0
PAY_AMT6                          0
default.payment.next.month        0
dtype: int64

## Step 3: Rename Columns and Map Categorical Values

In this step, we improve the readability of the dataset:

1. Rename columns to make them clear and consistent (e.g., lowercase, underscores).
2. Replace unclear column names like "PAY_0" with more descriptive names like "pay_delay_sep".
3. Convert numeric codes in categorical columns into readable labels:
   - `sex` → "male", "female"
   - `education` → "graduate school", "university", etc.
   - `marriage` → "married", "single", etc.
   - `pay_0` → -1 means "paid on time", 1 means "1-month delay", etc.

This step makes the dataset easier to analyze and understand, especially for presentations and dashboards.


In [16]:
# Rename columns for clarity
df.columns = df.columns.str.lower().str.strip().str.replace(" ", "_")

In [17]:
# Rename for key columns
df.rename(columns={
    "pay_0": "pay_delay_sep",
    "pay_2": "pay_delay_aug",
    "pay_3": "pay_delay_jul",
    "pay_4": "pay_delay_jun",
    "pay_5": "pay_delay_may",
    "pay_6": "pay_delay_apr",
    "bill_amt1": "bill_amount_sep",
    "bill_amt2": "bill_amount_aug",
    "bill_amt3": "bill_amount_jul",
    "bill_amt4": "bill_amount_jun",
    "bill_amt5": "bill_amount_may",
    "bill_amt6": "bill_amount_apr",
    "pay_amt1": "payment_amount_sep",
    "pay_amt2": "payment_amount_aug",
    "pay_amt3": "payment_amount_jul",
    "pay_amt4": "payment_amount_jun",
    "pay_amt5": "payment_amount_may",
    "pay_amt6": "payment_amount_apr",
    "default_payment_next_month": "default_next_month"
})

Unnamed: 0,id,limit_bal,sex,education,marriage,age,pay_delay_sep,pay_delay_aug,pay_delay_jul,pay_delay_jun,...,bill_amount_jun,bill_amount_may,bill_amount_apr,payment_amount_sep,payment_amount_aug,payment_amount_jul,payment_amount_jun,payment_amount_may,payment_amount_apr,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,29996,220000.0,1,3,1,39,0,0,0,0,...,88004.0,31237.0,15980.0,8500.0,20000.0,5003.0,3047.0,5000.0,1000.0,0
29996,29997,150000.0,1,3,2,43,-1,-1,-1,-1,...,8979.0,5190.0,0.0,1837.0,3526.0,8998.0,129.0,0.0,0.0,0
29997,29998,30000.0,1,2,2,37,4,3,2,-1,...,20878.0,20582.0,19357.0,0.0,0.0,22000.0,4200.0,2000.0,3100.0,1
29998,29999,80000.0,1,3,1,41,1,-1,0,0,...,52774.0,11855.0,48944.0,85900.0,3409.0,1178.0,1926.0,52964.0,1804.0,1


### What Do Payment Delay Codes Mean?

The dataset includes several columns like `pay_0`, `pay_2`, ..., `pay_6`, renamed for clarity to:

- `pay_delay_sep`
- `pay_delay_aug`
- `pay_delay_jul`
- `pay_delay_jun`
- `pay_delay_may`
- `pay_delay_apr`

These columns represent the **repayment status** of each customer in different months.

Each value in these columns means the following:

| Value | Meaning                        |
|-------|--------------------------------|
| -2    | No consumption                 |
| -1    | Paid on time                   |
|  0    | Duly paid                      |
|  1    | Payment delay for 1 month      |
|  2    | Payment delay for 2 months     |
|  3    | Payment delay for 3 months     |
|  4    | Payment delay for 4 months     |
|  5    | Payment delay for 5 months     |
|  6    | Payment delay for 6 months     |
|  7    | Payment delay for 7 months     |
|  8    | Payment delay for 8 months     |
|  9    | Payment delay for 9 months or more |

Keeping these codes numeric makes it easier to analyze and model delay behavior later on.


In [19]:
# Map categorical variables
df["sex"] = df["sex"].map({1: "male", 2: "female"})

df["education"] = df["education"].map({
    1: "graduate school",
    2: "university",
    3: "high school",
    4: "others",
    5: "unknown",
    6: "unknown"
})

df["marriage"] = df["marriage"].map({
    1: "married",
    2: "single",
    3: "others",
})


In [20]:
df.head()

Unnamed: 0,id,limit_bal,sex,education,marriage,age,pay_0,pay_2,pay_3,pay_4,...,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,default.payment.next.month
0,1,20000.0,female,university,married,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,female,university,single,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,female,university,single,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,female,university,married,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,male,university,married,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0
