Per the UCI Machine Learning Repository, the following is the description of each of the variables:
* X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
* X2: Gender (1 = male; 2 = female).
* X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
* X4: Marital status (1 = married; 2 = single; 3 = others).
* X5: Age (year).
* X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: 
    - X6 = the repayment status in September, 2005
    - X7 = the repayment status in August, 2005; . . .
    - X11 = the repayment status in April, 2005. 
    - The measurement scale for the repayment status is: 
        - -1 = pay duly
        - 1 = payment delay for one month
        - 2 = payment delay for two months; . . .
        - 8 = payment delay for eight months
        - 9 = payment delay for nine months and above.
* X12-X17: Amount of bill statement (NT dollar)
    - X12 = amount of bill statement in September, 2005
    - X13 = amount of bill statement in August, 2005; . . .
    - X17 = amount of bill statement in April, 2005. 
* X18-X23: Amount of previous payment (NT dollar).
    - X18 = amount paid in September, 2005
    - X19 = amount paid in August, 2005 . . .
    - X23 = amount paid in April, 2005.

Based on the descriptions, payment history is available for 6 months (April - September). Repayment status is available for 9+ months depending on the debtor.

There are some interesting variables (education level, marital status, etc.) that can be explored to determine their effects on payment history.

In [99]:
# Load the necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [100]:
# Load the dataset

credit = pd.read_excel('/home/bruce/Desktop/coding-projects/Other Datasets/default of credit card clients.xls', header=1)

Dataset is loaded without issues. Will complete an initial inspection of the dataset to determine datatypes, missing, etc.

In [101]:
# Check the head of the dataset
credit.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [102]:
# Verify there is no missing data
credit.isna().sum()

ID                            0
LIMIT_BAL                     0
SEX                           0
EDUCATION                     0
MARRIAGE                      0
AGE                           0
PAY_0                         0
PAY_2                         0
PAY_3                         0
PAY_4                         0
PAY_5                         0
PAY_6                         0
BILL_AMT1                     0
BILL_AMT2                     0
BILL_AMT3                     0
BILL_AMT4                     0
BILL_AMT5                     0
BILL_AMT6                     0
PAY_AMT1                      0
PAY_AMT2                      0
PAY_AMT3                      0
PAY_AMT4                      0
PAY_AMT5                      0
PAY_AMT6                      0
default payment next month    0
dtype: int64

In [103]:
# Verify datatypes
credit.dtypes

ID                            int64
LIMIT_BAL                     int64
SEX                           int64
EDUCATION                     int64
MARRIAGE                      int64
AGE                           int64
PAY_0                         int64
PAY_2                         int64
PAY_3                         int64
PAY_4                         int64
PAY_5                         int64
PAY_6                         int64
BILL_AMT1                     int64
BILL_AMT2                     int64
BILL_AMT3                     int64
BILL_AMT4                     int64
BILL_AMT5                     int64
BILL_AMT6                     int64
PAY_AMT1                      int64
PAY_AMT2                      int64
PAY_AMT3                      int64
PAY_AMT4                      int64
PAY_AMT5                      int64
PAY_AMT6                      int64
default payment next month    int64
dtype: object

The following variables need to be changed:
* X2, X3, X4 = categorical
* X6-X11 = categorical

Initial inspection completed. Time to preprocess the data.

In [104]:
# Convert column names to lowercase to avoid errors
credit.columns = credit.columns.str.lower()
credit.columns

Index(['id', 'limit_bal', 'sex', 'education', 'marriage', 'age', 'pay_0',
       'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6', 'bill_amt1', 'bill_amt2',
       'bill_amt3', 'bill_amt4', 'bill_amt5', 'bill_amt6', 'pay_amt1',
       'pay_amt2', 'pay_amt3', 'pay_amt4', 'pay_amt5', 'pay_amt6',
       'default payment next month'],
      dtype='object')

In [105]:
## Convert sex, education, marriage, pay_0, pay_2, pay_3, pay_4, pay_5, pay_6 to categorical and verify change

# Define mapping for categorical variables
education_map = {1: 'Graduate', 2: 'University', 3: 'High School', 4: 'Others'}
gender_map = {1: 'Male', 2: 'Female'}
marriage_map = {1: 'Married', 2: 'Single', 3: 'Others'}
pay_map = {
    -1: 'Pay Duly',
    1: 'Delay 1 Month',
    2: 'Delay 2 Months',
    3: 'Delay 3 Months',
    4: 'Delay 4 Months',
    5: 'Delay 5 Months',
    6: 'Delay 6 Months',
    7: 'Delay 7 Months',
    8: 'Delay 8 Months',
    9: 'Delay 9+ Months'
}

# Define category lists
education_cats = ['Graduate', 'University', 'High School', 'Others']
gender_cats = ['Male', 'Female']
marriage_cats = ['Married', 'Single', 'Others']
pay_cats = [
    'Pay Duly',
    'Delay 1 Month',
    'Delay 2 Months',
    'Delay 3 Months',
    'Delay 4 Months',
    'Delay 5 Months',
    'Delay 6 Months',
    'Delay 7 Months',
    'Delay 8 Months',
    'Delay 9+ Months'
]

# Apply mapping and categorical conversion in one step
credit['sex'] = credit['sex'].map(gender_map).astype('category')

credit['education'] = pd.Categorical(
    credit['education'].map(education_map),
    categories=education_cats,
    ordered=True
)

credit['marriage'] = credit['marriage'].map(marriage_map).astype('category')

# For pay columns: map AND convert with explicit categories/order
for pay_col in ['pay_0', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6']:
    credit[pay_col] = pd.Categorical(
        credit[pay_col].map(pay_map),
        categories=pay_cats,
        ordered=True
    )

# Verify change to categorical
credit.dtypes

id                               int64
limit_bal                        int64
sex                           category
education                     category
marriage                      category
age                              int64
pay_0                         category
pay_2                         category
pay_3                         category
pay_4                         category
pay_5                         category
pay_6                         category
bill_amt1                        int64
bill_amt2                        int64
bill_amt3                        int64
bill_amt4                        int64
bill_amt5                        int64
bill_amt6                        int64
pay_amt1                         int64
pay_amt2                         int64
pay_amt3                         int64
pay_amt4                         int64
pay_amt5                         int64
pay_amt6                         int64
default payment next month       int64
dtype: object

In [106]:
# Verify categories
print("Education categories:", credit['education'].cat.categories.tolist())
print("Marriage categories:", credit['marriage'].cat.categories.tolist())
print("Gender categories:", credit['sex'].cat.categories.tolist())
print("Pay_0 categories:", credit['pay_0'].cat.categories.tolist())

Education categories: ['Graduate', 'University', 'High School', 'Others']
Marriage categories: ['Married', 'Others', 'Single']
Gender categories: ['Female', 'Male']
Pay_0 categories: ['Pay Duly', 'Delay 1 Month', 'Delay 2 Months', 'Delay 3 Months', 'Delay 4 Months', 'Delay 5 Months', 'Delay 6 Months', 'Delay 7 Months', 'Delay 8 Months', 'Delay 9+ Months']


In [107]:
# Verify no data was "lost" during conversion
credit.isna().sum()

id                                0
limit_bal                         0
sex                               0
education                       345
marriage                         54
age                               0
pay_0                         17496
pay_2                         19512
pay_3                         19849
pay_4                         20803
pay_5                         21493
pay_6                         21181
bill_amt1                         0
bill_amt2                         0
bill_amt3                         0
bill_amt4                         0
bill_amt5                         0
bill_amt6                         0
pay_amt1                          0
pay_amt2                          0
pay_amt3                          0
pay_amt4                          0
pay_amt5                          0
pay_amt6                          0
default payment next month        0
dtype: int64

## NEED TO ASSESS THE NA'S IDENTIFIED ABOVE

### Code Not Used:

In [72]:
# Define the categories
cat_dict = {
    'education': ['Graduate', 'University', 'High School', 'Others'],
    'sex': ['Male', 'Female'],
    'marriage': ['Married', 'Single', 'Others'],
    'pay_0': ['Pay Duly', 'Delay One Month', 'Delay Two Months', 'Delay Three Months', 'Delay Four Months', 'Delay Five Months', 'Delay Six Months', 'Delay Seven Months', 'Delay Eight Months', 'Delay Nine Months'],
    'pay_2': ['Pay Duly', 'Delay One Month', 'Delay Two Months', 'Delay Three Months', 'Delay Four Months', 'Delay Five Months', 'Delay Six Months', 'Delay Seven Months', 'Delay Eight Months', 'Delay Nine Months'],
    'pay_3': ['Pay Duly', 'Delay One Month', 'Delay Two Months', 'Delay Three Months', 'Delay Four Months', 'Delay Five Months', 'Delay Six Months', 'Delay Seven Months', 'Delay Eight Months', 'Delay Nine Months'],
    'pay_4': ['Pay Duly', 'Delay One Month', 'Delay Two Months', 'Delay Three Months', 'Delay Four Months', 'Delay Five Months', 'Delay Six Months', 'Delay Seven Months', 'Delay Eight Months', 'Delay Nine Months'],
    'pay_5': ['Pay Duly', 'Delay One Month', 'Delay Two Months', 'Delay Three Months', 'Delay Four Months', 'Delay Five Months', 'Delay Six Months', 'Delay Seven Months', 'Delay Eight Months', 'Delay Nine Months'],
    'pay_6': ['Pay Duly', 'Delay One Month', 'Delay Two Months', 'Delay Three Months', 'Delay Four Months', 'Delay Five Months', 'Delay Six Months', 'Delay Seven Months', 'Delay Eight Months', 'Delay Nine Months']
}

# Convert to categorical using a loop
for col in cat_cols:
    if col in cat_dict:
        credit[col] = pd.Categorical(
            credit[col],
            categories=cat_dict[col]
        )



