# BT2103 Project

## Introduction: 
The data set contains payment information of 30,000 credit card holders obtained from a bank in Taiwan. Each data sample is described by 23 feature attributes (columns B to X). The target feature (column Y) to be predicted is binary valued 0 (= not default) or 1 (= default).

We aim to predict whether a credit card holder is able to make payment in the next month (1 = yes , 0 = no)

There are 25 variables:

ID: ID of each client

LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit)

SEX: Gender (1=male, 2=female)

EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)

MARRIAGE: Marital status (1=married, 2=single, 3=others)

AGE: Age in years

PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)

PAY_2: Repayment status in August, 2005 (scale same as above)

PAY_3: Repayment status in July, 2005 (scale same as above)

PAY_4: Repayment status in June, 2005 (scale same as above)

PAY_5: Repayment status in May, 2005 (scale same as above)

PAY_6: Repayment status in April, 2005 (scale same as above)

BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)

BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)

BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)

BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)

BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)

BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)

PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)

PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)

PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)

PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)

PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)

PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)

default.payment.next.month: Default payment (1=yes, 0=no)




## Importing Relevant Libaries

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

## Exploratory Data Analysis

In [None]:
df = pd.read_csv('card.csv',header=[1])

In [None]:
# Get a glimpse of the data
df.head()

In [None]:
# Let's see the information of data
df.info()

In [None]:
## Statistics of data
## Here we can check if there are values that are potentially out of range in the variable.
## such as education - supposed to only have values from 1 to 4, but in this case there were occurrences 
## out of the range.
df.describe()

From the statistics above, there is no missing data/null data in our dataset but there are several anomalies that we have to deal with later on.
- EDUCATION has categories 0, 5 and 6 that are not part of the labels.
- MARRIAGE has a category 0 that is not part of the labels.

In [None]:
## Do we still need this since on top we already have information that our data is non-null.

## Here we check for null values in the dataframe since it needs to be handled if we require the 
## variable that contains null values.
df.isnull().sum()
## No missing values since sum all 0  

In [None]:
# Create another column with a clearer header for defaulters
df['Defaulter'] = df['default payment next month']

In [None]:
## DROP ID COLUMN as there are independent and identically distributed variables that are not correlated to `default payment next month` variable

df = df.drop("ID", axis = 1)
df.head()

In [None]:
## Correlation of the variables
correlation_matrix = df.corr()
fig = plt.figure(figsize=(12,9))
sns.heatmap(correlation_matrix,vmax=0.8,square = True, cmap = "coolwarm")
plt.show()

## From this, we can see that the variables that have slightly higher correlation are 
## LIMIT_BAL, PAY_AMT1 to PAY_AMT6
## With that in mind, let's dive deeper into those variables.

### Analysis on Age of Credit Card Holders

In [None]:
df['AGE'].describe() ## all credit card holders are above the age of 21, and below the age of 79

In [None]:
## Distribution plot of AGE
sns.displot(x='AGE', data = df, kde=True, aspect=2)
plt.xticks(rotation=0)
plt.ylabel('Count')
plt.title("Age distribution")

### Analysis on Marital Status of Credit Card Holders

In [None]:
df['MARRIAGE'].describe()

In [None]:
df.MARRIAGE.value_counts().plot(kind = 'bar') ## have 0 values

In [None]:
# The value 0 does not represent any category of marriage. 
# Hence, I am going to map 0 to 3, to categorise it under others.
df['MARRIAGE'].replace({0:3,1:1,2:2,3:3}, inplace=True)
df['MARRIAGE'].value_counts()


In [None]:
# Plotting pie chart and bar chart to see how the marital status of an individual is correlated to whether the individual defaults
plt.figure(figsize=(10,5))
fig, axes = plt.subplots(ncols=2,figsize=(13,8))
df['MARRIAGE'].value_counts().plot(kind="pie",ax = axes[0],subplots=True)
sns.countplot(x = 'MARRIAGE', hue = 'Defaulter', data = df)

From the pie chart and bar graph above, we can see that the highest proportion of defaulters are Single, followed by Married then Others.

### Analysis on Education of Credit Card Holders

In [None]:
df.EDUCATION.value_counts().plot(kind = "barh")

In [None]:
# From the Data Description given, we know that in df.EDUCATION, 5 and 6 represents "unknown" 
# Changing 0,5 and 6 to keep it under 1 category.

## why don't we change it to others?

df['EDUCATION'].replace({0:1,1:1,2:2,3:3,4:4,5:1,6:1}, inplace=True)
df.EDUCATION.value_counts()

In [None]:
# Plotting pie chart and bar chart to see how the the education of an individual is correlated to whether the individual defaults
plt.figure(figsize=(10,5))
fig, axes = plt.subplots(ncols=2,figsize=(13,8))
df['EDUCATION'].value_counts().plot(kind="pie",ax = axes[0],subplots=True)
sns.countplot(x = 'EDUCATION', hue = 'Defaulter', data = df)

From the pie chart and bar graph above, we can see that 

### Analysis on Gender of Credit Card Holders

In [None]:
df.SEX.value_counts().plot(kind = "barh")

In [None]:
# Plotting pie chart and bar chart to see how the the education of an individual is correlated to whether the individual defaults
plt.figure(figsize=(10,5))
fig, axes = plt.subplots(ncols=2,figsize=(13,8))
df['SEX'].value_counts().plot(kind="pie",ax = axes[0],subplots=True)
sns.countplot(x = 'SEX', hue = 'Defaulter', data = df)

From the pie chart and bar graph above, we can see that the number of defaulters has a higher proportion of females.

### Analysis on Amount of Given Credit of Credit Card Holders


In [None]:
df['LIMIT_BAL'].describe()

In [None]:
## Distribution plot of LIMIT_BAL
sns.displot(x='LIMIT_BAL', data = df, kde=True, aspect=2)
plt.xticks(rotation=0)
plt.ylabel('Count')
plt.title("Credit Distribution")

In [None]:
## Box plot 
plt.figure(figsize=(10,10))
sns.boxplot(x="Defaulter", y="LIMIT_BAL", data=df)

From our box plot above, we can see that generally, defaulters have lower mean credit balances than non-defaulters. (which makes sense because the higher the "chance" of you defaulting, the lower your credit balance -> is a cycle)

### Analysis on PAY_0 to PAY_6

In [None]:
# Lets see the value counts in column 'PAY_0'
df['PAY_0'].value_counts()

In [None]:
# Lets visualize the target column "default.payment.next.month"
plt.figure(figsize=(6,6))
sns.countplot(x='default payment next month', data=df)
plt.xticks([0,1], labels=["Not Defaulted", "Defaulted"])
plt.title("Target Distribution")