## Credit card status payment


During this paper, we will analyze the information provided on credit card usage of a set of customers in a bank in Taiwan. With this information we will build a model to predict payments.

### Variables

* CREDIT ID: Identifier of each client. LIMIT BAL: Credit Line Amount 
* SEX: Gender (1=male, 2=female) 
* EDUCATION: (1=high school, 2=college, 3= Middle School, 4=other, 5=unknown, 6=unknown) 
* MARRIAGE: Marital status (1=married, 2=single, 3=other) 
* AGE: Age in years 
* DEFAULT - Default payment next month (Yes=1, No=0)
* PAST_PAY - History of repayment status
    - PAST_PAY1 = the repayment status in September 2005
    - PAST_PAY2 = the repayment status in August 2005
    - PAST_PAY6 = the repayment status in April 2005
    - The measurement scale for the repayment status is
    i.  -1 = pay duly <br>
    ii.  1 = payment delay for one month<br>
    iii. 2 = payment delay for two months <br>
    iv.  8 = payment delay for eight months<br>
    v.   9 = payment delay for nine months and above <br>
* BILL_AMT- Amount of bill statement (INR)
    - BILL_AMT1 = amount of bill statement in September 2005
    - BILL_AMT2 = amount of bill statement in August 2005
    - BILL_AMT6 = amount of bill statement in April 2005
* PAY_AMT - Amount of previous payment (INR)
    - PAY_AMT1 = amount paid in September 2005
    - PAY_AMT2 = amount paid in August 2005
    - PAY_AMT6 = amount paid in April 2005

### Libraries 

In [None]:
import csv
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np 
import seaborn as sns
import matplotlib.gridspec as gridspec
from sklearn import svm as sv
from warnings import filterwarnings
from scipy.stats import zscore
from scipy import stats
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

In [None]:
filterwarnings("ignore", category=DeprecationWarning) 
filterwarnings("ignore", category=FutureWarning) 
filterwarnings("ignore", category=UserWarning)

### Importing Data 

In [None]:
data = pd.read_csv('UCI_Credit_Card.csv')
data


### Feature Engineering 

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
#Show Nan values
multis = data.ID.value_counts() 
multis = multis[multis.values>1] 
multis

In [None]:
#Duplicate Values 
duplicates_index = data[data.duplicated(keep="first")].index
duplicates_index

In [None]:
data['EDUCATION'].unique()

In [None]:
#The categories 4:others, 5:unknown, and 6:unknown can be grouped into a single class '4'.
data['EDUCATION']=np.where(data['EDUCATION'] == 5, 4, data['EDUCATION'])
data['EDUCATION']=np.where(data['EDUCATION'] == 6, 4, data['EDUCATION'])
data['EDUCATION']=np.where(data['EDUCATION'] == 0, 4, data['EDUCATION'])
data['EDUCATION'].unique()

In [None]:
data = data.drop(data.columns[0], axis = 1)
data.head()

In [None]:
data.columns

In [None]:
#Outliers
z_scores = stats.zscore(data)
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 3).all(axis=1)
data = data[filtered_entries]
data.count()

### Data visualization

In [None]:
fig, axarr = plt.subplots(3, 2, figsize=(20, 12))
sns.countplot(x='SEX', hue = 'SEX', data = data, ax=axarr[0][0])
sns.countplot(x='EDUCATION', hue = 'EDUCATION', data = data, ax=axarr[0][1])
sns.countplot(x='MARRIAGE', hue = 'MARRIAGE', data = data, ax=axarr[1][0])
sns.countplot(x='AGE', data = data, ax=axarr[1][1])
sns.countplot(x='LIMIT_BAL', data = data, ax=axarr[2][0])
sns.countplot(x='PAY_AMT2', data = data, ax=axarr[2][1])

There are 27,000 credit card clients.

The average value for the amount of credit card limit is 167,484 NT dollars. The standard deviation is 129,658 NT dollars, ranging from 10,000 to 1M NT dollars.

Education level is mostly graduate school and university.

Most of the clients are either marrined or single (less frequent the other status).

Average age is 35.5 years, with a standard deviation of 9.2.

As the value 0 for default payment means 'not default' and value 1 means 'default', the mean of 0.221 means that there are 22.1% of credit card contracts that will default next month (will verify this in the next sections of this analysis).

In [None]:
#Distribution Plot:
con_col = data.drop(['SEX', 'MARRIAGE'], axis = 1)
for i in con_col:
    plt.figure(figsize=(20,5))
    sns.distplot(data[i],color='b')
    plt.show()

In [None]:
corr_pearson = data.corr(method = 'pearson')
fig = plt.figure(figsize = (14,8))
sns.heatmap(corr_pearson, annot=True, cmap='RdYlGn',
            vmin=-1, vmax=1)
plt.title('Pearson Correlation')
plt.show()

In [None]:
corr_spearman = data.corr(method = 'spearman')
fig = plt.figure(figsize = (14,8))
sns.heatmap(corr_spearman, annot=True, cmap='RdYlGn',
            vmin=-1, vmax=1)
plt.title('Spearman Correlation')
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(9,5))
data.corr()["PAY_AMT2"].sort_values(ascending=False).plot(kind="bar", ax=ax)

### Modeling 


In [None]:
x_data = data.drop("default.payment.next.month",axis=1)
y = data["default.payment.next.month"]
x = (x_data - np.min(x_data)) / (np.max(x_data) - np.min(x_data)).values
xTrain,xTest,yTrain,yTest = train_test_split(x,y,test_size=0.25,random_state=42)

In [None]:
lg = LogisticRegression().fit(xTrain,yTrain)
dtc = DecisionTreeClassifier().fit(xTrain,yTrain)
rdc = RandomForestClassifier(n_estimators=100).fit(xTrain,yTrain)
svm = sv.SVC(kernel='linear').fit(xTrain,yTrain)
XG = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0).fit(xTrain,yTrain)
bg = BaggingClassifier(base_estimator = None, n_estimators = 10, n_jobs = None, random_state = 0)

In [None]:
models = [lg,dtc,rdc,svm, XG, bg]

In [None]:
for model in models: 
    name = model.__class__.__name__
    y_pred=model.predict(xTest)
    print(name + ": ")
    print("-" * 18)
    print("Accuracy:",metrics.accuracy_score(yTest, y_pred))
    print("Precision:",metrics.precision_score(yTest, y_pred))
    print("Recall:",metrics.recall_score(yTest, y_pred))
    print("Recall:",metrics.f1_score(yTest, y_pred))
    matrix = confusion_matrix(yTest,y_pred, labels=[1,0])
    print('Confusion matrix : \n',matrix)
    print("-" * 35)

In [None]:
for pred in models:
    Y_pred = pred.predict(xTest)
    fig = plt.figure(figsize=(10,7))

    plt.plot(np.arange(0,len(yTest)),sorted(yTest), c='b', label='Actual')
    plt.plot(np.arange(0,len(yTest)),sorted(Y_pred), c='r', label='Predicted')

    plt.title(
             f'Method = {pred}')
    plt.legend(loc='best')

    plt.show()