In [11]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as colors
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn import preprocessing 
from sklearn.svm import SVC 
from sklearn.model_selection import GridSearchCV 
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import plot_confusion_matrix
from sklearn.decomposition import PCA

We are using data from the UCI Machine Learning Repisotory, specifically the [default of credit card clients dataset](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients), in order to predict if someone will default on their credit card payments based on numerous metrics such as sex, age, marital status and many others.

In [13]:
df = pd.read_csv('/Users/theoeudes/default_of_credit_card_clients.tsv', 
                 header=1, sep='\t') ## NOTE: The second line contains column names, so we skip the first line

df.rename({'default payment next month' : 'DEFAULT'}, axis='columns', inplace=True)
df.head(10)


Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,DEFAULT
0,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0
5,6,50000,1,1,2,37,0,0,0,0,...,19394,19619,20024,2500,1815,657,1000,1000,800,0
6,7,500000,1,1,2,29,0,0,0,0,...,542653,483003,473944,55000,40000,38000,20239,13750,13770,0
7,8,100000,2,2,2,23,0,-1,-1,0,...,221,-159,567,380,601,0,581,1687,1542,0
8,9,140000,2,3,1,28,0,0,2,0,...,12211,11793,3719,3329,0,432,1000,1000,1000,0
9,10,20000,1,3,2,35,-2,-2,-2,-2,...,0,13007,13912,0,0,0,13007,1122,0,0


We can see a lot of variables in our dataset 

- **ID** : The ID number assigned to each customer
- **LIMIT_BAL** : Amount of the given credit
- **SEX** : Gender (1 = male; 2 = female)
- **EDUCATION** : Level of education (1 = graduate school; 2 = university; 3 = high school; 4 = others)
- **MARRIAGE** : Marital status (1 = married; 2 = single; 3 = others)
- **AGE** : Age (years)
- **PAY_** : History of past payment, tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
- **BLL_AMT** : What the last 6 bills were: X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
- **PAY_AMT** : Amount of the last payments were: X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.
- **DEFAULT** (1= Yes; 0=No)



## Data preprocessing

We can see there is a lot of variables in our dataset, some more useful than others, we can drop ID because it's doesn't give us information for our task.

In [14]:
df.drop('ID', axis=1, inplace=True)

First and foremost, we have to check if we have missing values and deal with them

In [15]:
df.dtypes

LIMIT_BAL    int64
SEX          int64
EDUCATION    int64
MARRIAGE     int64
AGE          int64
PAY_0        int64
PAY_2        int64
PAY_3        int64
PAY_4        int64
PAY_5        int64
PAY_6        int64
BILL_AMT1    int64
BILL_AMT2    int64
BILL_AMT3    int64
BILL_AMT4    int64
BILL_AMT5    int64
BILL_AMT6    int64
PAY_AMT1     int64
PAY_AMT2     int64
PAY_AMT3     int64
PAY_AMT4     int64
PAY_AMT5     int64
PAY_AMT6     int64
DEFAULT      int64
dtype: object

All of our varibales are numeric characters, which means we don't have a mix of letters and numbers. In other words, there are no **NA** values, or other character based place holders for missing data, in **df**.

We also have categorical variables, we should check for **NA** too

In [17]:
df['SEX'].unique(), df['EDUCATION'].unique() , df['MARRIAGE'].unique()

(array([2, 1]), array([2, 1, 3, 5, 4, 6, 0]), array([1, 2, 3, 0]))

# Missing Data Part 2: Dealing With Missing Data

Since scikit-learn's support vector machines do not support datasets with missing values, we need to figure out what to do with  the 0s in the dataset. We can either delete these customers from the training dataset, or impute values for the missing data. First let's see how many rows contain missing values.

In [19]:
len(df.loc[(df['EDUCATION'] == 0) | (df['MARRIAGE'] == 0)]), len(df)

(68, 30000)

68 values out of 3000 could be missings values, since it's ~1% of our global values, we still have much than necessary to perform our Support Vector Machines, so we don't need to replace the missing values we can just erase the row with 0 in the database

In [20]:
df_cleaned = df.loc[(df['EDUCATION'] != 0) & (df['MARRIAGE'] != 0)]

In [22]:
len(df_cleaned)

29932

In [23]:
df_cleaned['EDUCATION'].unique() , df_cleaned['MARRIAGE'].unique()

(array([2, 1, 3, 5, 4, 6]), array([1, 2, 3]))

# Downsample the data

**Support Vector Machines** are great with small datasets, but not awesome with large ones, and this dataset, while not huge, is big enough to take a long time to optimize with **Cross Validation**. So we'll downsample both categories, customers who did and did not default, to 1,000 each.

**29,932** samples is a relatively large number for a **Support Vector Machine**, so let's downsample. To make sure we get **1,000** of each category, we start by splitting the data into two **dataframes**, one for people that did not default and one for people that did.

In [24]:
df_no_default = df_cleaned[df_cleaned['DEFAULT'] == 0]
df_default = df_cleaned[df_cleaned['DEFAULT'] == 1]

In [25]:
df_no_default_downsampled = resample(df_no_default,
                                  replace=False,
                                  n_samples=1000,
                                  random_state=42)
len(df_no_default_downsampled)

df_default_downsampled = resample(df_default,
                                  replace=False,
                                  n_samples=1000,
                                  random_state=42)
len(df_default_downsampled)

1000

In [26]:
df_downsample = pd.concat([df_no_default_downsampled, df_default_downsampled])
len(df_downsample)

2000

# Format Data Part 1: Split the Data into Dependent and Independent Variables

Now that we have taken care of the missing data, we are ready to start formatting the data for making a **Support Vector Machine**.

The first step is to split the data into two parts:
1. The columns of data that we will use to make classifications
2. The column of data that we want to predict.

We will use the conventional notation of `X` (capital **X**) to represent the columns of data that we will use to make classifications and `y` (lower case **y**) to represent the thing we want to predict. In this case, we want to predict **default** (whether or not someone defaulted on a payment).

**NOTE:** The reason we deal with missing data before splitting it into **X** and **y** is that if we remove rows, splitting after ensures that each row in **X** correctly corresponds with the appropriate value in **y**.

**ALSO NOTE:** In the code below we are using `copy()` to copy the data *by value*. By default, pandas uses copy *by reference*. Using `copy()` ensures that the original data `df_downsample` is not modified when we modify `X` or `y`. In other words, if we make a mistake when we are formatting the columns for classification trees, we can just re-copy `df_downsample`, rather than reload the original data and remove the missing values etc.

In [None]:
X = df_downsample.drop('DEFAULT', axis=1).copy() # alternatively: X = df_no_missing.iloc[:,:-1].copy()
X.head()

y = df_downsample['DEFAULT'].copy()
y.head()