Task:
A mock data set.  Please use the 23,000 data set to predict whether a customer will pay back loan or not.
Find the attached file.

Key:
X1: Amount of the given credit (dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. 
X2: Gender (1 = male; 2 = female).
X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). 
X4: Marital status (1 = married; 2 = single; 3 = others). 
X5: Age (year). 
X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. 
X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005. 
X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005. 

Deliverable:  Detailed explanation of steps etc in Jupyter notebook or via word document.  Plots and results.  Code to be submitted as well.  

In [1]:
import pandas as pd
import numpy as np

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Data Acquistion

In [3]:
df = pd.read_csv('Loan-data.csv')[:23000]
df.head()

Unnamed: 0.1,Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,...,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
1,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
2,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
4,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0


In [4]:
df.shape

(23000, 25)

In [6]:
df.head()

Unnamed: 0.1,Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,...,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
1,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
2,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
4,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0


In [7]:
df.tail()

Unnamed: 0.1,Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,...,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
22995,22995,180000,2,2,2,44,0,0,0,0,...,0,0,5386,5000,1900,0,0,5386,1103,0
22996,22996,310000,2,1,2,35,0,0,0,0,...,181491,176661,173532,9088,7044,9072,11210,6116,5362,0
22997,22997,500000,2,1,1,39,-2,-2,-2,-2,...,3518,2208,2550,2584,4524,3518,2208,2550,5953,0
22998,22998,330000,2,2,2,41,0,0,0,0,...,86458,91512,100433,4179,5000,6458,6512,10433,5442,0
22999,22999,150000,2,4,3,49,1,-2,-1,-1,...,10410,2361,2868,4,7920,10426,0,2868,4384,0


In [8]:
df1 = df.drop(0,axis=0)

In [9]:
df2 = df1.drop('Unnamed: 0',axis=1)
df2.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
1,20000,2,2,1,24,2,2,-1,-1,-2,...,0,0,0,0,689,0,0,0,0,1
2,120000,2,2,2,26,-1,2,0,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,90000,2,2,2,34,0,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
4,50000,2,2,1,37,0,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
5,50000,1,2,1,57,-1,0,-1,0,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [10]:
df2.shape

(22999, 24)

# Training a model

## Train-test split

In [11]:
train, test = train_test_split(df2, test_size=0.2)

In [12]:
train.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
22010,210000,2,1,2,28,0,-1,-1,-1,-1,...,5499,2960,670,1512,8475,5499,2960,670,25594,0
12977,50000,1,2,1,35,1,-1,-1,-1,-2,...,0,0,0,1118,673,0,0,0,0,0
16766,30000,1,2,2,37,0,0,0,0,0,...,26453,26987,28760,1397,1435,1425,959,2201,1200,0
4640,230000,2,3,2,39,1,2,0,0,0,...,12134,0,0,0,3200,0,0,0,0,0
756,230000,1,2,2,58,0,0,0,0,0,...,116400,45643,51083,5600,5700,5000,1855,6400,0,0


In [13]:
X_train = train.drop('Y',axis=1)

In [14]:
y_train = train['Y']

In [15]:
X_test = test.drop('Y',axis=1)
y_test = test['Y']

## Model Selection and training 

In [17]:
model = LogisticRegression()

In [18]:
model.fit(X_train,y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [19]:
## Peediction

In [20]:
y_predict = model.predict(X_test)

In [21]:
y_predict.shape

(4600,)

In [22]:
y_test.shape

(4600,)

In [23]:
type(y_predict)

numpy.ndarray

In [24]:
type(y_test)

pandas.core.series.Series

In [25]:
y_predict

array(['0', '0', '0', ..., '0', '0', '0'], dtype=object)

In [26]:
y_t = y_test.as_matrix()

  """Entry point for launching an IPython kernel.


In [27]:
y_t

array(['0', '0', '0', ..., '0', '0', '0'], dtype=object)

In [28]:
len(y_t)

4600

In [29]:
y_t[5]

'0'

In [30]:
y_predict[5]

'0'

In [31]:
len(y_predict)

4600

In [33]:
## Model evaluation

In [34]:
correct = 0
for i in range(0,len(y_t)):
    if y_t[i] == y_predict[i]:
        correct = correct + 1
acc = correct/len(y_t) * 100

In [35]:
acc

77.5