# Logistic Regression
Linear Model

## Statlog (German Credit Data) Data Set

The url to this datasets can be found by the following link :
https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)

However, this data contains categorical, numerical and binary as well. In this case, we need a datasets with numerical data. MIT Data Mining Course provided the statlog Dataset with converted data from categorical into binary. The excel sheet dataset can be found in the following link :
https://ocw.mit.edu/courses/sloan-school-of-management/15-062-data-mining-spring-2003/assignments/

Number of instances : 1000
Number of Features : 30

Feature information :
1. CHK_ACCOUNT : Checking account status ({0:<0DM}, {1:0<=X<200DM}, {2:=> 200DM}, {3:No checking account})
2. DURATION : Duration of credit in months
3. HISTORY : Credit history
4. NEW_CAR : Purpose of credit
5. USED_CAR : Purpose of credit
6. FURNITURE : Purpose of credit
7. TV : Purpose of credit
8. EDUCATION : Purpose of credit
9. RETRAIN : Purpose of credit
10. AMT : Credit amount
11. SAV_ACCT : Average balance in savings account ({0:<100DM}, {1:100<=X<500DM}, {2:500<=X<1000DM}, {3:>=1000DM}, {4:Unknown/no savings acc})
12. EMPLOYMENT : Present employment since ({0:unemployed}, {1:<1year}, {2:1<=X<4years}, {3:4<=X<7years}, {4:>= 7years})
13. INSTALL_RATE : Installment rate as % of disposable income numerical
14. MALE_DIV : Applicant is male and divorced
15. MALE_SINGLE : Applicant is male and single
16. MALE_MAR_WID : Applicant is male and married or a widower
17. CO-APPLICANT : APplicant has a co-applicant
18. GUARANTOR : Applicant has a guarantor
19. PRESENT_RESIDENT : Present resident since - years ({0:<=1year}, {1:<=2years}, {2:<=3years}, {3:>4years})
20. REAL_ESTATE : Applicant owns real estate
21. PROP_UNKN_NONE : Applicant owns no property
22. AGE : Age in years
23. OTHER_INSTALL : Applicant has other installment plan credit
24. RENT : Applicant rents
25. OWN_RES : Applicant owns residence
26. NUM_CREDITS : number of existing credits at this bank
27. JOB : Nature of job ({0:Unemployed}, {1:unskilled-resident}, {2:skilled employee}, {3: management / self employement})
28. NUM_DEPENDENTS : Number of people for whom liable to provide maintenance
29. TELEPHONE : Applicant has phone in his or her name
30. FOREIGN : Foreign worker

In [1]:
#Library

import numpy as np
import pandas as pd
from sklearn.preprocessing import binarize
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from matplotlib import pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.utils.extmath import safe_sparse_dot

## Loading and Pre-processing the data

In [33]:
url = "data-credit-german.csv"
names = ['CHK_ACCT', 'DURATION', 'CRED_HIST', 'NEW_CAR', 'USER_CAR', 'FURN', 'TV', 'EDU', 'RETRAIN', 'AMT', 'SAV_ACCT', 'EMPLOYM', 'INSTALL_RATE', 'MALE_DIV', 'MALE_SING', 'MALE_MAR_WID', 'CO-APP','GUARANTOR', 'PRES_RESIDENT', 'REAL_ESTATE', 'PROP_UNKN_NONE','AGE','OTHER_INSTALL', 'RENT', 'OWN_RES', 'NUM_CRED','JOB','NUM_DEPENDEN', 'TEL', 'FOREIGN', 'RESPONSE']

cred_df = pd.read_csv(url, names=names)
cred_df.shape

(1000, 31)

In [3]:
X_train, X_test, y_train, y_test = train_test_split(cred_df.iloc[:, :30], cred_df.iloc[:, 30], test_size=0.30)

## Train the data

In [4]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [5]:
y_pred = clf.predict(X_test)

In [6]:
accuracy_score(y_pred, y_test)

0.76333333333333331

In [7]:
y_pred_proba = clf.predict_proba(X_test)
y_pred_proba.shape

(300, 2)

In [8]:
y_pred_log_proba = clf.predict_log_proba(X_test)
y_pred_log_proba.shape

(300, 2)

## Calculate the Evidence

(1) Given two sets of $P$ and $N$, set of attribute value contribute to positive evidence and set of attribute value contribute to negative evidence, respectively. <br> <br>
$ \mathcal{P}_x = \{a_i|w_ia_i>0\} $ <br> 
$ \mathcal{N}_x = \{a_i|w_ia_i<0\} $ <br><br>
(2) Total evidence of each instance / Objects can be calculated as follow : <br><br>
$ E^{+1}(x) = \sum_{a_i\in\mathcal{P}_x}w_ia_i$<br>
$ E^{-1}(x) = - \sum_{a_i\in\mathcal{N}_x}w_ia_i$

In [9]:
# array, shape (1, n_features) or (n_classes, n_features)
# Coefficient of the features in the decision function.

clf.coef_.shape

(1, 30)

In [11]:
clf.intercept_

array([ 0.21206387])

In [13]:
'''Copy the data X and coefficient w_i'''
X = np.copy(X_test)
coef_ = np.copy(clf.coef_)
n_samples, n_features = X.shape

'''Initialize X_ev : x_i * w_i'''
X_ev = np.zeros((n_samples, n_features))

In [37]:
'''Get evidence'''
'''Calculate the w_ia_i'''
for idx in range(n_samples):
    X_ev[idx, :] = X[idx,:] * coef_

'''Generate the sets of P and N'''
X_pos_ev = X_ev * (X_ev > 0)
X_neg_ev = X_ev * (X_ev < 0)

'''Sum each the set P and N'''
pos_ev = np.sum(X_pos_ev, axis=1)
neg_ev = -np.sum(X_neg_ev, axis=1)


### The most positive object with respect to the probabilities.

In [15]:
most_pos_obj_idx = np.argmax(y_pred_proba[:,1])

In [30]:
print('Index of the object : ', most_pos_obj_idx)
print(X_test.iloc[most_pos_obj_idx, :].T)
print('Class : ', y_test.iloc[most_pos_obj_idx])
print('Predict Class : ', y_pred[most_pos_obj_idx])
print('a) Total  positive log-evidence : ', np.log(pos_ev[most_pos_obj_idx]))
print('b) Total negative log-evidence : ', np.log(neg_ev[most_pos_obj_idx]))
print('c) Probability distribution', y_pred_proba[most_pos_obj_idx])

print(pos_ev[most_pos_obj_idx])
feature_pos = X_ev[most_pos_obj_idx,:]
pos_list = np.argsort(feature_pos)[::-1]
feature_neg = X_ev[most_pos_obj_idx,:]
neg_list = np.argsort(feature_neg)
#feature_pos = np.multiply(X[most_pos_obj_idx, :], pos_X.T)
#pos_list = np.argsort(feature_pos)[::-1]
#feature_neg = np.multiply(1-X[most_pos_obj_idx, :], neg_X.T)
#neg_list = np.argsort(feature_neg)

print('d) Top 3 features values that contribute most to the positive evidence')
for i in range(0,3):
#    print('\t',names[pos_list[i]], '\t Evidence Value : ', np.sort(feature_pos)[::-1][i])
     print('\t',names[pos_list[i]], '\t Value : ', X_test.iloc[most_pos_obj_idx, pos_list[i]], '\t Evidence Value : ', np.sort(feature_pos)[::-1][i])
    
print('e) Top 3 features values that contribute most to the negative evidence')
for j in range(0,3):
#    print('\t',names[neg_list[j]], '\t Evidence Value : ', np.sort(feature_neg)[j])
     print('\t',names[neg_list[j]], '\t Value : ', X_test.iloc[most_pos_obj_idx, neg_list[j]], '\t Evidence Value : ', np.sort(feature_neg)[j])

print(np.sort(feature_pos)[::-1])     
print(np.sort(feature_neg))
#print(pos_X)

Index of the object :  23
CHK_ACCT             3
DURATION             6
CRED_HIST            4
NEW_CAR              0
USER_CAR             0
FURN                 0
TV                   1
EDU                  0
RETRAIN              0
AMT               1898
SAV_ACCT             4
EMPLOYM              2
INSTALL_RATE         1
MALE_DIV             0
MALE_SING            1
MALE_MAR_WID         0
CO-APP               0
GUARANTOR            0
PRES_RESIDENT        2
REAL_ESTATE          1
PROP_UNKN_NONE       0
AGE                 34
OTHER_INSTALL        0
RENT                 0
OWN_RES              1
NUM_CRED             2
JOB                  1
NUM_DEPENDEN         2
TEL                  0
FOREIGN              0
Name: 159, dtype: int64
Class :  1
Predict Class :  1
a) Total  positive log-evidence :  1.88657684149
b) Total negative log-evidence :  0.349474091287
c) Probability distribution [ 0.00453904  0.99546096]
6.59674827154
d) Top 3 features values that contribute most to the positive ev

### The most negative object with respect to the probabilities.

In [17]:
most_neg_obj_idx = np.argmin(y_pred_proba[:,1])

In [18]:
print('Index of the object : ', most_neg_obj_idx)
print(X_test.iloc[most_neg_obj_idx, :])
print('Class : ', y_test.iloc[most_neg_obj_idx])
print('Predict Class : ', y_pred[most_neg_obj_idx])
print('a) Total  positive log-evidence : ', pos_ev[most_neg_obj_idx])
print('b) Total negative log-evidence : ', neg_ev[most_neg_obj_idx])
print('c) Probability distribution', y_pred_proba[most_neg_obj_idx])

feature_pos = X_ev[most_neg_obj_idx,:]
pos_list = np.argsort(feature_pos)[::-1]
feature_neg = X_ev[most_neg_obj_idx,:]
neg_list = np.argsort(feature_neg)
#feature_pos = np.multiply(X[most_neg_obj_idx, :], pos_X.T)
#pos_list = np.argsort(feature_pos)[::-1]
#feature_neg = np.multiply(1-X[most_neg_obj_idx, :], neg_X.T)
#neg_list = np.argsort(feature_neg)

print('d) Top 3 features values that contribute most to the positive evidence')
for i in range(0,3):
    print('\t',pos_list[i], '\t Evidence Value : ', np.sort(feature_pos)[::-1][i])
 
print('e) Top 3 features values that contribute most to the negative evidence')
for j in range(0,3):
    print('\t',neg_list[j], '\t Evidence Value : ', np.sort(feature_neg)[j])

print(np.sort(feature_pos)[::-1])     
print(np.sort(feature_neg))

Index of the object :  294
CHK_ACCT             1
DURATION            60
CRED_HIST            2
NEW_CAR              0
USER_CAR             0
FURN                 0
TV                   0
EDU                  1
RETRAIN              0
AMT               6288
SAV_ACCT             0
EMPLOYM              2
INSTALL_RATE         4
MALE_DIV             0
MALE_SING            1
MALE_MAR_WID         0
CO-APP               0
GUARANTOR            0
PRES_RESIDENT        4
REAL_ESTATE          0
PROP_UNKN_NONE       1
AGE                 42
OTHER_INSTALL        0
RENT                 0
OWN_RES              0
NUM_CRED             1
JOB                  2
NUM_DEPENDEN         1
TEL                  0
FOREIGN              0
Name: 938, dtype: int64
Class :  0
Predict Class :  0
a) Total  positive log-evidence :  2.96594236357
b) Total negative log-evidence :  5.75603249665
c) Probability distribution [ 0.92943393  0.07056607]
d) Top 3 features values that contribute most to the positive evidence
	 2 	 E

### The object that has the largest positive evidence.

In [19]:
most_pos_ev_idx = np.argmax(pos_ev)

In [20]:
print('Index of the object : ', most_pos_ev_idx)
print(X_test.iloc[most_pos_ev_idx, :])
print('Class : ', y_test.iloc[most_pos_ev_idx])
print('Predict Class : ', y_pred[most_pos_ev_idx])
print('a) Total  positive log-evidence : ', pos_ev[most_pos_ev_idx])
print('b) Total negative log-evidence : ', neg_ev[most_pos_ev_idx])
print('c) Probability distribution', y_pred_proba[most_pos_ev_idx])

feature_pos = X_ev[most_pos_ev_idx,:]
pos_list = np.argsort(feature_pos)[::-1]
feature_neg = X_ev[most_pos_ev_idx,:]
neg_list = np.argsort(feature_neg)

print('d) Top 3 features values that contribute most to the positive evidence')
for i in range(0,3):
    print('\t',pos_list[i], '\t Evidence Value : ', np.sort(feature_pos)[::-1][i])
 
print('e) Top 3 features values that contribute most to the negative evidence')
for j in range(0,3):
    print('\t',neg_list[j], '\t Evidence Value : ', np.sort(feature_neg)[j])

print(np.sort(feature_pos)[::-1])     
print(np.sort(feature_neg))

Index of the object :  267
CHK_ACCT             3
DURATION            24
CRED_HIST            4
NEW_CAR              0
USER_CAR             1
FURN                 0
TV                   0
EDU                  0
RETRAIN              0
AMT               2197
SAV_ACCT             4
EMPLOYM              3
INSTALL_RATE         4
MALE_DIV             0
MALE_SING            1
MALE_MAR_WID         0
CO-APP               0
GUARANTOR            0
PRES_RESIDENT        4
REAL_ESTATE          0
PROP_UNKN_NONE       0
AGE                 43
OTHER_INSTALL        0
RENT                 0
OWN_RES              1
NUM_CRED             2
JOB                  2
NUM_DEPENDEN         2
TEL                  1
FOREIGN              0
Name: 406, dtype: int64
Class :  1
Predict Class :  1
a) Total  positive log-evidence :  7.7530083369
b) Total negative log-evidence :  3.37181904253
c) Probability distribution [ 0.0100185  0.9899815]
d) Top 3 features values that contribute most to the positive evidence
	 2 	 Evid

### The object that has the largest negative evidence.

In [21]:
most_neg_ev_idx = np.argmin(neg_ev)

In [22]:
print('Index of the object : ', most_neg_ev_idx)
print(X_test.iloc[most_neg_ev_idx, :])
print('Class : ', y_test.iloc[most_neg_ev_idx])
print('Predict Class : ', y_pred[most_neg_ev_idx])
print('a) Total  positive log-evidence : ', pos_ev[most_neg_ev_idx])
print('b) Total negative log-evidence : ', neg_ev[most_neg_ev_idx])
print('c) Probability distribution', y_pred_proba[most_neg_ev_idx])

feature_pos = X_ev[most_neg_ev_idx,:]
pos_list = np.argsort(feature_pos)[::-1]
feature_neg = X_ev[most_neg_ev_idx,:]
neg_list = np.argsort(feature_neg)

print('d) Top 3 features values that contribute most to the positive evidence')
for i in range(0,3):
    print('\t',pos_list[i], '\t Evidence Value : ', np.sort(feature_pos)[::-1][i])
 
print('e) Top 3 features values that contribute most to the negative evidence')
for j in range(0,3):
    print('\t',neg_list[j], '\t Evidence Value : ', np.sort(feature_neg)[j])

print()     


Index of the object :  161
CHK_ACCT             3
DURATION             6
CRED_HIST            4
NEW_CAR              0
USER_CAR             0
FURN                 0
TV                   1
EDU                  0
RETRAIN              0
AMT               1237
SAV_ACCT             1
EMPLOYM              2
INSTALL_RATE         1
MALE_DIV             0
MALE_SING            0
MALE_MAR_WID         0
CO-APP               0
GUARANTOR            0
PRES_RESIDENT        1
REAL_ESTATE          0
PROP_UNKN_NONE       0
AGE                 27
OTHER_INSTALL        0
RENT                 0
OWN_RES              1
NUM_CRED             2
JOB                  2
NUM_DEPENDEN         1
TEL                  0
FOREIGN              0
Name: 492, dtype: int64
Class :  1
Predict Class :  1
a) Total  positive log-evidence :  4.81366310417
b) Total negative log-evidence :  1.19604463902
c) Probability distribution [ 0.02125493  0.97874507]
d) Top 3 features values that contribute most to the positive evidence
	 2 	 E

### The most uncertain object with respect to the probabilities.

In [23]:
uncertain_idx = np.argmin(np.square(y_pred_proba[:,1]-0.5))

In [24]:
print('Index of the object : ', uncertain_idx)
print(X_test.iloc[uncertain_idx, :])
print('Class : ', y_test.iloc[uncertain_idx])
print('Predict Class : ', y_pred[uncertain_idx])

if y_test.iloc[uncertain_idx] != y_pred[uncertain_idx]:
    print('\t \t \t \t \t False Positive')

print('a) Total  positive log-evidence : ', pos_ev[uncertain_idx])
print('b) Total negative log-evidence : ', neg_ev[uncertain_idx])
print('c) Probability distribution', y_pred_proba[uncertain_idx])

feature_pos = X_ev[uncertain_idx,:]
pos_list = np.argsort(feature_pos)[::-1]
feature_neg = X_ev[uncertain_idx,:]
neg_list = np.argsort(feature_neg)

print('d) Top 3 features values that contribute most to the positive evidence')
for i in range(0,3):
    print('\t',pos_list[i], '\t Evidence Value : ', np.sort(feature_pos)[::-1][i])
    
print('e) Top 3 features values that contribute most to the negative evidence')
for j in range(0,3):
    print('\t',neg_list[j], '\t Evidence Value : ', np.sort(feature_neg)[j])

Index of the object :  277
CHK_ACCT             1
DURATION            24
CRED_HIST            4
NEW_CAR              1
USER_CAR             0
FURN                 0
TV                   0
EDU                  0
RETRAIN              0
AMT               3878
SAV_ACCT             1
EMPLOYM              1
INSTALL_RATE         4
MALE_DIV             1
MALE_SING            0
MALE_MAR_WID         0
CO-APP               0
GUARANTOR            0
PRES_RESIDENT        2
REAL_ESTATE          0
PROP_UNKN_NONE       0
AGE                 37
OTHER_INSTALL        0
RENT                 0
OWN_RES              1
NUM_CRED             1
JOB                  2
NUM_DEPENDEN         1
TEL                  1
FOREIGN              0
Name: 284, dtype: int64
Class :  1
Predict Class :  0
	 	 	 	 	 False Positive
a) Total  positive log-evidence :  3.96369483383
b) Total negative log-evidence :  4.17913592814
c) Probability distribution [ 0.5008443  0.4991557]
d) Top 3 features values that contribute most to the po