#**Fitting a Logistic Regression Model and Directly Using the Coefficients**
In this Project, we're going to train a logistic regression model on the two most important features we discovered in univariate feature exploration, as well as learn how to manually implement logistic regression using coefficients from the fitted model. This will show you how you could use logistic regression in a computing environment where scikit-learn may not be available, but the mathematical functions necessary to compute the sigmoid function are. On successful completion of the challenge, you should observe that the calculated ROC AUC values using scikit-learn predictions and those obtained from manual predictions should be the same: approximately 0.63.


----------------------------------------------------------------------------------
**Run the following two cells before you begin.**

In [1]:
%autosave 10

Autosaving every 10 seconds


In [2]:
import pandas as pd
import numpy as np

______________________________________________________________________
**First, import your data set and define the sigmoid function.**
<details>
    <summary>Hint:</summary>
    The definition of the sigmoid is $f(x) = \frac{1}{1 + e^{-X}}$.
</details>

In [3]:
# Import the data set
df=pd.read_csv('cleaned_data.csv')
df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month,EDUCATION_CAT,graduate school,high school,others,university
0,798fc410-45c1,20000,2,2,1,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1,university,0,0,0,1
1,8a8c8f3b-8eb4,120000,2,2,2,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1,university,0,0,0,1
2,85698822-43f5,90000,2,2,2,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0,university,0,0,0,1
3,0737c11b-be42,50000,2,2,1,37,0,0,0,0,0,0,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,0,university,0,0,0,1
4,3b7f77cc-dbc0,50000,1,2,1,57,-1,0,-1,0,0,0,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679,0,university,0,0,0,1


In [4]:
# Define the sigmoid function
def sigmoid(x):
    return 1/(1+np.exp(-x))

**Now, create a train/test split (80/20) with `PAY_1` and `LIMIT_BAL` as features and `default payment next month` as values. Use a random state of 24.**

In [8]:
#X- Feature and Y- Target variablles
X=df.loc[:,['PAY_1','LIMIT_BAL']]
Y=df.loc[:,['default payment next month']]

In [9]:
#Top rows of X
X.head()

Unnamed: 0,PAY_1,LIMIT_BAL
0,2,20000
1,-1,120000
2,0,90000
3,0,50000
4,-1,50000


In [10]:
#Top rows of Y
Y.head()

Unnamed: 0,default payment next month
0,1
1,1
2,0
3,0
4,0


In [11]:
# Create a train/test split
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=24)
X_train.head()

Unnamed: 0,PAY_1,LIMIT_BAL
6409,0,120000
4663,2,190000
13763,-1,420000
7542,2,30000
14518,0,120000


______________________________________________________________________
**Next, import LogisticRegression, with the default options, but set the solver to `'liblinear'`.**

In [12]:
from sklearn.linear_model import LogisticRegression
clf=LogisticRegression(solver='liblinear')

______________________________________________________________________
**Now, train on the training data and obtain predicted classes, as well as class probabilities, using the testing data.**

In [13]:
# Fit the logistic regression model on training data
clf.fit(X_train,Y_train)

  y = column_or_1d(y, warn=True)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [14]:
# Make predictions using `.predict()`
Y_pred=clf.predict(X_test)
Y_pred

array([0, 0, 0, ..., 0, 0, 0])

In [15]:
# Find class probabilities using `.predict_proba()`
Y_pred_prob=clf.predict_proba(X_test)

In [16]:
Y_pred_prob

array([[0.74826924, 0.25173076],
       [0.584297  , 0.415703  ],
       [0.79604453, 0.20395547],
       ...,
       [0.584297  , 0.415703  ],
       [0.82721498, 0.17278502],
       [0.66393435, 0.33606565]])

______________________________________________________________________
**Then, pull out the coefficients and intercept from the trained model and manually calculate predicted probabilities. You'll need to add a column of 1s to your features, to multiply by the intercept.**

In [17]:
#Shape of the datastet
X.shape

(26664, 2)

In [18]:
Y.shape

(26664, 1)

In [19]:
# Add column of 1s to features using append
one=np.ones((X_test.shape[0],1),dtype=int)
X_test=np.append(X_test,one,axis=1)
print(X_test.shape)
X_test

(5333, 3)


array([[     2, 160000,      1],
       [     1,  50000,      1],
       [    -1, 200000,      1],
       ...,
       [    -1,  50000,      1],
       [     1, 230000,      1],
       [     2, 100000,      1]])

In [20]:
# Get coefficients and intercepts from trained model
coeffs=clf.coef_
intercept=clf.intercept_
coeffs,intercept

(array([[ 8.27451187e-11, -6.80876727e-06]]), array([-6.57647457e-11]))

In [21]:
coeffs=np.append(coeffs,intercept)
coeffs

array([ 8.27451187e-11, -6.80876727e-06, -6.57647457e-11])

In [22]:
Z=np.dot(coeffs,X_test.T)
print(len(Z))
Z

5333


array([-1.08940276, -0.34043836, -1.36175345, ..., -0.34043836,
       -1.56601647, -0.68087673])

In [23]:
# Manually calculate predicted probabilities
pred_proba=sigmoid(Z)
pred_proba

array([0.25173076, 0.415703  , 0.20395547, ..., 0.415703  , 0.17278502,
       0.33606565])

In [24]:
pred_prob=[]
for i in range(len(pred_proba)):
    temp=[]
    temp.append(1-pred_proba[i])
    temp.append(pred_proba[i])
    pred_prob.append(temp)
pred_prob

[[0.7482692412207628, 0.2517307587792373],
 [0.584297002694049, 0.4157029973059509],
 [0.7960445320597713, 0.20395546794022865],
 [0.7960445320060341, 0.20395546799396586],
 [0.584297002694049, 0.4157029973059509],
 [0.9206459615683965, 0.07935403843160348],
 [0.9206459615683965, 0.07935403843160348],
 [0.663934345627771, 0.336065654372229],
 [0.7352286839145739, 0.26477131608542614],
 [0.7608764148571872, 0.23912358514281284],
 [0.632905399536245, 0.367094600463755],
 [0.5676699048051779, 0.43233009519482213],
 [0.796044532046337, 0.20395546795366298],
 [0.5339913254281781, 0.466008674571822],
 [0.796044531952297, 0.20395546804770306],
 [0.8852008186486576, 0.11479918135134233],
 [0.663934345627771, 0.336065654372229],
 [0.6329053995554696, 0.36709460044453035],
 [0.8458207820248429, 0.15417921797515716],
 [0.600734397342152, 0.399265602657848],
 [0.6329053994977956, 0.36709460050220444],
 [0.9300417968069964, 0.0699582031930036],
 [0.8272149830344946, 0.17278501696550547],
 [0.584297

In [26]:
pred_prob=np.array(pred_prob)
pred_prob

array([[0.74826924, 0.25173076],
       [0.584297  , 0.415703  ],
       [0.79604453, 0.20395547],
       ...,
       [0.584297  , 0.415703  ],
       [0.82721498, 0.17278502],
       [0.66393435, 0.33606565]])

______________________________________________________________________
**Next, using a threshold of `0.5`, manually calculate predicted classes. Compare this to the class predictions output by scikit-learn.**

In [27]:
# Manually calculate predicted classes
classes=[]
for i in range(len(pred_prob)):
    if pred_prob[i][0]>0.5:
        classes.append(0.0)
    else:
        classes.append(1.0)
classes

[0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0

In [28]:
# Compare to scikit-learn's predicted classes
Y_pred

array([0, 0, 0, ..., 0, 0, 0])

______________________________________________________________________
**Finally, calculate ROC AUC using both scikit-learn's predicted probabilities, and your manually predicted probabilities, and compare.**

In [29]:
from sklearn.metrics import roc_curve,roc_auc_score

In [30]:
# Use scikit-learn's predicted probabilities to calculate ROC AUC
ruc_score_skl=roc_auc_score(Y_test,Y_pred_prob[:,1])
ruc_score_skl

0.627207450280691

In [31]:
pred_prob[0],Y_pred_prob[0]

(array([0.74826924, 0.25173076]), array([0.74826924, 0.25173076]))

In [32]:
# Use manually calculated predicted probabilities to calculate ROC AUC
ruc_score_man=roc_auc_score(Y_test,pred_prob[:,1])
ruc_score_man

0.627207450280691