**Run the following two cells before you begin.**

In [46]:
%autosave 10

Autosaving every 10 seconds


In [47]:
import pandas as pd
import numpy as np

______________________________________________________________________
**First, import your data set and define the sigmoid function.**
<details>
    <summary>Hint:</summary>
    The definition of the sigmoid is $f(x) = \frac{1}{1 + e^{-X}}$.
</details>

In [48]:
# Import the data set
data = pd.read_csv('cleaned_data.csv')

In [None]:
data.isnull().sum()

In [50]:
# Define the sigmoid function
def sigmoid_fun(x):
  return (1/(1+np.exp(-x)))

**Now, create a train/test split (80/20) with `PAY_1` and `LIMIT_BAL` as features and `default payment next month` as values. Use a random state of 24.**

In [178]:
# Create a train/test split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(data[['PAY_1','LIMIT_BAL']].values,data['default payment next month'].values,test_size=0.2,random_state=24)

______________________________________________________________________
**Next, import LogisticRegression, with the default options, but set the solver to `'liblinear'`.**

In [179]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver='liblinear')

______________________________________________________________________
**Now, train on the training data and obtain predicted classes, as well as class probabilities, using the testing data.**

In [180]:
# Fit the logistic regression model on training data
lr.fit(X_train,y_train)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [181]:
# Make predictions using `.predict()`
pred = lr.predict(X_test)
pred

array([0, 0, 0, ..., 0, 0, 0])

In [139]:
y_test.values

array([0, 0, 0, ..., 0, 0, 1])

In [182]:
# Find class probabilities using `.predict_proba()`
pred_prob = lr.predict_proba(X_test) 
pred_prob

array([[0.74826924, 0.25173076],
       [0.584297  , 0.415703  ],
       [0.79604453, 0.20395547],
       ...,
       [0.584297  , 0.415703  ],
       [0.82721498, 0.17278502],
       [0.66393435, 0.33606565]])

______________________________________________________________________
**Then, pull out the coefficients and intercept from the trained model and manually calculate predicted probabilities. You'll need to add a column of 1s to your features, to multiply by the intercept.**

In [201]:
# Add column of 1s to features
X_test_df = pd.DataFrame(X_test,columns=['PAY_1','LIMIT_BAL'],  index=None)
X_test_df['1s']=1

In [202]:
X_test_df

Unnamed: 0,PAY_1,LIMIT_BAL,1s
0,2,160000,1
1,1,50000,1
2,-1,200000,1
3,3,200000,1
4,1,50000,1
...,...,...,...
5328,0,140000,1
5329,-1,50000,1
5330,-1,50000,1
5331,1,230000,1


In [194]:
# Get coefficients and intercepts from trained model
coef = lr.coef_
coef

array([[ 8.27451187e-11, -6.80876727e-06]])

In [195]:
intr = lr.intercept_
intr

array([-6.57647457e-11])

In [211]:
X_test_df['prob'] = (intr[0]*X_test_df['1s']+X_test_df['PAY_1']*coef[0][0]+X_test_df['LIMIT_BAL']*coef[0][1])
X_test_df['prob']

0      -1.089403
1      -0.340438
2      -1.361753
3      -1.361753
4      -0.340438
          ...   
5328   -0.953227
5329   -0.340438
5330   -0.340438
5331   -1.566016
5332   -0.680877
Name: prob, Length: 5333, dtype: float64

In [212]:
# Manually calculate predicted probabilities
X_test_df['after_sigmoid'] = X_test_df['prob'].apply(sigmoid_fun)
X_test_df['after_sigmoid']

0       0.251731
1       0.415703
2       0.203955
3       0.203955
4       0.415703
          ...   
5328    0.278236
5329    0.415703
5330    0.415703
5331    0.172785
5332    0.336066
Name: after_sigmoid, Length: 5333, dtype: float64

______________________________________________________________________
**Next, using a threshold of `0.5`, manually calculate predicted classes. Compare this to the class predictions output by scikit-learn.**

In [215]:
# Manually calculate predicted classes
X_test_df['bool']=X_test_df['after_sigmoid']>0.5
arr_pred = np.array((X_test_df['bool']))

In [217]:
# Compare to scikit-learn's predicted classes
np.array_equal(pred, arr_pred)

True

______________________________________________________________________
**Finally, calculate ROC AUC using both scikit-learn's predicted probabilities, and your manually predicted probabilities, and compare.**

In [219]:
# Use scikit-learn's predicted probabilities to calculate ROC AUC
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, pred_prob[:,1])

0.627207450280691

In [220]:
# Use manually calculated predicted probabilities to calculate ROC AUC
roc_auc_score(y_test, X_test_df['after_sigmoid'])

0.627207450280691