### Fitting Logistic Regression

In this first notebook, you will be fitting a logistic regression model to a dataset where we would like to predict if a transaction is fraud or not.

To get started let's read in the libraries and take a quick look at the dataset.

In [14]:
import numpy as np
import pandas as pd
import statsmodels.api as sm


df = pd.read_csv('./fraud_dataset.csv')
df.head()

Unnamed: 0,transaction_id,duration,day,fraud
0,28891,21.3026,weekend,False
1,61629,22.932765,weekend,False
2,53707,32.694992,weekday,False
3,47812,32.784252,weekend,False
4,43455,17.756828,weekend,False


`1.` As you can see, there are two columns that need to be changed to dummy variables.  Replace each of the current columns to the dummy version.  Use the 1 for `weekday` and `True`, and 0 otherwise.  Use the first quiz to answer a few questions about the dataset.

In [15]:
df[df.fraud==True]

Unnamed: 0,transaction_id,duration,day,fraud
15,32057,4.909117,weekday,True
80,33212,3.931617,weekday,True
134,24194,3.273424,weekend,True
179,29647,2.433195,weekend,True
193,33493,3.679496,weekday,True
...,...,...,...,...
8489,74766,7.426604,weekday,True
8524,19583,6.960134,weekday,True
8547,20596,2.935236,weekday,True
8568,21571,3.627393,weekday,True


In [18]:
df['fraud'] = df.fraud.apply(lambda x: 1 if x else 0)
df['day'] = df.day.apply(lambda x: 1 if x == 'weekday' else 0)
df.sample(10)

Unnamed: 0,transaction_id,duration,day,fraud
312,90129,28.763659,1,0
1329,62790,25.228503,1,0
999,62590,31.974642,0,0
5089,68936,29.095207,1,0
1700,32465,30.514439,0,0
728,80822,29.208313,0,0
5887,48214,39.234274,1,0
2362,22029,36.193698,0,0
1456,89636,38.471116,0,0
336,73563,23.493622,0,0


In [22]:
(df.fraud == 1).sum() / df.shape[0]

0.012168770612987604

In [23]:
(df.day == 1).sum() / df.shape[0]

0.3452746502900034

In [24]:
df[['duration', 'fraud']].groupby('fraud').mean()

Unnamed: 0_level_0,duration
fraud,Unnamed: 1_level_1
0,30.013583
1,4.624247


`2.` Now that you have dummy variables, fit a logistic regression model to predict if a transaction is fraud using both day and duration.  Don't forget an intercept!  Use the second quiz below to assure you fit the model correctly.

In [20]:
df['intercept'] = 1

y = df.fraud
X = df[['intercept', 'duration', 'day']]

In [21]:
lm = sm.Logit(y, X)
res = lm.fit()
res.summary()

Optimization terminated successfully.
         Current function value: 0.002411
         Iterations 16


0,1,2,3
Dep. Variable:,fraud,No. Observations:,8793.0
Model:,Logit,Df Residuals:,8790.0
Method:,MLE,Df Model:,2.0
Date:,"Wed, 07 Oct 2020",Pseudo R-squ.:,0.9633
Time:,18:30:15,Log-Likelihood:,-21.2
converged:,True,LL-Null:,-578.1
Covariance Type:,nonrobust,LLR p-value:,1.39e-242

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,9.8709,1.944,5.078,0.000,6.061,13.681
duration,-1.4637,0.290,-5.039,0.000,-2.033,-0.894
day,2.5465,0.904,2.816,0.005,0.774,4.319
