## Logistic regression

<img src="logistic.JPG">

# Example 1

Fitting Logistic Regression
In this first notebook, you will be fitting a logistic regression model to a dataset where we would like to predict if a transaction is fraud or not.

To get started let's read in the libraries and take a quick look at the dataset.

In [3]:
import numpy as np
import pandas as pd
import statsmodels.api as sm


df = pd.read_csv('fraud_dataset.csv')
df.head()

Unnamed: 0,transaction_id,duration,day,fraud
0,28891,21.3026,weekend,False
1,61629,22.932765,weekend,False
2,53707,32.694992,weekday,False
3,47812,32.784252,weekend,False
4,43455,17.756828,weekend,False


In [4]:
 df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8793 entries, 0 to 8792
Data columns (total 4 columns):
transaction_id    8793 non-null int64
duration          8793 non-null float64
day               8793 non-null object
fraud             8793 non-null bool
dtypes: bool(1), float64(1), int64(1), object(1)
memory usage: 214.8+ KB


`1.` As you can see, there are two columns that need to be changed to dummy variables.  Replace each of the current columns to the dummy version.  Use the 1 for `weekday` and `True`, and 0 otherwise.  Use the first quiz to answer a few questions about the dataset.

In [14]:
df['weekday'] = pd.get_dummies(df['day'])['weekday']
df[['not_fraud','fraud']] = pd.get_dummies(df['fraud'])
df = df.drop('not_fraud', axis=1)

In [21]:
df.head()

Unnamed: 0,transaction_id,duration,day,fraud,weekday,intercept
0,28891,21.3026,weekend,0,0,1
1,61629,22.932765,weekend,0,0,1
2,53707,32.694992,weekday,0,1,1
3,47812,32.784252,weekend,0,0,1
4,43455,17.756828,weekend,0,0,1


`2.` Now that you have dummy variables, fit a logistic regression model to predict if a transaction is fraud using both day and duration.  Don't forget an intercept!  Use the second quiz below to assure you fit the model correctly.

In [23]:
df['intercept']=1
lm = sm.OLS(df['fraud'],df[['intercept','weekday','duration']])
result = lm.fit()
result.summary()

0,1,2,3
Dep. Variable:,fraud,R-squared:,0.145
Model:,OLS,Adj. R-squared:,0.145
Method:,Least Squares,F-statistic:,747.9
Date:,"Fri, 27 Mar 2020",Prob (F-statistic):,1.13e-300
Time:,22:34:24,Log-Likelihood:,7651.6
No. Observations:,8793,AIC:,-15300.0
Df Residuals:,8790,BIC:,-15280.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
intercept,0.1674,0.005,36.944,0.000,0.159,0.176
weekday,0.0184,0.002,8.071,0.000,0.014,0.023
duration,-0.0054,0.000,-37.539,0.000,-0.006,-0.005

0,1,2,3
Omnibus:,10942.218,Durbin-Watson:,1.941
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1208518.066
Skew:,6.989,Prob(JB):,0.0
Kurtosis:,58.706,Cond. No.,129.0


# Example 2

### Interpreting Results of Logistic Regression

In this notebook (and quizzes), you will be getting some practice with interpreting the coefficients in logistic regression.  Using what you saw in the previous video should be helpful in assisting with this notebook.

The dataset contains four variables: `admit`, `gre`, `gpa`, and `prestige`:

* `admit` is a binary variable. It indicates whether or not a candidate was admitted into UCLA (admit = 1) our not (admit = 0).
* `gre` is the GRE score. GRE stands for Graduate Record Examination.
* `gpa` stands for Grade Point Average.
* `prestige` is the prestige of an applicant alta mater (the school attended before applying), with 1 being the highest (highest prestige) and 4 as the lowest (not prestigious).

To start, let's read in the necessary libraries and data.

In [24]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

df = pd.read_csv("./admissions.csv")
df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.0,1
3,1,640,3.19,4
4,0,520,2.93,4


There are a few different ways you might choose to work with the `prestige` column in this dataset.  For this dataset, we will want to allow for the change from prestige 1 to prestige 2 to allow a different acceptance rate than changing from prestige 3 to prestige 4.

1. With the above idea in place, create the dummy variables needed to change prestige to a categorical variable, rather than quantitative, then answer quiz 1 below.

In [30]:
df[['prestige1','prestige2','prestige3','prestige4']] = pd.get_dummies(df['prestige'])
df.head()

Unnamed: 0,admit,gre,gpa,prestige,prestige1,prestige2,prestige3,pretige4,prestige4
0,0,380,3.61,3,0,0,1,0,0
1,1,660,3.67,3,0,0,1,0,0
2,1,800,4.0,1,1,0,0,0,0
3,1,640,3.19,4,0,0,0,1,1
4,0,520,2.93,4,0,0,0,1,1


`2.` Now, fit a logistic regression model to predict if an individual is admitted using `gre`, `gpa`, and `prestige` with a baseline of the prestige value of `1`.  Use the results to answer quiz 2 and 3 below.  Don't forget an intercept.

In [34]:
df['intercept'] =1 
lm = sm.Logit(df['admit'],df[['gre','gpa','prestige2','prestige3','prestige4']])
result = lm.fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.589121
         Iterations 5


0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,392.0
Method:,MLE,Df Model:,4.0
Date:,"Fri, 27 Mar 2020",Pseudo R-squ.:,0.05722
Time:,22:45:24,Log-Likelihood:,-233.88
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.039e-05

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
gre,0.0014,0.001,1.308,0.191,-0.001,0.003
gpa,-0.1323,0.195,-0.680,0.497,-0.514,0.249
prestige2,-0.9562,0.302,-3.171,0.002,-1.547,-0.365
prestige3,-1.5375,0.332,-4.627,0.000,-2.189,-0.886
prestige4,-1.8699,0.401,-4.658,0.000,-2.657,-1.083
