# Logistic Regression

Fitting a logistic regression model to a dataset to predict if a transaction is fraud or not.

In [29]:
import numpy as np
import pandas as pd
import statsmodels.api as sm


df = pd.read_csv('../data/fraud_dataset.csv')
df.head()

Unnamed: 0,transaction_id,duration,day,fraud
0,28891,21.3026,weekend,False
1,61629,22.932765,weekend,False
2,53707,32.694992,weekday,False
3,47812,32.784252,weekend,False
4,43455,17.756828,weekend,False


`1.` Dummy variables for columns `day` and `fraud` 

In [30]:
df['weekday'] = pd.get_dummies(df['day'])['weekday']
df[['not_fraud','fraud']] = pd.get_dummies(df['fraud'])
df = df.drop('not_fraud', axis=1)
df.head()

Unnamed: 0,transaction_id,duration,day,fraud,weekday
0,28891,21.3026,weekend,0,0
1,61629,22.932765,weekend,0,0
2,53707,32.694992,weekday,0,1
3,47812,32.784252,weekend,0,0
4,43455,17.756828,weekend,0,0


`2.` Now that you have dummy variables, fit a logistic regression model to predict if a transaction is fraud using both day and duration.  Don't forget an intercept!  Use the second quiz below to assure you fit the model correctly.

In [32]:
# Add intercept and fit logistic regression model
df['intercept'] = 1

logit_mod = sm.Logit(df['fraud'], df[['intercept', 'weekday', 'duration']])
results = logit_mod.fit()
results.summary()

Optimization terminated successfully.
         Current function value: inf
         Iterations 16


  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q*np.dot(X,params))))
  return 1 - self.llf/self.llnull


AttributeError: module 'scipy.stats' has no attribute 'chisqprob'

## Interpreting Results of Logistic Regression

The dataset contains four variables: `admit`, `gre`, `gpa`, and `prestige`:

* `admit` is a binary variable. It indicates whether or not a candidate was admitted into UCLA (admit = 1) our not (admit = 0).
* `gre` is the GRE score. GRE stands for Graduate Record Examination.
* `gpa` stands for Grade Point Average.
* `prestige` is the prestige of an applicant alta mater (the school attended before applying), with 1 being the highest (highest prestige) and 4 as the lowest (not prestigious).

In [39]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

df = pd.read_csv("../data/admissions.csv")
df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.0,1
3,1,640,3.19,4
4,0,520,2.93,4


In [40]:
# create dummy prestige variable
prestige_dummies = pd.get_dummies(df['prestige'], prefix='prestige')
df = df.join(prestige_dummies)
df.head()

Unnamed: 0,admit,gre,gpa,prestige,prestige_1,prestige_2,prestige_3,prestige_4
0,0,380,3.61,3,0,0,1,0
1,1,660,3.67,3,0,0,1,0
2,1,800,4.0,1,1,0,0,0
3,1,640,3.19,4,0,0,0,1
4,0,520,2.93,4,0,0,0,1


`2.` Now, fit a logistic regression model to predict if an individual is admitted using `gre`, `gpa`, and `prestige` with a baseline of the prestige value of `1`.  Use the results to answer quiz 2 and 3 below.  Don't forget an intercept.

In [44]:
df['intercept'] = 1

# fit logistic model using prestige value of 1 as base
log_mod = sm.Logit(df['admit'], df[['intercept', 'gre', 'gpa', 'prestige_2', 'prestige_3', 'prestige_4']])
results = log_mod.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


AttributeError: module 'scipy.stats' has no attribute 'chisqprob'

In [22]:
np.exp(results.params)

intercept    0.020716
gre          1.002221
gpa          2.180027
prest_2      0.506548
prest_3      0.262192
prest_4      0.211525
dtype: float64

In [23]:
1/_

intercept    48.272116
gre           0.997784
gpa           0.458710
prest_2       1.974147
prest_3       3.813995
prest_4       4.727566
dtype: float64

In [24]:
df.groupby('prestige').mean()['admit']

prestige
1    0.540984
2    0.358108
3    0.231405
4    0.179104
Name: admit, dtype: float64

### Model Diagnostics in Python

In [45]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score
from sklearn.model_selection import train_test_split
np.random.seed(42)

df = pd.read_csv('../data/admissions.csv')
df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.0,1
3,1,640,3.19,4
4,0,520,2.93,4


`1.` Change prestige to dummy variable columns that are added to `df`. Then divide data into training and test data.

In [46]:
df[['prest_1', 'prest_2', 'prest_3', 'prest_4']] = pd.get_dummies(df['prestige'])
X = df.drop(['admit', 'prestige', 'prest_1'] , axis=1)
y = df['admit']
X_train, X_test, y_train, y_test = train_test_split(
          X, y, test_size=0.20, random_state=0)

`2.` Use sklearn's Logistic Regression to fit a logistic model using `gre`, `gpa`, and 3 of `prestige` dummy variables

In [47]:
log_mod = LogisticRegression()
log_mod.fit(X_train, y_train)
preds = log_mod.predict(X_test)
confusion_matrix(y_test, preds) 

array([[56,  0],
       [22,  2]])

`3.` Additional metrics to test model



In [4]:
precision_score(y_test, preds) 

1.0

In [5]:
recall_score(y_test, preds)

0.083333333333333329

In [6]:
accuracy_score(y_test, preds)

0.72499999999999998

In [48]:
### Unless you install the ggplot library in the workspace, you will 
### get an error when running this code!

from ggplot import *
from sklearn.metrics import roc_curve, auc
%matplotlib inline

preds = log_mod.predict_proba(X_test)[:,1]
fpr, tpr, _ = roc_curve(y_test, preds)

df = pd.DataFrame(dict(fpr=fpr, tpr=tpr))
ggplot(df, aes(x='fpr', y='tpr')) +\
    geom_line() +\
    geom_abline(linetype='dashed')

ModuleNotFoundError: No module named 'ggplot'