## Fitting logist regression
In linear regression your predicted response variable was no constrained at all. With the right inputs it was possible to predict the house that has a negative price or that the height of an individual is 80 feet. 

Using logistic regression (as opposed to linear regression) bounds our response to a probability between zero and one.

We need to change the labels of fraud (1) or not (0), sucess (1) or failure (0) and the other variables to only a one and zero outcome.

Then, in logistic regression is set up in a way where our linear model will predict the log odds instead of predicting the response itself

$log(\frac{p}{1-p})=b_0+b_1x_1+b_2x_2+...$

> Odds ratio ($p$) is the probability of a one value occuring (ratio of a probability of an event occurring as compared to the probability of the event not occurring);
> Taking the log controls our predictions to be between 0 and 1

Changing the equation using the log odds, and this solves for probability directly:

$p=\frac{e^{b_0+b_1x_1+b_2x_2+...}}{1+e^{b_0+b_1x_1+b_2x_2+...}}$

**Sigmoid function** takes the value from the linear regression line and maps it between zero and one as probability.

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

df=pd.read_csv('fraud_dataset.csv')
df.head()

Unnamed: 0,transaction_id,duration,day,fraud
0,28891,21.3026,weekend,False
1,61629,22.932765,weekend,False
2,53707,32.694992,weekday,False
3,47812,32.784252,weekend,False
4,43455,17.756828,weekend,False


In [2]:
df.describe()

Unnamed: 0,transaction_id,duration
count,8793.0,8793.0
mean,55243.38451,29.704626
std,21792.120147,7.464452
min,17301.0,0.215113
25%,36454.0,25.211787
50%,55420.0,29.92316
75%,74131.0,34.532567
max,92828.0,60.412763


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8793 entries, 0 to 8792
Data columns (total 4 columns):
transaction_id    8793 non-null int64
duration          8793 non-null float64
day               8793 non-null object
fraud             8793 non-null bool
dtypes: bool(1), float64(1), int64(1), object(1)
memory usage: 214.8+ KB


In [4]:
# Create dummy variables
df[['no_fraud', 'fraud']] = pd.get_dummies(df['fraud'])
df.head()

Unnamed: 0,transaction_id,duration,day,fraud,no_fraud
0,28891,21.3026,weekend,0,1
1,61629,22.932765,weekend,0,1
2,53707,32.694992,weekday,0,1
3,47812,32.784252,weekend,0,1
4,43455,17.756828,weekend,0,1


In [5]:
df = df.drop('no_fraud', axis=1)

In [6]:
df['intercept'] = 1
logit_mod = sm.Logit(df['fraud'], df[['intercept']])
results = logit_mod.fit()
results.summary()

  return np.sum(np.log(self.cdf(q*np.dot(X,params))))


Optimization terminated successfully.
         Current function value: inf
         Iterations 8




0,1,2,3
Dep. Variable:,fraud,No. Observations:,8793.0
Model:,Logit,Df Residuals:,8792.0
Method:,MLE,Df Model:,0.0
Date:,"Thu, 17 Jan 2019",Pseudo R-squ.:,inf
Time:,17:52:22,Log-Likelihood:,-inf
converged:,True,LL-Null:,0.0
,,LLR p-value:,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,-4.3966,0.097,-45.202,0.000,-4.587,-4.206


`1.` As you can see, there are two columns that need to be changed to dummy variables.  Replace each of the current columns to the dummy version.  Use the 1 for `weekday` and `True`, and 0 otherwise.  Use the first quiz to answer a few questions about the dataset.

In [8]:
df = pd.read_csv('./fraud_dataset.csv')
df['day'] = pd.get_dummies(df['day'])['weekday']
df['fraud'] = pd.get_dummies(df['fraud'])[True]
df.head()

Unnamed: 0,transaction_id,duration,day,fraud
0,28891,21.3026,0,0
1,61629,22.932765,0,0
2,53707,32.694992,1,0
3,47812,32.784252,0,0
4,43455,17.756828,0,0


In [9]:
df['intercept'] = 1
lr = sm.Logit(df['fraud'], df[['intercept', 'day', 'duration']])
results = lr.fit()
results.summary()

  return np.sum(np.log(self.cdf(q*np.dot(X,params))))


Optimization terminated successfully.
         Current function value: inf
         Iterations 16




0,1,2,3
Dep. Variable:,fraud,No. Observations:,8793.0
Model:,Logit,Df Residuals:,8790.0
Method:,MLE,Df Model:,2.0
Date:,"Thu, 17 Jan 2019",Pseudo R-squ.:,inf
Time:,18:04:24,Log-Likelihood:,-inf
converged:,True,LL-Null:,0.0
,,LLR p-value:,1.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,9.8709,1.944,5.078,0.000,6.061,13.681
day,2.5465,0.904,2.816,0.005,0.774,4.319
duration,-1.4637,0.290,-5.039,0.000,-2.033,-0.894


> Em dias de semana, a probabilidade de fraude é 12.76 vezes maior que em finais de semana, mantendo constante a duração.

> Para cada minuto a menos despendido na transação, a probabilidade de fraude é 4.32 vezes maior, mantendo constantes os dias da semana.

In [10]:
df['fraud'].mean()

0.012168770612987604

In [11]:
df['duration'].mean()

29.704626129619825

In [12]:
df['day'].mean()

0.3452746502900034

In [13]:
df.head()

Unnamed: 0,transaction_id,duration,day,fraud,intercept
0,28891,21.3026,0,0,1
1,61629,22.932765,0,0,1
2,53707,32.694992,1,0,1
3,47812,32.784252,0,0,1
4,43455,17.756828,0,0,1


In [14]:
df.query('fraud==0')['duration'].mean()

30.013583132522584

In [15]:
df.query('fraud==1')['duration'].mean()

4.624247370615658

## Interpreting results

### Part 1

* Similar to in the regression case, we might use the p-values to help us to understand if a particular variable was significant in helping us predict if a transaction was fraud or not;
     * The p-values suggested that both duration and weekday are **statistically significant** in predicting if a transaction is fraud or not.
* The **coef** isn't like linear regression, we need to exponentiate each of the coefficients. Then, with quantitative variables we would say, for a one unit increase in your explanatory variable ($x_1$), we expect a multiplicative change in the odds of being in the one category to be exponential of your coefficient ($e^{b_1}$), holding all other variables constant;
* We encode dummy variables as in multiple linear regression;
    * So interpreting the dummy variable coefficient is only changed again by exponentiating the coefficient;
* In this category ($x_1$) as compared to the baseline, we expect a multiplicative change in the odds of being in the one category of this value ($e^{b_1}$), holding all other variables constant.

### Interpreting results in Python

* In most cases we don't care about the intercept, we only care about the ones that are attached to the explanatory variables;
* We first need to exponentiate each result;]
* Each exponentiate result is the multiplicative change in the odds;
    * On weekdays, fraud is 12.76 times than weekends, holding all else constant;
    * For each one unit increase in the duration on the page, fraud is 0.23 times as likely holding all else constant;
    > When multiplicative changes are less than one (duration result), it's often useful to compute the reciprocal (this change the direction for unit increase to unit decrease.
    * For each one unit decrease in the duration on the page, fraud is 4.32 times as likely holding all else constant.


In [16]:
results.summary()

0,1,2,3
Dep. Variable:,fraud,No. Observations:,8793.0
Model:,Logit,Df Residuals:,8790.0
Method:,MLE,Df Model:,2.0
Date:,"Thu, 17 Jan 2019",Pseudo R-squ.:,inf
Time:,19:34:31,Log-Likelihood:,-inf
converged:,True,LL-Null:,0.0
,,LLR p-value:,1.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,9.8709,1.944,5.078,0.000,6.061,13.681
day,2.5465,0.904,2.816,0.005,0.774,4.319
duration,-1.4637,0.290,-5.039,0.000,-2.033,-0.894


In [17]:
np.exp(2.5465), np.exp(-1.4637)

(12.762357271496972, 0.2313785882117941)

In [19]:
1/np.exp(-1.4637) # reciprocal

4.321921089278333

The dataset contains four variables: `admit`, `gre`, `gpa`, and `prestige`:

* `admit` is a binary variable. It indicates whether or not a candidate was admitted into UCLA (admit = 1) our not (admit = 0).
* `gre` is the GRE score. GRE stands for Graduate Record Examination.
* `gpa` stands for Grade Point Average.
* `prestige` is the prestige of an applicant alta mater (the school attended before applying), with 1 being the highest (highest prestige) and 4 as the lowest (not prestigious).

In [43]:
df = pd.read_csv("./admissions.csv")
df_cp = df.copy()
df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.0,1
3,1,640,3.19,4
4,0,520,2.93,4


There are a few different ways you might choose to work with the `prestige` column in this dataset.  For this dataset, we will want to allow for the change from prestige 1 to prestige 2 to allow a different acceptance rate than changing from prestige 3 to prestige 4.

1. With the above idea in place, create the dummy variables needed to change prestige to a categorical variable, rather than quantitative, then answer quiz 1 below.

In [44]:
df[['prest_1', 'prest_2', 'prest_3','prest_4',]] = pd.get_dummies(df['prestige'])
df.head()

Unnamed: 0,admit,gre,gpa,prestige,prest_1,prest_2,prest_3,prest_4
0,0,380,3.61,3,0,0,1,0
1,1,660,3.67,3,0,0,1,0
2,1,800,4.0,1,1,0,0,0
3,1,640,3.19,4,0,0,0,1
4,0,520,2.93,4,0,0,0,1


`2.` Now, fit a logistic regression model to predict if an individual is admitted using `gre`, `gpa`, and `prestige` with a baseline of the prestige value of `1`.  Use the results to answer quiz 2 and 3 below.  Don't forget an intercept.

In [72]:
df['intercept'] = 1
lr = sm.Logit(df['admit'], df[['intercept', 'gre', 'gpa', 'prest_2', 'prest_3', 'prest_4']])
results = lr.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,391.0
Method:,MLE,Df Model:,5.0
Date:,"Thu, 17 Jan 2019",Pseudo R-squ.:,0.08166
Time:,20:28:15,Log-Likelihood:,-227.82
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.176e-07

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,-3.8769,1.142,-3.393,0.001,-6.116,-1.638
gre,0.0022,0.001,2.028,0.043,7.44e-05,0.004
gpa,0.7793,0.333,2.344,0.019,0.128,1.431
prest_2,-0.6801,0.317,-2.146,0.032,-1.301,-0.059
prest_3,-1.3387,0.345,-3.882,0.000,-2.015,-0.663
prest_4,-1.5534,0.417,-3.721,0.000,-2.372,-0.735


Para cada aumento de um ponto no vestibular (GPA, nos EUA), uma pessoa tem **2.180027** mais chances de ser aceita, mantendo as demais variáveis constantes.

In [73]:
np.exp(results.params)

intercept    0.020716
gre          1.002221
gpa          2.180027
prest_2      0.506548
prest_3      0.262192
prest_4      0.211525
dtype: float64

In [61]:
1/_ # gre reciprocal

intercept    48.272116
gre           0.997784
gpa           0.458710
prest_2       1.974147
prest_3       3.813995
prest_4       4.727566
dtype: float64

In [62]:
df_cp['prestige'].value_counts()

2    148
3    121
4     67
1     61
Name: prestige, dtype: int64

Se uma pessoa frequentou a instituição de maior prestígio, ela tem **4.727566** mais chances de ser aceita do que se tivesse frequentado a de menor prestígio, mantendo as demais variáveis constantes

In [77]:
lr = sm.Logit(df['admit'], df[['intercept', 'gre', 'gpa', 'prest_1', 'prest_2', 'prest_3']])
results = lr.fit()
np.exp(results.params)

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


intercept    0.004382
gre          1.002221
gpa          2.180027
prest_1      4.727566
prest_2      2.394738
prest_3      1.239531
dtype: float64

Se uma pessoa frequentou a instituição de maior prestígio, ela tem **3.813995** mais chances de ser aceita do que se tivesse frequentado a segunda de menor prestígio, com as demais variáveis constantes

In [78]:
lr = sm.Logit(df['admit'], df[['intercept', 'gre', 'gpa', 'prest_1', 'prest_2', 'prest_4']])
results = lr.fit()
np.exp(results.params)

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


intercept    0.005432
gre          1.002221
gpa          2.180027
prest_1      3.813995
prest_2      1.931971
prest_4      0.806757
dtype: float64

Se uma pessoa frequentou a instituição de maior prestígio, ela tem **1.974147** mais chances de ser aceita do que se tivesse frequentado a segunda de maior prestígio, com as demais variáveis constantes

In [79]:
lr = sm.Logit(df['admit'], df[['intercept', 'gre', 'gpa', 'prest_1', 'prest_4', 'prest_3']])
results = lr.fit()
np.exp(results.params)

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


intercept    0.010494
gre          1.002221
gpa          2.180027
prest_1      1.974147
prest_4      0.417582
prest_3      0.517606
dtype: float64

## Model diagnostics
To determine how hell your logistic regression model is doing at predicting the correct labels, teh most common measure to use is **accuracy**.
* Accuracy is the proportion of correctly labeled rows divided by the total number of rows in the data set.
* In some cases, accuracy won't work well, particularly when you have large class imbalances in your data set.

> [ML udacity free course](https://www.udacity.com/course/intro-to-machine-learning--ud120)

> [Nanodregree on ML](https://www.udacity.com/course/machine-learning-engineer-nanodegree--nd009t)

## Confusion matrix
Actual class vs. Predicted class

In [81]:
# Recall colin powell
#  verdadeiros positivos / (verdadeiros positivos + falsos negativos).
55/63

0.873015873015873

In [83]:
# Precision colin powell
# verdadeiros positivos / (verdadeiro positivos + falso positivos)
55/67

0.8208955223880597

In [84]:
# Recall george bush
123/127

0.968503937007874

In [89]:
# Precision george bush
123/156

0.7884615384615384

In [88]:
# tony blair
TP = 26
FP = 8
FN = 8

In [90]:
# donald rumsfeld
TP = 25
FP = 2
FN = 11

## Model diagnostics in python