# Logistic Regression 

In [1]:
from data.create_data import *
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

%matplotlib inline

In [4]:
data = read_frmgham()
data.dropna(subset=['hyperten', 'age'], inplace=True)

### Research Questions
  1. Is there an effect of age on the onset of hypertension?

#### Hypertension
Hypertension is defined per the dataset as high blood pressure where the patient has one of the following:
  * Systolic blood pressure (`sysbp`) ≥ 140 mmHg
  * Diastolic blood pressure (`diabp`) ≥ 90 mmHg

## Model: Logistic Regression
Logistic regression is a form of regression used when the dependent variable (*response*) is **dichotomous/binary**. Essentially, it is a predictive model (supervised learning) that addresses a *classification* problem. 

The logistic model predicts the probability of ownership to a particular class, *mapping the observation to a particular class label*.

### Consideration: *Why not apply linear regression?*
Although the response is numerical, the semantics of the variable is binary with **`0`** and **`1`** values, each representing their own particular meaning.

##### Problem of Linear Regression Model
A *linear regression* model produces prediction is (`y`) that are difficult to interpret, producing predictions that are not binary (e.g. y = 0.5). The only meaningful values of a dichotomous response variable are **`0`** and **`1`**.  

##### Benefits of Logistic Regression
*Logistic Regression* avoids this issue by expressing predictions in terms of **odds** rather than the probabilities. More specifically, it s predictions are *odds of a particular class/event*.
  * Odds are different representations of the same information as probabilities.
  * Odds (of an event/class) are given by the ratio of the probaiblity of the event by its complement.
  * The range of odds is [0, 1], aligning with the dichotomoy of the response variable.
  
`odds = o = p / (1 - p)`

#### Equation
**`log o = β`<sub>0</sub> `+ β`<sub>1</sub>` X`<sub>1</sub>**

Model Coefficients:
  * **β<sub>0</sub>**: intercept
     * log(odds) value * when `x`<sub>1</sub> (predictor) = 0
  * **β<sub>1</sub>**: difference in log (odds) in the outcome (`y`) between `x`.
     * For every one unit change in `x`<sub>1</sub>, `log o` changes by β<sub>1</sub>.
     * similarly, for every one unit change in `x`<sub>1</sub>, the odds are changed by `e`<sup>β<sub>1</sub></sup>

##### Odds Ratio
e<sup>β<sub>1</sub>+β<sub>1</sub></sup>: odds ratio of disease onset
e<sup>β<sub>0</sub></sup>: odds ratio of no disease onset

#### Predictory Variables (`X`)
  * **`age`**: age of patients

#### Response (`Y`)
  * **`hyperten`**: onset of hypertension
    * **`0`**: No hypertension
    * **`1`**: patient has hypertension

### Hypothesis Tests

1) Hypothesis
  * **H<sub>0</sub>**: There is no effect of age on the onset of hypertension (β<sub>1</sub>=0).
  * **H<sub>A</sub>**: There is an effect of age on the onset of hypertension (β<sub>1</sub>=0).
  
2) Statistical Test
Compute the test statistic & p-value
  * test statistic: (β<sub>1</sub> = β<sub>diabetes</sub>)

In [7]:
formula = 'hyperten ~ age'
log_model = smf.logit(formula=formula, data=data)
log_results = log_model.fit()

print "intercept = %.3f" % log_results.params['Intercept']
print "Beta Coefficient = %.3f" % log_results.params['age']

Optimization terminated successfully.
         Current function value: 0.549257
         Iterations 6
intercept = -1.634
Beta Coefficient = 0.050


In [9]:
print "Odds ratio = %.3f" % np.exp(log_results.params['age'])

Odds ratio = 1.052


#### Model Coefficients
  * β<sub>0</sub> (intercept) = -1.634
  * β<sub>1</sub> (slope) = 0.05
    * e<sup>β<sub>1</sub></sup> (odds ratio) = 1.052  

The test-statistic (β<sub>1</sub>) is 0.05. Thus, for every one unit change in `age`, `log o` (*response*) changes by 0.05.

The odds that an individual will have the onset of hypertension compared to an individual without the disease is 1.052 to 1 (1.052:1).

From the logistic regression the model, the equation to predict the odds of `hyperten` using the predictor variable `age` is:  

`log o = 0.05 x - 1.634`

#### p-value

In [10]:
log_results.pvalues['age']

2.8841963691930888e-98

#### 95% Confidence Interval

In [11]:
print "95% CI = ", list(log_results.conf_int().loc['age'])

95% CI =  [0.045550968281770389, 0.054909750952097734]


#### Summary Statistics

In [12]:
log_results.summary()

0,1,2,3
Dep. Variable:,hyperten,No. Observations:,11627.0
Model:,Logit,Df Residuals:,11625.0
Method:,MLE,Df Model:,1.0
Date:,"Thu, 16 Mar 2017",Pseudo R-squ.:,0.03573
Time:,20:13:40,Log-Likelihood:,-6386.2
converged:,True,LL-Null:,-6622.8
,,LLR p-value:,6.249e-105

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-1.6342,0.128,-12.812,0.000,-1.884 -1.384
age,0.0502,0.002,21.039,0.000,0.046 0.055


3) Results
  * **β<sub>age</sub>** = 0.05
  * **p-value** = 0  
  
The **p-value** is extremely small (p-val < 0.05), providing substantial evidence against  the null hypothesis. Additionally, the **confidence interval** does not contains a slope (β<sub>age</sub>) of 0. 

The null hypothesis (H<sub>0</sub>) is rejected, in favor of the alternative hypothesis (H<sub>A</sub>). Hence, proving that there's a significant association between `age` and the onset of hypertension (`hyperten`).

##### Conclusion
There is a significant effect of `age` on the onset of hypertension. For each one unit increase in `age`, `log o` changes by 0.05 (**β<sub>age</sub>**).

The odds that an individual will have the onset of hypertension compared to an individual without the disease is 1.052 to 1.

Also, the positive **β<sub>age</sub>** suggests that older patients are more likely to have the onset of hypertension.

## Model's Accuracy
The model's metric is measured by its accuracy, which is the number of successful predicitons compared to what would be expected by chance.

#### Baseline Prediction Strategy
Baseline strategy is to guess positive/true or the binary value of **`1`** everytime. The accuracy of this strategy is simply the *fraction of hypertensive patients*.

The exogenous (*predictor*) variables and endogenous (*reponse*) variables must be extracted. They are attributes of the `smf-logit` class.

In [19]:
endog = pd.DataFrame(log_model.endog, 
                     columns = [log_model.endog_names])
exog = pd.DataFrame(log_model.exog, 
                    columns = [log_model.exog_names])
print exog.head()
print
print endog.head()

   Intercept   age
0        1.0  39.0
1        1.0  52.0
2        1.0  46.0
3        1.0  52.0
4        1.0  58.0

   hyperten
0       0.0
1       0.0
2       0.0
3       0.0
4       0.0


In [22]:
actual = endog['hyperten']
baseline = actual.mean()
baseline

0.74326997505805448

##### Conclusion
The mean fraction of hypertensive patients is 0.74. Thus, it's predicted that the model is correct 74%of the time with this *guess*.

#### Model Accuracy Strategy
Use the logistic regression model to compute the number of predictions that are correct.

In [38]:
predict = log_results.predict() >= 0.5
true_pos = predict * actual # true positives
true_neg = (1-predict) * (1-actual) # true negatives

`log_results.predict()` returns a NumPy array of probabilities, which are rounded off to **`0`** or **`1`**. 

Multiplying by `actual` yields **`1`** if we predict a *hypertension* and get it right, **`0`** otherwise.

**Accuracy** is the fraction of correct guesses  
  * *True Positive*: actual correct predictions of `hypertension` (**`1`**)
  * *True Negative*: actual correct predictions of no hypertension (**`0`**)  
`acc = (sum(true_pos) + sum(true_neg)) / len(actual)`

In [39]:
acc = (np.sum(true_pos) + np.sum(true_neg)) / len(actual)
acc

0.7433559817665778

##### Conclusion
The accuracy of the logistic regression is 74.3%.