# Logistic Regression

***

Logistic regression is used for answerering yes and no questions or questions involving the odds of something happening. To do this, it uses the "logistic" function.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/2000px-Logistic-curve.svg.png" width=30%, height=30% />

This function is derived from the following:

### Odds Ratio
We want to know the odds of something happening. This can be exressed as an odds ratio
$$Odds Ratio (OR) = \frac{p}{1 - p}$$
where $p$ is the probability of the event.

If that ratio is, let's say, > 0.75, then we're pretty confident about the event being very likely. If that ratio is, say, < 0.25, then we might say that event is very unlikely.

### Logit Function
We can use a derivation of this function called the "Logit Function" which is just a Log of the odds ratio above.
$$logit(p) = \log(\frac{p}{1 - p})$$

We want this logit function to be on the y axis and $\theta_0 x_0 + \theta_1X$ (i.e. our linear equation) to be on the x axis. Thus, we can say:

$$logit(p) = \log(\frac{p}{1 - p}) = \theta_0 x_0 + \theta_1X$$

### Logistic Function

One more step! We just need to solve for $p(X)$ which gives us:

$$p(X) = \frac{1}{1 + e^{\theta_0 x_0 + \theta_1X}}$$

### Kaggle Competition Titanic Dataset

We'll be using the dataset that is provided by the Titanic competition hosted at Kaggle.
The goal of this competition is to use the dataset to build a model that predicts who survived and who didn't.

Let's take a look at the dataset now.

In [39]:
%pylab inline
import pandas as pd

train = pd.read_csv("train.csv")
train.head()

Populating the interactive namespace from numpy and matplotlib


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


We'll be skipping any kind of preprocessing or feature engineering as this document is meant to just focus on logistic regression.

Since we're wanting to predict the Survived column, we need to choose a couple predictors; let's use Pclass and Sex. We can start with the model:
$$Survived = \theta_0 * x_0 + \theta_1 * Pclass + \theta_2 * Sex$$

And since we're calculating the probability that someone survived, the second equation would just be:
$$z = Prob(Survived)$$

In [40]:
import statsmodels.api as sm
# add a column of 0s for the intercept
train['intercept'] = 1.0

# make the Sex column a category type
from sklearn import preprocessing
le_sex = preprocessing.LabelEncoder()
train['Sex'] = le_sex.fit_transform(train['Sex'])

# identify the independent variables
columns = ['intercept', 'Pclass', 'Sex']
logistic_function = sm.Logit(train['Survived'], train[columns])
model_parameters = logistic_function.fit()

Optimization terminated successfully.
         Current function value: 0.464195
         Iterations 6


Let's check out how our model adjusted the parameters!

In [41]:
theta = model_parameters.params
theta

intercept    3.294642
Pclass      -0.960553
Sex         -2.643398
dtype: float64

We can update our model now to show:
$$Survived = 3.294642 * x_0 + -0.960553 * Pclass + -2.643398 * Sex$$

Given the above, we can say the probability of someone surviving is:
$$p(z) = \frac{1}{1 + e^{\theta_0 * x_0 + \theta_1 * Pclass + \theta_2 * Sex}}$$

Let's write that in code.

In [42]:
def probability_of_z(pclass, sex, theta):
    z = theta[0] + pclass * theta[1] + sex * theta[2]
    return 1 / (1 + exp(-1*z))

Now that we've got everything set up, let's make a prediction. First we need to see how scikit-learn encoded our categorical variable Sex.

In [45]:
print le_sex.inverse_transform([0, 1])

['female' 'male']


Cool, so if we wanted to see how a woman in first class faired, we could make a prediction like so:

In [46]:
probability_of_z(1, 0, theta)

0.91166115474379972

Seems like she would have faired pretty well. What about a guy in third class...

In [48]:
probability_of_z(3, 1, theta)

0.097052204431359895

Wow, that's abysmal. But it makes sense, the people in the lower parts of the ship probably didn't have a very good chance of escaping when the iceberg hit and the Titanic is known for saving women and children first. Unfortunately, it looks like if you were a poor male on the Titanic, you almost certainly didn't survive - at least according to our logistic regression model :).

Now there are other things we could have done like more pre-processing of the data such as engineering new features. We could have also used other variables in our logistic regression function that would have helped our model be more accurate. But now we know the basics of logistic regression!