# Logistic Regression

***

## One Class

Logistic regression is used for answering yes and no questions or questions involving the odds of something happening. To do this, it uses the "logistic" function.

The idea is that you want your hypothesis, $h_\theta (x)$ to be $> 1$ or $< 0$ and this is what logistic regression does for us - it limits the domain of our hypothesis to be between 0 and 1 which is great as 0 can be "no" and 1 can be "yes" in answering any kind of binary question.
To express the above, we can say:

$$0 \le h_\theta (x) \le 1$$

Remember in the linear regression tutorial, we had defined our hypothesis, $h_\theta (x)$ to be $\theta^T x$. Now our new hypothesis for logistic regression is defined as:

$$h_\theta (x) = g(\theta^T x)$$
<p style="text-align:center"> where </p>
$$g(z) = \frac{1}{1 + e^{-z}}$$

This $g(z)$ function is also called the "sigmoid" function or the "logistic" function as is graphed below.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/2000px-Logistic-curve.svg.png" width=30%, height=30% />

Thus, by substitution, we can say that our new hypothesis is:

$$h_\theta (x) = \frac{1}{1 + e^{-z}}$$

To break the above equation down a little, $h_\theta (x)$ will be our estimated probability that $y = 1$ on input $x$. This can be expressed as the probability that $y = 1$ given $x$ using parameters $\theta$ or in other words:

$$h_\theta (x) = P(y = 1\ |\ x;\ \theta)$$

Since we know that if the probability of something is $0.3$, then the probability of that "not-something" would be $1 - 0.3 = 0.7$. Remember that in probabilities, 0 means you're absolutely certain the event won't happen and 1 means you're absolutely sure that something will happen. That being said, if we wanted to find the probability when $y = 1$, we could express it as such:

$$P(y = 0\ |\ x;\ \theta) = 1 - P(y = 1\ |\ x;\ \theta)$$

So we will be predicting $y = 1$ if $\theta^T x \ge 0$. This makes sense looking at the graph of the sigmoid function. If our line ($\theta^T x$) outputs something greater than 0, we can the sigmoid function asymptotically approaches 1.
You can see that we're still using our line equation, $\theta^T x$, in a way. If you takea another look at the graph and only look at the portion of the graph between -2 and 2, you can see the resemblance of a linear function. So the sigmoid function involves a linear function but is expressed in a way that we can use it for classification instead of just for predictions involving continuous numbers. This line we see in the graph is defined as the decision boundary for our model, separating the two different groups of data points.

In the previous notebook about linear regression, we came up with a cost function for our hypothesis, but that cost function was convex.

$$J(\theta) = \frac{1}{N}\sum\limits_{i=1}^N(h_\theta (x^{(i)}) - y^{(i)})^2$$

<p style="text-align:center"> where $\theta$ is a vector of coefficients and $i$ is the ith training example </p>

Our sigmoid function is non-linear, thus non-convex, so we can't use the same Mean Squared Error cost function as we did in linear regression since we need a cost function that's convex so our gradient descent algorithm can easily converge at the global minimum.

Let's start out with rewriting the cost function like so:

$$J(\theta) = \frac{1}{N}\sum\limits_{i=1}^N\ Cost(h_\theta (x^{(i)}), y^{(i)})$$

We can define $Cost(h_\theta (x^{(i)}), y^{(i)})$ as:

$$Cost(h_\theta (x), y) =
\begin{cases}
-\log(h_\theta (x))\ if\ y = 1\\
-log(1 - h_\theta(x))\ if\ y = 0
\end{cases}
$$

The first function of the piecewise function will be severely punished (high cost) if we predict 0 but $y = 1$. Inversely, if we predict 1 but y = 0 with the second function in the piecewise function, there will be a high cost (the model will be punished).

We can actually simplify the piecewise function above to be:

$$Cost(h_\theta (x), y) = -y \log(h_\theta (x)) - (1 - y) \log(1 - h_\theta(x))$$

Simplifying the piecewise function, we can now rewrite the cost function as:

$$J(\theta) = -\frac{1}{N}[\sum\limits_{i=1}^N y^{(i)}\log(h_\theta(x^{(i)}) + (1 - y^{(i)}\log(1 - h_\theta(x^{(i)}))]$$

Now that we have our cost function for the sigmoid function, we need to find the partial derivative of that cost function so that we can use gradient descent to minimize $\theta$.

$$\frac{\delta}{\delta \theta_j}J(\theta) = \frac{2}{N}\sum\limits_{i=1}^N(h_\theta (x^{(i)}) - y^{(i)})x_j^{(i)}$$

Now we have everything we need to use the gradient descent algorithm (which is below if you've forgotten)!

$$\theta_j = \theta_j - \alpha \frac{\delta}{\delta \theta_j} J(\theta)$$

### Kaggle Competition Titanic Dataset

We'll be using the dataset that is provided by the Titanic competition hosted at Kaggle.
The goal of this competition is to use the dataset to build a model that predicts who survived and who didn't.

Let's take a look at the dataset now.

In [5]:
%pylab inline
import pandas as pd

train = pd.read_csv("train.csv")
train.head()

Populating the interactive namespace from numpy and matplotlib


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


We'll be skipping any kind of preprocessing or feature engineering as this document is meant to just focus on logistic regression.

Since we're wanting to predict the Survived column, we need to choose a couple predictors; let's use Pclass and Sex. We can start with what would be the linear part of the sigmoid function:
$$\theta^Tx = \theta_0 * x_0 + \theta_1 * Pclass + \theta_2 * Sex$$

Now that we've defined the linear part of the model, we can just plug it in to our hypothesis:
$$h_\theta (x) = \frac{1}{1 + e^{-\theta_0 * x_0 + \theta_1 * Pclass + \theta_2 * Sex}}$$

Since we've already coded the gradient descent algorithm in the linear regression tutorial, we'll just skip to using third-party libraries. If you'd like to code the above algorithms yourself, refer to the linear regression notebook for implementation details.

In [8]:
import statsmodels.api as sm
# add a column of 0s for the intercept
train['intercept'] = 1.0

# make the Sex column a category type
from sklearn import preprocessing
le_sex = preprocessing.LabelEncoder()
train['Sex'] = le_sex.fit_transform(train['Sex'])

# identify the independent variables
columns = ['intercept', 'Pclass', 'Sex']
logistic_function = sm.Logit(train['Survived'], train[columns])
model_parameters = logistic_function.fit()

Optimization terminated successfully.
         Current function value: 0.464195
         Iterations 6


Let's check out how our model adjusted the parameters!

In [9]:
theta = model_parameters.params
theta

intercept    3.294642
Pclass      -0.960553
Sex         -2.643398
dtype: float64

Now that we've got our optimized thetas, we can update the linear part of our model to show:

$$\theta^Tx = 3.294642 * x_0 + -0.960553 * Pclass + -2.643398 * Sex$$

Given the above, we can update our hypothesis and say the probability of someone surviving is:
$$h_\theta(x) = \frac{1}{1 + e^{-3.294642 * x_0 + -0.960553 * Pclass + -2.643398 * Sex}}$$

Let's write that in code.

In [10]:
def hypothesis(pclass, sex, theta):
    z = theta[0] + pclass * theta[1] + sex * theta[2]
    return 1 / (1 + exp(-1*z))

Now that we've got everything set up, let's make a prediction. First we need to see how scikit-learn encoded our categorical variable Sex.

In [11]:
print le_sex.inverse_transform([0, 1])

['female' 'male']


Cool, so if we wanted to see how a woman in first class faired, we could make a prediction like so:

In [12]:
hypothesis(1, 0, theta)

0.91166115474379972

Seems like she would have faired pretty well. What about a guy in third class...

In [14]:
hypothesis(3, 1, theta)

0.097052204431359895

Wow, that's abysmal. But it makes sense, the people in the lower parts of the ship probably didn't have a very good chance of escaping when the iceberg hit and the Titanic is known for saving women and children first. Unfortunately, it looks like if you were a poor male on the Titanic, you almost certainly didn't survive - at least according to our logistic regression model :).

Now there are other things we could have done like more pre-processing of the data such as engineering new features. We could have also used other variables in our logistic regression function that would have helped our model be more accurate. But now we know the basics of logistic regression - for a single class.

## Multiple Classes

Now the good news is that since we know how to pull off logistic regression for one class (binary classification), we can easily do it for multiple classes. The strategy is to just do binary classification for each class while setting the other classes as negative classes. So for each class:

$$h_\theta^{(i)}(x) = P(y = 1\ |\ x;\ \theta)\ (i = 1, 2, 3)$$

We want to train a logistic regression classifier $h_\theta^{(i)}(x)$ for each class $i$ to predict the probability that $y = 1$.

On a new input $x$, to make a prediction, pick the class $i$ that gives the highest probability.