# Logistic Regression

Logistic regression is a supervised learning method used for binary classification. You can think of it as a special case of linear regression where the target variable is categorical instead of numeric. In summary, linear regression is used to predict a *continuous* dependent variable given a set of independent variables whereas logistic regression is used to predict a *categorical* dependent variable given a set of independent variables.

### Model

A logistic regression model is run using the following steps:

1. Find the line of best fit using linear regression.
2. Convert the predicted values to probabilities using the sigmoid function. The sigmoid function always returns a value between 0 and 1. This is an important step since the regression line is highly susceptible to outliers and therefore unsuitable for binary classification.

    \begin{equation*}      
    S(x) = \frac{1}{1 + e^{-x}}
    \label{eq:1} \tag{1}
    \end{equation*} 


3. Set a threshold value for our sigmoid function (the standard is 0.5). Any value higher than our threshold is classified as belonging to one group. Anything lower is classified as belonging to the other group. For example, if $S(X_i) = 0.7$, $X_i$ belongs to the first group.  

### Training the Model

Below is an example of how to train and evaluate a logistic regression model using startup company data. In this model, we'll predict the State of each startup given its other features. Specifically, we want to predict whether the startup is based in New York or not. 

In [29]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

df = pd.read_csv('../50_Startups.csv')
df = df.dropna()

df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


Our target, `State`, is already categorical but non-binary. To make it binary, we'll use $1$ to represent New York and $0$ for all other states. 

In [30]:
# Extract dependent var and predictor vars
x = df.drop('State', axis = 1)
y = df['State'] = (df['State'] == 'New York').astype(int)

# Get training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)

df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,1,192261.83
1,162597.7,151377.59,443898.53,0,191792.06
2,153441.51,101145.55,407934.54,0,191050.39
3,144372.41,118671.85,383199.62,1,182901.99
4,142107.34,91391.77,366168.42,0,166187.94


Next, we're going to create and train the model.

In [39]:
model = LogisticRegression(solver='liblinear', C=10.0, random_state=0)
model.fit(x_train, y_train)

LogisticRegression(random_state=0)

Like linear regression, logistic regression is trained using least squares optimization, which is the most efficient approach to finding coefficients that minimize error for these models. By default, `LogisticRegression()` uses `lbfgs` as the optimization algorithm, which supports L2 regularization. `liblinear` used to be the default and is better to use for smaller data sets and supports both L1 and L2 penalities.  See [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) for the full list of parameters. 

The parameter `C` is a hyperparameter. A high value of `C` tells the model to give more weight to the training data and a lower weight to the complexity penalty because we trust it more. 

### Evaluating the Model

Now, let's check the performance of our model.

In [45]:
p_pred = model.predict_proba(x_test)
y_pred = model.predict(x_test)
r2 = model.score(x_test, y_test)
conf_m = confusion_matrix(y_test, y_pred)

`p_pred` returns a matrix of probabilites for our targets (in this case, $0$ and $1$). For example, the probability that the first term is $0$ is about $0.82$ and the probability that it's $1$ is about $1-0.82=0.18$.

In [46]:
p_pred

array([[0.7683486 , 0.2316514 ],
       [0.78861458, 0.21138542],
       [0.73986803, 0.26013197],
       [0.65224295, 0.34775705],
       [0.69214806, 0.30785194],
       [0.7919051 , 0.2080949 ],
       [0.5545323 , 0.4454677 ],
       [0.71152914, 0.28847086],
       [0.67516454, 0.32483546],
       [0.63250928, 0.36749072]])

`y_pred` will only return our targets of $0$'s and $1$'s.

In [47]:
y_pred

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

A confusion matrix is used to evaluate the performance of a classification model. It is read as follows:

* **True negatives** are in the upper-left quadrant
* **False negatives** are in the lower-left quadrant
* **False positives** are in the upper-right quadrant
* **True positives** are in the lower-right quadrant

In [48]:
conf_m

array([[5, 0],
       [5, 0]])

In [49]:
r2

0.5

I guess these features alone aren't enough to predict the state for each startup.