# Logistic Regression

## 0) Warmup

1. What are we trying to predict in this weeks project?
- Which values does y take on?
- What information do we use to make the prediction?

## 1) Predicting Probabilities

Instead of directly predicting the binary outcome (0, 1), we are actually predicting a probability of "success" (belonging to class 1).

$f(X) = \hat{p}(X)$

where X are input features such as *age*, *Pclass*, *gender*, ...

How do we do that?

- Threshold value of 0.5
- The parameters are responsible for the predictions
    - w are the weights of the input features --> determine the sensitivity of the curve
    - b is a parameter that shifts the function to the left (>0) or right (<0). It determines the predicted probability for x=0
- How do we find the parameters? --> The loss is minimized --> Every machine learning algorithm will have some kind of loss (objective functin) that is minimized.
- The minimzation of the loss is equivalent to the maximization of the likelihood of observing the data points that we have observed

## 2) Let's do it

In [2]:
# !pip install -U scikit-learn

In [1]:
# Import the necessary packages
import pandas as pd

# Import the logistic regression
from sklearn.linear_model import LogisticRegression

In [4]:
# Import the dataset
df = pd.read_csv('train.csv', index_col=0)
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [7]:
type(df[['Pclass']]), type(df['Pclass'])

(pandas.core.frame.DataFrame, pandas.core.series.Series)

In [8]:
# Define X
# For simplicity I will just take the Passenger Class as input variable
X = df[['Pclass']] # scikit-learn expects pd.DataFrame

In [11]:
# Define y
y = df['Survived'] # scikit-learn expects pd.Series

In [9]:
# Split the data into a training set and a test set
from sklearn.model_selection import train_test_split

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10)

In [16]:
X_test.shape, y_test.shape

((223, 1), (223,))

In [26]:
# Create a model
m = LogisticRegression(random_state=10) 

In [27]:
# Train a model
m.fit(X_train, y_train) # <-- is the whole iterative process of finding parameters

LogisticRegression(random_state=10)

In [28]:
# Where are the parameters/coefficients (b and w)
w = m.coef_[0]
w

array([-0.80400439])

In [29]:
b = m.intercept_
b

array([1.40871638])

In [30]:
# Use the model to make predictions on the seen data
ypred_train = m.predict(X_train)

In [31]:
# You can make predictions for unseen data
ypred_test = m.predict(X_test)

In [32]:
ypred_test

array([0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
       0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1])

#### In the long run we want to actually use more than one predictor (x-variable)

In [35]:
df_multi = df.dropna(subset=['Age'])

In [36]:
df_multi.shape

(714, 11)

In [37]:
X_multi = df_multi[['Pclass', 'Age']]
y_multi = df_multi['Survived']

In [38]:
# I for simplicity skip the train-test-split

In [39]:
m.fit(X_multi, y_multi)

LogisticRegression(random_state=10)

In [40]:
m.coef_

array([[-1.22653571, -0.04149665]])

In [43]:
# Accuracy --> Which ratio of the datapoints did I classify correctly?
# scale 0 - 1
m.score(X_multi, y_multi)

0.696078431372549

In [None]:
# Play around with the different features and try to include different features
# in order to improve your accuracy