# Logistic Regression

__Logistic Regression__ is a supervised machine learning algorithm that uses regression to predict the continuous probability, ranging from `0` to `1`, of a data sample belonging to a specific category, or class. Then, based on that probability, the sample is classified as belonging to the more probable class, ultimately making Logistic Regression a classification algorithm.


## Predict Titanic Survival

The RMS Titanic set sail on its maiden voyage in 1912, crossing the Atlantic from Southampton, England to New York City. The ship never completed the voyage, sinking to the bottom of the Atlantic Ocean after hitting an iceberg, bringing down 1,502 of 2,224 passengers onboard.

In this project you will create a Logistic Regression model that predicts which passengers survived the sinking of the Titanic, based on features like age and class.

The data we'll be using for training our model is provided by Kaggle. Feel free to make the model better on your own and submit it to the [Kaggle Titanic competition](https://www.kaggle.com/c/titanic)!

## Step 1. Load the Data

The file `passengers.csv` contains the data of `892` passengers onboard the Titanic when it sank that fateful day. Let's begin by loading the data into a pandas DataFrame.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# load the passengers data
passengers = pd.read_csv('passengers.csv')
print(passengers.head(10))
print(passengers.columns)

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   
5            6         0       3   
6            7         0       1   
7            8         0       3   
8            9         1       3   
9           10         1       2   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   
5                                   Moran, Mr. James    male   NaN      0   
6                            McCarthy, Mr. Timothy J    male  54

## Step 2. Clean the Data

Given the saying, "women and children first," `Sex` and `Age` seem like good features to predict survival. Let's map the text values in the `Sex` column to a numerical value. Update `Sex` such that all values `female` are replaced with `1` and all values `male` are replaced with `0`.

In [2]:
# update Sex column to numerical
passengers['Sex'] = passengers['Sex'].map({'female': 1, 'male': 0})
print(passengers.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name  Sex   Age  SibSp  Parch  \
0                            Braund, Mr. Owen Harris    0  22.0      1      0   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...    1  38.0      1      0   
2                             Heikkinen, Miss. Laina    1  26.0      0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)    1  35.0      1      0   
4                           Allen, Mr. William Henry    0  35.0      0      0   

             Ticket     Fare Cabin Embarked  
0         A/5 21171   7.2500   NaN        S  
1          PC 17599  71.2833   C85        C  
2  STON/O2. 3101282   7.9250   NaN        S  
3            113803  53.1000  C123        S  
4            373450   8.0500   NaN        S  


In [3]:
print(passengers['Age'])

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64


We have multiple missing values, or `nan`s. Let's fix this.

In [4]:
# fill the nan values in the Age column
passengers['Age'].fillna(inplace=True, value=round (passengers['Age'].mean()))
print(passengers['Age'])

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888    30.0
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64


Given the strict class system onboard the Titanic, let's utilize the `Pclass` column, or the passenger class, as another feature. Create a new column named `FirstClass` that stores `1` for all passengers in first class and `0` for all other passengers.

In [5]:
# create a FirstClass column
passengers['FirstClass'] = passengers['Pclass'].apply(lambda x: 1 if x == 1 else 0)
print(passengers['FirstClass'])

0      0
1      1
2      0
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: FirstClass, Length: 891, dtype: int64


Create a new column named `SecondClass` that stores `1` for all passengers in second class and `0` for all other passengers. 

In [6]:
# create a SecondClass column
passengers['SecondClass'] = passengers['Pclass'].apply(lambda x: 1 if x == 2 else 0)
print(passengers['SecondClass'])

0      0
1      0
2      0
3      0
4      0
      ..
886    1
887    0
888    0
889    0
890    0
Name: SecondClass, Length: 891, dtype: int64


## Step 3. Select and Split the Data

Now that we have cleaned our data, let's select the columns we want to build our model on. 


In [7]:
# select the desired features
features = passengers[['Sex', 'Age', 'FirstClass', 'SecondClass']]
survival = passengers['Survived']

In [8]:
# perform train, test, split
train_features, test_features, train_labels, test_labels = train_test_split(features, survival)

## Step 4. Normalize the Data

`sklearn`'s Logistic Regression implementation requires feature data to be normalized.

Normalization scales all feature data to vary over the same range. `sklearn`'s Logistic Regression requires normalized feature data due to a technique called Regularization that it uses under the hood. 

In [9]:
# scale the feature data so it has mean = 0 and standard deviation = 1
scaler = StandardScaler()
train_features = scaler.fit_transform(train_features)
test_features = scaler.transform(test_features)

## Step 5. Create and Evaluate the Model

`sklearn` is a Python library that helps build, train, and evaluate Machine learning models.

To take advantage of its abilities, we can begin by creating a LogisticRegression object.

After that, we need to fit our model on the data. When we fit the model  with `sklearn` it will perform gradient descent, repeatedly updating the coefficients of our model in order to minimize the log-loss. We train the model using the `.fit()` method, which takes two parameters: a matrix of features, and a matrix of class labels.

In [10]:
# create and train a LogisticRegression model
model = LogisticRegression()
model.fit(train_features, train_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [11]:
# score the model on the train data
train_score = model.score(train_features, train_labels)
print(train_score)

0.8008982035928144


Scoring the model on the training data will run the data through the model and make final classifications on survival for each passenger in the training set. The score returned is the percentage of correct classifications, or the accuracy.

In [12]:
# score the model on the test data
test_score = model.score(test_features, test_labels)
print(test_score)

0.7623318385650224


Similarly, scoring the model on the testing data will run the data through the model and make final classifications on survival for each passenger in the test set.

Now that the model is trained, we can access a few useful attributes of the LogisticRegression object:
* `model.coef_` is a vector of the coefficients of each feature
* `model.intercept_` is the intercept

Since our data is normalized, all features vary over the same range. Given this understanding, we can compare the feature coefficients' magnitudes and signs to determine which features have the greatest impact on class prediction, and if that impact is positive or negative.

Features with larger, __positive coefficients__ will increase the probability of a data sample belonging to the positive class.

Features with larger, __negative coefficients__ will decrease the probability of a data sample belonging to the positive class.

Features with __small__, positive or negative coefficients have minimal impact on the probability of a data sample belonging to the positive class.

In [13]:
# analyze the coefficients
print(list(zip(['Sex', 'Age', 'FirstClass', 'SecondClass'], model.coef_[0])))

[('Sex', 1.279831884560119), ('Age', -0.3494349894255272), ('FirstClass', 0.9057109077549288), ('SecondClass', 0.3817538408471818)]


We see that `FirstClass` is most important feature in predicting survival on the sinking of the Titanic.

## Step 6. Predict with the Model

Let's use our model to make predictions on the survival of a few fateful passengers. 

The arrays store 4 feature values, in the following order:
* `Sex`, represented by a `0` for male and `1` for female
* `Age`, represented as an integer in years
* `FirstClass`, with a `1` indicating the passenger is in first class
* `SecondClass`, with a `1` indicating the passenger is in second class

With our trained model we are able to predict whether new data points belong to the positive class using `.predict()` method.

It takes a matrix of features as a parameter and returns a vector of labels `1` or `0` for each sample. In making its predictions, `sklearn` uses a classification threshold of `0.5`.

In [14]:
# sample passenger features
Jack = np.array([0.0,20.0,0.0,0.0])
Rose = np.array([1.0,17.0,1.0,0.0])
You = np.array([1.0,31.0,0.0,2.0])

# combine passenger arrays
sample_passengers = np.array([Jack, Rose, You])

# scale the sample passenger features
sample_passengers = scaler.transform(sample_passengers)
print(sample_passengers)

[[-0.71506099 -0.776322   -0.57504547 -0.49625463]
 [ 1.39848211 -1.00941645  1.73899292 -0.49625463]
 [ 1.39848211  0.07835765 -0.57504547  4.52644374]]


In [15]:
# make survival predictions
survival_prediction = model.predict(sample_passengers)
print(survival_prediction)

[0 1 1]


If we are more interested in the predicted probability of the data samples belonging to the positive class than the actual class, we can use the `.predict_proba()` method. It takes a matrix of features as a parameter and returns a vector of probabilities, ranging from `0` to `1`, for each sample.

In [16]:
# print probabilities that led to the predictions
survival_p = model.predict_proba(sample_passengers)
print(survival_p)

[[0.88528942 0.11471058]
 [0.05526427 0.94473573]
 [0.09277133 0.90722867]]


The first column is the probability of a passenger perishing on the Titanic, and the second column is the probability of a passenger surviving the sinking (which was calculated by the model to make the final classification decision).

# Review

* __Logistic Regression__ is used to perform binary classification, predicting whether a data sample belongs to a positive (present) class, labeled `1` and the negative (absent) class, labeled `0`.
* __The Sigmoid Function__ bounds the product of feature values and their coefficients, known as the log-odds, to the range `[0, 1]`, providing the probability of a sample belonging to the positive class.
* A __loss function__ measures how well a machine learning model makes predictions. The loss function of Logistic Regression is __log-loss__.
* A __Classification Threshold__ is used to determine the probabilistic cutoff for where a data sample is classified as belonging to a positive or negative class. The standard cutoff for Logistic Regression is `0.5`, but the threshold can be higher or lower depending on the nature of the data and the situation.
* __Scikit-learn__ has a Logistic Regression implementation that allows you to fit a model to your data, find the feature coefficients, and make predictions on new data samples.
* The coefficients determined by a Logistic Regression model can be used to interpret the relative importance of each feature in predicting the class of a data sample.
