# Lede Algorithms, Lecture 4. 
#### July 19, 2019. Friday. 

## Introduction to Logistic Regression

We're going to be using logistic regression on a very very old albeit familiar dataset just to try and understand the various concepts around how we approach logistic regression problems. Logistic regression, which you may remember, is a classification algorithm. For the purpose of this class, we're restricting the scope to dichotomous or binary dependent variables, i.e. the categories your algorithm will predict will be one of two choices. The dataset: Titanic. The output: Survived? Yes or no. 

You might wonder why we are bothering to use this dataset again. Is it particularly interesting? And, what are the odds—no puns intended—of us ever using a dataset like this for journalistic reasons? Let's go through the exercise and then discuss the ethics of doing something like this on more current/live data. 

In [1]:
# Let's import all the packages/modules that we need. 
# If you get a `ModuleNotFoundError`, run a `pip install` for the module 
# that failed to import.
# You'll need to have the following Python modules installed:
# - matplotlib
# - numpy
# - pandas
# - seaborn
# - sklearn

import numpy as np
import pandas as pd

from sklearn import metrics
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.metrics import confusion_matrix

import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib import style

### Loading the data in. 
You should notice that we now have two different files: `train.csv` and `test.csv`. Wait, what, why? (This data has been downloaded from Kaggle: https://www.kaggle.com/c/titanic/data)

`train.csv` is our training dataset, i.e. the dataset our algorithm will learn from. It has a bunch of features and the output label (i.e. survived: yes/no) for a subset of the passengers. The remaining passengers are in the `test.csv` file, i.e. they make up our test dataset. 

In [2]:
# Yes, we are doing this here, but remember, relative paths are evil. 

train_df = pd.read_csv('sources/titanic_train.csv')
test_df = pd.read_csv('sources/titanic_test.csv')

# In theory, both dataframes should have the same number of columns, but 
# the training dataset should have more rows. Let's do a quick sanity check
# to see the kind of data we are dealing with. 

print(train_df.shape, test_df.shape)

(891, 12) (418, 12)


## Explore and clean the data
It's important to get a feel for the data, because unless you're not intimate with it, the chances of making a careless mistake are high. In this case, most of the column names are self-explanatory, but there are a couple that aren't. 
- `SibSp` defines family relationships for siblings and spouses
- `Parch` defines family relationships for parents and children; when 0, it means the child was travelling with a nanny. 

Once we understand what each of the columns mean, let's dive into the data. Typically, folks are 
inclined to do a `df.head()`, but the problem with that is it often masks inconsistencies, so randomly sampling data can be a better approach. It's not 100% foolproof, but...

In [3]:
## Let's see a quick sample of our data to see what we've got. 
train_df.sample(15)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
140,141,0,3,"Boulos, Mrs. Joseph (Sultana)",female,,0,2,2678,15.2458,,C
127,128,1,3,"Madsen, Mr. Fridtjof Arne",male,24.0,0,0,C 17369,7.1417,,S
808,809,0,2,"Meyer, Mr. August",male,39.0,0,0,248723,13.0,,S
157,158,0,3,"Corn, Mr. Harry",male,30.0,0,0,SOTON/OQ 392090,8.05,,S
309,310,1,1,"Francatelli, Miss. Laura Mabel",female,30.0,0,0,PC 17485,56.9292,E36,C
389,390,1,2,"Lehmann, Miss. Bertha",female,17.0,0,0,SC 1748,12.0,,C
681,682,1,1,"Hassab, Mr. Hammad",male,27.0,0,0,PC 17572,76.7292,D49,C
491,492,0,3,"Windelov, Mr. Einar",male,21.0,0,0,SOTON/OQ 3101317,7.25,,S
357,358,0,2,"Funk, Miss. Annie Clemmer",female,38.0,0,0,237671,13.0,,S
321,322,0,3,"Danoff, Mr. Yoto",male,27.0,0,0,349219,7.8958,,S


Now that we've had a quick glance, let's find out: 
- if any data is missing
- if any data needs to be normalised 
- if any data needs to be removed altogether
- ...and, let's make some decisions: do we really need all these columns?

In [4]:
# First, we find attempt identifying missing data points. 
train_df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

#### OK, so a few things: 
- Only three columns (henceforth called "features") are missing data. What are they, and what % of data is missing?


Note: anytime you think, "we can fudge this," **PAUSE**. 
What are the consequences of fudging the data? What would happen if you got it wrong? How much would it undermine everything else you've done? 


But, also, a passenger's PassengerID, name, or family relationships should hold no bearing on whether they survived the disaster or not. So, maybe we can drop those columns. 

In [5]:
# let's drop the columns we don't need so that we can focus on what we do need. 
# remember: whatever action we take on the training data, we need to do exactly the same for the test data. 

train_df = train_df.drop(['PassengerId','Name','Parch','SibSp'], axis=1)
test_df = test_df.drop(['PassengerId','Name','Parch','SibSp'], axis=1)

### How do we solve the two missing "Embarked" values?

In [6]:
# Let's first look at the data, to see if we can see anything obvious. 
# For example, can the family data help us extrapolate?

print(f"The NA indices for Embarked are at: {train_df.index[train_df['Embarked'].isna()]}")
train_df.iloc[[61, 829]]

The NA indices for Embarked are at: Int64Index([61, 829], dtype='int64')


Unnamed: 0,Survived,Pclass,Sex,Age,Ticket,Fare,Cabin,Embarked
61,1,1,female,38.0,113572,80.0,B28,
829,1,1,female,62.0,113572,80.0,B28,


In [7]:
# Now, let's look at the missing data points, and see what we can do. 
# Starting with Embarked...
### What's the breakdown of values in Embarked?

train_df.Embarked.value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [8]:
# When we breakdown "Embarked" by the counts, we see 'S' is the most common, so we blindly adopt that. 
# We can blindly do this for two reasons:
# (i) the number of missing data points is tiiiiiiny 
# (ii) does the port where someone embarked _realistically_ impact their survival? 

train_df.Embarked = train_df.Embarked.fillna('S')

### How do we solve the missing data in Cabin?

If you look through one of the references, you'll see that they use what they have and ignore what they don't. But, that doesn't seem entirely right, because it might skew our results one way or another. So, let's drop it. 


In [9]:
## Drop Cabin from our training and test data sets

train_df = train_df.drop(['Cabin'], axis=1)
test_df = test_df.drop(['Cabin'], axis=1)

### How do we solve the missing data in Age?

Age is obviously a far more crucial datapoint. The approaches adopted by the references vary. 
- Using the median age based on the data where age != NA
- Computing a random age based on the mean age and standard deviation

For the sake of simplicity, let's go down the route of option 1. Again, remember, we need to mimic this across the training & test datasets. 

In [10]:
# When we fill the median age on the test dataframe, we stick to the median
# age we computed from the training data. This is because we shouldn't be
# doing _anything_ based on the test dataset. We might end up skewing the 
# results or overfitting, and we need to be careful to avoid that. 

train_df["Age"].fillna(train_df["Age"].median(skipna=True), inplace=True)
test_df["Age"].fillna(train_df["Age"].median(skipna=True), inplace=True)
train_df["Fare"].fillna(0, inplace=True)
test_df["Fare"].fillna(0, inplace=True)

## Data Normalisation
For logistic regression to work, the data needs to be "normalised". By that, we mean:
- we can't have string values, i.e. gender which is currently male/female should be numeric. 
- individual ages might lead to _overfitting_, so we create age brackets. 
- ...and we might want to see what we can do with the tickets, because they might be unique ticket IDs, which means it wouldn't help.

In [11]:
# let's start with tickets, and see how many unique values we have

train_df.Ticket.value_counts()

1601            7
CA. 2343        7
347082          7
3101295         6
347088          6
CA 2144         6
S.O.C. 14879    5
382652          5
17421           4
349909          4
113760          4
4133            4
LINE            4
113781          4
347077          4
PC 17757        4
19950           4
W./C. 6608      4
2666            4
110413          3
C.A. 31921      3
C.A. 34651      3
345773          3
248727          3
PC 17760        3
35273           3
230080          3
PC 17572        3
347742          3
29106           3
               ..
2926            1
342826          1
349205          1
PC 17482        1
365222          1
14311           1
244310          1
236171          1
A/5. 2151       1
239855          1
17764           1
364846          1
234604          1
350025          1
334912          1
315086          1
237668          1
2697            1
17463           1
112058          1
2693            1
315037          1
349910          1
392092          1
SC 1748   

In [12]:
# Right, you can immediately tell this isn't going to fly: 691 unique items here 
# will lead to overfitting. So, let's drop this. 

train_df = train_df.drop(['Ticket'], axis=1)
test_df = test_df.drop(['Ticket'], axis=1)

In [13]:
# Next up, genders. Let's check what our gender counts are, and then decide 
# on unique values for each gender. 

train_df.Sex.value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [14]:
# OK, so that's pretty straightforward: add a new column to the dataframe: 'Male'. 
# If male, this column should have a value of 1. Else 0. 

train_df["Male"] = train_df.Sex.apply(lambda x: 1 if x=='male' else 0)
test_df["Male"] = test_df.Sex.apply(lambda x: 1 if x=='male' else 0)

In [15]:
# And, now, let's do age brackets. How should we do this? 'Adult'/'Child' to
# keep it simple? Or, proper brackets? 
# As the folks at TowardsDataScience did, let's do age brackets. (And, yes, 
# this cell is blindly copying & pasting their code.)

data = [train_df, test_df]
for dataset in data:
    dataset['Age_Cat'] = dataset['Age'].astype(int)
    dataset.loc[ dataset['Age_Cat'] <= 11, 'Age_Cat'] = 0
    dataset.loc[(dataset['Age'] > 11) & (dataset['Age'] <= 18), 'Age_Cat'] = 1
    dataset.loc[(dataset['Age'] > 18) & (dataset['Age'] <= 22), 'Age_Cat'] = 2
    dataset.loc[(dataset['Age'] > 22) & (dataset['Age'] <= 27), 'Age_Cat'] = 3
    dataset.loc[(dataset['Age'] > 27) & (dataset['Age'] <= 33), 'Age_Cat'] = 4
    dataset.loc[(dataset['Age'] > 33) & (dataset['Age'] <= 40), 'Age_Cat'] = 5
    dataset.loc[(dataset['Age'] > 40) & (dataset['Age'] <= 66), 'Age_Cat'] = 6
    dataset.loc[ dataset['Age'] > 66, 'Age_Cat'] = 7

In [16]:
# We have to do the same for fares. 
for dataset in data:
    dataset['Fare_Cat'] = dataset['Fare'].astype(int)
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare_Cat'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare_Cat'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare_Cat']   = 2
    dataset.loc[(dataset['Fare'] > 31) & (dataset['Fare'] <= 99), 'Fare_Cat']   = 3
    dataset.loc[(dataset['Fare'] > 99) & (dataset['Fare'] <= 250), 'Fare_Cat']   = 4
    dataset.loc[ dataset['Fare'] > 250, 'Fare_Cat'] = 5

In [17]:
# And, because Embarked is a single letter, let's just map it to int, too. 
# This is cheating somewhat. Instead of going through the values, and assigning
# [1, 2, 3], what we are doing is simply mapping the letter to its ordinal value.

train_df["Embarked_Cat"] = train_df.Embarked.apply(lambda x: ord(x))
test_df["Embarked_Cat"] = test_df.Embarked.apply(lambda x: ord(x))

In [18]:
#...and now we can drop sex and age from the original data frames as we have
#the normalised columns in there. 

train_df = train_df.drop(['Fare','Sex','Age','Embarked'], axis=1)
test_df = test_df.drop(['Fare','Sex','Age','Embarked'], axis=1)

### Machine Learning
OK, we finally have data that we can run logistic regression on. So, here's what we have to do: 
1. Split our data frames into X_train, y_train, X_test, and y_test. This, in English, means, training data, labels for the training data, test data, and labels for test data. "X" is shorthand for training/input data, and "y" is shorthand for labels. In our case, our X will be all columns in our dataframe except Survived, and our y will be the Survived columns. 
2. Once we do that, we simply run the data as is through the sklearn LogisticRegression classifier. 

In [21]:
# Creating our arrays, which are parameters to the LogisticRegression classifier. 

X_train = train_df.drop('Survived', axis=1)
Y_train = train_df['Survived']
X_test = test_df.drop('Survived', axis=1)
Y_test = test_df['Survived']

In [25]:
classifier = LogisticRegression()
classifier.fit(X_train , Y_train)
predictions = classifier.predict(X_test)
print(f"Accuracy: {metrics.accuracy_score(Y_test, predictions)}")
print(f"Precision: {metrics.precision_score(Y_test, predictions)}")
print(f"Recall: {metrics.recall_score(Y_test, predictions)}")
print(f"Confusion Matrix: \n {metrics.confusion_matrix(Y_test, predictions)}")

Accuracy: 0.8588516746411483
Precision: 0.8175675675675675
Recall: 0.7908496732026143
Confusion Matrix: 
 [[238  27]
 [ 32 121]]




## References

- https://towardsdatascience.com/predicting-the-survival-of-titanic-passengers-30870ccc7e8
- https://www.kaggle.com/mnassrib/titanic-logistic-regression-with-python/data#1.-Import-Data-&-Python-Packages
- https://www.kaggle.com/kernels/svzip/447794
- https://github.com/jstray/lede-algorithms/tree/master/week-3