<a href="https://colab.research.google.com/github/cagBRT/Machine-Learning/blob/master/LogReg_Titanic_NB4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Logistic Regression on the Titanic Dataset

In [None]:
# Clone the entire repo.
!git clone -l -s https://github.com/cagBRT/Machine-Learning.git cloned-repo
%cd cloned-repo

**Import libraries**

In [None]:
# pandas
import pandas as pd
from pandas import Series,DataFrame
from sklearn.model_selection import train_test_split 

# numpy, matplotlib, seaborn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline
from sklearn import metrics

# machine learning
from sklearn.linear_model import LogisticRegression

**Load the Titanic dataset**<br>
This is the dataset from the Titanic. <br>
It lists the following information: 
Passenger Survived, Boarding Class (1,2,3)<br>
Passenger Name, Sex, Age<br>
Sibsp = number of siblings/spouses the passenger has on board<br>
Parch = number of parents/children the passenger has on board<br>
The passenger ticket number, fare paid, cabin number<br>
Where the passenger boarded the ship<br>

In [None]:
titanic_df = pd.read_csv("train.csv")
test_df    = pd.read_csv("test.csv")

# preview the data
titanic_df.head()

**Look at the data**

In [None]:
test_df

Where are values missing in the data? Check both the train and test data<br>
What should be done about the missing data?<br>

In [None]:
titanic_df.info()
print("----------------------------\n\n")
test_df.info()

In [None]:
titanic_df.columns

**Assignment**<br>
Given the information about the dataset that you have so far, what features are important? Which can we drop? 

# **Learning about the data**

**Plot the data**<br>
Show age, sex, and who survived<br>
0 = not survived<br>
1 = survived

In [None]:
import seaborn as sns
sns.set(style="white")

sns.relplot(x="Age", y="Survived",  size="Sex",
            sizes=(40, 400), alpha=.5, palette="bright",
            height=6, data=titanic_df)

**Drop columns that may not have an impact on survival rates**<br>
The passenger name and passenger ID have no impact on the survival rate. <br>
The ticket number also has no impact on survival rates. 

**The embark column**

In [None]:
# drop unnecessary columns, these columns won't be useful in analysis and prediction
titanic_df = titanic_df.drop(['PassengerId','Name','Ticket'], axis=1)
test_df    = test_df.drop(['Name','Ticket'], axis=1)

There are three places where people onboarded: S, C, Q<br>
S = SouthHampton, England<br>
C = Cherboug, France<br>
Q = Queenstown, Ireland<br>
<br>
Some of the data is missing from the embark column. What should we do about the missing data? 

One idea is to assume the missing values are from South Hampton, since this is the most largest embark group. 

In [None]:
# only in titanic_df, fill the two missing values with the most occurred value, which is "S".
titanic_df["Embarked"] = titanic_df["Embarked"].fillna("S")

When we plot the survived vs embark, we see that those from South Hampton were least likely to survive. <br>
Since we used South Hampton for the missing data, how could this skew the data? 

In [None]:
# plot
sns.factorplot('Embarked','Survived', data=titanic_df,size=4,aspect=3)

Plotting all three embark locations by the number of passengers that boarded, the number that survived, and the percent of survival, we see South Hampton had the lowest survival rate. <br>
Why?

In [None]:
fig, (axis1,axis2,axis3) = plt.subplots(1,3,figsize=(15,5))

# sns.factorplot('Embarked',data=titanic_df,kind='count',order=['S','C','Q'],ax=axis1)
# sns.factorplot('Survived',hue="Embarked",data=titanic_df,kind='count',order=[1,0],ax=axis2)
sns.countplot(x='Embarked', data=titanic_df, ax=axis1)
sns.countplot(x='Survived', hue="Embarked", data=titanic_df, order=[1,0], ax=axis2)

# group by embarked, and get the mean for survived passengers for each value in Embarked
embark_perc = titanic_df[["Embarked", "Survived"]].groupby(['Embarked'],as_index=False).mean()
sns.barplot(x='Embarked', y='Survived', data=embark_perc,order=['S','C','Q'],ax=axis3)

What should we do about the embark column? 
1. Consider Embarked column in predictions,
and remove "S" dummy variable, 
and leave "C" & "Q", since they seem to have a good rate for Survival.

2. Don't create dummy variables for Embarked column, just drop it, 
because logically, Embarked doesn't seem to be useful in prediction.

In this first instance, the column is dropped. <br>
As an experiment, uncomment the code below and don't drop the column. <br>
How does the change affect the model?

In [None]:

#embark_dummies_titanic  = pd.get_dummies(titanic_df['Embarked'])
#embark_dummies_titanic.drop(['S'], axis=1, inplace=True)

#embark_dummies_test  = pd.get_dummies(test_df['Embarked'])
#embark_dummies_test.drop(['S'], axis=1, inplace=True)

#titanic_df = titanic_df.join(embark_dummies_titanic)
#test_df    = test_df.join(embark_dummies_test)

titanic_df.drop(['Embarked'], axis=1,inplace=True)
test_df.drop(['Embarked'], axis=1,inplace=True)


**The Fare column**<br>
There is a missing value in the fare column. <br>
What should be done about it?

In this case the missing fare is filled with the mean of the fares. <br>
The fare data is binned to help the model with predictions. 

In [None]:
# Fare

# only for test_df, since there is a missing "Fare" values
test_df["Fare"].fillna(test_df["Fare"].median(), inplace=True)

# convert from float to int
titanic_df['Fare'] = titanic_df['Fare'].astype(int)
test_df['Fare']    = test_df['Fare'].astype(int)

# get fare for survived & didn't survive passengers 
fare_not_survived = titanic_df["Fare"][titanic_df["Survived"] == 0]
fare_survived     = titanic_df["Fare"][titanic_df["Survived"] == 1]

# get average and std for fare of survived/not survived passengers
avgerage_fare = DataFrame([fare_not_survived.mean(), fare_survived.mean()])
std_fare      = DataFrame([fare_not_survived.std(), fare_survived.std()])

# plot
fig = titanic_df['Fare'].plot(kind='hist', figsize=(15,3),bins=100, xlim=(0,100))
fig.set_xlabel("Binned Fares")

In [None]:
avgerage_fare.index.names = std_fare.index.names = ["Survived"]
fig2 = avgerage_fare.plot(yerr=std_fare,kind='bar',legend=False)
fig2.set_ylabel("Std fare")

**Age Column**<br>
The age column is missing a lot of values. <br>
What should be done about the missing data?


In [None]:
# get average, std, and number of NaN values in titanic_df
average_age_titanic   = titanic_df["Age"].mean()
std_age_titanic       = titanic_df["Age"].std()
count_nan_age_titanic = titanic_df["Age"].isnull().sum()
print("Ave training age: ", average_age_titanic)
print("Std training age: ",std_age_titanic)
print("Num of missing age training values: ", count_nan_age_titanic)

In [None]:
# get average, std, and number of NaN values in test_df
average_age_test   = test_df["Age"].mean()
std_age_test       = test_df["Age"].std()
count_nan_age_test = test_df["Age"].isnull().sum()
print("Ave test age: ", average_age_test)
print("Std test age: ",std_age_test)
print("Num of missing age test values: ", count_nan_age_test)

In [None]:
# generate random numbers between (mean - std) & (mean + std)
rand_1 = np.random.randint(average_age_titanic - std_age_titanic, average_age_titanic + std_age_titanic, size = count_nan_age_titanic)
rand_2 = np.random.randint(average_age_test - std_age_test, average_age_test + std_age_test, size = count_nan_age_test)


Plot the ages using bins <br>
Drop all the rows missing data. 

In [None]:
# Age 
fig, (axis1,axis2) = plt.subplots(1,2,figsize=(15,4))
axis1.set_title('Original Age values - Titanic')
axis2.set_title('New Age values - Titanic')

# plot original Age values
# NOTE: drop all null values, and convert to int
titanic_df['Age'].dropna().astype(int).hist(bins=70, ax=axis1)
test_df['Age'].dropna().astype(int).hist(bins=70, ax=axis1)

# fill NaN values in Age column with random values generated
titanic_df["Age"][np.isnan(titanic_df["Age"])] = rand_1
test_df["Age"][np.isnan(test_df["Age"])] = rand_2

# convert from float to int
titanic_df['Age'] = titanic_df['Age'].astype(int)
test_df['Age']    = test_df['Age'].astype(int)
        
# plot new Age Values
titanic_df['Age'].hist(bins=70, ax=axis2)
# test_df['Age'].hist(bins=70, ax=axis4)

In [None]:
# .... continue with plot Age column

# peaks for survived/not survived passengers by their age
facet = sns.FacetGrid(titanic_df, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'Age',shade= True)
facet.set(xlim=(0, titanic_df['Age'].max()))
facet.add_legend()

In [None]:
# average survived passengers by age
fig, axis1 = plt.subplots(1,1,figsize=(18,4))
average_age = titanic_df[["Age", "Survived"]].groupby(['Age'],as_index=False).mean()
sns.barplot(x='Age', y='Survived', data=average_age)

**The cabin column**<br>
The cabin column is missing a lot of data. <br>
Drop the cabin column. 

In [None]:
# Cabin
# It has a lot of NaN values, so it won't cause a remarkable impact on prediction
titanic_df.drop("Cabin",axis=1,inplace=True)
test_df.drop("Cabin",axis=1,inplace=True)

**The SibSp and ParCh columns**<br>
The family columns have data about who was on board with their spouses, children, parents, siblings. <br>
We can simplify the data by saying just saying of a passenger was with a family member or not. <br>
Create a new column called family.<br>
1 = with sibling or spouse<br>
1 = with parent or child<br>
Drop the SibSp and Parch columns

In [None]:
# Family

# Instead of having two columns Parch & SibSp, 
# we can have only one column represent if the passenger had any family member aboard or not,
# Meaning, if having any family member(whether parent, brother, ...etc) will increase chances of Survival or not.
titanic_df['Family'] =  titanic_df["Parch"] + titanic_df["SibSp"]
titanic_df['Family'].loc[titanic_df['Family'] > 0] = 1
titanic_df['Family'].loc[titanic_df['Family'] == 0] = 0

test_df['Family'] =  test_df["Parch"] + test_df["SibSp"]
test_df['Family'].loc[test_df['Family'] > 0] = 1
test_df['Family'].loc[test_df['Family'] == 0] = 0

# drop Parch & SibSp
titanic_df = titanic_df.drop(['SibSp','Parch'], axis=1)
test_df    = test_df.drop(['SibSp','Parch'], axis=1)

**Plot the new family column**

In [None]:
# plot
fig, (axis1,axis2) = plt.subplots(1,2,sharex=True,figsize=(10,5))

# sns.factorplot('Family',data=titanic_df,kind='count',ax=axis1)
sns.countplot(x='Family', data=titanic_df, order=[1,0], ax=axis1)

# average of survived for those who had/didn't have any family member
family_perc = titanic_df[["Family", "Survived"]].groupby(['Family'],as_index=False).mean()
sns.barplot(x='Family', y='Survived', data=family_perc, order=[1,0], ax=axis2)

axis1.set_xticklabels(["With Family","Alone"], rotation=0)

**The Sex column**<br>
We can guess that the sex of the passenger is important because of the tradition of women and children first in the lifeboats. 

Let's consider all passengers under the age of 16 to be children. All over 16 to be adults. <br>
Let's create a new column called Person. <br>
Person will have three values: <br>
males<br>
females<br>
children<br>
Drop the sex column

In [None]:
# Sex

# As we see, children(age < ~16) on aboard seem to have a high chances for Survival.
# So, we can classify passengers as males, females, and child
def get_person(passenger):
    age,sex = passenger
    return 'child' if age < 16 else sex
    
titanic_df['Person'] = titanic_df[['Age','Sex']].apply(get_person,axis=1)
test_df['Person']    = test_df[['Age','Sex']].apply(get_person,axis=1)

# No need to use Sex column since we created Person column
titanic_df.drop(['Sex'],axis=1,inplace=True)
test_df.drop(['Sex'],axis=1,inplace=True)

Let's create new columns: <br>
Child<br>
Female (this column records both female and male)<br> 
Drop the Person column

In [None]:
# create dummy variables for Person column, & 
# drop Male as it has the lowest average of survived passengers
person_dummies_titanic  = pd.get_dummies(titanic_df['Person'])
person_dummies_titanic.columns = ['Child','Female','Male']
person_dummies_titanic.drop(['Male'], axis=1, inplace=True)

person_dummies_test  = pd.get_dummies(test_df['Person'])
person_dummies_test.columns = ['Child','Female','Male']
person_dummies_test.drop(['Male'], axis=1, inplace=True)

titanic_df = titanic_df.join(person_dummies_titanic)
test_df    = test_df.join(person_dummies_test)

fig, (axis1,axis2) = plt.subplots(1,2,figsize=(10,5))

# sns.factorplot('Person',data=titanic_df,kind='count',ax=axis1)
sns.countplot(x='Person', data=titanic_df, ax=axis1)

# average of survived for each Person(male, female, or child)
person_perc = titanic_df[["Person", "Survived"]].groupby(['Person'],as_index=False).mean()
sns.barplot(x='Person', y='Survived', data=person_perc, ax=axis2, order=['male','female','child'])

titanic_df.drop(['Person'],axis=1,inplace=True)
test_df.drop(['Person'],axis=1,inplace=True)

What columns do we have now?

In [None]:
titanic_df.columns

**The Pclass column**<br>
Use one-hot encoding to contain the information about the passenger classes. <br>
We can use three columns or two columns for this information<br>
We will use two columns:<br>
Class 1<br>
Class 2 and 3<br>
Drop the PClass column

In [None]:
# Pclass

# sns.factorplot('Pclass',data=titanic_df,kind='count',order=[1,2,3])
sns.factorplot('Pclass','Survived',order=[1,2,3], data=titanic_df,size=5)

# create dummy variables for Pclass column, & drop 3rd class as it has the lowest average of survived passengers
pclass_dummies_titanic  = pd.get_dummies(titanic_df['Pclass'])
pclass_dummies_titanic.columns = ['Class_1','Class_2','Class_3']
pclass_dummies_titanic.drop(['Class_3'], axis=1, inplace=True)

pclass_dummies_test  = pd.get_dummies(test_df['Pclass'])
pclass_dummies_test.columns = ['Class_1','Class_2','Class_3']
pclass_dummies_test.drop(['Class_3'], axis=1, inplace=True)

titanic_df.drop(['Pclass'],axis=1,inplace=True)
test_df.drop(['Pclass'],axis=1,inplace=True)

titanic_df = titanic_df.join(pclass_dummies_titanic)
test_df    = test_df.join(pclass_dummies_test)

In [None]:
titanic_df.columns

In [None]:
print(titanic_df.value_counts('Class_1'))
print(titanic_df.value_counts('Class_2'))

In [None]:
# define training and testing sets
X_train = titanic_df.drop("Survived",axis=1)
Y_train = titanic_df["Survived"]
X_test  = test_df.drop("PassengerId",axis=1).copy()


In [None]:
print(X_train)
print(Y_train)

**Create, train, and predict with the model**

In [None]:
X_train2,X_test2,y_train2,y_test2=train_test_split(X_train,Y_train,test_size=0.20,random_state=0)

In [None]:
# Logistic Regression

logreg = LogisticRegression()

logreg.fit(X_train2, y_train2)

Y_pred = logreg.predict(X_test2)

logreg.score(X_train2, y_train2)

**The predicted values**

In [None]:
Y_pred

In [None]:
df_check = pd.DataFrame()
df_check["test values"] = y_test2
df_check["predictions"] = Y_pred
df_check.head(20)

**Look at the confusion matrix**

In [None]:
cnf_matrix = metrics.confusion_matrix(y_test2, Y_pred)
cnf_matrix

**Look at the correlation between the features and the likelyhood of surviving**

In [None]:
# get Correlation Coefficient for each feature using Logistic Regression
coeff_df = DataFrame(titanic_df.columns.delete(0))
coeff_df.columns = ['Features']
coeff_df["Coefficient Estimate"] = pd.Series(logreg.coef_[0])

# preview
coeff_df