# Titanic Survival Machine Learning Project

### This project was made for the purpose of learning about machine learning. I already had some experience with logistic regression, but through this project I wanted to explore the topic of binary classification as well as practice doing data analysis on Python. 

### The goal of this project is to attempt to predict whether a passenger on the Titanic would survive based on several features. The assessment of the project is on the "accuracy" which in Binary Classification is noted as the proportion of True Positives and True Negatives over the total amount of observations. 

Let's import the libraries we're going to use

In [96]:
import pandas as pandas
import numpy as numpy
import scipy as scipy
import sklearn as scikit
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

Lets import our datasets

In [97]:
test = pandas.read_csv('test.csv')
train = pandas.read_csv('train.csv')

Lets look at some information about our data

In [98]:
train.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


By looking at the initial portion of the data, it can be seen that there are many 'NaN' data entries. The data needs to be cleaned of these for analysis, so lets look at where our 'NaN' values are. 

In [99]:
null_columns=train.columns[train.isnull().any()]
train[null_columns].isnull().sum()

Age         177
Cabin       687
Embarked      2
dtype: int64

Looks like there is a substantial amount of missing data in the 'Cabin' category, considering 77% of our data doesnt have a value under 'Cabin'. Lets drop the 'Cabin' column from our data, as there might be some existing relation between 'Cabin' and 'Pclass' anyways. After that, lets remove the rows where 'Age' and 'Embarked' are missing. And look at our data again. 

In [100]:
train = train.drop(columns=['Cabin'])

In [101]:
train = train.dropna()

In [104]:
train.columns[train.isnull().any()]

Index([], dtype='object')

We can see there is no more 'NaN'/missing variables in our data, so lets look at some summary statistics. 

In [105]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,712.0,712.0,712.0,712.0,712.0,712.0,712.0
mean,448.589888,0.404494,2.240169,29.642093,0.514045,0.432584,34.567251
std,258.683191,0.491139,0.836854,14.492933,0.930692,0.854181,52.938648
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,222.75,0.0,1.0,20.0,0.0,0.0,8.05
50%,445.0,0.0,2.0,28.0,0.0,0.0,15.64585
75%,677.25,1.0,3.0,38.0,1.0,1.0,33.0
max,891.0,1.0,3.0,80.0,5.0,6.0,512.3292


I'm curious to look within each category, and see what percentage is of survived. Lets take a look at those tables. 

In [115]:
survived_class = pandas.crosstab(index=train["Pclass"], 
                           columns=train["Survived"])

survived_class.columns= ["Didn't Survive","Survived"]
survived_class.index= ["1","2","3"]                           

survived_class

Unnamed: 0,Didn't Survive,Survived
1,64,120
2,90,83
3,270,85


In [116]:
survived_sex = pandas.crosstab(index=train["Sex"], 
                           columns=train["Survived"])

survived_sex.columns= ["Didn't Survive","Survived"]
survived_sex.index= ["Female","Male"]

survived_sex

Unnamed: 0,Didn't Survive,Survived
Female,64,195
Male,360,93


In [147]:
survived_embarked= pandas.crosstab(index=train["Embarked"], 
                           columns=train["Survived"])
                           
survived_embarked.columns= ["Didn't Survive","Survived"]
survived_embarked.index= ["C","Q","S" ]

survived_embarked

Unnamed: 0,Didn't Survive,Survived
C,51,79
Q,20,8
S,353,201
