## Titanic: Machine Learning from Disaster

**The objectives of this competition: Use machine learning to create a model that predicts which passengers survived the Titanic shipwreck**

The prediction for which passengers survived the Titanic shipwreck looks like a case of predicting discrete variables. Yes or No, 1 or 0. This is apparently a classification problem.

In [2]:
# Importing the relevant modules
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning algorithms for classification
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

**Load the dataset**

In [3]:
# training data
titanicTrainingData = pd.read_csv('train.csv')
titanicTrainingData.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
# test data
titanicTestData = pd.read_csv('test.csv')
titanicTestData.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [5]:
# combine both train and test
combinedDataset = [titanicTrainingData, titanicTestData]
combinedDataset

[     PassengerId  Survived  Pclass  \
 0              1         0       3   
 1              2         1       1   
 2              3         1       3   
 3              4         1       1   
 4              5         0       3   
 ..           ...       ...     ...   
 886          887         0       2   
 887          888         1       1   
 888          889         0       3   
 889          890         1       1   
 890          891         0       3   
 
                                                   Name     Sex   Age  SibSp  \
 0                              Braund, Mr. Owen Harris    male  22.0      1   
 1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
 2                               Heikkinen, Miss. Laina  female  26.0      0   
 3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
 4                             Allen, Mr. William Henry    male  35.0      0   
 ..                                               

**Features available in the dataset**

In [6]:
print(titanicTrainingData.columns)

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')


**Which features are categorical and or numerical?**

In [7]:
titanicTrainingData.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

**What are categorical variables?**

*A categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property.*

**Which features are categorical?**

*Pclass(ordinal), Survived, Sex and Embarked

In [15]:
titanicTrainingData['Pclass'].head(4)

0    3
1    1
2    3
3    1
Name: Pclass, dtype: int64

In [11]:
titanicTrainingData['Survived'].head(4)

0    0
1    1
2    1
3    1
Name: Survived, dtype: int64

In [13]:
titanicTrainingData['Sex'].head(4)

0      male
1    female
2    female
3    female
Name: Sex, dtype: object

In [14]:
titanicTrainingData['Embarked'].head(4)

0    S
1    C
2    S
3    S
Name: Embarked, dtype: object

**What are numerical variables**

*A numerical variable is a variable where the measurement or number has a numerical meaning*

**Which features are numerical?**

*Age, fare, Discrete:SibSp, Parch 

In [19]:
titanicTrainingData['Age'].head(4)

0    22.0
1    38.0
2    26.0
3    35.0
Name: Age, dtype: float64

In [20]:
titanicTrainingData['Fare'].head(4)

0     7.2500
1    71.2833
2     7.9250
3    53.1000
Name: Fare, dtype: float64

In [21]:
titanicTrainingData['SibSp'].head(4)

0    1
1    1
2    0
3    1
Name: SibSp, dtype: int64

In [22]:
titanicTrainingData['Parch'].head(4)

0    0
1    0
2    0
3    0
Name: Parch, dtype: int64

In [23]:
# preview the training dataset
titanicTrainingData.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [24]:
titanicTrainingData.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


**Which features are mixed data types?**

*Tickets has a mix of numbers and alphanumeric data types*

*Cabin is alphanumeric*

In [None]:
titanicTrainingData['Parch'].head(4)

**Which features may contain errors or typos?**

*The Name column looks to have errors because the data shows the existence of commas, brackets, titles etc being
juxtaposed*