# Titanic Survivals
## Python Demo: Data Analysis with Machine Learning

<br><br>
This demo code will illustrate solving a typical machine learning task.
Based on data about all <a href = "https://en.wikipedia.org/wiki/RMS_Titanic">Titanic</a> passangers, both those that have survived and those that have not survived the tragic and the biggest passanger ship crash in the history (1912), we will make a model, which can be used for prediction of future and imaginery survivals.
Please, note, that this are not just numbers, but numbers, associated with real people's destiny!


![image.png](attachment:image.png)

We will implement the following procedure:
1. Prepare the environment by importing the major libraries, which contain modules and functions we will need.
2. Get the available data
3. Explore the data to get to know what it contains.
4. Clean the data, so it can be easier analysed
5. Chose a method for analysis
6. Build an analytical model of the data applying the method above
7. Validate the model with test data 
8. Use the model for prediction of new data

### Import the necessary Python libraries

In [None]:
# import pandas for structuring the data
import pandas as pd

# import numpy for numerical analysis
import numpy as np

# import matplotlib for diagrams inline with the text
import matplotlib.pyplot as plt
%matplotlib inline

# for generating random numbers
import random

# the most important library for machine learning algorithms
from sklearn import datasets, svm, tree, preprocessing, metrics

### Get the data 
In this demo we will use an excel file with the original data. In many other cases we can only find plain text or csv files.

In [None]:
# read the file into a Pandas data frame
df = pd.read_csv('/users/tdi/Data/TitanicData.csv', index_col=None, na_values=['NA'])

In [None]:
# see the size
df.shape

In [None]:
# see which are the attribute labels
list(df)

In [None]:
# get an overview
df.info()

In [None]:
# see the first five records
df.head()

### Explore the data

In [None]:
# see the types of the attributes
df.dtypes

In [None]:
# get some insights of the value scope
df.describe()

In [None]:
# Non-numeric data is not included in the statistic above, but can be plotted
df['embarked'].value_counts().plot(kind='bar')

In [None]:
# Numeric data can also be plotted 
df['survived'].value_counts().plot(kind='bar')

Incredibly low % of survivals: 38% <br>
Titanic was only carrying 20 lifeboats for 1317 passengers and 885 crew members aboard!

#### Social status

In [None]:
# Did the social class matter?
social = df.groupby('pclass').count()
social

In [None]:
# plot
social['survived'].plot.bar()

#### Gender

In [None]:
# Did the gender matter?
gender = df.groupby(['survived', 'sex'])['name'].count()
gender

In [None]:
# plot
gender.plot.pie()

In [None]:
# gender by class
gender_by_class = df.groupby(['pclass','sex']).count()
gender_by_class

In [None]:
# plot
gender_by_class['survived'].plot.area()

#### Age

In [None]:
# We need to split it into groups (bins)
ranges = [0,10,20,30,40,50,60,70,80] 
age=df.groupby([(pd.cut(df.age, ranges))])['name'].count()
age

In [None]:
# average per range
age_by_gender=df.groupby([(pd.cut(df.age, ranges)), 'sex']).count()
age_by_gender

In [None]:
# plot
age_by_gender['survived'].plot.bar()

## Prepare the data for analysis

In [None]:
# count the missing values in each attribute
df.isnull().sum()

In [None]:
# fill the missing home destination with 'NA'
df["home.dest"] = df["home.dest"].fillna("NA")
df.head()

In [None]:
# remove most empty columns, which are not so informative
df = df.drop(['cabin','boat', 'body'], axis=1)
df.head()

In [None]:
# replace the missing age with the average age
mean_age = df.age.mean()
df['age'] = df['age'].fillna(mean_age)
df.head()

In [None]:
# see the current number of data
df.count()

In [None]:
# see the current state of nill values
df.isnull().sum()

In [None]:
# replace the  missing fare values with the average
df['fare'] = df['fare'].fillna(df.fare.mean())
df.count()

In [None]:
# find the most used 'embarked' value
emb_mode = df.embarked.mode()
emb_mode

In [None]:
# replace the missing embarked values with the mode
df['embarked'] = df['embarked'].fillna(emb_mode).iloc[0]
df.count()

#### Transform data into numeric

As a preprocessing, we will convert the strings into integer keys, making it easier for the  algorithms to find patterns. 
- “Female” and “Male” are categorical values and will be converted to 0 and 1 respectively
- The “name”, “ticket”, and “home.dest” columns consist of non-categorical string values, which are difficult to use in our algorithm, so we will drop them from the data set

In [None]:
# define a function for transformation
def preprocessor(df):   
    processed_df = df.copy()
    le = preprocessing.LabelEncoder()
    
    # sex {male, female} to {0, 1}
    processed_df['sex'] = le.fit_transform(df['sex'])
    
    # embarked {S, C, Q} => 3 values
    processed_df['embarked'] = le.fit_transform(df['embarked'])
    processed_df = processed_df.drop(['name','ticket','home.dest'], axis=1)
    return processed_df




In [None]:
# call the transformation function
dfp = preprocessor(df)

In [None]:
dfp.head()

In [None]:
# see the current state of nill values
dfp.isnull().sum()

In [None]:
result = dfp[dfp['parch'].isnull()]
result

In [None]:
dff = dfp.drop(index=1309)









### Train a model
1. Split the data into input and output
2. Split the data into train and test sets

In [None]:
# Split the data into input and output
X = dff.drop(['survived'], axis=1).values
y = dff['survived'].values

In [None]:
X


In [None]:
y

In [None]:
# Split the data into train and test sets
# 80% of the dataset will be used for training and 20% will be used for testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [None]:
# Select a method
from sklearn import tree
dt = tree.DecisionTreeClassifier(max_depth=3)

In [None]:
# Train a model
dt.fit (X_train, y_train)

In [None]:
# Validate the model
dt.score (X_test, y_test)

The resulting value is the model accuracy. It means that the model correctly predicted the survival of this % of the test set. Not bad for start!

In [None]:
# Try another method
import sklearn.ensemble as ske
rf = ske.RandomForestClassifier(n_estimators=50)
rf.fit (X_train, y_train)
rf.score (X_test, y_test)

In [None]:
# Try another method
import sklearn.ensemble as ske
gb = ske.GradientBoostingClassifier(n_estimators=50)
gb.fit (X_train, y_train)
gb.score (X_test, y_test)

In [None]:
# Try combining methods
import sklearn.ensemble as ske

In [None]:
eclf = ske.VotingClassifier([('dt', dt), ('rf', rf), ('gb', gb)])
eclf.fit (X_train, y_train)
eclf.score (X_test, y_test)

### Test the model for prediction
Once the model is trained we can use it to predict the survival of passengers in the test data set, and compare these to the known survival of each passenger using the original dataset.

#### Evaluate the performance with the test data

In [None]:
# Test the classifier with the test input data
prediction = dt.predict(X_test)

In [None]:
prediction

In [None]:
# Evaluate classifier performance
from sklearn.metrics import classification_report
class_names = ['Non-survival', 'Survival']
print("\n" + "#"*40)

print("\nClassifier performance on training dataset\n")
print(classification_report(y_train, eclf.predict(X_train), target_names=class_names))
print("#"*40 + "\n")

print("\nClassifier performance on test dataset\n")
print(classification_report(y_test, prediction, target_names=class_names))

#### Evaluate the performance with new data

In [None]:
list(X_test)

In [None]:
# Enter a new data set for a person
my_set1 = ([[1,0,29.00,0,0,211.3375,0]])
# my_set2 = ([[3.    ,   1.    ,  29.    , 0,   0.    ,   7.875 ,   0.   ]])
# my_set3 = ([[3.    ,   1.    ,  29.    , 0,   0.    ,   7.875 ,   0.   ]])

In [None]:
prediction = eclf.predict(my_set1)
prediction

## Reference
https://www.kaggle.com/c/titanic/data <br>
https://blog.socialcops.com/technology/data-science/machine-learning-python/<br>
https://www.youtube.com/watch?v=siEPqQsPLKA<br>