*Antoine de Marassé*
## Titanic dataset discovery

RMS Titanic was a British passenger liner operated by the White Star Line that sank in the North Atlantic Ocean in the early morning hours of April 15, 1912, after striking an iceberg during her maiden voyage from Southampton to New York City. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, making the sinking one of modern history's deadliest peacetime commercial marine disasters. 
    
<img align="left" style="padding-right:10px; width: 400px" src="https://media-exp1.licdn.com/dms/image/C4D22AQE50TOapAkZ8w/feedshare-shrink_2048_1536/0?e=1585785600&v=beta&t=6iz2C_g_QqjbZNz9OJn8u_dH9AxwHfacDIpCmPAYJWg">

After leaving Southampton on 10 April 1912, Titanic called at Cherbourg in France and Queenstown (now Cobh) in Ireland, before heading west to New York. On 14 April, four days into the crossing and about 375 miles (600 km) south of Newfoundland, she hit an iceberg at 11:40 p.m. ship's time. The collision caused the hull plates to buckle inwards along her starboard (right) side and opened five of her sixteen watertight compartments to the sea; she could only survive four flooding. Meanwhile, passengers and some crew members were evacuated in lifeboats, many of which were launched only partially loaded. A disproportionate number of men were left aboard because of a "women and children first" protocol for loading lifeboats. At 2:20 a.m., she broke apart and foundered with well over one thousand people still aboard. Just under two hours after Titanic sank, the Cunard liner RMS Carpathia arrived and brought aboard an estimated 705 survivors. 

Install pandas package ```conda install pandas```

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Load the dataset file into "titanic" object
titanic = pd.read_csv("Titanic-1309-rows-biostatvanderbilt.csv")

# Let's have a look at the dataset attributes
titanic.head()

In [None]:
# until the last value 0-1308 
titanic.tail()

In [None]:
# Let's check the dataset info, variables types and null counts 
# (pclass, survived, name, sex, sibsp, parch, ticket are complete)
# (embarked miss one value, age is incomplete for 263 passengers)
titanic.info()

In [None]:
# Let's get few statistical data from the dataset (and mostly the mean of survivers: 38,19%)
titanic.describe()

The dataset has 1309 entries. 
This dataset describes the survival status of individual passengers on the Titanic. The dataset has 10 variables:
- `survived`: 0 = No, 1 = Yes. **(As we can see on the table above `survived` mean, 38,19% of passengers survived)**
- `pclass`: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
- Demographics: `Sex`, `Age`
- `sibsp`, `parch`: Number of siblings or spouses aboard, number of parents or children aboard
- `ticket`: Passenger ticket number
- `fare`: Passenger fare
- `cabin`: Cabin number
- `embarked`: Port of embarkation; C = Cherbourg, Q = Queenstown, S = Southampton

## Checking and preparing the data

Dropping few columns and checking data values

In [None]:
titanic.drop(['name','body','boat','cabin','ticket'],axis=1,inplace=True)
titanic.head()

In [None]:
# adding on id column (to replace name)
titanic.insert(0, 'id', range(1, 1 + len(titanic)))

In [None]:
# re-order the columns
titanic = titanic[['id','sex','age','sibsp','parch','fare','pclass','embarked','home.dest','survived']]
# print the dataframe
titanic.head()

## Visual Analysis with Seaborn

Install Seaborn, statistical data visualization library based on matplotlib: ```conda install seaborn```

In [None]:
# Use Seaborn to visualize survived male and female per class in grouped barplots
import seaborn as sns
sns.set(style="whitegrid")

# Draw a nested barplot to show survival for class and sex
#g = sns.catplot(x="pclass", y="survived", hue="sex", data=titanic, height=5, kind="bar", palette="muted")
graph = sns.catplot(x="sex", y="survived", hue="pclass", kind="bar", palette="muted", data=titanic)
graph.set_ylabels("survival probability")

In [None]:
# Duplicating the object for some cleaning and analysis
exploratory = titanic
exploratory.drop(['id','home.dest','embarked','sibsp'],axis=1,inplace=True)

In [None]:
# Converting the sex variable into integer
exploratory['ismale'] = exploratory['sex'].replace(regex='female', value=0)
exploratory['sex_is_male'] = exploratory['ismale'].replace(regex='male', value=1)
exploratory.drop(['sex','ismale'],axis=1,inplace=True)

# re-order the columns
exploratory = exploratory[['sex_is_male','age','parch','fare','pclass','survived']]

exploratory.head()

In [None]:
# Draw scatterplots for joint relationships and histograms for univariate distributions:
# different levels of a categorical variable by the color of plot elements
sns.set(style="darkgrid")  
sns.pairplot(exploratory, dropna=True, hue="survived", corner=True, kind="reg")

## Split the training and test data

Install the scikit-learn - Machine Learning in Python: ```conda install -c intel scikit-learn```
- Simple and efficient tools for predictive data analysis
- Built on NumPy, SciPy, and matplotlib
Link: https://scikit-learn.org/stable/

In [None]:
from sklearn.model_selection import train_test_split