# Learning from disaster

## 1. Defining the question

## 1.1. Specifying the Data Analytic question

The RMS Titanic was a luxury steamship that was considered to be unsinkable. It set sail on its maiden voyage from Southampton, England to New York City on April 10, 1912. The ship was carrying more than 2,200 passengers and crew. On the night of April 14, the Titanic struck an iceberg and began to sink. Despite the crew's efforts to keep the ship afloat, it eventually went down in the early hours of April 15. Many of the lifeboats on board were not filled to capacity, and there were not enough for all of the passengers and crew. As a result, more than 1,500 people died in the disaster. The Titanic's sinking was one of the worst peacetime maritime disasters in history and continues to be remembered as a tragic event.

> In our analysis we would like to know who had a higher chance of surviving and under which circumstances this was to happen.

## 1.2. Define the metric for success.

- Creating a model that has an adjusted R-squared of 0.80.

## 1.3. Understanding the context.

## 1.4. Reading the experimental design.

## 1.5. Data relevance 

# 2. Loading the data

In [7]:
#importing the relevant libraries
import pandas as pd 
import numpy as np 
from scipy import stats
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt 
import re
from scipy.stats import shapiro
import statsmodels.api as sm
import statsmodels
from statsmodels.regression import linear_model
from scipy.stats import levene
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [4]:
#loading the data set
training = pd.read_csv("titanic/train.csv")
training.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# 3. Data Understanding.

In [8]:
#getting information about the data set.
training.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


> The data set has been sourced from Kaggle.It has 12 columns and 891 rows.
- PassengerId 
- Survived - shows who survived and who did not.
- Pclass - Ticket class.
- Name - Name of the passenger.
- Sex - Gender of the passenger.
- Age - Age of passenger in years.
- SibSp - Whether a passenger has a sibling or spouse.
- Parch - Whether a passenger has a parent or child aboard.
- Ticket - Ticket number 
- Fare - Passenger fare 
- Cabin - Cabin number
- Embarked - Port of embarkation. 


In [9]:
#checking the number of records our dataset has.
training.shape

(891, 12)

In [11]:
#checking the bottom of the data set.
training.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [14]:
#checking the descriptive statistics.
training.describe(include="all")

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891,891.0,204,889
unique,,,,891,2,,,,681,,147,3
top,,,,"Minahan, Miss. Daisy E",male,,,,CA. 2343,,B96 B98,S
freq,,,,1,577,,,,7,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


> The average age for someone who boarded was 29 years old,the avarage fare for a passenger was 32.20

# 4. External Data Source Validation.

# 5. Tidying the dataset.

In [16]:
#checking whether there are any missing values
training.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

> Columns with missing values are age,Cabin and Embarked.Since Age is relevant to our analysis,we can replace it with the mean age.

> The null values in rows of cabin column,are more than 50% of the data set,therefore cannot be dropped as wil it affect the integrity of our analysis.

In [22]:
#checking if there are duplicates
training.duplicated().sum()

0

> There are no duplicates in the data set.

In [28]:
#getting the age median
training["Age"].median()

28.0

In [29]:
#replacing null values in the Age column
training["Age"].fillna(value=28, inplace=True)

In [27]:
#confirming whether the null values have been replaced
training["Age"].isnull().sum()

0

# 6. Exploratory Analysis.

## Univariate analysis

# 7. Implementing the solution.

# 8. Challenging the solution.

# 9. Follow up questions.