# Titanic Data

### Description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew (32% survival rate). This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others. What sorts of people were more likely to survive?

### Data Dictionary

Survived: 0 = No, 1 = Yes

Pclass (Passenger Class): class 1 = 1st, 2 = 2nd, 3 = 3rd 

Sex: gender

Age: Age in years 

SibSp: # of siblings / spouses traveling with an individual aboard the Titanic

Parch: # of parents / children traveling with an individual aboard the Titanic 

Ticket: ticket number

Fare: Passenger fare 

Cabin: Cabin number 

Embarked (Port of Embarkation): C = Cherbourg, Q = Queenstown, S = Southampton


   **Variable Notes**
   
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

Age: Age is fractional if less than 1. If the age is estimated, it is in the form of xx.5

SibSp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

Parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children traveled only with a nanny, therefore Parch=0 for them.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv("titanic.csv")
df.head()

# Inspect the features
Note which are numerical and which are categorical.

Check for missing values. Which features can be dropped?
Which features may we want to complete/impute?

### Distribution of numerical features

In [None]:
df.describe()

In [None]:
df.info()

### Distribution of categorical features

In [None]:
df.describe(include=["object"])

# Exploratory Data Analysis

## Univariate (single variable)

### Bar Chart

In [None]:
fig, ax = plt.subplots()

survived = df.loc[df["Survived"]==1, "Survived"].count()
not_survived = df.loc[df["Survived"]==0, "Survived"].count()

plt.bar([0,1], [survived, not_survived], align='center', alpha=0.5)
plt.xticks([0,1], ['Yes', 'No'])

plt.title('Survived')

plt.show()

### Histogram

In [None]:
import warnings
warnings.filterwarnings('ignore')

fig, ax = plt.subplots()
ax.hist(df['Age'],  color='red', alpha=.3, edgecolor='black', bins=10)
ax.set(xlabel="Age")

plt.show()

## Bivariate (joint distributions)

### Histogram

In [None]:
fig, ax = plt.subplots(1,2, figsize=(12,5))
ax[0].hist(df.loc[(df.Age.notnull()) & (df["Survived"] == 0),'Age'],  color='red', alpha=.3, edgecolor='black', bins=20)
ax[0].set(xlabel="Age", title="Not Survived", ylim=(0,60))
ax[1].hist(df.loc[(df.Age.notnull()) & (df["Survived"] == 1),'Age'],  color='green', alpha=.3, edgecolor='black', bins=20)
ax[1].set(xlabel="Age", title="Survived", ylim=(0,60))
plt.show()

### Density Plot

In [None]:
import warnings
warnings.filterwarnings('ignore')

sns.distplot(df.loc[(df.Age.notnull()) & (df["Survived"] == 0),'Age'], hist = False, kde = True,
                 kde_kws = {'shade': True, 'linewidth': 3}, 
                  label = "Not Survived")
sns.distplot(df.loc[(df.Age.notnull()) & (df["Survived"] == 1),'Age'], hist = False, kde = True,
                 kde_kws = {'shade': True, 'linewidth': 3}, 
                  label = "Survived")

### Box Plot

In [None]:
fig, ax = plt.subplots()
not_survived = df.loc[(df.Age.notnull()) & (df["Survived"] == 0),'Age']
survived = df.loc[(df.Age.notnull()) & (df["Survived"] == 1),'Age']
ax.boxplot([not_survived, survived])
ax.set(xticklabels=['Not Survived', 'Survived'], title="Age")
plt.show()

### Scatterplot

In [None]:
fig, ax = plt.subplots()
ax.scatter(df['Age'], df['Fare'], alpha=0.3,
            s=200, c=df['Survived'], cmap='viridis')
ax.set(xlabel="Age", ylabel="Fare", ylim=(0,350))
plt.show()

### Pairplot

In [None]:
import warnings
warnings.filterwarnings('ignore')

pair_plot = sns.pairplot(df, hue='Survived')

# Analyze by grouping (pivoting) features

### Explore relationships between categorical features

In [None]:
df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)


In [None]:
df[['Sex', 'Survived']].groupby(['Sex']).mean().sort_values(by='Survived', ascending=False)

In [None]:
df[['Parch', 'Survived']].groupby(['Parch']).mean().sort_values(by='Survived', ascending=False)

In [None]:
df[["SibSp", "Survived"]].groupby(['SibSp'], as_index=True).mean().sort_values(by='Survived', ascending=False)


## Multivariate

### Visualize relationships between multiple categorical features

In [None]:
import warnings
warnings.filterwarnings('ignore')

grid = sns.FacetGrid(df, col='Embarked', height=2.2, aspect=1.6)
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', order = [1,2,3], hue_order=["male", "female"], palette='deep')
grid.add_legend()

#### Data Analysis: 
#### It appears that women, children, the upperclass, and those traveling with at least one other person, but no more than 2, had the best chances to survive the Titanic tragedy.

# Data cleaning and tranformation

### Impute missing values (Embarked)

In [None]:
df.Embarked.value_counts()

In [None]:
most_common_port = df.Embarked.value_counts().idxmax()
most_common_port

In [None]:
df['Embarked'] = df['Embarked'].fillna(most_common_port)
    
df[['Embarked', 'Survived']].groupby(['Embarked']).mean().sort_values(by='Survived', ascending=False)

### One-hot encoding "Embarked"

In [None]:
df.head()

In [None]:
# use pandas to one-hot encode "Embarked"

# DEFAULTS:
    # prefix_sep='_' 
    # columns=None   ... will encode all columns with categorical variables
    # drop_first=False
# returns a DataFrame

df = pd.get_dummies(df, columns=["Embarked"])
df.head()

### Convert "Sex" to a binary value

In [None]:
# Converting a categorical feature to a binary one
df["Sex"] = df["Sex"].map({'male':0, 'female':1})

### Impute missing values (Age)

Note: There is a relationship among Age, Gender, and Pclass. 

Perhaps use the mean Age, across sets of Pclass and Gender combinations, to impute missing Ages.  Alternatively, a random value within 1 standard deviation of the mean Age can be used. The median value can also be used

### Imputing with a randomly selected age within 1 standard deviation of its groups mean

In [None]:
# Males (coded as Sex=0) in First Class (coded as Pclass=1)
Age01_mean = df.loc[(df['Sex']==0) & (df['Pclass']==1), 'Age'].mean()
Age01_std = df.loc[(df['Sex']==0) & (df['Pclass']==1), 'Age'].std()
Age01_mean, Age01_std

In [None]:
# Males in Second Class
Age02_mean = df.loc[(df['Sex']==0) & (df['Pclass']==2), 'Age'].mean()
Age02_std = df.loc[(df['Sex']==0) & (df['Pclass']==2), 'Age'].std()
Age02_mean, Age02_std

In [None]:
# Males in Third Class
Age03_mean = df.loc[(df['Sex']==0) & (df['Pclass']==3), 'Age'].mean()
Age03_std = df.loc[(df['Sex']==0) & (df['Pclass']==3), 'Age'].std()
Age03_mean, Age03_std

In [None]:
# use the mean and std of Males in First Class
# to randomly generate an Age within 1 standard deviation of the mean
Age01_impute = round(np.random.uniform(Age01_mean - Age01_std, Age01_mean + Age01_std))
Age01_impute

*Do the same, as above, for 'females'

In [None]:
# replace the null values with the imputed age
#df.loc[ (df["Age"].isnull()) & (df.Sex=='male') & (df.Pclass==1),'Age'] = Age01_impute

# Feature Engineering
Perhaps create an "AgeBand" feature by grouping Age within bands (discretization).

In [None]:
df["Age"].head()

In [None]:
# Create "AgeBand" feature

#df['AgeBand'] = pd.cut(df['Age'], 5)
#df['AgeBand'] = pd.cut(df['Age'], 4)
df['AgeBand'] = df['AgeBand'] = pd.cut(df['Age'], [0,20,40,60,80])
#df['AgeBand'] = pd.cut(df['Age'], [0,20,40,60,80], labels=["child","adult","middle age","elder"])
df['AgeBand'] = pd.cut(df['Age'], [0,20,40,60,80], labels=[1,2,3,4])
df["AgeBand"].head(20)

In [None]:
df[['AgeBand', 'Survived', "Sex"]].groupby(['AgeBand', "Sex"]).mean().sort_values(by='AgeBand', ascending=True)

Perhaps create a **"FamilySize"** feature, combining "SibSp" and "Parch"

In [None]:
# Create "FamilySize" feature  and perhaps drop "SibSp" and "Parch"
 
df["FamilySize"] = df["SibSp"] + df["Parch"] + 1

df[['FamilySize', 'Survived']].groupby(['FamilySize']).mean().sort_values(by='Survived', ascending=False)

Perhaps create an **"IsAlone"** feature using "FamilySize"

In [None]:
# Create IsAlone feature
df['IsAlone'] = 0
df.loc[df['FamilySize'] == 1, 'IsAlone'] = 1

df[['IsAlone', 'Survived']].groupby(['IsAlone']).mean()

In [None]:
df.head()

# Model and predict

**Note:** Scikit-learn will give you an error if you have any NaNs in your data. You must impute or drop them.

In [None]:
# ...
