# Titanic EDA & Prep 
**Your Name**

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

Fields include:

- **Name** (str) - Name of the passenger
- **Pclass** (int) - Ticket class
- **Sex** (str) - Sex of the passenger
- **Age** (float) - Age in years
- **SibSp** (int) - Number of siblings and spouses aboard
- **Parch** (int) - Number of parents and children aboard
- **Ticket** (str) - Ticket number
- **Fare** (float) - Ticket price paid
- **Cabin** (str) - Cabin number
- **Embarked** (str) - Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

## Import Libraries & Set Default Plot Attributes

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# Customize seaborn plot styles
# Seaborn docs: https://seaborn.pydata.org/tutorial/aesthetics.html

# Adjust to retina quality
import matplotlib_inline.backend_inline
matplotlib_inline.backend_inline.set_matplotlib_formats("retina")

# Adjust dpi and font size
sns.set(rc={"figure.dpi":100, 'savefig.dpi':300})
sns.set_context('notebook', font_scale = 0.8)

# Display tick marks
sns.set_style('ticks')

# Remove borders
plt.rc('axes.spines', top=False, right=False, left=False, bottom=False)

In [None]:
# Color palettes for plots
# Named colors: https://matplotlib.org/stable/gallery/color/named_colors.html
# Seaborn color palette docs: https://seaborn.pydata.org/tutorial/color_palettes.html
# Seaborn palette chart: https://www.codecademy.com/article/seaborn-design-ii

# cp1 Color Palette - a binary blue/orange palette
blue = 'deepskyblue' # Use 'skyblue' for a lighter blue
orange = 'orange'
cp1 = [blue, orange]

# cp2 Color Palette - 5 colors for use with categorical data
turquoise = 'mediumaquamarine'
salmon = 'darksalmon'
tan = 'tan'
gray = 'darkgray'
cp2 = [blue, turquoise, salmon, tan, gray]

# cp3 Color Palette - blue-to-orange diverging palette for correlation heatmaps
cp3 = sns.diverging_palette(242, 39, s=100, l=65, n=11)

# Set the default palette
sns.set_palette(cp1)

In [None]:
# View cp1 color palette
sns.palplot(cp1)

In [None]:
# View cp2 color palette
sns.palplot(cp2)

In [None]:
# View cp3 color palette
sns.palplot(cp3)

## Read and Review Data

In [None]:
df = pd.read_csv('data/titanic.csv')
df.head(10)

In [None]:
# View dataframe fundamentals
df.info()

# Drop irrelevant columns
These appear irrelevant to predicting survival:
- PassengerId
- Name
- Ticket

In [None]:
df.drop(['PassengerId','Name','Ticket'], axis=1, inplace=True)

# Preview the updated dataframe
df.head()

# Explore Numeric Features

- **Survived** is binary: 1 = yes; 0 = no --> but it is the target variable, so we will keep it for exploration with the continuous features
- **Pclass** is ordinal: 1st, 2nd, 3rd classes
- **Age** is continuous with integer values
- **SibSp** is ordinal, because a very small range: 1, 2, 3, etc. siblings or spouses
- **Parch** is ordinal, because a very small range: 1, 2, 3, etc. parents or children 
- **Fare** is continuous with float values

In [None]:
# Store numeric features to a variable for easy re-use
cont = ['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

# Test our new variable as a filter to preview only those columns
df[cont].head()

In [None]:
# View summary statistics for these continuous features
df[cont].describe()

**Observations**
- 38.3% of passengers survived
- The majority of passengers were in 2nd or 3rd class.
- The average age was 29. The youngest was under a year. The oldest was 80.
- Most passengers had no sibling or spouse aboard, and no parent-child relationship.
- Median fare was 14, while the highest was 512.

### Did survivors' stats for these features differ markedly from non-survivors'?
Let's compare the mean values for these features for these groups.

In [None]:
# Compare mean values for these features, grouped by Survived
df[cont].groupby('Survived').mean()

**Observations:**
- Pclass: survivors tended to be upper class (1st or 2nd)
- Age: survivors were slightly younger in average age
- SubSp: survivors averaged fewer siblings
- Parch: survivors averaged slightly higher parent/child relations
- Fare: average survivors paid more than twice the ticket price of non-survivors

### Investigate impact of null values for Age

In [None]:
# How many null values for Age?
df['Age'].isnull().sum()

Did those with null for age have a different survival rate?

In [None]:
df[cont].groupby(df[cont]['Age'].isnull()).mean()

On average, those with null values for age: 
- had a 10.7% lower chance of surviving
- were in 2nd or 3rd passenger classes
- were significantly less likely to have a parent or child on board
- paid significantly lower ticket prices

## Age Distributions
Compare age distributions for survivors and non-survivors

In [None]:
# Histogram: Age Distribution Comparisons by Survival
plt.title("Age Distributions Comparison", fontsize=14, fontweight='bold')
ax = sns.histplot(data=df, x='Age', hue='Survived', binwidth=5, alpha=0.7);
# ax.set(xlabel = 'Custom x axis label', ylabel='Custom y axis label');

**Interpretation:**
The age distributions for both survivors (1) and non-survivors (0) are very similar _except_ very young passengers (ages 0-5) and young teens were more likely to survive than not.

In [None]:
# Horizontal Boxplot: Comparing Age Distributions by Survival
plt.title("Age Distributions Comparison", fontsize=14, fontweight='bold')
ax = sns.boxplot(data=df, x='Age', y='Survived', orient='h');
# ax.set(xlabel = 'Custom x axis label', ylabel='Custom y axis label');

**Interpretation:** The box plots show more clearly that most survivors were slightly younger than non-survivors.

## Fare Distributions
Is there a pattern to survival rates by ticket price?

In [None]:
# Histogram: Fare Distribution Comparisons by Survival
plt.title("Fare Distributions Comparison", fontsize=14, fontweight='bold')
ax = sns.histplot(data=df, x='Fare', hue='Survived', binwidth=25, alpha=0.7);
# ax.set(xlabel = 'Custom x axis label', ylabel='Custom y axis label');

Those with tickets priced around $40 or more were more likely to survive.

In [None]:
# Boxplot: Fare Distributions Comparison by Survival


### Continuous Variables Summary
- **Age** appears somewhat relevant to predicting survival. **177** null values need attention.
- **Fare** appears very relevant to predicting survival. Those with a ticket priced at $40 or greater were more likely to survive than not.

# Explore Ordinal Features
Pclass, SibSp, and Parch are more accurately considered ordinal variables, so let's explore those here.

- **Pclass:** passenger class ranges from 1st to 3rd
- **SibSp:** Sibling and Spouse relationships range from 1 to 8
- **Parch:** Parent child relationships range from 1 to 6

In [None]:
# Create variable to hold ordinal features, plus Survived
ord = ['Survived','Pclass','SibSp','Parch']

# View summary stats
df[ord].describe()

Compare the mean values for survivors and non-survivors

In [None]:
df[ord].groupby('Survived').mean()

**Interpretation:**
- Survivors tended to be first or second class.
- Survivors averaged slightly fewer siblings and/or spouses.
- Survivors averaged slightly more parent-child relationships.

### Explore Pclass

In [None]:
# Countplot comparing survivors vs. non-survivors by Pclass
plt.title("Pclass Survival Comparisons", fontsize=14, fontweight='bold')
ax = sns.countplot(data=df, x='Pclass', hue='Survived');
# ax.set(xlabel = 'Custom x axis label', ylabel='Custom y axis label');

In [None]:
# Calculate survival rate by Pclass
df['Survived'].groupby(df['Pclass']).mean()

In [None]:
# Barplot survival rate by Pclass
plt.title("Pclass Survival Rate", fontsize=14, fontweight='bold')
ax = sns.barplot(data=df, x='Pclass', y='Survived', ci=None, color=blue);
# ax.set(xlabel = 'Custom x axis label', ylabel='Custom y axis label');

**Interpretation**: Pclass is *highly* relevant to predicting survival, with lower Pclass numbers corresponding with higher survival probability.

### Explore Sibsp

In [None]:
# SibSp countplot for survival comparisons

In [None]:
# Calculate survival rate by SibSp

In [None]:
# Barplot survival rate by SibSp

**Interpretation:** 
- Add your interpretation here.

### Explore Parch

In [None]:
# Countplot comparing survived vs non-survived for Parch

In [None]:
# Calculate survival rate by Parch

In [None]:
# Barplot survival rate by Parch

**Interpretation:**
- Add your interpretation.

# Clean Numeric Data

### Fill Nulls for Age with Average Age

In [None]:
# Follow Jedamski's Cleaning Continuous Variables video.
# Replace these comments with your own and code.
# Add new cells as needed.

### Create `Family_count` from `SibSp` and `Parch`
Reduce [multicollinearity](https://www.investopedia.com/terms/m/multicollinearity.asp) and enhance the data modeling by combining SibSp and Parch into one variable, the sum of SibSp + Parch.

In [None]:
# Create Family_count
# Add new cells as needed.

### Explore the new `Family_count` variable

In [None]:
# Countplot comparing survived vs. non for Family_count

In [None]:
# Barplot comparing average survival rate by Family_count

**Intepretation:** 
- Add your interpretation here.

### Drop `SibSp` & `Parch`
- These are now redundant with `Family_count`.
- We need to remove them to avoid a [multicollinearity](https://www.investopedia.com/terms/m/multicollinearity.asp) problem.

In [None]:
# Add cells as needed.

# Explore Categorical Features

In [None]:
# Create a variable to hold our categorical features, plus Survived as the target variable
cat = ['Survived','Sex','Cabin','Embarked']
df[cat].head()

In [None]:
# View informational summary of these categorical features
df[cat].info()

In [None]:
# Add remaining cells as needed to explore and clean the categorical features.