# Titanic First Pass Analysis

This analysis is a demo to practice basic data analysis skills with Pandas, NumPy, and matplotlib as part of Udacity's Intro to Data Analysis course. Various data wrangling, data analysis, and data visualization techniques will be explored on a real world data set. Additionally, this project will give practice with using Jupyter Notebooks to present findings. As this is a first pass analysis, there will be no machine learning techniques or statistical analysis performed.

The data set is the popular "Titanic Data Set" as found on [Kaggle](https://www.kaggle.com/c/titanic/data). The data contains passenger information and survival data from 891 of the 2224 passengers. This set contains features such as age, sex, and ticket class. 

The goal of the analysis is to find inital trends and correlations between these features and passenger survival.  

## 0. Data information

Taken from the [Kaggle](https://www.kaggle.com/c/titanic/data) description.
#### **Data Dictionary**

|**Variable**|**Definition**|	      Key
|:-|--------------|
|*survival*| Survival|  0 = No, 1 = Yes
|*pclass*| Ticket class| 1 = 1st, 2 = 2nd, 3 = 3rd
|*sex*| Sex|	
|*Age*|	Age in years	
|*sibsp*| # of siblings / spouses aboard the Titanic	
|*parch*| # of parents / children aboard the Titanic	
|*ticket*| Ticket number	
|*fare*| Passenger fare	
|*cabin*| Cabin number	
|*embarked*| Port of Embarkation| C = Cherbourg, Q = Queenstown, S = Southampton

#### **Variable Notes**

**pclass**: A proxy for socio-economic status (SES)
* 1st = Upper
* 2nd = Middle
* 3rd = Lower

**age**: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

**sibsp**: The dataset defines family relations in this way...
* Sibling = brother, sister, stepbrother, stepsister
* Spouse = husband, wife (mistresses and fiancés were ignored)

**parch**: The dataset defines family relations in this way...
* Parent = mother, father
* Child = daughter, son, stepdaughter, stepson
* Some children travelled only with a nanny, therefore parch=0 for them.

## 1. Process the data

### Load data

In [127]:
import pandas as pd
import numpy as np

data = pd.read_csv('titanic_data.csv')

data = data.drop(['Name', 'Ticket', 'Cabin', 'Embarked'], axis=1)

### Clean Data

The first step needed for cleaning the data set is to look at the data. A good feel of what is available and if there are any strange values is a good start in identifying what needs to be cleaned. The two methods `data.head()` and `data.describe()` will do this. The first will print the first 5 lines of the DataFrame, while the second gives basic statistics of each column.

In [128]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
0,1,0,3,male,22.0,1,0,7.25
1,2,1,1,female,38.0,1,0,71.2833
2,3,1,3,female,26.0,0,0,7.925
3,4,1,1,female,35.0,1,0,53.1
4,5,0,3,male,35.0,0,0,8.05


In [129]:
data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


When looking at the output from `data.head()`, the 'Sex' feature's data type is string. It will be easier to manipulate this later if it is changed to a binary value. 

In [130]:
data['Sex_binary'] = data['Sex'].map({'male': 1, 'female': 0})

When looking at the output from `data.describe()`, the number of entries in 'Age' is less than the number of entries in the other features, which suggests there are missing values. Missing values will be set to a default value of 0. 

'PassengerId' can be set as the index of the DataFrame, as it appears there is a unique entry for each passenger.

To get a feel for the data, it appears that 38% of the roster did not survive. This analysis will look for commonalities in this group.


In [131]:
data['Age'] = data['Age'].fillna(0)

data = data.set_index('PassengerId')

# Check to see if any null values still exist
data.isnull().sum()

Survived      0
Pclass        0
Sex           0
Age           0
SibSp         0
Parch         0
Fare          0
Sex_binary    0
dtype: int64

In [132]:
data.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Sex_binary
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,0,3,male,22.0,1,0,7.25,1
2,1,1,female,38.0,1,0,71.2833,0
3,1,3,female,26.0,0,0,7.925,0
4,1,1,female,35.0,1,0,53.1,0
5,0,3,male,35.0,0,0,8.05,1


## 2. Explore the data

In [149]:
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns

def survival_rate(pclass, sex):
    """
    pclass: class value 1, 2, or 3
    sex: male or female
    return: percent of people survived over total
    """
    group_by_total = data.groupby(['Pclass', 'Sex']).size()[pclass, sex]
    group_by_survived = data.groupby(['Pclass', 'Sex', 'Survived']).size()[pclass, sex, 1]
    return group_by_survived / group_by_total*100

for pclass in np.sort(data.Pclass.unique()):
    for sex in np.sort(data.Sex.unique()):
        print('Class {} survival rate for {}s: {}'.format(pclass, sex, survival_rate(pclass, sex)))



ModuleNotFoundError: No module named 'seaborn'