# Titanic First Pass Analysis

This analysis is a demo to practice basic data analysis skills with Pandas, NumPy, and matplotlib as part of Udacity's Intro to Data Analysis course. Various data wrangling, data analysis, and data visualization techniques will be explored on a real world data set. Additionally, this project will give practice with using Jupyter Notebooks to present findings. As this is a first pass analysis, there will be no machine learning techniques or statistical analysis performed.

The data set is the popular "Titanic Data Set" as found on [Kaggle](https://www.kaggle.com/c/titanic/data). The data contains features, such as age, sex, and ticket class. The goal of the analysis is to find inital trends and correlations between these features and passenger survival.  



## 1. Process the data

### Load data

In [15]:
import pandas as pd

data = pd.read_csv('titanic_data.csv')

data = data.drop(['Name', 'Ticket', 'Cabin'], axis=1)

### Clean Data

The first step needed for cleaning the data set is to look at the data. A good feel of what is available and if there are any strange values is a good start in identifying what needs to be cleaned. The two methods `data.head()` and `data.describe()` will do this. The first will print the first 5 lines of the DataFrame, while the second gives basic statistics of each column.

In [17]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,male,22.0,1,0,7.25,S
1,2,1,1,female,38.0,1,0,71.2833,C
2,3,1,3,female,26.0,0,0,7.925,S
3,4,1,1,female,35.0,1,0,53.1,S
4,5,0,3,male,35.0,0,0,8.05,S


In [18]:
data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


When looking at the output from `data.head()`, the 'Sex' feature's data type is string. It will be easier to manipulate this later if it is changed to a binary value. 

In [20]:
data['Sex_binary'] = data['Sex'].map({'male': 1, 'female': 0})

When looking at the output from `data.describe()`, the number of entries in 'Age' is less than the number of entries in the other features, which suggests there are missing values. Missing values will be set to a default value of 0. 

'PassengerId' can be set as the index of the DataFrame, as it appears there is a unique entry for each passenger.

To get a feel for the data, it appears that 38% of the roster did not survive. This analysis will look for commonalities in this group.


In [22]:
data['Age'] = data['Age'].fillna(0)

data.set_index('PassengerId')

data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Sex_binary
0,1,0,3,male,22.0,1,0,7.25,S,1
1,2,1,1,female,38.0,1,0,71.2833,C,0
2,3,1,3,female,26.0,0,0,7.925,S,0
3,4,1,1,female,35.0,1,0,53.1,S,0
4,5,0,3,male,35.0,0,0,8.05,S,1
