# TITANIC: Wrangling the Passenger Manifest

## Advanced Exploratory Analysis with Pandas

This tutorial is based on the Titanic Wrangling notebook covered in class. This notebook goes more in-depth than the basic EDA and Wrangling we performed in class. The objective is to explore the data more and think about various ways to impute data. We work through these steps because scikit-learn will expect numeric values and no blanks.

Refer to the references provided in the wrangling notebook, and try some of the exercises below.

(References:
http://www.analyticsvidhya.com/blog/2014/08/baby-steps-python-performing-exploratory-analysis-python/    
http://www.analyticsvidhya.com/blog/2014/09/data-munging-python-using-pandas-baby-steps-python/)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
pd.set_option('display.max_columns', 500)

## Aquire the Data

Use pandas to open the training csv from the data folder.

In [None]:
# open the csv


## Exploratory Data Analysis

#### 1. Describe the Data

In [None]:
# view the head of the file


In [None]:
# Use pandas to run summary statistics on the data.


**What does describe tell you?**

Use the information from describe to provide a summary of the data.
* Are there missing values?
* What is the age distribution?
* What is the mean age?
* What percentage of passengers survived?
* How many passengers traveled in class 3?
* Does the data have outliers?

##### Notes on the data from describe:
(put your observations here based on the above questions)

In [None]:
# what are the dtypes of the columns


In [None]:
# what does the ticket data look like? check how many unique values there are.


In [None]:
# look at the Fares data since the describe suggests outliers


In [None]:
# Cabin has null values. check the values


#### 2. Visualize the Data

In [None]:
# Visualize the age distribution


In [None]:
# Visualize the fare distribution


In [None]:
# look at fares using a boxplot since the describe data suggested outlier(s)


In [None]:
# examine the fares by Pclass with boxplot


In [None]:
# look at the distribution of passengers by class and their survival probability

# create groupby dataframes that:
# plots the distribution of passengres by class
# shows the probability of survival by class


#plot the data


In [None]:
# look at the distribution of passengers by sex and their survival probability

# create groupby dataframes that:
# plots the distribution of passengres by sex
# shows the probability of survival by sex


#plot the data


In [None]:
# look at the distribution of passengers by Port of Embarkation and their survival probability

# create groupby dataframes that:
# plots the distribution of passengres by Port of Embarkation
# shows the probability of survival by Port of Embarkation

#plot the data


## Data Wrangling

We need to review the data and decide how to handle missing data. In some cases we may want to impute our missing values. In other cases we may want to drop the data. Explore the data and explain your decisions on imputing or dropping data.

In [None]:
# check for null values


In [None]:
# make a decision on what to do with the cabin data. explain your decision.


In [None]:
# make a decision on what to do with the ticket data. explain your decision.


**Age is likely to be an important factor in modeling. Is there another way to impute the data aside from using the mean of the column?**

In [None]:
#  What is mean age by sex?


In [None]:
# What is mean age by Pclass?
# hint: use pandas 'group_by'


In [None]:
# What about mean age by sex and pclass?


In [None]:
# and mean age by sex and pclass if they survived?


As you can see, mean age can vary widely if you consider a passenger's sex, class, and survival. 

Can you use this information to impute NaN values based on these factors instead of just using the mean for the entire column?

**As noted during the visualization, there is outlier data in fares.**

Should we do anything about this data? It is possible the fare was actually above $500. It is also possible there was an error in transcrbing the data. 

Explore the fare information and make a decision on what, if anything, to do with the outlier data. Explain your decision. 

**What about passengers with a fare of $0?**

**What should we do about missing embarkation data?**

**Sex and Embarkation**
As noted in the objectives, we will need to change text values to numeric values for scikit-learn. There are different ways to do this. In order to practice, we will use 'map' and 'get_dummies' here. In the machine learning class we will discuss when you may or may not want to use these as well as the options scikit-learn gives us for converting this data.

Convert Sex to numeric values using the python 'map' function to change the values to binary values. 

Use pandas 'get_dummies' to change the Embarkation data to dummy variables. 

In [None]:
# map sex into numeric binaries


In [None]:
# create dummy varaibles for Embarked, join them to the munged df, then drop the Embarked column


Optional: save the new dataframe to your titanic database or to a csv file.

In [None]:
import pandas.io.sql as pd_sql
import sqlite3 as sql