<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>
Hands-on Practice with Pandas and Matplotlib</p>



In [2]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from math import pi

%matplotlib inline

<p style="font-family: Arial; font-size:1.3em;color:purple; font-style:bold"><br>
1: Get the data and inspect it</p>
 
**Load the data** into a Pandas Dataframe. **Inspect all the columns**, and <b>read the data dictionary <a href="https://www.kaggle.com/c/titanic/data">on Kaggle</a></b> to understand their meaning. **Print the shape of the dataframe and its head**.

To avoid confusion, note that this isn't the exact same dataset from Kaggle. Kaggle just has a nicely formatted data dictionary.  The dataset we're using (<a href="http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets">source</a>) includes data on all available passengers as well as their lables. Kaggle's version does not, for reasons you can guess. 

In [3]:
titanic = pd.read_csv("./titanic_data.csv")

<p style="font-family: Arial; font-size:1.3em;color:purple; font-style:bold"><br>
2: Clean the data</p>

Next, we want to remove rows with null values, as usual. However, first it may be a good idea to see if there are any columns that we can get rid of. Two reasons for wanting to do this are:<br>
1. If any columns have way too many NaN values, we should get rid of these. If we don't, we'll frivolously remove rows that only have NaN's in these columns, which we don't want to do.
2. Not all columns provide valuable information anyway

**Find the name of the column with the most missing values. Then, drop it from the DataFrame.**<br>
**Also, drop the ticket ticket column** because it's not useful to us.
Once these columns are removed, **drop any rows that have NaN in the remaining columns**

In [3]:
# TODO: print column names and associated NaN counts

pclass 0
survived 0
name 0
sex 0
age 263
sibsp 0
parch 0
ticket 0
fare 1
cabin 1014
embarked 2


In [4]:
# TODO: Drop cabin and ticket columns
# TODO: Drop remaining rows with NaNs
# TODO (Recommended): Make copy of your dataframe at this point called "original" so you have something to revert back to

<p style="font-family: Arial; font-size:1.3em;color:purple; font-style:bold"><br>
3: Do Some Visualizations</p>

First, lets see how age and travel class are related.<br> 
<b>Build a <a href="https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html">Matplotlib hisogram</a></b> for ages by first, second and third class tickets. For the histogram,
<ul>
    <li>x-axis is age</li>
    <li>y-axis is occurences of passengers at that age in the given travel group</li>
    <li>The number of buckets you choose for the histogram is up to you</li>
    <li>Use hyisttype='step', and plot all 3 age distributions on the same histogram</li>
</ul>

*Note: Remember to title your plots and label all your axes!*

In [5]:
# TODO: Get ages by class
# TODO: Create an array of bins for ages. Use range.
# TODO: Plot all three age lists with different colors

There were different locations from which passengers embarked. Build a <a href="https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html">Matplotlib bar chart</a></b> where:<br>
<ul>
    <li>x-axis is location</li>
    <li>y-axis is occurences of passengers embarking at that location</li>
</ul>

In [6]:
# TODO: Get embarked value counts
# TODO: Create a bar chart with one bar for each location

Time for a more morbid visualization. Build one <a href="https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html">Matplotlib bar chart</a></b> that represents survival/casualty rates for males and females. This will requre you to plot 4 bars in total, where two bars will overlay the other two. The larger of each overlaid bar is the total count of that gender, and the smaller is the coun that survived.
<ul>
    <li>x-axis is gender</li>
    <li>y-axis is # of passengers</li>
    <li>2 bars for each gender<ul>
        <li>One bar represents total # passengers</li>
        <li>One bar represents # passengers that survived</li></ul>
    </li>
</ul>

In [7]:
# TODO: Get counts of total males/females and total survived males/females 
# TODO: Plot two bars, one for male total and one for male survived at same x location. 
# TODO: Do same for female at other x location.

<p style="font-family: Arial; font-size:1.3em;color:purple; font-style:bold"><br>
4: Predict survival</p>

*Note:* Please make a copy of the original dataset before  doing the following

Separate the data into it's features and labels. Since we're trying to predict survival rate, the labels will be the survival column. The features we'll use will exclude the name, sex and embarked (think about why - there are different reasons for dropping name vs. dropping sex and embarked). Drop these two columns from your dataset. The result should be a (1309, 7) DataFrame called features and a (1309,) DataFrame called labels. 



In [8]:
from sklearn.utils import shuffle
random_state = 123

# TODO: Drop name, embarked, sex
# TODO: Shuffle your dataset using shuffle(df, random_state=random_state)

In [9]:
# TODO: Create data df 
# TODO: Create label df
# TODO: Print their shapes and heads.

Now, **split your labels and data each into a train and test portion**. Note that it's important that train features and train labels align perfectly (same for test features and labels). Let's use 80% of the data for training

In [10]:
# TODO: Split data into X_train X_test
# TODO: Split labels into y_train y_test
# TODO: Print their shapes and the head of both X's.

Now, fit a classifier to your training data. Let's use <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">Logistic Regression</a>


In [11]:
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

# TODO: Instantiate and then fit a logistic regressor 

In [12]:
# TODO: Make predictions and get accuracy score

<p style="font-family: Arial; font-size:1.3em;color:purple; font-style:bold"><br>
5: More features</p>

To allow our model to use and embarked and sex columns, let's transform them to numerical values.<br>
We'll <a href="https://www.kaggle.com/dansbecker/using-categorical-data-with-one-hot-encoding">one hot encode</a> these string features. There's a <a href="https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.get_dummies.html">convenient pandas function</a> to help us with this.
Remember to still drop the name column, we're not using this as it holds no predictive value.

In [16]:
# TODO: Get copy of original dataset before step 4.
# TODO: Repeat step 4 up until the train-test split (shuffle, get data df, get label df). 
#       Note: Drop name. Don't drop sex and embarked.

In [17]:
# TODO: get new columns for sex and embarked using pd.get_dummies. Concatenate them to the dataset column wise.
# TODO: drop old columns for sex and embarked

# Split into train-test set same way as at the end of step 4. Print resulting shapes.

In [18]:
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

# TODO: Instantiate and then fit a logistic regressor 
# TODO: Make predictions and get accuracy score