## EDA & Cleaning: Cleaning the categorical features

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

This dataset contains information about 891 people who were on board the ship when departed on April 15th, 1912. As noted in the description on Kaggle's website, some people aboard the ship were more likely to survive the wreck than others. There were not enough lifeboats for everybody so women, children, and the upper-class were prioritized. Using the information about these 891 passengers, the challenge is to build a model to predict which people would survive based on the following fields:

- **Name** (str) - Name of the passenger
- **Pclass** (int) - Ticket class
- **Sex** (str) - Sex of the passenger
- **Age** (float) - Age in years
- **SibSp** (int) - Number of siblings and spouses aboard
- **Parch** (int) - Number of parents and children aboard
- **Ticket** (str) - Ticket number
- **Fare** (float) - Passenger fare
- **Cabin** (str) - Cabin number
- **Embarked** (str) - Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

**This section focuses on cleaning up the `Name`, `Sex`, `Cabin`, and `Embarked` features.**

### Read in data

_Welcome back to the last lesson of the EDA & Data Cleaning section. In this section we will take what we've learned through our EDA on categorical variables and we'll apply it to clean up our dataset. A lot of this is going to be repeating some of what we did last lesson but this is just removing all of the exploratory stuff so you can see a clear, direct way of how this data should be cleaned._

_Start by importing the packages we need, read in our data, and then drop a couple variables that aren't useful._

In [1]:
import numpy as np
import pandas as pd

titanic = pd.read_csv('../titanic.csv')
titanic.drop(['Name', 'Ticket'], axis=1, inplace=True)
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,1,0,3,male,22.0,1,0,7.25,,S
1,2,1,1,female,38.0,1,0,71.2833,C85,C
2,3,1,3,female,26.0,0,0,7.925,,S
3,4,1,1,female,35.0,1,0,53.1,C123,S
4,5,0,3,male,35.0,0,0,8.05,,S


### Create indicator for `Cabin`

_In the last section we learned that missing values for Cabin weren't random and that it meant they did not have a cabin and those people were much less likely to survive. So lets go ahead and create that indicator variable based on whether `Cabin` is null._

_So again, using the `where()` method for `numpy` - this is just like an if statement, tell it what condition you're looking at, so that's whether or not `Cabin` is null. If it is, pass in a 0, if not - pass in a 1._

In [2]:
titanic['Cabin_ind'] = np.where(titanic['Cabin'].isnull(), 0, 1)
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Cabin_ind
0,1,0,3,male,22.0,1,0,7.25,,S,0
1,2,1,1,female,38.0,1,0,71.2833,C85,C,1
2,3,1,3,female,26.0,0,0,7.925,,S,0
3,4,1,1,female,35.0,1,0,53.1,C123,S,1
4,5,0,3,male,35.0,0,0,8.05,,S,0


### Convert `Sex` to numeric

_Next we need to convert the `Sex` feature from `male` vs `female` to numeric. A model doesn't know what `male` vs `female` really means, it just knows there are two values for `Sex`. What those values are doesn't really matter - so if we convert it to being numeric it just makes it a little easier for most models to handle._

_So this is how we're going to handle it, we're going to create a dictionary that does the gender to numeric mapping and then we will apply that to the `Sex` column using the `.map()` method in pandas. Just to make it easy, we will make `male`=0 and `female`=1. Then to map it we call titanic[`Sex`], then the `.map()` method and we tell is to use the `gender num` dictionary to control the mapping. So run that and print out the first few rows. You can see that now instead of `male` and `female` it is 0 and 1._

In [3]:
gender_num = {'male': 0, 'female': 1}

titanic['Sex'] = titanic['Sex'].map(gender_num)

In [4]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Cabin_ind
0,1,0,3,0,22.0,1,0,7.25,,S,0
1,2,1,1,1,38.0,1,0,71.2833,C85,C,1
2,3,1,3,1,26.0,0,0,7.925,,S,0
3,4,1,1,1,35.0,1,0,53.1,C123,S,1
4,5,0,3,0,35.0,0,0,8.05,,S,0


### Drop `Cabin` and `Embarked`

_Lastly, now that we have `Cabin ind` - `Cabin` is repetitive and unnecessary. In addition to that, in our last lesson we discovered that there were different survival rates with the different levels of teh `Embarked` feature. However, this was not a causal relationship - other features like % of women boarding at each port or % of people that have cabins were the real causal factors and since those are already accounted for in this model, we can drop `Embarked`. This also mitigates any potential impact of multi-collinearity._

_So we've done this before, we just call the dataframe.drop(), tell it to drop `Cabin` and `Embarked`, tell it we want to drop columns and not rows (that's axis=1), and then tell it to do it inplace (alter dataframe as it stands now)._

In [5]:
titanic.drop(['Cabin', 'Embarked'], axis=1, inplace=True)
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin_ind
0,1,0,3,0,22.0,1,0,7.25,0
1,2,1,1,1,38.0,1,0,71.2833,1
2,3,1,3,1,26.0,0,0,7.925,0
3,4,1,1,1,35.0,1,0,53.1,1
4,5,0,3,0,35.0,0,0,8.05,0
