# Titanic Dataset

In this notebook you will see pandas in action. Every machine learning model expects data in a particular format. Also, machine learning algorithms in sklearn can only work with number (not string, date_time and Nans). Hence, we will have to process our dataset so that it has only numbers, no missing values and is in the proper format. 

To do all of it, we will be using pandas. Whenever I say Machine Learning, I spend most of my time working with pandas. So you should be super good with pandas.

Theres no way around it and every Data Scientist should master pandas for data preprocessing.

In [18]:
import pandas as pd

In [19]:
df = pd.read_csv('../data/train.csv')

In [20]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [21]:
df.isnull().sum() # lets look at number of missing values in each columns

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [22]:
df.drop(df[df['Embarked'].isnull()].index, axis=0, inplace=True) # we will drop the records were embarked value is missing. 

In [23]:
df.shape

(889, 12)

In [24]:
df['has_cabin'] = 0 # taking care of Cabin column as it has highest number of missing values
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,has_cabin
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0


In [25]:
df.loc[~(df.Cabin.isnull()),'has_cabin'] = 1 # All the records who have some values of Cabin, should have has_cabin = 1
df.drop('Cabin', axis=1, inplace=True) # dropping the Cabin column.

In [26]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,has_cabin
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,0


In [27]:
df.isnull().sum().sum() # Do we still have missing values?

177

In [29]:
df['Age'].fillna(df['Age'].mean(), inplace=True)  # filling all the missing values in Age column with its mean

In [30]:
df.isnull().sum().sum() # Do we still have missing values?

0

Ohh! great, no missing values at all. Moving ahead.

Machine Learning models cannot work with strings. Lets handle them.

In [31]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Embarked        object
has_cabin        int64
dtype: object

Name, Sex, Ticket and Embarked are the columns were we have string values. Lets address them one by one.

In [33]:
df.Embarked = df.Embarked.astype('category').cat.codes # encoding Embarked Column

You can read more about categorical encoding [here](http://benalexkeen.com/mapping-categorical-data-in-pandas/) and [here](https://pbpython.com/categorical-encoding.html)

In [34]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,has_cabin
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,2,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,0,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,2,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,2,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,2,0


In [35]:
df.Sex = df.Sex.astype('category').cat.codes # encoding Sex column

In [36]:
df.Ticket.head() # we can take the digits of the ticket.

0           A/5 21171
1            PC 17599
2    STON/O2. 3101282
3              113803
4              373450
Name: Ticket, dtype: object

In [37]:
t1 = 'A/5 21171' # looking at single element

In [38]:
t1.split()[-1]

'21171'

In [39]:
df.Ticket = df.Ticket.str.split().str[-1] # using str accessor from pandas 

To read more about 'str' accessor in pandas, visit [this](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html)

In [40]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex               int8
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Embarked          int8
has_cabin        int64
dtype: object

In [41]:
df.Ticket.astype('int') # now lets convert Ticket column into int()

ValueError: invalid literal for int() with base 10: 'LINE'

Ooops, An Error! Calm down. It says I cannot convert "LINE" to int() i.e. we have "LINE" in the ticket column. Lets find it using boolean Indexing.

In [42]:
df.drop(df[df.Ticket == 'LINE'].index, axis=0, inplace=True) # dropping all such records where ticket==LINE.

In [43]:
df.Ticket = df.Ticket.astype('int') # now it works

In [44]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex               int8
Age            float64
SibSp            int64
Parch            int64
Ticket           int32
Fare           float64
Embarked          int8
has_cabin        int64
dtype: object

Only, Name column is left. For now we will just encode them by using "cat" accessor.

In [45]:
df.Name = df.Name.astype('category').cat.codes

In [46]:
### Sanity Check 1 - NO Missing values ###
df.isnull().sum().sum()

0

In [48]:
### Sanity Check 1 - NO string values ###
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name             int16
Sex               int8
Age            float64
SibSp            int64
Parch            int64
Ticket           int32
Fare           float64
Embarked          int8
has_cabin        int64
dtype: object

Our data is ready to be feed into a Machine Learning model. 

The goal that we were trying to achieve with this notebook was *to show you how important pandas is*. To do almost anything with you data, you will have to use pandas. Its used very extensively in every Data Science and Machine Learning project. Its an extremely powerful library and there is no limit to what we can do with it. Hope the purpose was well served and we will keep adding more notebooks to this section.