<a href="https://colab.research.google.com/github/frank-895/machine_learning_journey/blob/main/titanic_dataset/using_framework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
%%capture
!pip install fastbook

In [9]:
import pandas as pd, fastai, numpy as np

# Using a Framework on the Titanic Dataset

## Introduction

In the previous notebook we manually built a deep learning (DL) model to demonstrate the specifics of how neural networks (NNs) work. However, in practice and when not learning, building NNs from scratch is generally time-consuming and will yield worse results. By using pre-made architectures and pretrained models, we can get better results as the models have been optimised by experts and with extensive research.

In this notebook, we will be using fastai to reproduce the results of the last notebook, where we create a model that predicts whether a person survived the Titanic disaster. We will also employ a more advanced technique called ensembling to improve on our model further.

## Prepare Data

In [10]:
df = pd.read_csv('train.csv')

In the last notebook, because we were building the NN from scratch, we had to carefully feature engineer as the variables all required a lot of work to prepare for the model. However, since we are using fastai, all of this is done for us! So, we will use some amazing features from this notebook [Titanic - Advanced Feature Engineering Tutorial](https://www.kaggle.com/code/gunesevitan/titanic-advanced-feature-engineering-tutorial/) to better understand Pandas and to improve our model.

The feature will be shown in code and explained *afterwards*.

In [11]:
df['LogFare'] = np.log1p(df.Fare)

`log1p` is a function that takes the logarithm of `Fare + 1`. Instead of having to add 1, to prevent `log(0)` this function does that for us!

We use the 'LogFare' as this column has many really big values and lots of smaller values. This can cause a skewed distribution - generally, machine learning (ML) models work better with more normal distributions.

In [13]:
df['Deck'] = df.Cabin.str[0].map(dict(A='ABC', B='ABC', C='ABC', D='DE', E='DE', F='FG', G='FG'))

`df.Cabin.str[0]` extracts the first letter of the 'Cabin' column, which represents the deck where the passenger's cabin is locate. We use `map` to group certain decks together.

This line reduces the number of unique categories and may improve the predictive power by simplifying the model's input data.

In [14]:
df['Family'] = df.SibSp + df.Parch

The 'Family' column is created by adding siblings and spouses and parents and children.

This column indicates that the sum of family members on board could be a predictor of a passenger's likelihood of survival.

In [15]:
df['Alone'] = df.Family == 0

Similar to above, this line indicates that travelling alone is a predictor of survival. This will produce a boolean value that fastai can automatically handle.

In [17]:
df['TicketFreq'] = df.groupby('Ticket')['Ticket'].transform('count')

'TicketFreq' represents how many passesngers shared a ticket number. `groupby` allows us to treat all rows with the same ticket number as a group so we can perform operations on these groups. `transform` is applied after grouping and is used to count the occurences of each ticket number.

`transform` is very useful for performing calculations within groups while retaining the original shape of the DataFrame.

This feature indicates that people travelling on the same ticket may have had a similar experience during the journey impacting their likelihood of survival.

In [19]:
df['Title'] = df.Name.str.split(', ', expand=True)[1].str.split('.', expand=True)[0]
df['Title'] = df.Title.map(dict(Mr='Mr', Miss='Miss', Mrs='Mrs', Master='Master'))

In [20]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,LogFare,Deck,Family,Alone,TicketFreq,Title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,2.110213,,1,False,1,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,4.280593,ABC,1,False,1,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,2.188856,,0,True,1,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,3.990834,ABC,1,False,2,Mrs
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,2.202765,,0,True,1,Mr


Notice, no messing around with NaN columns.