<a href="https://colab.research.google.com/github/frank-895/machine_learning_journey/blob/main/random_forests/random_forest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [7]:
from fastai.imports import *

# Random Forests on the Spaceship Titanic

Inspired by my recent work on the classic Titanic dataset, I've decided to make a submission to a Kaggle competition called the [Spaceship Titanic](https://www.kaggle.com/competitions/spaceship-titanic/data?select=test.csv). The problem is set in 2912 and a group of passengers on an interstellar passenger liner have been launched into alternate dimension. The goal is to use their statistics to determine who has been launched into the alternate dimension!

Since my current area of learning is in **random forests**, I will be limiting myself to this machine learning technique for my submission to demonstrate my learning.

## Data Extraction and Cleaning

### Extraction

We can read in our dataset using Pandas, which will be useful for cleaning the data and feature engineering too.

In [26]:
df = pd.read_csv("train.csv")
df

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276_01,Europa,False,A/98/P,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,Gravior Noxnuther,False
8689,9278_01,Earth,True,G/1499/S,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,Kurta Mondalley,False
8690,9279_01,Earth,False,G/1500/S,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,Fayey Connon,True
8691,9280_01,Europa,False,E/608/S,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,Celeon Hontichre,False


### Data Cleaning


The first thing to check in our dataframe is for missing values. We can check this easily with Pandas.

In [27]:
df.isnull().sum()

Unnamed: 0,0
PassengerId,0
HomePlanet,201
CryoSleep,217
Cabin,199
Destination,182
Age,179
VIP,203
RoomService,181
FoodCourt,183
ShoppingMall,208


For the rows with missing values, it would be wasteful to remove them. In order to ensure all the data can be used, I will fill `NaN` values with the mode of the column they exist in. This will ensure the remaining data can be used.

In [28]:
for col in df.columns:
  df[col] = df[col].fillna(df[col].mode()[0])

  df[col] = df[col].fillna(df[col].mode()[0])


Let's make sure this function worked!

In [29]:
df.isnull().sum()

Unnamed: 0,0
PassengerId,0
HomePlanet,0
CryoSleep,0
Cabin,0
Destination,0
Age,0
VIP,0
RoomService,0
FoodCourt,0
ShoppingMall,0


Perfect!

### Feature Engineering

Let's have a look our dataframe and decide what we will do with all our columns and what **feature engineering** we have to perform to feed the data into the random forest.

In [30]:
df.describe(include='all')

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
count,8693,8693,8693,8693,8693,8693.0,8693,8693.0,8693.0,8693.0,8693.0,8693.0,8693,8693
unique,8693,3,2,6560,3,,2,,,,,,8473,2
top,0001_01,Earth,False,G/734/S,TRAPPIST-1e,,False,,,,,,Alraium Disivering,True
freq,1,4803,5656,207,6097,,8494,,,,,,202,4378
mean,,,,,,28.728517,,220.009318,448.434027,169.5723,304.588865,298.26182,,
std,,,,,,14.355438,,660.51905,1595.790627,598.007164,1125.562559,1134.126417,,
min,,,,,,0.0,,0.0,0.0,0.0,0.0,0.0,,
25%,,,,,,20.0,,0.0,0.0,0.0,0.0,0.0,,
50%,,,,,,27.0,,0.0,0.0,0.0,0.0,0.0,,
75%,,,,,,37.0,,41.0,61.0,22.0,53.0,40.0,,


Here's my action plan for each feature:

- `PassengerId` is unique for each passenger; however, it does contain useful information as it is in the form `gggg_pp` where `gggg` represents the group they are travelling with. We will extract this `gggg` value as an integer to use as a continuous variable in the model.
- `HomePlanet`, `CryoSleep`, `Destination`, `VIP` have unique values between 2-3 so these will be converted to categorical variables without further manipulation.
- `Cabin` is more complex and requires further inspection. It is split into the format `D/N/M` so it may be possible to split this column into 3 separate categorical variables.
- `Age`, `RoomService`, `FoodCourt`, `ShoppingMall`, `Spa`, `VRDeck` all represent continuous variables and can be fed directly into the model without further manipulation.
- `Name` is an interesting column. For now, we will dump this column as it will require significant processing to elicit useful information, especially in a random forest. It is possible that the last name could provide valuable insights for the model; however, I theorise that this information will be automatically contained within the `gggg` part of `PassengerId`.

Then finally, `Transported` is our target variable. This represents either 'True' or 'False'.

#### PassengerId

We need to extract the first 4 digits of each value in the `PassengerId` column and convert the value to an integer to be used as a **continuous variable**.

In [32]:
df['PassengerId'] = [int(i[:4]) for i in df['PassengerId']]
df['PassengerId']

Unnamed: 0,PassengerId
0,1
1,2
2,3
3,3
4,4
...,...
8688,9276
8689,9278
8690,9279
8691,9280


If we inspect the number of unique values in the column we can see that there is some useful overlap.

In [33]:
df['PassengerId'].nunique()

6217

When creating random forests, we don't need to create dummy variables for non-numeric columns, instead just convert them to **categorical variables**. Internally, by Pandas, they will be interpreted as numbers. Below, you can see that I've carefully selected categorical variables by determining the number of unique values for each feature.

In [15]:
[i for i in df['Cabin']]

['B/0/P',
 'F/0/S',
 'A/0/S',
 'A/0/S',
 'F/1/S',
 'F/0/P',
 'F/2/S',
 'G/0/S',
 'F/3/S',
 'B/1/P',
 'B/1/P',
 'B/1/P',
 'F/1/P',
 'G/1/S',
 'F/2/P',
 nan,
 'F/3/P',
 'F/4/P',
 'F/5/P',
 'G/0/P',
 'F/6/P',
 'E/0/S',
 'E/0/S',
 'E/0/S',
 'E/0/S',
 'E/0/S',
 'E/0/S',
 'D/0/P',
 'C/2/S',
 'F/6/S',
 'C/0/P',
 'F/8/P',
 'G/4/S',
 'F/9/P',
 'F/9/P',
 'F/9/P',
 'D/1/S',
 'D/1/P',
 'F/8/S',
 'F/10/S',
 'G/1/P',
 'G/2/P',
 'B/3/P',
 'G/3/P',
 'G/3/P',
 'G/3/P',
 'F/10/P',
 'F/10/P',
 'E/1/S',
 'E/2/S',
 'G/6/S',
 'F/11/S',
 'A/1/S',
 'A/1/S',
 'A/1/S',
 'G/7/S',
 'F/12/S',
 'F/13/S',
 'F/14/S',
 'E/3/S',
 'G/6/P',
 'G/10/S',
 'G/10/S',
 'F/15/S',
 'E/4/S',
 'F/16/S',
 'F/13/P',
 'F/14/P',
 'F/17/S',
 'D/3/P',
 'C/3/S',
 'F/18/S',
 'F/15/P',
 'C/4/S',
 'G/13/S',
 'F/16/P',
 'F/16/P',
 'F/16/P',
 'G/14/S',
 'C/5/S',
 'F/17/P',
 'E/5/S',
 'G/15/S',
 'G/16/S',
 'F/20/S',
 'G/9/P',
 'G/9/P',
 'G/9/P',
 'A/2/S',
 'G/11/P',
 'G/11/P',
 'F/19/P',
 'G/12/P',
 nan,
 'F/23/S',
 'F/24/S',
 'G/18/S',
 'G/18

In [None]:
def process_data(df):
  df['HomePlanet'] = pd.Categorical(df.HomePlanet)
  df['Destination'] = pd.Categorical(df.Desination)

  df.drop('PassengerId', inplace=True)
