# Spaceship - Data Analysis

In [1]:
from IPython.display import Image
from IPython.core.display import HTML

In [None]:
Image(url= "https://www.redpointglobal.com/wp-content/uploads/2020/10/nebula-and-galaxies-in-space-scaled.jpg", width=700)

## Project Description

Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good.

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

Help save them and change history!

[Text from kaggle.com/competitions/spaceship-titanic]

## Project Goal

The point of this data analyis is ultimately the prediction of how many survivors there are. For this reason, we have a training dataset that includes the columns: PassengerId, HomePlanet, CryoSleep,Cabin, Destination, Age, VIP, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck, Name, and Transported.
We also have a testing dataset that includes the same columns apart from Transported. This column is not included because it is the one we are trying to predict.
But of course this project also consists of other research questions, such as "How many Home Planets are there?" or "What are the age groups?"

In [2]:
# !pip install kaggle
import kaggle

In [None]:
! kaggle competitions download -c spaceship-titanic

## Data Evaluation

Data evaluation is an important step here because we get an overview of the data given and are able to understand the (un-)structuredness of the data.

In [3]:
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
import sklearn
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score

Read in the training data with pandas

In [4]:
train_data = pd.read_csv("spaceship-titanic/train.csv")
train_data.head(10)

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True
5,0005_01,Earth,False,F/0/P,PSO J318.5-22,44.0,False,0.0,483.0,0.0,291.0,0.0,Sandie Hinetthews,True
6,0006_01,Earth,False,F/2/S,TRAPPIST-1e,26.0,False,42.0,1539.0,3.0,0.0,0.0,Billex Jacostaffey,True
7,0006_02,Earth,True,G/0/S,TRAPPIST-1e,28.0,False,0.0,0.0,0.0,0.0,,Candra Jacostaffey,True
8,0007_01,Earth,False,F/3/S,TRAPPIST-1e,35.0,False,0.0,785.0,17.0,216.0,0.0,Andona Beston,True
9,0008_01,Europa,True,B/1/P,55 Cancri e,14.0,False,0.0,0.0,0.0,0.0,0.0,Erraiam Flatic,True


Then we use shape to see the extent of the dataset

In [None]:
train_data.shape

Read in the test data with pandas

In [5]:
test_data = pd.read_csv("spaceship-titanic/test.csv")
test_data.head(10)

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
0,0013_01,Earth,True,G/3/S,TRAPPIST-1e,27.0,False,0.0,0.0,0.0,0.0,0.0,Nelly Carsoning
1,0018_01,Earth,False,F/4/S,TRAPPIST-1e,19.0,False,0.0,9.0,0.0,2823.0,0.0,Lerome Peckers
2,0019_01,Europa,True,C/0/S,55 Cancri e,31.0,False,0.0,0.0,0.0,0.0,0.0,Sabih Unhearfus
3,0021_01,Europa,False,C/1/S,TRAPPIST-1e,38.0,False,0.0,6652.0,0.0,181.0,585.0,Meratz Caltilter
4,0023_01,Earth,False,F/5/S,TRAPPIST-1e,20.0,False,10.0,0.0,635.0,0.0,0.0,Brence Harperez
5,0027_01,Earth,False,F/7/P,TRAPPIST-1e,31.0,False,0.0,1615.0,263.0,113.0,60.0,Karlen Ricks
6,0029_01,Europa,True,B/2/P,55 Cancri e,21.0,False,0.0,,0.0,0.0,0.0,Aldah Ainserfle
7,0032_01,Europa,True,D/0/S,TRAPPIST-1e,20.0,False,0.0,0.0,0.0,0.0,0.0,Acrabi Pringry
8,0032_02,Europa,True,D/0/S,55 Cancri e,23.0,False,0.0,0.0,0.0,0.0,0.0,Dhena Pringry
9,0033_01,Earth,False,F/7/S,55 Cancri e,24.0,False,0.0,639.0,0.0,0.0,0.0,Eliana Delazarson


Then we use shape to see the extent of the dataset

In [None]:
train_data.shape

## EDA (Exploratory data analysis)

Data Dictionary defined on kaggle.com, which explains all columns that can be found in the dataset:



| Column      | Description                                                                               |
|-------------|-------------------------------------------------------------------------------------------|
| PassengerID | A unique Id for each passenger. Each Id takes the form gggg_pp (group_numberWithinGroup)  |
| HomePlanet  | The planet the passenger departed from, typically their planet of permanent residence     |
| CryoSleep   | Indicates whether passenger is in cryosleep (if so, they are refined to their cabins)     |
| Cabin       | The cabin number where the passenger is staying (deck/num/side), P (Port), S (Starboard)  |
| Destination | The planet the passenger will be debarking to                                             |
| Age         | The age of the passenger                                                                  |
| VIP         | Whether the passenger has paid for special VIP service during the voyage                  |
| RoomService | Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities  |
| FoodCourt   | Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities  |
| ShoppingMall| Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities  |
| Spa         | Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities  |
| VRDeck      | Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities  |
| Name        | The first and last names of the passenger.                                                |
| Transported | This is the column that will be defined throughout the analysis                           |


Now that we have an overview of the dataset and its columns, we can check the integrity of the given data.

### How many null values does the dataset have?
This is important to know so we can replace the NaNs with other values later on.

In [None]:
train_data.isnull().sum()

In [None]:
test_data.isnull().sum()

### Duplicates?
Should there be duplicate data, it would be better to drop them, so as not to distort the analysis results.

In [None]:
print(f"{train_data.duplicated().sum()} duplicated rows ({np.round(100*train_data.duplicated().sum()/len(train_data),1)}%) in train_data")
print(f"{test_data.duplicated().sum()} duplicated rows ({np.round(100*test_data.duplicated().sum()/len(test_data),1)}%) in test_data")

### Data types?

In [None]:
train_data.dtypes

In [None]:
train_data.describe()

## Introductory Questions
#### How many Home Planets are there?

In [None]:
train_data['HomePlanet'].value_counts().plot(kind='pie', autopct='%.0f%%')

The results show that there are three Home Planets, i.e. planets where the passengers come from or at least depart from. Most passengers, namely 54%, come from Earth, while 25% come from a planet called Europa and 21% come from Mars.

#### How many destinations are there?

In [None]:
train_data['Destination'].value_counts().plot(kind='pie', autopct='%.0f%%')

Interestingly enough, there are as many destinations as there are Home Planets. The most popular destination is TRAPPIST-1e with 69% of passengers aiming to go there. The second most popular destination is 55 Cancri e with 21% and PSO J318.5-22 is the least favorite destination with 9%.

#### How many people are put into CryoSleep?

In [None]:
train_data['CryoSleep'].value_counts().plot(kind='pie', autopct='%.0f%%')

According to the results, 36% of the passengers are in CryoSleep during the voyage. The remaining 64% are not in CryoSleep during the voyage. CryoSleep effects the prediction because a passenger being in cryosleep is unable to leave their cabin. It would therefore be interesting to see to what extent cryosleep effects the survival rate of the passengers.

#### How many VIPs are onboard?

In [None]:
train_data['VIP'].value_counts().plot(kind='pie', autopct='%.0f%%')

It shows that most passengers have a non-VIP ticket, namely 98%. Merely 2% have a VIP ticket. This is important for analysis as there could be vast differences in accomoditions and safety measures depending on the type of ticket. It would therefore be interesting to explore a correlation between the ticket type and the chance of survival. It can also be said that way more people are in cryosleep, which shows that having a VIP ticket is definitely not a requirement for cryosleep.

#### How many people were actually transported?

In [None]:
train_data['Transported'].value_counts().plot(kind='pie', autopct='%.0f%%')

According to the output, our dataset shows a survival rate of 50%. Therefore every second person on the spaceship survives.

#### What can be said about the passengers' age?

In [None]:
print('Oldest Passenger was of:',train_data['Age'].max(),'Years')
print('Youngest Passenger was of:',train_data['Age'].min(),'Years')
print('Average Age on the ship:',train_data['Age'].mean(),'Years')

It seems that the passenger age is highly variable starting at 0 years (a newborn/baby) up to 79 years. The average passenger, however, is almost 29 years old.

#### How many people survived depending on age and cryosleep?

In [None]:
fig = px.violin(train_data, y="Age", x="Transported", color="CryoSleep", box=True, points="all",
          hover_data=train_data.columns)
fig.show()

#### How many people survived depending on age and cryosleep?

In [None]:
fig = px.violin(train_data, y="Age", x="Transported", color="VIP", box=True, points="all",
          hover_data=train_data.columns)
fig.show()

## Preprocessing and Cleaning

In [None]:
train_data.isnull().sum()

In [None]:
test_data.isnull().sum()

The results show that there are quite a lot of NaNs in the dataset that need to be cleaned up.

Combining the dataset makes it easier to change data because we do not have to set up everything twice.

In [6]:
data_cleaner = [train_data, test_data]

Filling in the unkown values. For CryoSleep, Destination, VIP, Cabin and Name, I replace the NaNs with 'Unknown' or -1, depending on the datatype. For the other columns, I replace the NaNs with the median value. I am using the median value as substitute because it does not distort the data and the data stays about the same.

In [7]:
for dataset in data_cleaner:
    dataset['HomePlanet'].fillna('Unknown', inplace=True)
    dataset['Name'].fillna('Unknown', inplace=True)
    dataset['Cabin'].fillna('Unknown', inplace=True)
    dataset['Destination'].fillna('Unknown', inplace=True)
    dataset['CryoSleep'].fillna('-1', inplace=True)
    dataset['VIP'].fillna('-1', inplace=True)
    dataset['Age'].fillna(dataset['Age'].median(), inplace = True)
    dataset['VRDeck'].fillna(dataset['VRDeck'].median(), inplace = True)
    dataset['RoomService'].fillna(dataset['RoomService'].median(), inplace = True)
    dataset['FoodCourt'].fillna(dataset['FoodCourt'].median(), inplace = True)
    dataset['Spa'].fillna(dataset['Spa'].median(), inplace = True)
    dataset['ShoppingMall'].fillna(dataset['ShoppingMall'].median(), inplace = True)

In [None]:
train_data.sample(10)

## Feature Engineering
Based on the existing dataset, we can add more distinctive features to make the model more precise. Therefore I decided to divide the age groups and create a column containing the name of the age group for each passenger.

In [8]:
for dataset in data_cleaner: 
    dataset['Age_group']=np.nan
    dataset.loc[dataset['Age']<=12,'Age_group']='Age_0-12'
    dataset.loc[(dataset['Age']>12) & (dataset['Age']<18),'Age_group']='Age_13-17'
    dataset.loc[(dataset['Age']>=18) & (dataset['Age']<=25),'Age_group']='Age_18-25'
    dataset.loc[(dataset['Age']>25) & (dataset['Age']<=30),'Age_group']='Age_26-30'
    dataset.loc[(dataset['Age']>30) & (dataset['Age']<=50),'Age_group']='Age_31-50'
    dataset.loc[dataset['Age']>50,'Age_group']='Age_51+'

In the following code, I use matplotlib to visualize the distribution of the different age groups. Here, I incorporated the Survival / Transported rate into the diagram.

In [None]:
plt.figure(figsize=(10,4))
g=sns.countplot(data=train_data, x='Age_group', hue='Transported', order=['Age_0-12','Age_13-17','Age_18-25','Age_26-30','Age_31-50','Age_51+'])
plt.title('Age group distribution')

It can be said that the majority of passengers is between 31 and 50 years old. The second biggest age group is 18-25. The biggest divergence between surviving passengers and deceased passengers occurs for the age group 0-12. Interestingly enough, only for the age groups 0-12 and 13-17 the survival rate is higher than the death rate. While the rates are almost the same for the age groups 26-30 and 51+, the chance of survival is still smaller. For the groups 18-25 and 31-50 the divergence is still even higher.

In [None]:
train_data.head()

Here, I create three new columns that are based on the Deck column. I am splitting the column, which follows the pattern (deck/num/side) into separate columns through indexing.

In [9]:
for dataset in data_cleaner: 
    def deck(dataset):
        try:
            return dataset.split('/')[0]
        except:
            pass

    def num(dataset):
        try:
            return int(dataset.split('/')[1])
        except:
            pass

    def side(dataset):
        try:
            return dataset.split('/')[2]
        except:
            pass

    dataset['Cabin_deck'] = dataset['Cabin'].apply(deck)
    dataset['Cabin_num'] = dataset['Cabin'].apply(num)
    dataset['Cabin_side'] = dataset['Cabin'].apply(side)

In [None]:
train_data.sample(10)

In [None]:
test_data.sample(10)

Because the three new columns also include NaNs, it is necessary to replace them with 'Unknown' just like I did with the data in a previous step.

In [10]:
for dataset in data_cleaner: 
    dataset['Cabin_num'].fillna(-1, inplace=True)
    dataset['Cabin_side'].fillna('Unknown', inplace=True)
    dataset['Cabin_deck'].fillna('Unknown', inplace=True)

Here, I am replacing the boolean values True and False with 1 and 0.

In [11]:
train_data['Transported'] = train_data['Transported'].replace({True:1,False:0})
train_data['CryoSleep'] = train_data['CryoSleep'].replace({True:1,False:0})
train_data['VIP'] = train_data['VIP'].replace({True:1,False:0})
train_data

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,Age_group,Cabin_deck,Cabin_num,Cabin_side
0,0001_01,Europa,0,B/0/P,TRAPPIST-1e,39.0,0,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,0,Age_31-50,B,0.0,P
1,0002_01,Earth,0,F/0/S,TRAPPIST-1e,24.0,0,109.0,9.0,25.0,549.0,44.0,Juanna Vines,1,Age_18-25,F,0.0,S
2,0003_01,Europa,0,A/0/S,TRAPPIST-1e,58.0,1,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,0,Age_51+,A,0.0,S
3,0003_02,Europa,0,A/0/S,TRAPPIST-1e,33.0,0,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,0,Age_31-50,A,0.0,S
4,0004_01,Earth,0,F/1/S,TRAPPIST-1e,16.0,0,303.0,70.0,151.0,565.0,2.0,Willy Santantines,1,Age_13-17,F,1.0,S
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276_01,Europa,0,A/98/P,55 Cancri e,41.0,1,0.0,6819.0,0.0,1643.0,74.0,Gravior Noxnuther,0,Age_31-50,A,98.0,P
8689,9278_01,Earth,1,G/1499/S,PSO J318.5-22,18.0,0,0.0,0.0,0.0,0.0,0.0,Kurta Mondalley,0,Age_18-25,G,1499.0,S
8690,9279_01,Earth,0,G/1500/S,TRAPPIST-1e,26.0,0,0.0,0.0,1872.0,1.0,0.0,Fayey Connon,1,Age_26-30,G,1500.0,S
8691,9280_01,Europa,0,E/608/S,55 Cancri e,32.0,0,0.0,1049.0,0.0,353.0,3235.0,Celeon Hontichre,0,Age_31-50,E,608.0,S


### Detecting possible correlations through one hot encoding and creating a heatmap

In [12]:
one_hot_encoded_training_predictors = pd.get_dummies(train_data)
one_hot_encoded_testing_predictors = pd.get_dummies(test_data)

(The following step may load very long)

In [None]:
sns.heatmap(one_hot_encoded_training_predictors.corr(),annot=True,cmap='RdYlGn',linewidths=0.2)
fig=plt.gcf()
fig.set_size_inches(10,8)
plt.show()

### Deciding what features to include in the model
Based on the previous queries and results, I decided to include CryoSleep. Destination, Age_group, Cabin_deck, Cabin_num, Cabin_sum, and HomePlanet in my final model.

In [13]:
features = ['CryoSleep','Destination','Age_group','Cabin_deck','Cabin_num','Cabin_side','HomePlanet']
train_ds = train_data[features]
test_ds = test_data[features]
y = train_data['Transported']
print(train_ds.columns)
train_ds.sample(10)

Index(['CryoSleep', 'Destination', 'Age_group', 'Cabin_deck', 'Cabin_num',
       'Cabin_side', 'HomePlanet'],
      dtype='object')


Unnamed: 0,CryoSleep,Destination,Age_group,Cabin_deck,Cabin_num,Cabin_side,HomePlanet
4820,0,TRAPPIST-1e,Age_0-12,F,1050.0,P,Mars
2442,1,TRAPPIST-1e,Age_18-25,E,159.0,P,Unknown
6858,0,TRAPPIST-1e,Age_26-30,D,225.0,P,Europa
4879,0,TRAPPIST-1e,Age_0-12,G,845.0,S,Earth
6291,-1,TRAPPIST-1e,Age_18-25,G,1087.0,S,Earth
6208,0,TRAPPIST-1e,Age_26-30,G,1067.0,S,Earth
2470,0,TRAPPIST-1e,Age_18-25,G,421.0,S,Earth
8065,0,TRAPPIST-1e,Age_31-50,F,1659.0,S,Mars
8651,-1,TRAPPIST-1e,Age_0-12,G,1498.0,P,Earth
4525,0,TRAPPIST-1e,Age_26-30,E,299.0,P,Mars


Here I am splitting my training set into training and validation sets, with a size of 80/20.

In [14]:
train_X, test_X, train_y, test_y = train_test_split(one_hot_encoded_training_predictors, y, test_size=0.20)

In [15]:
print(y)

0       0
1       1
2       0
3       0
4       1
       ..
8688    0
8689    0
8690    1
8691    0
8692    1
Name: Transported, Length: 8693, dtype: int64


## K-Nearest Neighbours
The k-nearest neighbours algorithm uses k = 5 as default value. I decided to check the results for different values and then use the one with the best result/accuracy.

In [16]:
k_values = [3, 5, 7, 8, 9, 11]

best_accuracy = 0
best_k = None

for k in k_values:
    knn_clf = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn_clf, train_X, train_y, cv=5)
    accuracy = scores.mean()

    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_k = k
        
print(best_k)

11


As we have now found out the best k value, we can actually train the model in the next step.

In [21]:
knn_clf = KNeighborsClassifier(n_neighbors=best_k)
knn_clf.fit(train_X, train_y)
pred_knn = knn_clf.predict(test_X)

Of course it is also of quite an importance how accurate the model is. We can get the accuracy score with a module from sklearn. As a second accuracy measure I am using the F1 score.

In [22]:
acc_knn = accuracy_score(test_y, pred_knn)
f1 = f1_score(test_y, pred_knn)

print("Best k:", best_k)
print("Accuracy Score:", acc_knn)
print("F1 SCore:", f1)

Best k: 11
Accuracy Score: 0.7659574468085106
F1 SCore: 0.7722439843312814


After cleaning and pre-processing the data, the model was trained and has an accuracy rate of almost 80% which is pretty good.

### Sources

* Machine Learning for Absolute Beginners, Oliver Theobald: https://vk.com/wall-54530371_337420?lang=en
* Lucija Krusic's notebook: https://github.com/lucijakrusic/programming2SS23/blob/main/data_science/Titanic_challenge.ipynb
* Spaceship Dataset: https://www.kaggle.com/competitions/spaceship-titanic/data
* Spaceship Notebooks: 
    * https://www.kaggle.com/code/computervisi/best-models-spaceship-titanic (useful for the feature engineering part but very sophisticated and detailed)
    * https://www.kaggle.com/code/rajeshkuchhadia/spaceship-titanic (good overview)
    * https://www.kaggle.com/code/computervisi/best-models-spaceship-titanic (useful for the feature engineering part)
