# Task 2: Rescue Across The Rift

Our task is to first perform data cleaning on our given dataset that was partially corrupted due to a space anomaly that transported passengers from the legendary Star Trek passenger liner into unknown regions of space. We need to clean the dataset and provide a clean dataset as output.

For the second part of the task, we need to perform exploratory data analysis on our cleaned dataset to try to find correlations about why these passengers might have been transported by the anomaly and learn more about the incident and possible status of passengers.

In [1]:
#import necessary libaries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

In [3]:
#Get and check data
data = pd.read_csv('../Data/train.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


In [6]:
#Try to get an idea about the data
data.head(10)

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True
5,0005_01,Earth,False,F/0/P,PSO J318.5-22,44.0,False,0.0,483.0,0.0,291.0,0.0,Sandie Hinetthews,True
6,0006_01,Earth,False,F/2/S,TRAPPIST-1e,26.0,False,42.0,1539.0,3.0,0.0,0.0,Billex Jacostaffey,True
7,0006_02,Earth,True,G/0/S,TRAPPIST-1e,28.0,False,0.0,0.0,0.0,0.0,,Candra Jacostaffey,True
8,0007_01,Earth,False,F/3/S,TRAPPIST-1e,35.0,False,0.0,785.0,17.0,216.0,0.0,Andona Beston,True
9,0008_01,Europa,True,B/1/P,55 Cancri e,14.0,False,0.0,0.0,0.0,0.0,0.0,Erraiam Flatic,True


## Step 1: Data Cleaning

We will take a columnwise approach to data cleaning, and slowly cover all columns and try to clean the whole dataset that way in this case

### 1) Exploring Passenger_Id and Name

We can clearly see some correlation between passenger Id and 'family' since 0003_01 and _02 correlate to people with same last name (and there is another example of the same in the 10 above as well)

This implies that passenger Id is divided into 2 parts:
- First part signifies family number in the ship (same last name)
- Second part signifies the name of the people in the family (possibly in alphabetical order)

In [11]:
#Let's first check for NaN values in passener Id and Name columns

print(f"Number of NaN values in passengerId is {data['PassengerId'].isna().sum()}")
print(f"Number of NaN values in Name is {data['Name'].isna().sum()}")


Number of NaN values in passengerId is 0
Number of NaN values in Name is 200


In [19]:
#Check if more likely to have NaN name if transported?

(data[data['Name'].isna()].loc[:, ['Transported']] == True).sum()
#Not more likely by the results

Transported    101
dtype: int64

In [None]:
#Just because passengerId doesn't have NaN values doesn't mean its clear
# let's check for duplicate values next, as that should not be allowed

data[data['PassengerId'].duplicated() == True]

#Ok good no duplicate values here
#I think this means that the whole PassengerId column should be clean already

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported


In [23]:
#Now let's see what's up with the NaN values of names

data[data['Name'].isna()]

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
27,0022_01,Mars,False,D/0/P,TRAPPIST-1e,21.0,False,980.0,2.0,69.0,0.0,0.0,,False
58,0064_01,Mars,True,F/14/S,TRAPPIST-1e,15.0,False,0.0,0.0,0.0,0.0,0.0,,True
65,0069_01,Earth,False,F/16/S,TRAPPIST-1e,42.0,False,887.0,0.0,9.0,6.0,0.0,,True
77,0082_03,Mars,False,F/16/P,TRAPPIST-1e,8.0,False,0.0,0.0,0.0,0.0,0.0,,True
101,0108_02,Earth,False,G/19/S,TRAPPIST-1e,31.0,False,562.0,0.0,326.0,0.0,0.0,,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8629,9205_02,Europa,True,B/300/P,TRAPPIST-1e,15.0,False,0.0,0.0,0.0,0.0,0.0,,True
8631,9208_01,Earth,True,G/1485/S,TRAPPIST-1e,35.0,False,0.0,0.0,0.0,0.0,0.0,,True
8636,9218_01,Europa,True,B/353/S,55 Cancri e,43.0,False,0.0,0.0,0.0,0.0,0.0,,True
8652,9230_01,Europa,False,C/342/S,TRAPPIST-1e,36.0,True,0.0,5600.0,715.0,2868.0,971.0,,True


Hmm no explicit pattern seen here, maybe they are just completely random NaN values (which we cant fill in any way). Anyways, the name column in itself might be useless as the name itself doesnt matter

The passengerId has the 'family' and 'name' info in itself and the actual name doesn't matter.
We will probably end up dropping the name column entirely at the end of cleaning since it seems to be irrelevant in every way

What is interesting lies in our splitting passengerId into familyNo and nameNo idea.

First let's check whether our theory that people with same 'familyId' will have same 'last name', since that would confirm that that is how the Id works.


In [29]:
#First let's check how many such values there are (how many people part of 'families)

data['PassengerId'].str[:4].duplicated(keep = False).sum()
#Damn quite a lot

np.int64(3888)

In [37]:
#What we want to check:
#For every value of PassengerId with same first 4 starting letters, the corresponding Name's 2nd part (last name) is same

#We first create a dataset with boolean masking that returns all rows part of a 'family'
#Then we group by the first 4 letters (familyCode) 
#We apply a filter that checks if all people in a family have same last name
#We do this by splitting Name string into 2 and taking the last name
#And then we check for number of 'unique' last names, which should just be 1, if they all have the same last name
#So we filter to check for cases where the number of unique last names of all members in a group is NOT 1

data.groupby(data[data['PassengerId'].str[:4].duplicated(keep = False)]['PassengerId'].str[:4]).filter(lambda x: x['Name'].str.split().str[1].nunique() != 1)

#Welp there goes our theory (966 values that dont follow our theory spotted)

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
21,0020_01,Earth,True,E/0/S,TRAPPIST-1e,1.0,False,0.0,0.0,0.0,0.0,0.0,Almary Brantuarez,False
22,0020_02,Earth,True,E/0/S,55 Cancri e,49.0,False,0.0,0.0,0.0,0.0,0.0,Glendy Brantuarez,False
23,0020_03,Earth,True,E/0/S,55 Cancri e,29.0,False,0.0,0.0,,0.0,0.0,Mollen Mcfaddennon,False
24,0020_04,Earth,False,E/0/S,TRAPPIST-1e,10.0,False,0.0,0.0,0.0,0.0,0.0,Breney Jacostanley,True
25,0020_05,Earth,True,E/0/S,PSO J318.5-22,1.0,False,,0.0,0.0,0.0,0.0,Mael Brantuarez,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8577,9157_06,Earth,False,G/1476/S,TRAPPIST-1e,12.0,False,0.0,0.0,0.0,0.0,0.0,Pamont Navages,False
8578,9157_07,Earth,True,G/1476/S,TRAPPIST-1e,3.0,False,0.0,0.0,0.0,0.0,0.0,Racey Navages,True
8639,9220_01,Earth,False,G/1496/P,TRAPPIST-1e,25.0,False,2.0,45.0,45.0,0.0,815.0,Branca Meyerthy,False
8640,9220_02,Earth,True,G/1496/P,TRAPPIST-1e,20.0,False,0.0,0.0,0.0,0.0,0.0,Frey Meyerthy,True
