<img src="https://c4.wallpaperflare.com/wallpaper/773/253/247/multiple-display-space-planet-atmosphere-wallpaper-thumb.jpg" style='width:100%'>

# Hands On 3.2 - Let's Challenge

The main objective of this laboratory is to put into practice what you have learned on classification techniques. You will work on a dataset on which you will have to build a classification model. The idea is to do all the passages you think are useful to clean and preprocess the data, find a model with its best configurations and validate it on a test set to get the **best score**.

## Spaceship Titanic

Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good.

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!


<img src='https://storage.googleapis.com/kaggle-media/competitions/Spaceship%20Titanic/joel-filipe-QwoNAhbmLLo-unsplash.jpg' width=400>

To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

Help save them and change history!

  _**Spaceship Titanic**, Addison Howard, Ashley Chow, Ryan Holbrook_

In [1]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import pandas as pd

dataset = pd.read_csv('data/train.csv')
dataset.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


## Going into the task

We can imagine many steps needed to perform this task, for instance:
1. **Analysis on missing/Null information:** Pandas has `.isna()` or `.isnull()` attributes on series
2. **Preprocessing:** fill blank values and eventually encode categorical values (more about this later)
3. **Features selection:** select among the features the ones that are important (Not all of them are necessary relevant)
4. **Model analysis:** find the best model to complete the task, and search for its optimal configuration
5. **Send the results**

### 1. Missing information

Starting from the training test, how can try to inspect the deta. To fill in the gaps, the method [``.fillna()``](https://pandas.pydata.org/docs/reference/api/pandas.Series.fillna.html?highlight=fillna#pandas.Series.fillna) could help.

In [2]:
display(f'Dataset shape: {dataset.shape}', dataset.describe(), dataset.describe(include='O'))

'Dataset shape: (8693, 14)'

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,8514.0,8512.0,8510.0,8485.0,8510.0,8505.0
mean,28.82793,224.687617,458.077203,173.729169,311.138778,304.854791
std,14.489021,666.717663,1611.48924,604.696458,1136.705535,1145.717189
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,19.0,0.0,0.0,0.0,0.0,0.0
50%,27.0,0.0,0.0,0.0,0.0,0.0
75%,38.0,47.0,76.0,27.0,59.0,46.0
max,79.0,14327.0,29813.0,23492.0,22408.0,24133.0


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,VIP,Name
count,8693,8492,8476,8494,8511,8490,8493
unique,8693,3,2,6560,3,2,8473
top,0001_01,Earth,False,G/734/S,TRAPPIST-1e,False,Gollux Reedall
freq,1,4602,5439,8,5915,8291,2


In [3]:
dataset[dataset.Age.isna()].fillna()

ValueError: Must specify a fill 'value' or 'method'.

### 2. Preprocessing
Some models (e.g. Neural Networks) may have problem with un-normalized data or categorical ones. To fix this (if required) you can.

- [Normalize](https://scikit-learn.org/stable/modules/preprocessing.html) the input numerical features
- [Encode](https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features) the categorical ones

> **Remember that anything you do to the training set must be applied also to the test before getting the result**

In [4]:
from sklearn.preprocessing import StandardScaler
import numpy as np
# from sklearn.

scaler = StandardScaler().fit(dataset[['Age', 'RoomService']])

# you can put them into the dataframe with the .loc method
scaled_columns = scaler.transform(dataset[['Age', 'RoomService']])
print(f"Initial Age distribution: mean={dataset.Age.mean()}, std={dataset.Age.std()}")
print(f"Scaled Age distribution: mean={np.nanmean(scaled_columns[:, 0])}, std={np.nanstd(scaled_columns[:, 0])}")

Initial Age distribution: mean=28.82793046746535, std=14.489021423908726
Scaled Age distribution: mean=6.217457577416897e-17, std=1.0


In [5]:
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

encoder = OrdinalEncoder().fit(dataset[['HomePlanet', 'Cabin']])

# you can put them into the dataframe with the .loc method
encoded_columns = encoder.transform(dataset[['HomePlanet', 'Cabin']])

print(dataset.HomePlanet[:7])

print('\nAre changed to:')
print(encoded_columns[:7, 0])



0    Europa
1     Earth
2    Europa
3    Europa
4     Earth
5     Earth
6     Earth
Name: HomePlanet, dtype: object

Are changed to:
[1. 0. 1. 1. 0. 0. 0.]


### 3. Features selection
This is not necessary, but it can be useful. More about that [here](https://scikit-learn.org/stable/modules/feature_selection.html).

### 4. Model analysis
What we did previously, but now it's up to you

### 5. Send the results

To send the results, you have to export a csv with the solution file and go to the [Hands On 3 Leaderboard](http://mp3.polito.it:8989/).
Assuming to have an array the **ordered** (in terms of PassengerId) predictions in `y_pred`, run the code below.

You need a private (...) key. To get it, ther is file on this folder in which to find it or you can use your **NAME** and the initial of your **SURNAME** on the function:

``get_mypk(**"NAME S."**)``

In [None]:
import pandas as pd

result = pd.DataFrame(y_pred, columns=['Transported'])
result.index.name = 'Id'
result.to_csv('output.csv')

people = pd.read_csv('roster.aiis.tsv', delimiter='\t')
get_mypk = lambda name: people[people.student_id.eq(name)].private_key.item()