# LAB | Intro to Machine Learning

**Load the data**

In this challenge, we will be working with Spaceship Titanic data. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [13]:
#import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from tabulate import tabulate


In [15]:
spaceship = pd.read_csv("D:\Lab\Week_16\Titanic_Dataset.csv")
print(tabulate(spaceship.head(), headers='keys', tablefmt='grid'))

+----+---------------+--------------+-------------+---------+---------------+-------+-------+---------------+-------------+----------------+-------+----------+-------------------+---------------+
|    |   PassengerId | HomePlanet   | CryoSleep   | Cabin   | Destination   |   Age | VIP   |   RoomService |   FoodCourt |   ShoppingMall |   Spa |   VRDeck | Name              | Transported   |
|  0 |       0001_01 | Europa       | False       | B/0/P   | TRAPPIST-1e   |    39 | False |             0 |           0 |              0 |     0 |        0 | Maham Ofracculy   | False         |
+----+---------------+--------------+-------------+---------+---------------+-------+-------+---------------+-------------+----------------+-------+----------+-------------------+---------------+
|  1 |       0002_01 | Earth        | False       | F/0/S   | TRAPPIST-1e   |    24 | False |           109 |           9 |             25 |   549 |       44 | Juanna Vines      | True          |
+----+--------------

**Check the shape of your data**

In [20]:
# Check the shape of your data
print(tabulate([spaceship.shape], headers=['Rows', 'Columns'], tablefmt='grid'))


+--------+-----------+
|   Rows |   Columns |
|   8693 |        14 |
+--------+-----------+


**Check for data types**

In [22]:
#Check for data types
print(tabulate(spaceship.dtypes.reset_index().values, headers=['Column', 'Data Type'], tablefmt='grid'))

+--------------+-------------+
| Column       | Data Type   |
| PassengerId  | object      |
+--------------+-------------+
| HomePlanet   | object      |
+--------------+-------------+
| CryoSleep    | object      |
+--------------+-------------+
| Cabin        | object      |
+--------------+-------------+
| Destination  | object      |
+--------------+-------------+
| Age          | float64     |
+--------------+-------------+
| VIP          | object      |
+--------------+-------------+
| RoomService  | float64     |
+--------------+-------------+
| FoodCourt    | float64     |
+--------------+-------------+
| ShoppingMall | float64     |
+--------------+-------------+
| Spa          | float64     |
+--------------+-------------+
| VRDeck       | float64     |
+--------------+-------------+
| Name         | object      |
+--------------+-------------+
| Transported  | bool        |
+--------------+-------------+


**Check for missing values**

In [25]:
#Check for missing values
print(tabulate(spaceship.isnull().sum().reset_index().values, headers=['Column', 'Missing Values'], tablefmt='grid'))

+--------------+------------------+
| Column       |   Missing Values |
| PassengerId  |                0 |
+--------------+------------------+
| HomePlanet   |              201 |
+--------------+------------------+
| CryoSleep    |              217 |
+--------------+------------------+
| Cabin        |              199 |
+--------------+------------------+
| Destination  |              182 |
+--------------+------------------+
| Age          |              179 |
+--------------+------------------+
| VIP          |              203 |
+--------------+------------------+
| RoomService  |              181 |
+--------------+------------------+
| FoodCourt    |              183 |
+--------------+------------------+
| ShoppingMall |              208 |
+--------------+------------------+
| Spa          |              183 |
+--------------+------------------+
| VRDeck       |              188 |
+--------------+------------------+
| Name         |              200 |
+--------------+------------

There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [27]:
#Removing all rows or all columns containing missing data.
spaceship.dropna(axis=0, how='any', inplace=True)

print(tabulate(spaceship.isnull().sum().reset_index().values, headers=['Column', 'Missing Values'], tablefmt='grid'))


+--------------+------------------+
| Column       |   Missing Values |
| PassengerId  |                0 |
+--------------+------------------+
| HomePlanet   |                0 |
+--------------+------------------+
| CryoSleep    |                0 |
+--------------+------------------+
| Cabin        |                0 |
+--------------+------------------+
| Destination  |                0 |
+--------------+------------------+
| Age          |                0 |
+--------------+------------------+
| VIP          |                0 |
+--------------+------------------+
| RoomService  |                0 |
+--------------+------------------+
| FoodCourt    |                0 |
+--------------+------------------+
| ShoppingMall |                0 |
+--------------+------------------+
| Spa          |                0 |
+--------------+------------------+
| VRDeck       |                0 |
+--------------+------------------+
| Name         |                0 |
+--------------+------------

In [28]:
#Filling all missing values with an algorithm
spaceship.fillna(method='ffill', inplace=True)
print(tabulate(spaceship.isnull().sum().reset_index().values, headers=['Column', 'Missing Values'], tablefmt='grid'))

+--------------+------------------+
| Column       |   Missing Values |
| PassengerId  |                0 |
+--------------+------------------+
| HomePlanet   |                0 |
+--------------+------------------+
| CryoSleep    |                0 |
+--------------+------------------+
| Cabin        |                0 |
+--------------+------------------+
| Destination  |                0 |
+--------------+------------------+
| Age          |                0 |
+--------------+------------------+
| VIP          |                0 |
+--------------+------------------+
| RoomService  |                0 |
+--------------+------------------+
| FoodCourt    |                0 |
+--------------+------------------+
| ShoppingMall |                0 |
+--------------+------------------+
| Spa          |                0 |
+--------------+------------------+
| VRDeck       |                0 |
+--------------+------------------+
| Name         |                0 |
+--------------+------------

  spaceship.fillna(method='ffill', inplace=True)
  spaceship.fillna(method='ffill', inplace=True)


**KNN**

K Nearest Neighbors is a distance based algorithm, and requeries all **input data to be numerical.**

Let's only select numerical columns as our features.

In [31]:
# Select only numerical columns
numerical_columns = spaceship.select_dtypes(include=['number']).columns
spaceship_numerical = spaceship[numerical_columns]

# Display the numerical columns
print("Numerical Columns:")
print(tabulate([[col] for col in numerical_columns], headers=['Column'], tablefmt='psql'))

# Display the first few rows of the numerical DataFrame
print("First few rows of the numerical DataFrame:")
print(tabulate(spaceship_numerical.head(), headers='keys', tablefmt='psql'))


Numerical Columns:
+--------------+
| Column       |
|--------------|
| Age          |
| RoomService  |
| FoodCourt    |
| ShoppingMall |
| Spa          |
| VRDeck       |
+--------------+
First few rows of the numerical DataFrame:
+----+-------+---------------+-------------+----------------+-------+----------+
|    |   Age |   RoomService |   FoodCourt |   ShoppingMall |   Spa |   VRDeck |
|----+-------+---------------+-------------+----------------+-------+----------|
|  0 |    39 |             0 |           0 |              0 |     0 |        0 |
|  1 |    24 |           109 |           9 |             25 |   549 |       44 |
|  2 |    58 |            43 |        3576 |              0 |  6715 |       49 |
|  3 |    33 |             0 |        1283 |            371 |  3329 |      193 |
|  4 |    16 |           303 |          70 |            151 |   565 |        2 |
+----+-------+---------------+-------------+----------------+-------+----------+


And also lets define our target.

In [32]:

df = spaceship  # Assign cleaned DataFrame to df

# Separate features (X) and target (y)
X = df.select_dtypes(include=['number'])  # Numerical features
y = df['Transported']  # Target variable

# Convert the target to numerical data
y = y.astype(int)

# Display the first few rows of the features (X) using tabulate
print("Features (X):")
print(tabulate(X.head(), headers='keys', tablefmt='pretty'))

# Display the first few values of the target (y) using tabulate
print("\nTarget (y):")
print(tabulate(pd.DataFrame(y.head()), headers='keys', tablefmt='pretty'))

Features (X):
+---+------+-------------+-----------+--------------+--------+--------+
|   | Age  | RoomService | FoodCourt | ShoppingMall |  Spa   | VRDeck |
+---+------+-------------+-----------+--------------+--------+--------+
| 0 | 39.0 |     0.0     |    0.0    |     0.0      |  0.0   |  0.0   |
| 1 | 24.0 |    109.0    |    9.0    |     25.0     | 549.0  |  44.0  |
| 2 | 58.0 |    43.0     |  3576.0   |     0.0      | 6715.0 |  49.0  |
| 3 | 33.0 |     0.0     |  1283.0   |    371.0     | 3329.0 | 193.0  |
| 4 | 16.0 |    303.0    |   70.0    |    151.0     | 565.0  |  2.0   |
+---+------+-------------+-----------+--------------+--------+--------+

Target (y):
+---+-------------+
|   | Transported |
+---+-------------+
| 0 |      0      |
| 1 |      1      |
| 2 |      0      |
| 3 |      0      |
| 4 |      1      |
+---+-------------+


**Train Test Split**

Now that we have split the data into **features** and **target** variables and imported the **train_test_split** function, split X and y into X_train, X_test, y_train, and y_test. 80% of the data should be in the training set and 20% in the test set.

In [33]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting sets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

# Optional: Display a few rows of the training sets using tabulate
print("\nX_train (first 5 rows):")
print(tabulate(X_train.head(), headers='keys', tablefmt='pretty'))
print("\ny_train (first 5 values):")
print(tabulate(pd.DataFrame(y_train.head()), headers='keys', tablefmt='pretty'))

X_train shape: (5284, 6)
X_test shape: (1322, 6)
y_train shape: (5284,)
y_test shape: (1322,)

X_train (first 5 rows):
+------+------+-------------+-----------+--------------+--------+--------+
|      | Age  | RoomService | FoodCourt | ShoppingMall |  Spa   | VRDeck |
+------+------+-------------+-----------+--------------+--------+--------+
| 7832 | 25.0 |     0.0     |  1673.0   |     0.0      | 642.0  | 612.0  |
| 5842 | 36.0 |     0.0     |  2624.0   |    1657.0    | 2799.0 |  1.0   |
| 3928 | 34.0 |     0.0     |    0.0    |     0.0      |  0.0   |  0.0   |
| 4091 | 37.0 |     0.0     |    0.0    |     0.0      |  0.0   |  0.0   |
| 7679 | 22.0 |     0.0     |    0.0    |     0.0      |  0.0   |  0.0   |
+------+------+-------------+-----------+--------------+--------+--------+

y_train (first 5 values):
+------+-------------+
|      | Transported |
+------+-------------+
| 7832 |      0      |
| 5842 |      0      |
| 3928 |      1      |
| 4091 |      1      |
| 7679 |      1   

**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

You need to choose between **Classificator** or **Regressor**. Take into consideration target variable to decide.

Initialize a KNN instance without setting any hyperparameter.

In [34]:
from sklearn.neighbors import KNeighborsClassifier

# Initialize the KNN Classifier
knn = KNeighborsClassifier()

# Display the initialized model (optional)
print(knn)

KNeighborsClassifier()


Fit the model to your data.

In [35]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler


# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize the KNN Classifier
knn = KNeighborsClassifier()

# Fit the model to the scaled training data
knn.fit(X_train_scaled, y_train)

print("KNN model fitted to the training data.")

KNN model fitted to the training data.


Evaluate your model.

In [36]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Assuming knn, X_test_scaled, and y_test are already defined

# Make predictions on the scaled test data
y_pred = knn.predict(X_test_scaled)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print the evaluation metrics
print("Evaluation Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")

Evaluation Metrics:
Accuracy: 0.7693
Precision: 0.7622
Recall: 0.7907
F1-score: 0.7762


**Congratulations, you have just developed your first Machine Learning model!**