# Defining the Target (Y) and the Features (X)

-First step:

## Import libraries.

In [6]:
import pandas as pd

-Second step:

## Import the data.

In [7]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")

-Third step:

## Examine the data.

In [8]:
print(df.head())
print(df.shape)
print(df.dtypes)

  PassengerId HomePlanet CryoSleep  Cabin  Destination   Age    VIP  \
0     0001_01     Europa     False  B/0/P  TRAPPIST-1e  39.0  False   
1     0002_01      Earth     False  F/0/S  TRAPPIST-1e  24.0  False   
2     0003_01     Europa     False  A/0/S  TRAPPIST-1e  58.0   True   
3     0003_02     Europa     False  A/0/S  TRAPPIST-1e  33.0  False   
4     0004_01      Earth     False  F/1/S  TRAPPIST-1e  16.0  False   

   RoomService  FoodCourt  ShoppingMall     Spa  VRDeck               Name  \
0          0.0        0.0           0.0     0.0     0.0    Maham Ofracculy   
1        109.0        9.0          25.0   549.0    44.0       Juanna Vines   
2         43.0     3576.0           0.0  6715.0    49.0      Altark Susent   
3          0.0     1283.0         371.0  3329.0   193.0       Solam Susent   
4        303.0       70.0         151.0   565.0     2.0  Willy Santantines   

   Transported  
0        False  
1         True  
2        False  
3        False  
4         True  
(8

-Fourth step:

## Clean the data

In [9]:
df.isnull()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,False,False,False,False,False,False,False,False,False,False,False,False,False,False
8689,False,False,False,False,False,False,False,False,False,False,False,False,False,False
8690,False,False,False,False,False,False,False,False,False,False,False,False,False,False
8691,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [10]:
df.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [11]:
df_cleaned = df.dropna()

In [13]:
df_cleaned.shape

(6606, 14)


-Fifth step:

## Define target(y) and features(x).

Decide what is the question to answer, because that column will establish the TARGET. 

In this exercise the objective is to determine if the passenger is transported or not, so the target is the column 'Transported', because is the one in wich we find the data that answers the question: 

Is the passenger transported?


In [14]:
target = df_cleaned['Transported']

The features will be some of the rest of the columns.

In our case we will use the columns with numerical values, so we make a subset of the data with the columns that contain numeric values.


In [15]:
df_numeric = df_cleaned.select_dtypes(include = ["number"])

In [16]:
features = df_numeric

Lets see if the dimensions match:

In [18]:
print(df_numeric.shape)
print(target.shape)

(6606, 6)
(6606,)


# Spliting the data into Test and Training

train_test_split  is a built-in function from scikitlearn, so we will have to import it.

Scikitlearn is used to Build Machine Learning Models.


In [19]:
from sklearn.model_selection import train_test_split


In [20]:
X = df_numeric
y = target

train_test_split  consist of:

-Random sampling:

-Split into training and test sets.

There will be: 

training set: X_train, y_train

and testing set: X_test, y_test

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

(this next two parameters are optional)

test_size = 0.2  means that we will use the 20% of the data to test.

random_state=42  is the seed that we will use to shuffle tha data


You can also choose the amount of data used to train with:

train_data = 0.8

Lets see how much data will be in the train set and in the test set.

In [22]:
print("X_train size:", X_train.shape)
print("X_test size:", X_test.shape)
print("y_train size:", y_train.shape)
print("y_test size:", y_test.shape)

X_train size: (5284, 6)
X_test size: (1322, 6)
y_train size: (5284,)
y_test size: (1322,)
