# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [32]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [33]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [34]:
spaceship.shape

(8693, 14)

In [35]:
spaceship.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

In [36]:
spaceship.isnull()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,False,False,False,False,False,False,False,False,False,False,False,False,False,False
8689,False,False,False,False,False,False,False,False,False,False,False,False,False,False
8690,False,False,False,False,False,False,False,False,False,False,False,False,False,False
8691,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [37]:
spaceship.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [38]:
spaceship_cleaned = spaceship.dropna()

In [39]:
spaceship_cleaned.isnull().sum()

PassengerId     0
HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Name            0
Transported     0
dtype: int64

In [40]:
spaceship_cleaned.shape

(6606, 14)

In [41]:
spaceship_cleaned['Cabin'].unique()

array(['B/0/P', 'F/0/S', 'A/0/S', ..., 'G/1499/S', 'G/1500/S', 'E/608/S'],
      dtype=object)

In [42]:
# Extraer solo la letra al inicio de cada valor en la columna "Cabin"
spaceship_cleaned['Cabin_section'] = spaceship_cleaned['Cabin'].str.extract(r'^([A-Z])')


print(spaceship_cleaned['Cabin_section'])


0       B
1       F
2       A
3       A
4       F
       ..
8688    A
8689    G
8690    G
8691    E
8692    E
Name: Cabin_section, Length: 6606, dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  spaceship_cleaned['Cabin_section'] = spaceship_cleaned['Cabin'].str.extract(r'^([A-Z])')


In [43]:
spaceship_cleaned['Cabin_section'].value_counts()

Cabin_section
F    2152
G    1973
E     683
B     628
C     587
D     374
A     207
T       2
Name: count, dtype: int64

In [44]:
spaceship_cleaned = spaceship_cleaned.drop(['PassengerId', 'Name'], axis=1)

In [45]:
spaceship_cleaned.shape

(6606, 13)

In [46]:
print(spaceship_cleaned.columns)

Index(['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age', 'VIP',
       'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Transported', 'Cabin_section'],
      dtype='object')


In [47]:
spaceship_cleaned.dtypes

HomePlanet        object
CryoSleep         object
Cabin             object
Destination       object
Age              float64
VIP               object
RoomService      float64
FoodCourt        float64
ShoppingMall     float64
Spa              float64
VRDeck           float64
Transported         bool
Cabin_section     object
dtype: object

In [48]:
# Doing dummies (like one hot encoding) for non-numerical columns
spaceship_cleaned = pd.get_dummies(spaceship_cleaned, columns=['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP', 'Cabin_section'], drop_first=True)

In [49]:
spaceship_cleaned.head(5)

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_True,...,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_True,Cabin_section_B,Cabin_section_C,Cabin_section_D,Cabin_section_E,Cabin_section_F,Cabin_section_G,Cabin_section_T
0,39.0,0.0,0.0,0.0,0.0,0.0,False,True,False,False,...,False,True,False,True,False,False,False,False,False,False
1,24.0,109.0,9.0,25.0,549.0,44.0,True,False,False,False,...,False,True,False,False,False,False,False,True,False,False
2,58.0,43.0,3576.0,0.0,6715.0,49.0,False,True,False,False,...,False,True,True,False,False,False,False,False,False,False
3,33.0,0.0,1283.0,371.0,3329.0,193.0,False,True,False,False,...,False,True,False,False,False,False,False,False,False,False
4,16.0,303.0,70.0,151.0,565.0,2.0,True,False,False,False,...,False,True,False,False,False,False,False,True,False,False


In [50]:
target = spaceship_cleaned['Transported']
features = spaceship_cleaned.drop(columns = ['Transported'])

In [51]:
print(target)
print(features)

0       False
1        True
2       False
3       False
4        True
        ...  
8688    False
8689    False
8690     True
8691    False
8692     True
Name: Transported, Length: 6606, dtype: bool
       Age  RoomService  FoodCourt  ShoppingMall     Spa  VRDeck  \
0     39.0          0.0        0.0           0.0     0.0     0.0   
1     24.0        109.0        9.0          25.0   549.0    44.0   
2     58.0         43.0     3576.0           0.0  6715.0    49.0   
3     33.0          0.0     1283.0         371.0  3329.0   193.0   
4     16.0        303.0       70.0         151.0   565.0     2.0   
...    ...          ...        ...           ...     ...     ...   
8688  41.0          0.0     6819.0           0.0  1643.0    74.0   
8689  18.0          0.0        0.0           0.0     0.0     0.0   
8690  26.0          0.0        0.0        1872.0     1.0     0.0   
8691  32.0          0.0     1049.0           0.0   353.0  3235.0   
8692  44.0        126.0     4688.0           0.0     

**Perform Train Test Split**

In [52]:
from sklearn.model_selection import train_test_split

In [53]:
X = features
y = target

In [54]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [55]:
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")


X_train shape: (5284, 5323)
X_test shape: (1322, 5323)
y_train shape: (5284,)
y_test shape: (1322,)


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [56]:
# Scaling

from sklearn.preprocessing import MinMaxScaler

In [57]:
normalizer = MinMaxScaler()

In [58]:
# Para los datos de entrada X
features_scaled = normalizer.fit_transform(X)

# Si y es un vector unidimensional, conviértelo a 2D
if isinstance(y, pd.Series):
    y = y.values.reshape(-1, 1)
target_scaled = normalizer.fit_transform(y)

# Convertir los datos normalizados a DataFrame para mantener etiquetas originales
X_scaled_df = pd.DataFrame(features_scaled, columns=X.columns)
y_scaled_df = pd.DataFrame(target_scaled, columns=['target'])

In [59]:
print(X_scaled_df)
print(y_scaled_df)

           Age  RoomService  FoodCourt  ShoppingMall       Spa    VRDeck  \
0     0.493671     0.000000   0.000000      0.000000  0.000000  0.000000   
1     0.303797     0.010988   0.000302      0.002040  0.024500  0.002164   
2     0.734177     0.004335   0.119948      0.000000  0.299670  0.002410   
3     0.417722     0.000000   0.043035      0.030278  0.148563  0.009491   
4     0.202532     0.030544   0.002348      0.012324  0.025214  0.000098   
...        ...          ...        ...           ...       ...       ...   
6601  0.518987     0.000000   0.228726      0.000000  0.073322  0.003639   
6602  0.227848     0.000000   0.000000      0.000000  0.000000  0.000000   
6603  0.329114     0.000000   0.000000      0.152779  0.000045  0.000000   
6604  0.405063     0.000000   0.035186      0.000000  0.015753  0.159077   
6605  0.556962     0.012702   0.157247      0.000000  0.000000  0.000590   

      HomePlanet_Europa  HomePlanet_Mars  CryoSleep_True  Cabin_A/0/S  ...  \
0        

In [60]:
#your code here

In [73]:
from sklearn.model_selection import train_test_split

# Dividir los datos en entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled_df, y_scaled_df, test_size=0.2, random_state=42
)


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [69]:
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor

In [70]:
#your code here

bagging_reg = BaggingRegressor(DecisionTreeRegressor(max_depth=20),
                               n_estimators=100,
                               max_samples = 1000)

In [72]:
bagging_reg.fit(X_scaled_df, y_train)

ValueError: Found input variables with inconsistent numbers of samples: [6606, 5284]

- Random Forests

In [63]:
#your code here

- Gradient Boosting

In [64]:
#your code here

- Adaptive Boosting

In [65]:
#your code here

Which model is the best and why?

In [66]:
#comment here