_Developer: Seyedali Shohadaeolhosseini, 
@alishhde, MSc. Student at Unibo_

In [65]:
import pandas as pd

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import KNNImputer
from xgboost import XGBClassifier

In [83]:
# Load a dataset into a Pandas Dataframe
df = pd.read_csv('datasets/train.csv')
print("Full train dataset shape is {}".format(df.shape))

Full train dataset shape is (8693, 14)


As the output shows, we have 14 columns. However, not all of them are important for the output. For example, having the "Name" of Passenger or the "PassengerID" are not important for this classification problem. So we drop them.

In [84]:
df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [85]:
df = df.drop(['Name', 'PassengerId'], axis=1)  # The axis 1 refers to the columns
print(df.shape)
df.head()

(8693, 12)


Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False
1,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True
2,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False
3,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False
4,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True


In Pandas, both isna() and isnull() are methods that are used to detect missing or NaN (Not a Number) values in a DataFrame or Series. They are essentially aliases of each other and perform the same function. The choice between isna() and isnull() is purely a matter of preference; both methods are interchangeable and produce the same result.

In [86]:
print("The number of Null values for each column is as follow:\n", df.isna().sum().sort_values(ascending=False))
print("\nThe total number of rows that contains at least one null value is: ", df.isna().any(axis=1).sum())

The number of Null values for each column is as follow:
 CryoSleep       217
ShoppingMall    208
VIP             203
HomePlanet      201
Cabin           199
VRDeck          188
FoodCourt       183
Spa             183
Destination     182
RoomService     181
Age             179
Transported       0
dtype: int64

The total number of rows that contains at least one null value is:  1929


As the number of null value is considerable we need to take care of them in our preprocessing step. To do so, first let's take a look at the types of the values of columns. For the traing step we need to have only numerical values and no missing values. 

In [87]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   HomePlanet    8492 non-null   object 
 1   CryoSleep     8476 non-null   object 
 2   Cabin         8494 non-null   object 
 3   Destination   8511 non-null   object 
 4   Age           8514 non-null   float64
 5   VIP           8490 non-null   object 
 6   RoomService   8512 non-null   float64
 7   FoodCourt     8510 non-null   float64
 8   ShoppingMall  8485 non-null   float64
 9   Spa           8510 non-null   float64
 10  VRDeck        8505 non-null   float64
 11  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(5)
memory usage: 755.7+ KB


We first fill missing values and then split our data to training and test sets.

In [88]:
# # Instantiate the KNNImputer
# # Specify the number of neighbors (n_neighbors) and the strategy for imputation
# imputer = KNNImputer(n_neighbors=2, weights="uniform")

# # Fit and transform the dataset
# # Since KNNImputer works with numerical values, we need to encode the categorical variables first
# df_encoded = pd.get_dummies(df, columns=['HomePlanet', 'Cabin', 'Destination'])

# # Fill missing values
# imputed_data = imputer.fit_transform(df_encoded)

# # Convert the array back to DataFrame
# df = pd.DataFrame(imputed_data, columns=df_encoded.columns)

# # Decode the categorical variables
# df['HomePlanet'] = df.filter(like='HomePlanet').idxmax(axis=1).str.split('_').str[1]
# df['Cabin'] = df.filter(like='Cabin').idxmax(axis=1).str.split('_').str[1]
# df['Destination'] = df.filter(like='Destination').idxmax(axis=1).str.split('_').str[1]

# df.head(20)

In [89]:
df = df.drop(['Cabin'], axis=1)
df[['VIP', 'CryoSleep', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Age', 'RoomService']] = df[['VIP', 'CryoSleep', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Age', 'RoomService']].fillna(value=0)
df['Destination'].fillna('Unknown', inplace=True)
df['HomePlanet'].fillna('Unknown', inplace=True)
df['CryoSleep'] = df['CryoSleep'].astype(int)
df['VIP'] = df['VIP'].astype(int)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Destination'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['HomePlanet'].fillna('Unknown', inplace=True)


In [90]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   HomePlanet    8693 non-null   object 
 1   CryoSleep     8693 non-null   int64  
 2   Destination   8693 non-null   object 
 3   Age           8693 non-null   float64
 4   VIP           8693 non-null   int64  
 5   RoomService   8693 non-null   float64
 6   FoodCourt     8693 non-null   float64
 7   ShoppingMall  8693 non-null   float64
 8   Spa           8693 non-null   float64
 9   VRDeck        8693 non-null   float64
 10  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), int64(2), object(2)
memory usage: 687.8+ KB


In [91]:
df.isna().any()

HomePlanet      False
CryoSleep       False
Destination     False
Age             False
VIP             False
RoomService     False
FoodCourt       False
ShoppingMall    False
Spa             False
VRDeck          False
Transported     False
dtype: bool

In [92]:
y = df[['Transported']].astype(int)
X = df.drop(['Transported'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape, X_test.shape)

(6954, 10) (1739, 10)


```python
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), categorical_features)
    ],
    remainder='passthrough'  # passthrough numerical features as they are
)
```

In this part of the code, we are creating a ColumnTransformer object named preprocessor. ColumnTransformer is a utility in scikit-learn that allows you to apply different transformations to different columns in your dataset.

- transformers: This parameter takes a list of tuples, where each tuple consists of three elements:
    - The first element is a string identifier for the transformation.
    - The second element is the transformer object, which defines how to transform the data.
    - The third element is a list of column names (or indices) that the transformation should be applied to.

In our case:

- We're using the identifier 'cat' for the transformation.
- We're using OneHotEncoder() as the transformer, which will perform one-hot encoding on categorical features.
- categorical_features is the list of column names that contain categorical features.
- remainder: This parameter specifies what to do with the columns that were not explicitly transformed. In this case, 'passthrough' means that the numerical features will be passed through without any transformation.
- After defining the preprocessor, we apply it to the training and testing sets:

```python
X_train_enc = preprocessor.fit_transform(X_train)
X_test_enc = preprocessor.transform(X_test)
```
- fit_transform(X_train): This method fits the transformer to the training data and then transforms it. It learns how to transform the data based on the training set.
- transform(X_test): This method applies the learned transformation to the test set. It's important to note that we only transform the test set using the learned parameters from the training set. We do not refit the transformer on the test set, as this can lead to data leakage.

Overall, the ColumnTransformer allows us to apply different preprocessing steps to different columns in our dataset efficiently, making it a powerful tool for data preprocessing in scikit-learn pipelines.

In [93]:
# Define which features are categorical
categorical_features = ['HomePlanet', 'Destination']

# Apply one-hot encoding to categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), categorical_features)
    ],
    remainder='passthrough'  # passthrough numerical features as they are
)

X_train_enc_array = preprocessor.fit_transform(X_train)
X_test_enc_array = preprocessor.transform(X_test)

# Convert the transformed arrays back to DataFrames
X_train = pd.DataFrame(X_train_enc_array, columns=preprocessor.get_feature_names_out())
X_test = pd.DataFrame(X_test_enc_array, columns=preprocessor.get_feature_names_out())


## The previous column names


In [94]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6954 entries, 0 to 6953
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   cat__HomePlanet_Earth           6954 non-null   float64
 1   cat__HomePlanet_Europa          6954 non-null   float64
 2   cat__HomePlanet_Mars            6954 non-null   float64
 3   cat__HomePlanet_Unknown         6954 non-null   float64
 4   cat__Destination_55 Cancri e    6954 non-null   float64
 5   cat__Destination_PSO J318.5-22  6954 non-null   float64
 6   cat__Destination_TRAPPIST-1e    6954 non-null   float64
 7   cat__Destination_Unknown        6954 non-null   float64
 8   remainder__CryoSleep            6954 non-null   float64
 9   remainder__Age                  6954 non-null   float64
 10  remainder__VIP                  6954 non-null   float64
 11  remainder__RoomService          6954 non-null   float64
 12  remainder__FoodCourt            69

In [95]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1739 entries, 0 to 1738
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   cat__HomePlanet_Earth           1739 non-null   float64
 1   cat__HomePlanet_Europa          1739 non-null   float64
 2   cat__HomePlanet_Mars            1739 non-null   float64
 3   cat__HomePlanet_Unknown         1739 non-null   float64
 4   cat__Destination_55 Cancri e    1739 non-null   float64
 5   cat__Destination_PSO J318.5-22  1739 non-null   float64
 6   cat__Destination_TRAPPIST-1e    1739 non-null   float64
 7   cat__Destination_Unknown        1739 non-null   float64
 8   remainder__CryoSleep            1739 non-null   float64
 9   remainder__Age                  1739 non-null   float64
 10  remainder__VIP                  1739 non-null   float64
 11  remainder__RoomService          1739 non-null   float64
 12  remainder__FoodCourt            17

In [96]:
# Define individual classifiers
decision_tree = RandomForestClassifier(n_estimators=100)
svm = SVC(probability=True)
gbm = GradientBoostingClassifier()
naive_bayes = GaussianNB()
knn = KNeighborsClassifier()
xgb = XGBClassifier()

# Create a voting classifier with the individual classifiers
ensemble_clf = VotingClassifier(estimators=[
    ('decision_tree', decision_tree),
    ('svm', svm),
    ('gradient_boosting', gbm),
    ('naive_bayes', naive_bayes),
    ('knn', knn),
    ('XGBoost', xgb)
], voting='soft')  # 'soft' voting calculates the average of probabilities


import numpy as np
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)


# Train the ensemble classifier
ensemble_clf.fit(X_train, y_train)

# Make predictions
y_pred = ensemble_clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Ensemble Classifier Accuracy:", accuracy)

Ensemble Classifier Accuracy: 0.7786083956296722


The output is not the best output that we can get. What we can do is that to do a better preprocessing on data.

In [97]:
# Load the test dataset
test_df = pd.read_csv('datasets/test.csv')
submission_id = test_df.PassengerId

## Preprocessing Step 1
test_df = test_df.drop(['Cabin'], axis=1)
test_df[['VIP', 'CryoSleep', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Age', 'RoomService']] = test_df[['VIP', 'CryoSleep', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Age', 'RoomService']].fillna(value=0)
test_df['Destination'].fillna('Unknown', inplace=True)
test_df['HomePlanet'].fillna('Unknown', inplace=True)
test_df['CryoSleep'] = test_df['CryoSleep'].astype(int)
test_df['VIP'] = test_df['VIP'].astype(int)

## Preprocessing Step 2
test_df_arr = preprocessor.transform(test_df)

# Convert the transformed arrays back to DataFrames
test_df = pd.DataFrame(test_df_arr, columns=preprocessor.get_feature_names_out())

# Get the predictions for testdata
predictions = ensemble_clf.predict(test_df)
n_predictions = (predictions > 0.5).astype(bool)
output = pd.DataFrame({'PassengerId': submission_id,
                       'Transported': n_predictions.squeeze()})

output.head()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test_df['Destination'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test_df['HomePlanet'].fillna('Unknown', inplace=True)


Unnamed: 0,PassengerId,Transported
0,0013_01,True
1,0018_01,False
2,0019_01,True
3,0021_01,True
4,0023_01,True


In [98]:
sample_submission_df = pd.read_csv('datasets/sample_submission.csv')
sample_submission_df['Transported'] = n_predictions
sample_submission_df.to_csv('datasets/submission.csv', index=False)
sample_submission_df.head()

Unnamed: 0,PassengerId,Transported
0,0013_01,True
1,0018_01,False
2,0019_01,True
3,0021_01,True
4,0023_01,True
