<a href="https://www.kaggle.com/code/willfeeney/dealing-with-missing-data-spaceship-titanic?scriptVersionId=146243417" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

In this short kernel I consider different ways of dealing with missing data in the Spaceship Titanic challenge.  Having only recently joined Kaggle I was curious as to how much impact different imputation methods could have on the accuracy of my final model.  This isn't an exhaustive analysis of every method but hopefully it gives you a few ideas to think about and experiment with!

I'll look at four different imputing methods:
- Option 1: Replace missing numerical data with median and missing categorical data with most frequent
- Option 2: Replace missing numerical values with 0 and 'ignore' missing categorical data
- Option 3: Use a K-NN (K nearest-neighbour) imputation method 
- Option 4: Use Iterative Imputer method
- Option 5: Apply some pre-determined rules first, and then complete imputation on any remaining missing data with one of the other options 

As I said above this is just a small subset of the possible imputing methods that could be used.  And even within these options there is the possibility for further tweaking, for instance changing the number of neighbours we look at in the K-NN approach or using mean rather than median in Option 1.  

To avoid getting sidetracked I've kept my feature engineering to a minimum and only used a Gradient Boosting Classifier model.  However, I'd love to hear if anyone has carried out further analysis, for example if certain models work better with certain imputing methods.  

# Set Up

First install the libraries...

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.impute import KNNImputer
from sklearn.experimental import enable_iterative_imputer 
from sklearn.impute import IterativeImputer

Then load up the training and test data, and create a validation data set so that we can later compare how well the different imputing methods work.

In [2]:
train_data = pd.read_csv("/kaggle/input/spaceship-titanic/train.csv")
X_test = pd.read_csv("/kaggle/input/spaceship-titanic/test.csv")

In [3]:
X_train, X_val, y_train, y_val = train_test_split(train_data.iloc[:,:-1], train_data.iloc[:,-1], test_size=0.20, random_state=7)

As the focus on this kernel is the impact of different imputation methods I have avoided carrying out too much feature engineering.  That being said, I have made a few tweaks commonly seen in other workbooks:

- Splitting out the Cabin feature in Deck, Cabin Number and Side
- Dropping the Cabin feature following this split
- Dropping the Name and PassengerId features



In [4]:
X_train[["Deck", "CabinNum", "Side"]] = X_train["Cabin"].str.split("/", expand=True)
X_val[["Deck", "CabinNum", "Side"]] = X_val["Cabin"].str.split("/", expand=True)

X_train.drop(['PassengerId', 'Name', 'Cabin'], inplace=True, axis=1)
X_val.drop(['PassengerId', 'Name', 'Cabin'], inplace=True, axis=1)

X_test[["Deck", "CabinNum", "Side"]] = X_test["Cabin"].str.split("/", expand=True)
X_test.drop(['PassengerId', 'Name', 'Cabin'], inplace=True, axis=1)


We can then take a look at how much data is missing in our training data.  From the table we can see we're missing data in every category, including both numerical and categorical data.

In [5]:
def df_info(df):
    info_df = pd.DataFrame(df.dtypes, columns=['dtypes'])
    info_df["Nan"] = df.isna().sum()
    info_df["Nan %"] = df.isna().sum() / len(df)
    info_df["Nunique"] = df.nunique()
    info_df["count"] = df.count()
    print(f"Shape: {df.shape}")
    return info_df.style.background_gradient(cmap='Reds')

df_info(X_train)

Shape: (6954, 13)


Unnamed: 0,dtypes,Nan,Nan %,Nunique,count
HomePlanet,object,166,0.023871,3,6788
CryoSleep,object,165,0.023727,2,6789
Destination,object,153,0.022002,3,6801
Age,float64,154,0.022146,80,6800
VIP,object,166,0.023871,2,6788
RoomService,float64,146,0.020995,1131,6808
FoodCourt,float64,148,0.021283,1272,6806
ShoppingMall,float64,158,0.022721,983,6796
Spa,float64,152,0.021858,1126,6802
VRDeck,float64,151,0.021714,1124,6803


In [6]:
numeric_features = X_train.select_dtypes(exclude='object').columns
categorical_features = X_train.select_dtypes(include='object').columns

# Option 1: Replace missing numerical data with median and missing categorical data with most frequent

Under this option we replace all missing numerical data with the median value for that feature. We also replace all missing categorical data will the most frequent value for that feature.

In each case I'll apply One-Hot Encoding to the categorical data before fitting the model.

In [7]:
numeric_preprocessor1 = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
])

categorical_preprocessor1 = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))  
    
])

preprocessor1 = ColumnTransformer(transformers=[
    ('numeric', numeric_preprocessor1, numeric_features),
    ('categorical', categorical_preprocessor1, categorical_features)
])


model_pipeline1 = Pipeline(steps=[
    ('preprocessor', preprocessor1),
    ('model', GradientBoostingClassifier())
])

model_pipeline1


# Option 2: Replace missing numerical values with 0 and 'ignore' missing categorical data

Using this option we replace all missing numerical data with 0.  

When we carry out the one-hot encoding on the categorical features, where any data is missing this will result in a value of 0 for each of the one-hot encoding labels of that feature. This effectively results in a 'missing data' category, because if the data had been present then it would have been given a value of 1 under the corresponding one-hot encoding label.

In [8]:
numeric_preprocessor2 = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value=0)),
])

categorical_preprocessor2 = Pipeline(steps=[
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor2 = ColumnTransformer(transformers=[
    ('numeric', numeric_preprocessor2, numeric_features),
    ('categorical', categorical_preprocessor2, categorical_features)
])


model_pipeline2 = Pipeline(steps=[
    ('preprocessor', preprocessor2),
    ('model', GradientBoostingClassifier())
])

model_pipeline2


# Option 3: Use a K-NN imputation method 

Using this option, when data is missing for certain features we look for similar individuals and impute the missing data with the feature values from those individuals. Again, there are different possible orders for when to apply the K-NN imputer and when to apply the One-Hot Encoding. I've settled on the pipeline below as it performed well in testing.

In [9]:
numeric_preprocessor3 = Pipeline(steps=[
    ('imputer', KNNImputer(n_neighbors=2)),
])

categorical_preprocessor3 = Pipeline(steps=[
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),  
    ('imputer', KNNImputer(n_neighbors=2))
])

preprocessor3 = ColumnTransformer(transformers=[
    ('numeric', numeric_preprocessor3, numeric_features),
    ('categorical', categorical_preprocessor3, categorical_features)
])


model_pipeline3 = Pipeline(steps=[
    ('preprocessor', preprocessor3),
    ('model', GradientBoostingClassifier())
])

model_pipeline3


# Option 4: Apply Iterative Imputer method

This method uses round-robin linear regression, modeling each feature with missing values as a function of other features.

In [10]:
numeric_preprocessor4 = Pipeline(steps=[
    ('imputer', IterativeImputer(max_iter=1, n_nearest_features=5, random_state=7))
])

categorical_preprocessor4 = Pipeline(steps=[
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
    ('imputer', IterativeImputer(max_iter=1, n_nearest_features=5, random_state=7)),
    
])

preprocessor4 = ColumnTransformer(transformers=[
    ('numeric', numeric_preprocessor4, numeric_features),
    ('categorical', categorical_preprocessor4, categorical_features)
])


model_pipeline4 = Pipeline(steps=[
    ('preprocessor', preprocessor4),
    ('model', GradientBoostingClassifier())
])

model_pipeline4

# Option 5: Apply set rules first

Sometimes it might be possible to spot some sensible rules to apply.  For this challenge we can make the following assumptions:
- If someone is in CryoSleep then they won't have spent any money onboard.  Therefore, we can set the values of RoomService, FoodCourt ShoppingMall, Spa and VRDeck as 0.
- Similarly, if someone has spent money then we can assume that they aren't in CryoSleep.

This is a short, simple list of rules and many others have been suggested by fellow Kagglers.  Again I recommend experimenting!

Generally, this will only impute some of the missing values and so will need to be combined with another method to fully impute all missing values.  In this instance I have combined it with Option 1.

In [11]:
class ApplyImputeRules(TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X.loc[(X['RoomService'] > 0) | (X['FoodCourt'] > 0) | (X['ShoppingMall'] > 0) |
               (X['Spa'] > 0) | (X['VRDeck'] > 0), ['CryoSleep']] = False

        X.loc[X['CryoSleep'] == True, ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']] = 0

        return X


numeric_preprocessor5 = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
])

categorical_preprocessor5 = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
    
])

preprocessor5 = ColumnTransformer(transformers=[
    ('numeric', numeric_preprocessor5, numeric_features),
    ('categorical', categorical_preprocessor5, categorical_features)
])


model_pipeline5 = Pipeline(steps=[
    ('apply_rules', ApplyImputeRules()),
    ('preprocessor', preprocessor5),
    ('model', GradientBoostingClassifier())
])

model_pipeline5


# Comparison of imputation methods
We can now test out each of the four different methods on our validation data set using the evaluation function below.

In [12]:
def evaluate(pipe, model_name, X_train=X_train, y_train=y_train, X_val=X_val, y_val=y_val, random_state=7):

    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_val)

    accuracy = accuracy_score(y_val, y_pred)

    print("Accuracy: ", accuracy)

In [13]:
# Option 1: Replace missing numerical data with median and missing categorical data with most frequent
evaluate(model_pipeline1, 'Gradient Boosting Classifier')

Accuracy:  0.8062104657849338


In [14]:
#Option 2: Replace missing numerical values with 0 and 'ignore' missing categorical data
evaluate(model_pipeline2, 'Gradient Boosting Classifier')

Accuracy:  0.8085106382978723


In [15]:
# Option 3: Use a K-NN imputation method
evaluate(model_pipeline3, 'Gradient Boosting Classifier')

Accuracy:  0.8010350776308223


In [16]:
# Option 4: Apply Iterative Imputer method
evaluate(model_pipeline4, 'Gradient Boosting Classifier')



Accuracy:  0.80448533640023


In [17]:
# Option 5: Apply set rules first, and then complete imputation on any remaining missing data with Option 1 method
evaluate(model_pipeline5, 'Gradient Boosting Classifier')

Accuracy:  0.8079355951696378


# Conclusion

As can be seen the accuracy of the model will vary depending on which imputation method is used, across various runs the difference in accuracy between the worst and best methods tended to be about 0.7%. 

Some results were surprising, for example in this case applying the rules before imputing didn't seem to improve the final model accuracy. However, I suspect that for other problems and datasets the impact of the different methods will vary.

Interestingly, the highest score I have achieved in the competition (0.805) was when using the K-NN method.

In any case, there are a lot more variations and alternative imputation methods that could be explored. Through these examples I hope I have demonstrated the impact that different imputation methods can have and encourage you to try out more!

# Final submission

In [18]:
sample_submissions = pd.read_csv('/kaggle/input/spaceship-titanic/sample_submission.csv')

test_predictions = model_pipeline3.predict(X_test)
sample_submissions['Transported'] = test_predictions
sample_submissions.to_csv('submission.csv', index = False)