# My Approach to the Kaggle Challenge   

Challenge: https://www.kaggle.com/competitions/playground-series-s3e7/overview   

This notebook only wants to provide some hints on how to approach the challenge, the final solution is not reported here. 

# Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style("darkgrid")
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (15, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

import warnings
warnings.simplefilter(action='ignore')

import opendatasets as od

# Importing Data

In [2]:
od.download('https://www.kaggle.com/competitions/playground-series-s3e7/data?select=train.csv')

Skipping, found downloaded files in ".\playground-series-s3e7" (use force=True to force download)


In [3]:
od.download('https://www.kaggle.com/datasets/gauravduttakiit/reservation-cancellation-prediction?select=train__dataset.csv')

Skipping, found downloaded files in ".\reservation-cancellation-prediction" (use force=True to force download)


In [4]:
url = './playground-series-s3e7/'

train = pd.read_csv(url+'train.csv').drop(columns='id')
test = pd.read_csv(url+'test.csv').drop(columns='id')
submission = pd.read_csv(url + 'sample_submission.csv')

original = pd.read_csv('./reservation-cancellation-prediction/train__dataset.csv')

# EDA and Feature Engineering

In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42100 entries, 0 to 42099
Data columns (total 18 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   no_of_adults                          42100 non-null  int64  
 1   no_of_children                        42100 non-null  int64  
 2   no_of_weekend_nights                  42100 non-null  int64  
 3   no_of_week_nights                     42100 non-null  int64  
 4   type_of_meal_plan                     42100 non-null  int64  
 5   required_car_parking_space            42100 non-null  int64  
 6   room_type_reserved                    42100 non-null  int64  
 7   lead_time                             42100 non-null  int64  
 8   arrival_year                          42100 non-null  int64  
 9   arrival_month                         42100 non-null  int64  
 10  arrival_date                          42100 non-null  int64  
 11  market_segment_

Checking if the target variable is unbalanced.

In [6]:
train.booking_status.value_counts()

0    25596
1    16504
Name: booking_status, dtype: int64

It does not look unbalanced, checking if it is the same for the original dataset.

In [7]:
original.booking_status.value_counts()

0    12195
1     5942
Name: booking_status, dtype: int64

## Feature Creation Ideas  

In this section some new features were created, but none of them improved the score.  
I left the only one that helped.

- removing incorrect information from the date features   

### Removing Incorrect Information  

It is not possible since many dates are wrong. 

In [8]:
train.drop(index=train[(train.arrival_month == 2) & (train.arrival_date > 28)].index, inplace=True)
train.drop(index=train[(train.arrival_month == 4) & (train.arrival_date > 30)].index, inplace=True) 
train.drop(index=train[(train.arrival_month == 6) & (train.arrival_date > 30)].index, inplace=True) 
train.drop(index=train[(train.arrival_month == 9) & (train.arrival_date > 30)].index, inplace=True)
train.drop(index=train[(train.arrival_month == 11) & (train.arrival_date > 30)].index, inplace=True)

Checking whether classes are unbalanced.

In [9]:
train.booking_status.value_counts()

0    25559
1    16491
Name: booking_status, dtype: int64

It has not affected the balance so much.

## Creating X and y 

In [10]:
X = train.drop(columns='booking_status')
y = train.booking_status

## Defining Categorical Columns

In [11]:
from sklearn.preprocessing import OneHotEncoder

In [12]:
categorical_cols = ['type_of_meal_plan', 'room_type_reserved', 'market_segment_type']

encoder = OneHotEncoder(sparse=False).fit(X[categorical_cols])

encoded_cols = list(encoder.get_feature_names_out(categorical_cols))

X[encoded_cols] = encoder.transform(X[categorical_cols])
test[encoded_cols] = encoder.transform(test[categorical_cols])

## Scaling Values

In [13]:
from sklearn.preprocessing import RobustScaler

In [14]:
numerical_cols = X.drop(columns=categorical_cols).columns.tolist()

scaler = RobustScaler().fit(X[numerical_cols])

X[numerical_cols] = scaler.transform(X[numerical_cols])
test[numerical_cols] = scaler.transform(test[numerical_cols])

# Model   

The following model is only one of the many different attempts that I tried to achieve a good score in the challenge.  

By the way, I obtained my best score without including the original dataset, without adding new variables (in my first run I created some of them) and using the robust scaler.

In [15]:
from xgboost import XGBClassifier 
from sklearn.model_selection import RandomizedSearchCV

Best mean test score: 0.913138		

This is an example from where I started.  

I found RandomizedSearchCV really helpful.

In [16]:
classification = RandomizedSearchCV(XGBClassifier(n_jobs=-1, n_estimators=500, random_state=42),{
    'max_depth': [1,2,3,4,5,6,7,8,9],
    'learning_rate': [0.001, 0.01, 0.1, 0.2, 0.3],
    'min_child_weight': [1,3,5,7,9,11,13,15]
}, cv=5, return_train_score=False, scoring='roc_auc', n_iter=10) 

classification.fit(X,y) 

pd.DataFrame(classification.cv_results_).sort_values('rank_test_score').head(3)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_min_child_weight,param_max_depth,param_learning_rate,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
4,4.893324,0.019715,0.029577,0.001829,15,3,0.2,"{'min_child_weight': 15, 'max_depth': 3, 'lear...",0.891888,0.901862,0.899508,0.902134,0.894353,0.897949,0.004121,1
0,7.255245,0.527666,0.037725,0.006714,7,5,0.2,"{'min_child_weight': 7, 'max_depth': 5, 'learn...",0.892689,0.901285,0.898089,0.900741,0.893493,0.897259,0.00358,2
3,7.692775,0.036258,0.038852,0.00169,1,5,0.2,"{'min_child_weight': 1, 'max_depth': 5, 'learn...",0.891567,0.900633,0.898214,0.899204,0.892556,0.896435,0.003666,3


A good idea is to report the best parameters of each run and try new values close to them.

## Model Creation  

After having found the best parameters they can be put into the model.

The values used here are only an example of a run that I did.

In [25]:
final_model = XGBClassifier(n_jobs=-1, n_estimators=500, max_depth=6, learning_rate=0.19, min_child_weight=13, colsample_bytree=0.42, random_state=42) 

final_model.fit(X,y)  

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.42,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.19, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=13,
              missing=nan, monotone_constraints='()', n_estimators=500,
              n_jobs=-1, num_parallel_tree=1, predictor='auto', random_state=42,
              reg_alpha=0, reg_lambda=1, ...)

# Submission  

To submit the challenge.

In [29]:
submission.head()

Unnamed: 0,id,booking_status
0,42100,0.392
1,42101,0.392
2,42102,0.392
3,42103,0.392
4,42104,0.392


In [22]:
preds = final_model.predict_proba(test)[:,1]

In [252]:
submission['booking_status'] = preds 

Hint: put the attempt's number in the file's name.

In [253]:
submission.to_csv('tenth_attempt.csv', index=None)

# Conclusion  

I applied this workflow to the Kaggle playground challenges 6 and 7 and I was able to obtain very good results, obviously it does not mean that it can be applied to every situation. 