Findings and Insights
Handling Missing Values: KNNImputer successfully handled missing data by filling in the gaps based on the nearest neighbors' values.
Correlation Handling: The removal of highly correlated features (correlation above 0.8) helped reduce redundancy in the model, potentially improving performance and reducing overfitting.
Model Optimization: The use of Bayesian optimization with BayesSearchCV led to the identification of optimal hyperparameters for the XGBoost model, resulting in a more accurate classifier.
Model Performance: After hyperparameter tuning, the model achieved a high validation accuracy, demonstrating the effectiveness of Bayesian optimization in improving model performance.
Final Submission: Predictions were made on the test set, and the results were saved in a submission file, ready for further evaluation.
In conclusion, the combination of data preprocessing, feature selection, and Bayesian hyperparameter optimization resulted in a robust XGBoost model for the given classification task.

In [1]:
import pandas as p
print("Training set")
train_df = p.read_csv('/kaggle/input/imlcomp1/train_set.csv')
test_df = p.read_csv('/kaggle/input/imlcomp1/test_set.csv')
print(train_df.head())
print("test set")
print(test_df.head())
from sklearn.impute import KNNImputer

Training set
   RecordId         X2         X3  X4  X5  X6          X7  X8   X9  X10  ...  \
0         1  87.000000  34.118411   0   2   0  165.100000   1  829    2  ...   
1         2  82.372284  31.573280   0   0   1  162.983897   1  724    0  ...   
2         3  50.000000  27.771653   0   0   1  165.100000   1  895    2  ...   
3         4  66.236109  26.515922   0   0   1  167.009549   1  637    0  ...   
4         5  81.303299  20.843691   0   0   1  158.165419   0  564    0  ...   

        X70  X71  X72  X73  X74  X75  X76  X77  X78  Y  
0  0.040000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0  
1  0.033431  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0  
2  0.010000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0  
3  0.039363  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0  
4  0.069242  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0  

[5 rows x 79 columns]
test set
   RecordId         X2         X3  X4  X5  X6          X7  X8   X9  X10  ...  \
0    300001  79.000000  17.122318   0   0   1  170.2

In [2]:
imputer = KNNImputer(n_neighbors=25)
train_df.loc[:, train_df.columns != 'Y'] = imputer.fit_transform(train_df.loc[:, train_df.columns != 'Y'])

In [3]:
dropped_columns = []

In [4]:
import pandas as pd
import numpy as np  # Explicitly import numpy

In [5]:
# Calculate the correlation matrix
correlation_matrix = train_df.corr().abs()

In [6]:
# Set threshold for removing correlated features before it was 0.85
correlation_threshold = 0.8

In [7]:
# Select upper triangle of the correlation matrix
upper_triangle = correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool))

In [8]:
# Identify features to drop based on the correlation threshold
to_drop = [column for column in upper_triangle.columns if any(upper_triangle[column] > correlation_threshold)]

In [9]:
# Store the names of dropped columns
dropped_columns = to_drop.copy()

In [10]:
# Drop the highly correlated features from the training set
train_df_reduced = train_df.drop(columns=to_drop)

In [11]:
# Redefine Features and Target with reduced features
Features_reduced = train_df_reduced.drop(columns=['RecordId', 'Y'])
Target = train_df_reduced['Y']

In [12]:
# Print the reduced feature set
print(f"Features reduced from {train_df.shape[1]} to {train_df_reduced.shape[1]}")
print(Features_reduced.head())
print(dropped_columns)

Features reduced from 79 to 58
          X2         X3  X4  X5  X6          X7  X8   X9  X10  X11  ...  \
0  87.000000  34.118411   0   2   0  165.100000   1  829    2    7  ...   
1  82.372284  31.573280   0   0   1  162.983897   1  724    0    4  ...   
2  50.000000  27.771653   0   0   1  165.100000   1  895    2    7  ...   
3  66.236109  26.515922   0   0   1  167.009549   1  637    0    7  ...   
4  81.303299  20.843691   0   0   1  158.165419   0  564    0    5  ...   

        X68       X69  X71  X72  X73  X74  X75  X76  X77  X78  
0  4.200000  0.110000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  
1  3.718976  0.100292  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  
2  3.800000  0.020000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  
3  4.285677  0.108249  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  
4  3.769194  0.164645  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  

[5 rows x 56 columns]
['X16', 'X29', 'X30', 'X31', 'X33', 'X34', 'X35', 'X36', 'X43', 'X44', 'X49', 'X50', 'X52', 'X53', 'X54', 'X55'

In [13]:
# Scaling
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
Features = scaler.fit_transform(Features_reduced)

In [14]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
import xgboost as xgb
import random

In [15]:
# Assuming Features_reduced_cleaned and Target are already defined
# Step 1: Perform a 70/30 train-test split
train_f, val_f, train_t, val_t = train_test_split(Features_reduced, Target, test_size=0.30, random_state=42)

In [16]:
# Step 2: Set up data for XGBoost
train_data = xgb.DMatrix(train_f, label=train_t)
val_data = xgb.DMatrix(val_f, label=val_t)

In [17]:
from skopt import BayesSearchCV
from skopt.space import Real, Integer
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score

In [18]:
# Define the search space
param_space = {
    'max_depth': Integer(10, 15),
    'learning_rate': Real(0.01, 0.1, 'log-uniform'),
    'n_estimators': Integer(500, 750),
    'gamma': Real(0, 0.5),
    'min_child_weight': Integer(1, 10),
    'subsample': Real(0.5, 1.0),
    'colsample_bytree': Real(0.5, 1.0),
    'alpha': Real(0, 10),
    'lambda': Real(0, 10)
}

In [19]:
# Step 3: Train multiple XGBoost models with different seeds and other varying parameters

num_rounds = 350

In [20]:
# Define the model and optimization parameters
xgb_model = XGBClassifier(objective='binary:logistic', eval_metric='auc', use_label_encoder=False)

In [21]:
# Bayesian search with cross-validation
opt = BayesSearchCV(
    estimator=xgb_model,
    search_spaces=param_space,
    scoring='roc_auc',  # AUC is used for evaluation
    n_iter=50,  # Number of optimization iterations
    cv=3,       # 3-fold cross-validation
    random_state=42,
    verbose=2
)

In [22]:
# Run the optimization
opt.fit(train_f, train_t)

Fitting 3 folds for each of 1 candidates, totalling 3 fits
[CV] END alpha=4.101039588533139, colsample_bytree=0.8638628715886625, gamma=0.46643399942391695, lambda=3.1579959348704874, learning_rate=0.04678945087112739, max_depth=12, min_child_weight=4, n_estimators=685, subsample=0.6522316555182531; total time=  13.2s
[CV] END alpha=4.101039588533139, colsample_bytree=0.8638628715886625, gamma=0.46643399942391695, lambda=3.1579959348704874, learning_rate=0.04678945087112739, max_depth=12, min_child_weight=4, n_estimators=685, subsample=0.6522316555182531; total time=  12.4s
[CV] END alpha=4.101039588533139, colsample_bytree=0.8638628715886625, gamma=0.46643399942391695, lambda=3.1579959348704874, learning_rate=0.04678945087112739, max_depth=12, min_child_weight=4, n_estimators=685, subsample=0.6522316555182531; total time=  13.5s
Fitting 3 folds for each of 1 candidates, totalling 3 fits
[CV] END alpha=8.373883555532844, colsample_bytree=0.9416576386904312, gamma=0.1517050549420875, la

In [23]:
# Get best parameters and score
print("Best Parameters: ", opt.best_params_)
print("Best AUC Score: ", opt.best_score_)

Best Parameters:  OrderedDict([('alpha', 0.0), ('colsample_bytree', 0.5), ('gamma', 0.5), ('lambda', 0.0), ('learning_rate', 0.016144585584455588), ('max_depth', 10), ('min_child_weight', 10), ('n_estimators', 750), ('subsample', 0.9170187983999881)])
Best AUC Score:  0.9609606038431645


In [24]:
# Train final model with best parameters on the whole training set
final_model = XGBClassifier(**opt.best_params_, objective='binary:logistic', eval_metric='auc', use_label_encoder=False)
final_model.fit(train_f, train_t)

In [25]:
# Make predictions on the validation set
val_predictions_bayes = final_model.predict(val_f)
val_accuracy_bayes = accuracy_score(val_t, val_predictions_bayes)
print("Validation Accuracy with Bayesian Optimization:", val_accuracy_bayes)

Validation Accuracy with Bayesian Optimization: 0.9977111746143532


In [26]:
# Predict on test set
# Step 6: Prepare the test data and drop unnecessary columns
test_df_imputed = imputer.transform(test_df)
test_features = pd.DataFrame(test_df_imputed, columns=test_df.columns)
test_features = test_features.drop(columns=['RecordId'] + dropped_columns)

In [27]:
# Predictions on the test set
test_pred_bayes = final_model.predict_proba(test_features)[:, 1]

In [28]:
# Prepare submission
kaggle_submission_bayes = pd.DataFrame({'RecordId': test_df['RecordId'], 'Y': test_pred_bayes})
kaggle_submission_bayes.to_csv('bayesian_optimized_submission.csv', index=False)
print('Bayesian Optimized Submission file created successfully.')

Bayesian Optimized Submission file created successfully.
