<a href="https://colab.research.google.com/github/chl-eo/CCADMACL_EXERCISES_COM222/blob/main/Exercise2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Exercise 2: Use Gradient Boost for Regression

Instructions:

- Use the Dataset File to train your model
- Use the Test File to generate your results
- Use the Sample Submission file to generate the same format
Submit your results to:
https://www.kaggle.com/competitions/playground-series-s4e12/overview



In [3]:
import pandas as pd
import seaborn as sns

from matplotlib import pyplot as plt
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split

In [1]:
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV

## Dataset
Train, test and sample submission file can be found in this link
https://www.kaggle.com/competitions/playground-series-s4e12/data

## 1. Load the Data

In [4]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

## 2. Perform Data preprocessing

In [6]:
for col in train.columns:
   if col in test.columns:
       if train[col].dtype in ["float64", "int64"]:
           train[col] = train[col].fillna(0)
           test[col] = test[col].fillna(0)
       elif train[col].dtype == "object":
           train[col] = train[col].fillna("null")
           test[col] = test[col].fillna("null")
   else:
       print(f"Column '{col}' not found in the test. Ignore")

Column 'Premium Amount' not found in the test. Ignore


In [8]:
target_column = "Premium Amount"
X = train.drop(columns=[target_column, 'id', 'Group', 'Year', 'Month', 'Day', 'Week'], errors='ignore')
y = train[target_column]

In [9]:
categorical_features = X.select_dtypes(include=["object"]).columns.tolist()
numerical_features = X.select_dtypes(include=["float64", "int64"]).columns.tolist()

In [10]:
preprocessor = ColumnTransformer(
   transformers=[
       ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
   ],
   remainder="cross"
)

## 3. Create a Pipeline

In [11]:
xgb_model = XGBRegressor(
   n_estimators=500,
   max_depth=4,
   learning_rate=0.01,
   random_state=42,
)

In [12]:
pipeline = Pipeline(
   steps=[
       ("preprocessor", preprocessor),
       ("regressor", xgb_model),
   ]
)

## 4. Train the Model

In [13]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)


In [14]:
pipeline.fit(X_train, y_train)

The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



In [15]:
pipeline.fit(X_train, y_train)

y_val_pred = pipeline.predict(X_val)

## 5. Evaluate the Model

In [16]:
val_rmse = mean_squared_error(y_val, y_val_pred, squared=False)
print(f"Validation RMSE: {val_rmse:.4f}")

Validation RMSE: 847.2163




## Generate Submission File

Choose the model that has the best performance to generate a submission file.

In [20]:
test_features = test.drop(columns=['id', 'Group', 'Year', 'Month', 'Day', 'Week'], errors='ignore')
test_predictions = pipeline.predict(test_features)
submission_df = pd.DataFrame({
   "id": test["id"],
   "Premium Amount": test_predictions,
})
submission_df.to_csv("submission_file.csv", index=False)
print("Submission file created: submission_file.csv")

# Submission file created: submission_file.csv

Submission file created: submission_file.csv
