## Exercise 2: Use Gradient Boost for Regression

Instructions:

- Use the Dataset File to train your model
- Use the Test File to generate your results
- Use the Sample Submission file to generate the same format
Submit your results to:
https://www.kaggle.com/competitions/playground-series-s4e12/overview



In [30]:
import pandas as pd
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV

## Dataset
Train, test and sample submission file can be found in this link
https://www.kaggle.com/competitions/playground-series-s4e12/data

## 1. Load the Data

In [31]:
# put your answer here
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

In [32]:
print("Train Data Head:")
print(train_data.head())
print("Test Data Head:")
print(test_data.head())

Train Data Head:
   id   Age  Gender  Annual Income Marital Status  Number of Dependents  \
0   0  19.0  Female        10049.0        Married                   1.0   
1   1  39.0  Female        31678.0       Divorced                   3.0   
2   2  23.0    Male        25602.0       Divorced                   3.0   
3   3  21.0    Male       141855.0        Married                   2.0   
4   4  21.0    Male        39651.0         Single                   1.0   

  Education Level     Occupation  Health Score  Location  ... Previous Claims  \
0      Bachelor's  Self-Employed     22.598761     Urban  ...             2.0   
1        Master's            NaN     15.569731     Rural  ...             1.0   
2     High School  Self-Employed     47.177549  Suburban  ...             1.0   
3      Bachelor's            NaN     10.938144     Rural  ...             1.0   
4      Bachelor's  Self-Employed     20.376094     Rural  ...             0.0   

   Vehicle Age  Credit Score  Insurance Durat

In [33]:
for col in train_data.columns:
    if col in test_data.columns:
        if train_data[col].dtype in ["float64", "int64"]:
            train_data[col] = train_data[col].fillna(0)
            test_data[col] = test_data[col].fillna(0)
        elif train_data[col].dtype == "object":
            train_data[col] = train_data[col].fillna("missing")
            test_data[col] = test_data[col].fillna("missing")
    else:
        print(f"Column '{col}' not found in test_data. Skipping")

Column 'Premium Amount' not found in test_data. Skipping


## 2. Perform Data preprocessing

In [34]:
target_column = "Premium Amount"
X = train_data.drop(columns=[target_column, 'id', 'Group', 'Year', 'Month', 'Day', 'Week'], errors='ignore')
y = train_data[target_column]

In [35]:
categorical_features = X.select_dtypes(include=["object"]).columns.tolist()
numerical_features = X.select_dtypes(include=["float64", "int64"]).columns.tolist()

In [36]:
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
    ],
    remainder="passthrough"  # Keeps the numerical columns untouched
)

## 3. Create a Pipeline

In [37]:
xgb_model = XGBRegressor(
    n_estimators=500,
    max_depth=4,
    learning_rate=0.01,
    random_state=42,
)

In [38]:
pipeline = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("regressor", xgb_model),
    ]
)

## 4. Train the Model

In [39]:
# put your answer here
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [40]:
pipeline.fit(X_train, y_train)

# Predict on the validation set
y_val_pred = pipeline.predict(X_val)

## 5. Evaluate the Model

In [41]:
# put your answer here
val_rmse = mean_squared_error(y_val, y_val_pred, squared=False)
print(f"Validation RMSE: {val_rmse:.4f}")

Validation RMSE: 847.2163




## Generate Submission File

Choose the model that has the best performance to generate a submission file.

In [42]:
test_features = test_data.drop(columns=['id', 'Group', 'Year', 'Month', 'Day', 'Week'], errors='ignore')
test_predictions = pipeline.predict(test_features)

submission_df = pd.DataFrame({
    "id": test_data["id"],
    "Premium Amount": test_predictions,
})

submission_df.to_csv("submission_file.csv", index=False)
print("Submission file created: submission_file.csv")

Submission file created: submission_file.csv
