<a href="https://colab.research.google.com/github/beaasuncion/CCADMACL_EXERCISES_COM222ML/blob/main/EXERCISE_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Exercise 2: Use Gradient Boost for Regression

Instructions:

- Use the Dataset File to train your model
- Use the Test File to generate your results
- Use the Sample Submission file to generate the same format
Submit your results to:
https://www.kaggle.com/competitions/playground-series-s4e12/overview



In [26]:
import xgboost as xgb
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

## Dataset
Train, test and sample submission file can be found in this link
https://www.kaggle.com/competitions/playground-series-s4e12/data

## 1. Load the Data

In [27]:
def load_data(file_path):
    return pd.read_csv(file_path)

In [28]:
train_df = load_data('train.csv')
test_df = load_data('test.csv')
sample_submission = load_data('sample_submission.csv')

## 2. Perform Data preprocessing

In [29]:
X = train_df.drop(columns=['Premium Amount', 'id'])
y = train_df['Premium Amount']

In [30]:
categorical_features = X.select_dtypes(include=["object"]).columns.tolist()
numerical_features = X.select_dtypes(include=["float64", "int64"]).columns.tolist()

In [31]:
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", Pipeline([
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('encoder', OneHotEncoder(handle_unknown='ignore'))
        ]), categorical_features),
        ("num", Pipeline([
            ('imputer', SimpleImputer(strategy='mean')),
            ('scaler', StandardScaler())
        ]), numerical_features),
    ]
)

In [32]:
params = {
    "n_estimators": 100,
    "max_depth": 3,
    "learning_rate": 0.05,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "n_jobs": -1,
}

## 3. Create a Pipeline

In [33]:
pipeline = Pipeline(
    [
        ("preprocessor", preprocessor),
        ("regressor", xgb.XGBRegressor(**params, random_state=42)),
    ]
)

## 4. Train the Model

In [36]:
cv = KFold(n_splits=3, shuffle=True, random_state=42)
cv_scores = cross_val_score(pipeline, X, y, cv=cv, scoring='neg_root_mean_squared_error')
print(f"RMSE Scores: {-cv_scores}")
print(f"MCV RMSE: {-np.mean(cv_scores)}")

RMSE Scores: [853.84174404 856.08658693 854.21470854]
MCV RMSE: 854.7143465052582


In [34]:
pipeline.fit(X, y)

## 5. Evaluate the Model

In [35]:
test_features = test_df.drop(columns=['id'])
y_test_pred = pipeline.predict(test_features)

## Generate Submission File

Choose the model that has the best performance to generate a submission file.

In [24]:
sf = pd.read_csv('test.csv') # Load test data
id = sf.pop('id')  # pop the 'id' column
y_pred = pipeline.predict(sf)

# To submission DataFrame
submission_df = pd.DataFrame({
    'id': id,
    'Premium Amount': y_pred
})

# Save the submission DataFrame to a CSV file
submission_df.to_csv('submission_file.csv', index=False)
print("Submission file created: submission_file.csv")

Submission file created: submission_file.csv
