# XGBoost Notebook

In this notebook I will train an XGBoost model end to end.

### Description

This is the April 2025 podcast listening time podcast prediction competition.

The goal is to analyze and predict the average listening duration of podcast episodes based on various features.

### Files
1. train.csv
2. test.csv
3. sample_submission.csv

### Evaluation

The evaluation metric is the RMSE.

Submission File
For each id in the test set, you must predict the number of minutes listened. The file should contain a header and have the following format:

- id,Listening_Time_minutes
- 26570,0.2
- 26571,0.1
- 26572,0.9
- etc.

## Package Importing

In [1]:
# general python libraries
import time
import sys
import datetime
import math
import numpy as np

# dataframe and data manipulation library
import pandas as pd

# visualisation and EDA libraries
import matplotlib.pyplot as  plt
import ydata_profiling
import seaborn as sns

# machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn import metrics
import xgboost as xgb

  from .autonotebook import tqdm as notebook_tqdm


## Data Importing

In [2]:
folder_path = '../data/raw'
df_train = pd.read_csv(f'{folder_path}/train.csv')
df_test = pd.read_csv(f'{folder_path}/test.csv')
sample = pd.read_csv(f'{folder_path}/sample_submission.csv')

In [3]:
TARGET_COLUMN = 'Listening_Time_minutes'

## Data Cleaning

### Plan for data cleaning for first model
0.  id
    - drop
1.  Podcast_Name - drop
    - drop
2.  Episode_Title - parse out episode number
    - parse out episode number
3.  Episode_Length_minutes
    - impute with mean
4.  Genre
    - drop
5.  Host_Popularity_percentage
    - impute with mean
6.  Publication_Day
    - drop
7.  Publication_Time
    - drop
8.  Guest_Popularity_percentage
    - impute with mean
9.  Number_of_Ads
    - impute one missing value with mean
10.  Episode_Sentiment
    - drop
11.  Listening_Time_minutes - target

In [4]:
def feature_engineering(df):
    
    # Parse episode number
    df['Episode_Number'] = (
        df
            ['Episode_Title']
            .str.split(' ') # split based on space so that each element is a list ['Episode','12']
            .apply(lambda lst: lst[1])
            .astype('float64')
    )
    
    df = df.drop(columns='Episode_Title')

    return df

In [5]:
def preprocessing(df):
    
    # Drop non-important columns
    columns_to_drop = ['Podcast_Name', 'Genre','Publication_Time','Episode_Sentiment','id', 'Publication_Day']
    
    df = df.drop(columns=columns_to_drop)

    df = feature_engineering(df)

    return df # Enabled this to stop warnings



## Model fitting

### Train Test Split

Splitting data into groupings for model fitting

In [None]:
from sklearn.model_selection import KFold, cross_validate, cross_val_score
from sklearn.metrics import root_mean_squared_error

NUMBER_OF_SPLITS = 5
    
outer_kfold = KFold(n_splits=NUMBER_OF_SPLITS)

list_train_rmse = []
list_test_rmse = []

for fold_number, (infold_training_indices, infold_test_indices) in enumerate(outer_kfold.split(df_train), 1):

    # Pre-processing of training data in kfold
    X_train = df_train.loc[infold_training_indices,df_train.columns != TARGET_COLUMN]
    X_train = preprocessing(X_train)
    
    y_train = df_train.loc[infold_training_indices,TARGET_COLUMN].to_numpy()

    # Pre-processing of training data in kfold for in-fold validation
    X_test = df_train.loc[infold_test_indices,df_train.columns != TARGET_COLUMN]
    X_test = preprocessing(X_test)
    
    y_test = df_train.loc[infold_test_indices,TARGET_COLUMN].to_numpy()

    hyperparameters = {
        "learning_rate": 0.015,
        "max_depth": 6,
        "n_estimators": 700,
        "random_state": 42,
        "objective": 'reg:squarederror'
    }

    # Defining XGBoost Parameters
    xgboost_model=xgb.XGBRegressor(
        **hyperparameters
    )

    xgboost_model.fit(
        X_train,
        y_train,
        verbose=500,
        eval_set=[(X_train,y_train)]
    )

    y_train_preds = xgboost_model.predict(X_train)
    train_rmse = root_mean_squared_error(y_true=y_train,y_pred=y_train_preds)
    list_train_rmse.append(train_rmse)

    y_test_preds = xgboost_model.predict(X_test)
    test_rmse = root_mean_squared_error(y_true=y_test,y_pred=y_test_preds)
    list_test_rmse.append(test_rmse)

print('The average cross neg_root_mean_squared_error is ', sum(list_train_rmse)/len(list_train_rmse))

[0]	validation_0-rmse:26.83238
[500]	validation_0-rmse:13.05441
[699]	validation_0-rmse:13.01262
[0]	validation_0-rmse:26.83152
[500]	validation_0-rmse:13.05674
[699]	validation_0-rmse:13.01681
[0]	validation_0-rmse:26.82684
[500]	validation_0-rmse:13.06990
[699]	validation_0-rmse:13.02741
[0]	validation_0-rmse:26.82313
[500]	validation_0-rmse:13.07434
[699]	validation_0-rmse:13.03110
[0]	validation_0-rmse:26.83227
[500]	validation_0-rmse:13.07045
[699]	validation_0-rmse:13.02841


ZeroDivisionError: division by zero

In [7]:
print('The average cross neg_root_mean_squared_error is ', sum(list_test_rmse)/len(list_test_rmse))

The average cross neg_root_mean_squared_error is  13.094737572066606


In [9]:
# Training on entire dataset
X_train = df_train.loc[:,df_train.columns != TARGET_COLUMN]
X_train = preprocessing(X_train)

y_train = df_train.loc[:,TARGET_COLUMN].to_numpy()

In [10]:
xgboost_model = xgb.XGBRegressor(
    **hyperparameters
)

xgboost_model.fit(X_train,y_train)

In [20]:
import mlflow
from mlflow.models import infer_signature

# Set our tracking server uri for logging
mlflow.set_tracking_uri("http://localhost:5000")

# Create a new MLflow Experiment
mlflow.set_experiment("Kaggle S5E4")

# Start an MLflow run
with mlflow.start_run(run_id='fe80dcac823b4f17b522a63c4ed1067d'):

    # Log the hyperparameters
    mlflow.log_params(hyperparameters)

    # Log the loss metric
    mlflow.log_metric("cv_neg_root_mean_squared_error", sum(list_test_rmse)/len(list_test_rmse))
    mlflow.log_metric("kaggle leaderboard", 13.17167)

    # Infer the model signature
    signature = infer_signature(
        model_input=X_train,
        model_output=y_train,
    )

    # Log the model
    model_info = mlflow.sklearn.log_model(
        sk_model=xgboost_model,
        artifact_path="xgboost_model",
        signature=signature,
        input_example=X_train,
    )


üèÉ View run clean-toad-84 at: http://localhost:5000/#/experiments/3/runs/fe80dcac823b4f17b522a63c4ed1067d
üß™ View experiment at: http://localhost:5000/#/experiments/3


# Test Set Validation

In [12]:
X_test = preprocessing(df_test)

In [13]:
y_preds = xgboost_model.predict(X_test)

In [14]:
df_submission = pd.DataFrame(
    y_preds,
    columns=['Listening_Time_minutes'],
    index=df_test.id
)

In [17]:
# write the csv to the submissions folder
df_submission.to_csv('../submissions/xgboost_impute_with_mlflow_pipeline.csv')

In [16]:
df_submission

Unnamed: 0_level_0,Listening_Time_minutes
id,Unnamed: 1_level_1
750000,56.563034
750001,18.518909
750002,49.576340
750003,77.886902
750004,47.933655
...,...
999995,11.649667
999996,58.536015
999997,6.864640
999998,72.083359
