# XGBoost Notebook

In this notebook I will train an XGBoost model end to end.

### Description

This is the April 2025 podcast listening time podcast prediction competition.

The goal is to analyze and predict the average listening duration of podcast episodes based on various features.

### Files
1. train.csv
2. test.csv
3. sample_submission.csv

### Evaluation

The evaluation metric is the RMSE.

Submission File
For each id in the test set, you must predict the number of minutes listened. The file should contain a header and have the following format:

- id,Listening_Time_minutes
- 26570,0.2
- 26571,0.1
- 26572,0.9
- etc.

## Package Importing

In [19]:
# general python libraries
import time
import sys
import datetime
import math
import numpy as np

# dataframe and data manipulation library
import pandas as pd

# visualisation and EDA libraries
import matplotlib.pyplot as  plt
import ydata_profiling
import seaborn as sns

# machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn import metrics
import xgboost as xgb

## Data Importing

In [20]:
folder_path = '../data/raw'
df_train = pd.read_csv(f'{folder_path}/train.csv')
df_test = pd.read_csv(f'{folder_path}/test.csv')
sample = pd.read_csv(f'{folder_path}/sample_submission.csv')

In [21]:
TARGET_COLUMN = 'Listening_Time_minutes'

## Data Cleaning

### Plan for data cleaning for first model
0.  id
    - drop
1.  Podcast_Name - drop
    - drop
2.  Episode_Title - parse out episode number
    - parse out episode number
3.  Episode_Length_minutes
    - impute with mean
4.  Genre
    - drop
5.  Host_Popularity_percentage
    - impute with mean
6.  Publication_Day
    - drop
7.  Publication_Time
    - drop
8.  Guest_Popularity_percentage
    - impute with mean
9.  Number_of_Ads
    - impute one missing value with mean
10.  Episode_Sentiment
    - drop
11.  Listening_Time_minutes - target

In [54]:
def feature_engineering(df):
    
    # Parse episode number
    df['Episode_Number'] = (
        df
            ['Episode_Title']
            .str.split(' ') # split based on space so that each element is a list ['Episode','12']
            .apply(lambda lst: lst[1])
            .astype('float64')
    )
    
    df = df.drop(columns='Episode_Title')

    df['is_weekend']   = df['Publication_Day'].isin(['Saturday', 'Sunday']).astype('float64')

    return df

In [55]:
def preprocessing(df):
    
    # Drop non-important columns
    
    CAT_COLUMNS = ["Genre","Publication_Day","Episode_Sentiment"]
    for col in CAT_COLUMNS:
        df[col] = df[col].astype('category')

    
    columns_to_drop = ['Podcast_Name','Publication_Time','id']

    df = feature_engineering(df)

    df = df.drop(columns=columns_to_drop)


    return df # Enabled this to stop warnings



## Model fitting

### Train Test Split

Splitting data into groupings for model fitting

In [56]:
from sklearn.model_selection import KFold, cross_validate, cross_val_score
from sklearn.metrics import root_mean_squared_error

NUMBER_OF_SPLITS = 5
    
outer_kfold = KFold(n_splits=NUMBER_OF_SPLITS)

list_train_rmse = []
list_test_rmse = []

for fold_number, (infold_training_indices, infold_test_indices) in enumerate(outer_kfold.split(df_train), 1):

    # Pre-processing of training data in kfold
    X_train = df_train.loc[infold_training_indices,df_train.columns != TARGET_COLUMN]
    X_train = preprocessing(X_train)
    
    y_train = df_train.loc[infold_training_indices,TARGET_COLUMN].to_numpy()

    # Pre-processing of training data in kfold for in-fold validation
    X_test = df_train.loc[infold_test_indices,df_train.columns != TARGET_COLUMN]
    X_test = preprocessing(X_test)
    
    y_test = df_train.loc[infold_test_indices,TARGET_COLUMN].to_numpy()

    hyperparameters = {
        "learning_rate": 0.015,
        "max_depth": 6,
        "n_estimators": 700,
        "random_state": 42,
        "objective": 'reg:squarederror',
        "enable_categorical":True,
    }

    # Defining XGBoost Parameters
    xgboost_model=xgb.XGBRegressor(
        **hyperparameters,
    )

    xgboost_model.fit(
        X_train,
        y_train,
    )

    y_train_preds = xgboost_model.predict(X_train)
    train_rmse = root_mean_squared_error(y_true=y_train,y_pred=y_train_preds)
    list_train_rmse.append(train_rmse)

    y_test_preds = xgboost_model.predict(X_test)
    test_rmse = root_mean_squared_error(y_true=y_test,y_pred=y_test_preds)
    list_test_rmse.append(test_rmse)

    print(f'--- Fold {fold_number} Completed ---')
    print('train_rmse, test_rmse - ',train_rmse,test_rmse)

print('--- Training_Completed ---')
print('The average test cross neg_root_mean_squared_error is ', sum(list_test_rmse)/len(list_test_rmse))

--- Fold 1 Completed ---
train_rmse, test_rmse -  12.957847301006906 13.120053416255455
--- Fold 2 Completed ---
train_rmse, test_rmse -  12.963251420438514 13.097700897339704
--- Fold 3 Completed ---
train_rmse, test_rmse -  12.980167229885394 13.03547836210973
--- Fold 4 Completed ---
train_rmse, test_rmse -  12.98427055651127 13.027620431794702
--- Fold 5 Completed ---
train_rmse, test_rmse -  12.978954985486117 13.042091848078236
--- Training_Completed ---
The average test cross neg_root_mean_squared_error is  13.064588991115565


In [58]:
print('The average cross neg_root_mean_squared_error is ', sum(list_test_rmse)/len(list_test_rmse))

The average cross neg_root_mean_squared_error is  13.064588991115565


In [38]:
# Training on entire dataset
X_train = df_train.loc[:,df_train.columns != TARGET_COLUMN]
X_train = preprocessing(X_train)

y_train = df_train.loc[:,TARGET_COLUMN].to_numpy()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Genre"] = df["Genre"].astype("category")


In [39]:
xgboost_model = xgb.XGBRegressor(
    **hyperparameters
)

xgboost_model.fit(X_train,y_train)

In [51]:
# CHECKLIST BEFORE RUNNING
# 1. is this a new run (start_run run_id empty) or are you inserting into an old run (start run populated)
# 2. Do you know the kaggle leaderboard metric? If not set to 999
# 3. Is this a leaderboard model? If not then disable the model logging at the end
# This take 2 minutes to run

import mlflow
from mlflow.models import infer_signature

# Set our tracking server uri for logging
mlflow.set_tracking_uri("http://localhost:5000")

# Create a new MLflow Experiment
mlflow.set_experiment("Kaggle S5E4")

# Start an MLflow run
with mlflow.start_run(run_id='963c92f7aefc4587b116f92d16cfd4f4'):

    # Log the hyperparameters
    mlflow.log_params(hyperparameters)

    # Log the loss metric
    mlflow.log_metric("cv_neg_root_mean_squared_error", sum(list_test_rmse)/len(list_test_rmse))
    mlflow.log_metric("kaggle leaderboard", 13.15760)

    # # Infer the model signature
    # signature = infer_signature(
    #     model_input=X_train,
    #     model_output=y_train,
    # )

    # # Log the model
    # model_info = mlflow.sklearn.log_model(
    #     sk_model=xgboost_model,
    #     artifact_path="xgboost_model",
    #     signature=signature,
    #     input_example=X_train,
    # )


🏃 View run intrigued-skunk-907 at: http://localhost:5000/#/experiments/3/runs/963c92f7aefc4587b116f92d16cfd4f4
🧪 View experiment at: http://localhost:5000/#/experiments/3


# Test Set Validation

In [43]:
X_test = preprocessing(df_test)

In [44]:
y_preds = xgboost_model.predict(X_test)

In [45]:
df_submission = pd.DataFrame(
    y_preds,
    columns=['Listening_Time_minutes'],
    index=df_test.id
)

In [46]:
# write the csv to the submissions folder
df_submission.to_csv('../submissions/xgboost_add_weekend_indicator_and_genre.csv')

In [16]:
df_submission

Unnamed: 0_level_0,Listening_Time_minutes
id,Unnamed: 1_level_1
750000,56.563034
750001,18.518909
750002,49.576340
750003,77.886902
750004,47.933655
...,...
999995,11.649667
999996,58.536015
999997,6.864640
999998,72.083359
