# 🎧 Podcast Listening Time Prediction
_A data science project to predict [podcast listening duration](https://www.kaggle.com/competitions/playground-series-s5e4/overview)_

---

### **Created by: Antonio Kevin**
🌐 [**Kaggle**](https://www.kaggle.com/akkevin) | 💼 [**LinkedIn**](https://www.linkedin.com/in/antonio-kevin/) | 🧑‍💻 [**GitHub**](https://github.com/akkevinn)

---

## **Table of Contents**
1. [Problem Understanding](#Problem-Understanding)
2. [Approach](#Approach)
3. [Data Preprocessing](#Data-Preprocessing)
4. [Feature Engineering](#Feature-Engineering)
5. [Model Training](#Model-Training)
6. [Prediction & Submission](#Prediction-and-Submission)

---

## **Problem Understanding**

### **Business Objective**:
The objective of this model is to **predict podcast listening durations** to help content creators improve their podcasts and increase engagement. This prediction will assist with:

- **Optimizing episode length**: Helping creators decide on the ideal episode duration.
- **Improving publishing schedules**: Suggesting the best times and days for publishing.
- **Guest selection**: Determining which types of guests or topics result in higher engagement.

### **Evaluation Metric**:
The model's performance will be evaluated using **Root Mean Squared Error (RMSE)**, which measures the difference between the predicted and actual listening durations in **minutes**. The goal is to minimize this error to ensure more accurate predictions.

---

## **Approach**
This notebook will walk through the following steps:

1. **Data Preprocessing**: Clean and prepare the data for modeling.
2. **Feature Engineering**: Create meaningful features for model training.
3. **Model Training**: Train and fine-tune a suitable model for the prediction task.
4. **Evaluation**: Assess model performance using RMSE.
5. **Prediction on Test Data and Submission**: 
    - Use the trained model to make predictions on the test dataset.
    - Format the predictions for submission according to the competition requirements.
    - Save the results to a CSV file for submission.

---

In [1]:
import pandas as pd
import numpy as np
from category_encoders import TargetEncoder
from sklearn.model_selection import GroupKFold, KFold
import xgboost as xgb
from sklearn.metrics import mean_squared_error
import sklearn
sklearn.set_config(transform_output="pandas")

### **Data Preprocessing**

In [2]:
# Import dataset
train_data = pd.read_csv('train.csv')
train_data['dataset'] = 'train'

test_data = pd.read_csv('test.csv')
test_data['dataset'] = 'test'

# Combine both train & test data for feature engineering
data = pd.concat([train_data, test_data]).reset_index(drop=True)
data

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,Listening_Time_minutes,dataset
0,0,Mystery Matters,Episode 98,,True Crime,74.81,Thursday,Night,,0.0,Positive,31.41998,train
1,1,Joke Junction,Episode 26,119.80,Comedy,66.95,Saturday,Afternoon,75.95,2.0,Negative,88.01241,train
2,2,Study Sessions,Episode 16,73.90,Education,69.97,Tuesday,Evening,8.97,0.0,Negative,44.92531,train
3,3,Digital Digest,Episode 45,67.17,Technology,57.22,Monday,Morning,78.70,2.0,Positive,46.27824,train
4,4,Mind & Body,Episode 86,110.51,Health,80.07,Monday,Afternoon,58.68,3.0,Neutral,75.61031,train
...,...,...,...,...,...,...,...,...,...,...,...,...,...
999995,999995,Mind & Body,Episode 100,21.05,Health,65.77,Saturday,Evening,96.40,3.0,Negative,,test
999996,999996,Joke Junction,Episode 85,85.50,Comedy,41.47,Saturday,Night,30.52,2.0,Negative,,test
999997,999997,Joke Junction,Episode 63,12.11,Comedy,25.92,Thursday,Evening,73.69,1.0,Neutral,,test
999998,999998,Market Masters,Episode 46,113.46,Business,43.47,Friday,Night,93.59,3.0,Positive,,test


In [3]:
# Number of Ads
## Fill missing (NULL) values with the median of 'Number_of_Ads' column
## Censor outlier values by setting a maximum threshold of 3 for any values greater than 3
data['Number_of_Ads'] = data['Number_of_Ads'].fillna(data['Number_of_Ads'].median())
data['Number_of_Ads'] = data['Number_of_Ads'].clip(upper=3)

# Guest Popularity Percentage
## Fill missing (NULL) values with 0, assuming no guest in the podcast if the value is missing
## Censor outlier values by setting a maximum threshold of 100 for any values greater than 100
data['Guest_Popularity_percentage'] = data['Guest_Popularity_percentage'].fillna(0)
data['Guest_Popularity_percentage'] = data['Guest_Popularity_percentage'].clip(upper=100)

# Episode Length Minutes
## Fill missing (NULL) values in the 'Episode_Length_minutes' column with the median value for each podcast
## Censor outlier values by setting a maximum threshold of 120 minutes for any values greater than 120
data['Episode_Length_minutes'] = \
    data.groupby('Podcast_Name')['Episode_Length_minutes'] \
    .transform(lambda x: x.fillna(x.median()))
data['Episode_Length_minutes'] = data['Episode_Length_minutes'].clip(upper=120)

# Host Popularity Percentage
## Censor outlier values by setting a maximum threshold of 100 for any values greater than 100
data['Host_Popularity_percentage'] = data['Host_Popularity_percentage'].clip(upper=100)

# Create episode number feature
## Extract episode number from the 'Episode_Title' column (assumes episode number is in the title)
## Convert the extracted value to a float and fill any missing episode numbers with 0
data['episode_num'] = data['Episode_Title'].str.extract('(\d+)').astype(float).fillna(0)

## **Feature Engineering**

In [4]:
# Genre
## Apply One-Hot Encoding to the 'Genre' column
## This converts each unique genre into a separate binary column (1 or 0)
data = pd.get_dummies(data, columns=['Genre'], prefix='genre')

# Publication Day
## Use sine and cosine functions to encode the days of the week cyclically
## These columns will represent the periodic nature of the 'Publication_Day' feature
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 
             'Friday', 'Saturday', 'Sunday']
data['day_num'] = data['Publication_Day'].map(lambda x: day_order.index(x))
data['day_sin'] = np.sin(2 * np.pi * data['day_num'] / 7)
data['day_cos'] = np.cos(2 * np.pi * data['day_num'] / 7)

# Publication Time
## Apply One-Hot Encoding to the 'Publication_Time' column
## This will convert each unique publication time into separate binary columns
data = pd.get_dummies(data, columns=['Publication_Time'], prefix='time')

# Sentiment
## Apply Ordinal Encoding to the 'Episode_Sentiment' column
## Map 'Positive' to 2, 'Neutral' to 1, and 'Negative' to 0 for sentiment analysis
sentiment_map = {'Positive': 2, 'Neutral': 1, 'Negative': 0}
data['sentiment_encoded'] = data['Episode_Sentiment'].map(sentiment_map)

# Create length_minute_bin
## Bin the 'Episode_Length_minutes' column into discrete categories based on episode length
## The bins are set as [0, 30, 60, 90, 120] minutes, and the corresponding labels are [0, 1, 2, 3]
bins = [0, 30, 60, 90, 120]
labels = [0, 1, 2, 3]
data['length_bin'] = pd.cut(data['Episode_Length_minutes'], bins=bins, labels=labels, right=True)

## If 'Episode_Length_minutes' is 0, set the bin to 0
data['length_bin'] = data.apply(lambda x: 0 if x['Episode_Length_minutes'] == 0 else x['length_bin'], axis=1)

## Convert the 'length_bin' to an integer type
data['length_bin'] = data['length_bin'].astype(int)

# Create Is Weekend
## Create a binary column indicating whether the publication day is a weekend (Saturday or Sunday)
data['is_weekend'] = data['Publication_Day'].isin(['Saturday', 'Sunday']).astype(int)

# Compare Host & Guest Popularity
## Calculate the ratio of host popularity to guest popularity, with a small epsilon to avoid division by zero
data['host_guest_ratio'] = data['Host_Popularity_percentage'] / (data['Guest_Popularity_percentage'] + 1e-6)

## Calculate the difference between host and guest popularity
data['host_guest_diff'] = data['Host_Popularity_percentage'] - data['Guest_Popularity_percentage']

# Amplify popularity metrics based on sentiment
## Multiply the host popularity by the sentiment encoding to adjust for sentiment impact
data['host_pop_sentiment'] = data['Host_Popularity_percentage'] * data['sentiment_encoded']

## Multiply the guest popularity by the sentiment encoding to adjust for sentiment impact
data['guest_pop_sentiment'] = data['Guest_Popularity_percentage'] * data['sentiment_encoded']

In [5]:
data.columns

Index(['id', 'Podcast_Name', 'Episode_Title', 'Episode_Length_minutes',
       'Host_Popularity_percentage', 'Publication_Day',
       'Guest_Popularity_percentage', 'Number_of_Ads', 'Episode_Sentiment',
       'Listening_Time_minutes', 'dataset', 'episode_num', 'genre_Business',
       'genre_Comedy', 'genre_Education', 'genre_Health', 'genre_Lifestyle',
       'genre_Music', 'genre_News', 'genre_Sports', 'genre_Technology',
       'genre_True Crime', 'day_num', 'day_sin', 'day_cos', 'time_Afternoon',
       'time_Evening', 'time_Morning', 'time_Night', 'sentiment_encoded',
       'length_bin', 'is_weekend', 'host_guest_ratio', 'host_guest_diff',
       'host_pop_sentiment', 'guest_pop_sentiment'],
      dtype='object')

In [6]:
# List of features for model training
features = [
    'day_sin', 'day_cos', 'sentiment_encoded',
    'episode_num', 'Episode_Length_minutes',
    'Host_Popularity_percentage', 'Guest_Popularity_percentage',
    'Number_of_Ads',
    'length_bin',
    'is_weekend',
    'host_guest_ratio',
    'host_guest_diff',
    'host_pop_sentiment',
    'guest_pop_sentiment',
    'Podcast_Name'
] + [col for col in data.columns if col.startswith('time_')] \
    + [col for col in data.columns if col.startswith('genre_')]
features

['day_sin',
 'day_cos',
 'sentiment_encoded',
 'episode_num',
 'Episode_Length_minutes',
 'Host_Popularity_percentage',
 'Guest_Popularity_percentage',
 'Number_of_Ads',
 'length_bin',
 'is_weekend',
 'host_guest_ratio',
 'host_guest_diff',
 'host_pop_sentiment',
 'guest_pop_sentiment',
 'Podcast_Name',
 'time_Afternoon',
 'time_Evening',
 'time_Morning',
 'time_Night',
 'genre_Business',
 'genre_Comedy',
 'genre_Education',
 'genre_Health',
 'genre_Lifestyle',
 'genre_Music',
 'genre_News',
 'genre_Sports',
 'genre_Technology',
 'genre_True Crime']

In [7]:
# Split train & test data
train_data = data[data['dataset'] == 'train'].copy()
test_data = data[data['dataset'] == 'test'].reset_index(drop=True).copy()

## **Model Training**

In [8]:
# Prepare features and target variable
X = train_data[features]
y = train_data['Listening_Time_minutes']

# Prepare test data
X_test = test_data[features]

# Training model
gkf = KFold(n_splits=5)
rmse_scores = []

for (train_idx, val_idx) in gkf.split(X, y):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
    
    # Initialize target encoder
    encoder = TargetEncoder(cols=['Podcast_Name'], smoothing=10)
    
    # Fit and transform on training fold
    X_train_encoded = encoder.fit_transform(X_train, y_train)
    
    # Transform validation fold (using training fold's encoding)
    X_val_encoded = encoder.transform(X_val)
    
    # Drop original Podcast Name column
    X_train_encoded = X_train_encoded.drop(columns=['Podcast_Name'])
    X_val_encoded = X_val_encoded.drop(columns=['Podcast_Name'])
    
    # Train model
    model = xgb.XGBRegressor(
        objective='reg:squarederror',
        n_estimators=1000,
        learning_rate=0.02,
        max_depth=15,
        subsample=0.8,
        early_stopping_rounds=50,
        eval_metric='rmse'
    )
    model.fit(X_train_encoded, y_train, eval_set=[(X_val_encoded, y_val)], verbose=False)
    
    # Validate
    val_preds = model.predict(X_val_encoded)
    rmse = np.sqrt(mean_squared_error(y_val, val_preds))
    rmse_scores.append(rmse)
    print(f"Fold RMSE: {rmse:.2f} minutes")

print(f"\nAverage RMSE: {np.mean(rmse_scores):.2f} ± {np.std(rmse_scores):.2f}")


Fold RMSE: 12.71 minutes
Fold RMSE: 12.68 minutes
Fold RMSE: 12.61 minutes
Fold RMSE: 12.62 minutes
Fold RMSE: 12.63 minutes

Average RMSE: 12.65 ± 0.04


In [9]:
# Final model training (RMSE: 12.59510)
encoder = TargetEncoder(cols=['Podcast_Name'], smoothing=10)
X_encoded = encoder.fit_transform(X, y)
X_encoded = X_encoded.drop(columns=['Podcast_Name'])
X_test_encoded = encoder.transform(X_test).drop(columns=['Podcast_Name'])

final_model = xgb.XGBRegressor(
    objective='reg:squarederror',
    n_estimators=1000,
    learning_rate=0.02,
    max_depth=15,
    subsample=0.8,
    eval_metric='rmse'
)
final_model.fit(X_encoded, y)

## **Prediction and Submission**

In [10]:
# Prepare submission
test_preds = final_model.predict(X_test_encoded)
submission_data = pd.DataFrame({
    'id': test_data['id'],
    'Listening_Time_minutes': test_preds
})
submission_data.to_csv('final_submission_v13.csv', index=False)

## Optuna

In [55]:
import optuna

X = train_data[features]
y = train_data['Listening_Time_minutes']

def objective(trial):
    # Define hyperparameter search space
    params = {
        'objective': 'reg:squarederror',
        'n_estimators': 1000,
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'max_depth': trial.suggest_int('max_depth', 6, 15),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
        'early_stopping_rounds': 50,
        'eval_metric': 'rmse',
        'tree_method': 'hist',
        'device': 'cuda'
    }

    # Cross-validation setup
    gkf = KFold(n_splits=5)
    cv_rmse_scores = []

    for train_idx, val_idx in gkf.split(X, y):
        X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
        
        # Target encoding within fold
        encoder = TargetEncoder(cols=['Podcast_Name'], smoothing=10)
        X_train_encoded = encoder.fit_transform(X_train, y_train)
        X_val_encoded = encoder.transform(X_val)
        X_train_encoded = X_train_encoded.drop(columns=['Podcast_Name'])
        X_val_encoded = X_val_encoded.drop(columns=['Podcast_Name'])
        
        # Initialize and train model
        model = xgb.XGBRegressor(**params)
        model.fit(
            X_train_encoded, 
            y_train,
            eval_set=[(X_val_encoded, y_val)],
            verbose=False
        )
        
        # Evaluate
        val_preds = model.predict(X_val_encoded)
        fold_rmse = np.sqrt(mean_squared_error(y_val, val_preds))
        cv_rmse_scores.append(fold_rmse)
    
    return np.mean(cv_rmse_scores)

# Run optimization
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=25, timeout=3600)

# Best parameters
print("Best trial:")
trial = study.best_trial
print(f"RMSE: {trial.value:.4f}")
print("Best params:")
for key, value in trial.params.items():
    print(f"    {key}: {value}")

[I 2025-04-24 07:22:21,654] A new study created in memory with name: no-name-b82a849f-8076-46cb-8987-cc48b61be1b9
[I 2025-04-24 07:25:54,086] Trial 0 finished with value: 12.730165089009223 and parameters: {'n_estimators': 1000, 'learning_rate': 0.05460702799905646, 'max_depth': 14, 'subsample': 0.8350017473557522, 'colsample_bytree': 0.7676632123955569}. Best is trial 0 with value: 12.730165089009223.
[I 2025-04-24 07:26:23,146] Trial 1 finished with value: 13.166586159796646 and parameters: {'n_estimators': 1000, 'learning_rate': 0.2698270778984419, 'max_depth': 14, 'subsample': 0.7465846151493415, 'colsample_bytree': 0.918159572529525}. Best is trial 0 with value: 12.730165089009223.
[I 2025-04-24 07:30:17,533] Trial 2 finished with value: 12.811868869173665 and parameters: {'n_estimators': 1000, 'learning_rate': 0.01750120182693745, 'max_depth': 11, 'subsample': 0.7610416700415726, 'colsample_bytree': 0.910950369321817}. Best is trial 0 with value: 12.730165089009223.
[W 2025-04-24

KeyboardInterrupt: 