## 1.1. Regression Models for Predicting Number of Views and Likes per Video

This notebook contains an exploration of the target variables in the datasets on  `data/processed/`.

See [1.1 Regression Models](https://eddelojeda.github.io/youtube_trends/1.1_Regression_Models.html) to interact with the dynamic plots in this notebook.

In [1]:
import os
import joblib
import warnings
import numpy as np
import pandas as pd
import torch.nn as nn
import plotly.io as pio
import plotly.express as px
import plotly.graph_objects as go
from xgboost import XGBRegressor
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from torch.utils.data import TensorDataset, DataLoader
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from youtube_trends.config import PROCESSED_DATA_DIR, MODELS_DIR
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

pio.renderers.default = 'notebook_connected'
warnings.filterwarnings('ignore')

[32m2025-05-26 02:39:10.144[0m | [1mINFO    [0m | [36myoutube_trends.config[0m:[36m<module>[0m:[36m11[0m - [1mPROJ_ROOT path is: C:\Users\eddel\OneDrive\Documents\MCD\AAA\youtube_trends\venv\src\youtube-trends[0m


#### Preparing training, validation, and testing datasets

Loading processed datasets `df_train`, `df_val` and `df_test`.

In [2]:
df_train = pd.read_csv(PROCESSED_DATA_DIR / 'train_dataset.csv', low_memory=False)
df_val = pd.read_csv(PROCESSED_DATA_DIR / 'val_dataset.csv', low_memory=False)
df_test = pd.read_csv(PROCESSED_DATA_DIR / 'test_dataset.csv', low_memory=False)

In [3]:
print(df_train.shape)
print(df_val.shape)
print(df_test.shape)

(46337, 199)
(7626, 199)
(6736, 199)


Feature selection.

In [5]:
drop_cols = ['video_view_count', 'video_like_count', 'days_to_trend', 'video_published_at']

X_train = df_train.drop(columns=drop_cols)
X_val = df_val.drop(columns=drop_cols)
X_test = df_test.drop(columns=drop_cols)

Only numeric columns are maintained for regression models. If --translate=True was specified during dataset processing, the `video_title_translated` function is available for interpretation, but won't be considered in the models in this notebook.

In [6]:
X_train = X_train.select_dtypes(include=np.number)
X_val = X_val.select_dtypes(include=np.number)
X_test = X_test.select_dtypes(include=np.number)

In [7]:
X_train.describe()

Unnamed: 0,video_duration,video_comment_count,channel_view_count,channel_subscriber_count,published_dayofweek,published_hour,video_title_length,video_tag_count,sentiment_score,sentiment_negative,...,lang_pca_0,lang_pca_1,lang_pca_2,lang_pca_3,lang_pca_4,video_category_pca_0,video_category_pca_1,video_category_pca_2,video_category_pca_3,days_published
count,46337.0,46337.0,46337.0,46337.0,46337.0,46337.0,46337.0,46337.0,46337.0,46337.0,...,46337.0,46337.0,46337.0,46337.0,46337.0,46337.0,46337.0,46337.0,46337.0,46337.0
mean,1012.372273,3836.278438,6040277000.0,11979750.0,3.197898,12.679932,9.143212,0.655329,0.049619,0.144204,...,-7.851132e-17,-3.6802180000000003e-17,1.2420740000000001e-17,-3.291112e-17,3.6490700000000006e-17,2.5761530000000003e-17,4.968294e-17,2.4534790000000002e-17,2.0854570000000002e-17,21.900986
std,2819.329071,8609.104544,15784950000.0,39893810.0,1.958441,5.814503,4.036997,0.493445,0.281998,0.351301,...,0.4589096,0.2054438,0.158567,0.1435839,0.141411,0.4796688,0.3860894,0.3538891,0.3258885,4.506948
min,10.0,0.0,216347.0,56.0,0.0,0.0,1.0,0.0,-0.8979,0.0,...,-0.8912856,-0.3484205,-0.2960471,-0.578804,-0.5823643,-0.3966943,-0.704218,-0.4341486,-0.6524535,13.0
25%,36.0,335.0,239754000.0,610000.0,1.0,9.0,6.0,0.0,0.0,0.0,...,-0.693732,0.00783135,0.001800121,0.0002321304,-0.0002709102,-0.3926991,-0.007191368,-0.4194579,-0.03912901,18.0
50%,160.0,1138.0,1181903000.0,2650000.0,3.0,13.0,9.0,1.0,0.0,0.0,...,0.2746552,0.00783135,0.001800121,0.0002321304,-0.0002709102,-0.253623,-0.002479025,-0.06956865,-0.01005693,23.0
75%,984.0,3205.0,4687050000.0,10900000.0,5.0,17.0,12.0,1.0,0.0,0.0,...,0.2746552,0.00783135,0.001800121,0.0002321304,-0.0002709102,0.7745709,0.002370603,0.1999355,0.01919796,25.0
max,42901.0,82964.0,297055600000.0,396000000.0,6.0,23.0,25.0,4.0,0.9446,1.0,...,0.2746552,0.8052141,0.8589263,0.7906126,0.7749721,0.7745709,0.7099129,0.6165521,0.7542877,29.0


Standardization of the numeric features.

In [8]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

joblib.dump(scaler, MODELS_DIR / 'scaler_regression.pkl')

['C:\\Users\\eddel\\OneDrive\\Documents\\MCD\\AAA\\youtube_trends\\venv\\src\\youtube-trends\\models\\scaler_regression.pkl']

#### Target features

In [9]:
y_train_dict = {'video_view_count': df_train['video_view_count'], 'video_like_count': df_train['video_like_count']}
y_val_dict = {'video_view_count': df_val['video_view_count'], 'video_like_count': df_val['video_like_count']}
y_test_dict = {'video_view_count': df_test['video_view_count'], 'video_like_count': df_test['video_like_count']}

**Note:** Since the number of likes per video is strongly correlated with the number of views, we'll assume an explicit relation of dependancy between these features; more specifically, that the number of likes depends on the number of views, being the likes less than or equal to the views. This assumption is based on the premise that YouTube counts any play of the video as a view, regardless of the time it was played. This assumption might not be entirely true, since YouTube counts views based on a combination of factors, including the number of times a video is watched, the duration of each view, and user interactions. While the exact algorithm is not disclosed, a view generally requires a viewer to watch a portion of the video, often a minimum of 30 seconds, and actively engage with the content.

In [10]:
X_train_like = np.concatenate([X_train_scaled, df_train[['video_view_count']].values], axis=1)
X_val_like = np.concatenate([X_val_scaled, df_val[['video_view_count']].values], axis=1)
X_test_like = np.concatenate([X_test_scaled, df_test[['video_view_count']].values], axis=1)

#### Regression models

To predict the number of views and likes we use the following regression models.

In [11]:
model_classes = {
    'Linear_Regression': LinearRegression,
    'Ridge': Ridge,
    'Lasso': Lasso,
    'Decision_Tree': DecisionTreeRegressor,
    'Random_Forest': RandomForestRegressor,
    'XGBoost': XGBRegressor
}

Training and visualization of predicted values ​​versus actual values.

In [12]:
results = {}
results_test = {}
n_target = 0

for name, ModelClass in model_classes.items():
    results[name] = {}
    results_test[name] = {}

    # video_view_count
    model_vv = ModelClass()
    model_vv.fit(X_train_scaled, y_train_dict['video_view_count'])
    y_pred_vv = model_vv.predict(X_val_scaled)
    mse_vv = mean_squared_error(y_val_dict['video_view_count'], y_pred_vv)
    r2_vv = r2_score(y_val_dict['video_view_count'], y_pred_vv)
    results[name]['video_view_count'] = {'MSE': mse_vv, 'R^2': r2_vv}
    joblib.dump(model_vv, MODELS_DIR /  f'{name}_views.pkl')

    # Evaluate on test set
    y_test_pred_vv = model_vv.predict(X_test_scaled)
    mse_test_vv = mean_squared_error(y_test_dict['video_view_count'], y_test_pred_vv)
    r2_test_vv = r2_score(y_test_dict['video_view_count'], y_test_pred_vv)
    results_test[name]['video_view_count'] = {'MSE': mse_test_vv, 'R^2': r2_test_vv}

    # video_like_count
    model_vl = ModelClass()
    model_vl.fit(X_train_like, y_train_dict['video_like_count'])
    y_pred_vl = model_vl.predict(X_val_like)
    mse_vl = mean_squared_error(y_val_dict['video_like_count'], y_pred_vl)
    r2_vl = r2_score(y_val_dict['video_like_count'], y_pred_vl)
    results[name]['video_like_count'] = {'MSE': mse_vl, 'R^2': r2_vl}
    joblib.dump(model_vl, MODELS_DIR / f'{name}_likes.pkl')

    # Evaluate on test set
    y_test_pred_vl = model_vl.predict(X_test_like)
    mse_test_vl = mean_squared_error(y_test_dict['video_like_count'], y_test_pred_vl)
    r2_test_vl = r2_score(y_test_dict['video_like_count'], y_test_pred_vl)
    results_test[name]['video_like_count'] = {'MSE': mse_test_vl, 'R^2': r2_test_vl}

    # --- Visualizations ---
    for target, y_pred, y_true in zip(['video_view_count', 'video_like_count'], [y_pred_vv, y_pred_vl], [y_val_dict['video_view_count'], y_val_dict['video_like_count']]):
        if n_target % 2 == 0:
            print(f'\nModel: {name}\n')
        fig = go.Figure()
        fig.add_trace(go.Scatter(x=y_true, y=y_pred, mode='markers', name='Predictions'))
        fig.add_trace(go.Scatter(x=y_true, y=y_true, mode='lines', name='Ideal'))
        fig.update_layout(
            title=f'{name} for {target} (Validation)',
            xaxis_title='True Values',
            yaxis_title='Predicted Values',
            template='plotly_white'
        )
        fig.show()

        filename = f'figure_{100+n_target}.html'
        filepath = os.path.join('iframe_figures', filename)
        fig.write_html(filepath)

        n_target += 1


Model: Linear_Regression




Model: Ridge




Model: Lasso




Model: Decision_Tree




Model: Random_Forest




Model: XGBoost



#### Results visualization (MSE and R^2 metrics)

In [13]:
for name, targets in results.items():
    print(f"\nValidation Results for model {name}:\n")
    for target, metrics in targets.items():
        print(f"{target} — MSE: {metrics['MSE']:.2f}, R^2: {metrics['R^2']:.4f}")


Validation Results for model Linear_Regression:

video_view_count — MSE: 220067369319779.00, R^2: 0.5874
video_like_count — MSE: 94812489802.34, R^2: 0.9218

Validation Results for model Ridge:

video_view_count — MSE: 219975338122107.88, R^2: 0.5875
video_like_count — MSE: 94817511120.03, R^2: 0.9218

Validation Results for model Lasso:

video_view_count — MSE: 219859576257921.16, R^2: 0.5878
video_like_count — MSE: 94841756119.72, R^2: 0.9218

Validation Results for model Decision_Tree:

video_view_count — MSE: 462869121291732.50, R^2: 0.1321
video_like_count — MSE: 353415379979.87, R^2: 0.7086

Validation Results for model Random_Forest:

video_view_count — MSE: 434305780210382.31, R^2: 0.1857
video_like_count — MSE: 448914757838.19, R^2: 0.6298

Validation Results for model XGBoost:

video_view_count — MSE: 442006682144723.19, R^2: 0.1712
video_like_count — MSE: 412630223603.98, R^2: 0.6598


In [14]:
for name, targets in results_test.items():
    print(f"\nTest Results for model {name}:\n")
    for target, metrics in targets.items():
        print(f"{target} — MSE: {metrics['MSE']:.2f}, R^2: {metrics['R^2']:.4f}")


Test Results for model Linear_Regression:

video_view_count — MSE: 204913013085667.66, R^2: 0.0079
video_like_count — MSE: 90109435650.77, R^2: 0.4257

Test Results for model Ridge:

video_view_count — MSE: 152187488131724.31, R^2: 0.2632
video_like_count — MSE: 71009355495.35, R^2: 0.5474

Test Results for model Lasso:

video_view_count — MSE: 119586250075047.30, R^2: 0.4210
video_like_count — MSE: 61543212851.28, R^2: 0.6078

Test Results for model Decision_Tree:

video_view_count — MSE: 278077070901919.84, R^2: -0.3463
video_like_count — MSE: 68356951988.16, R^2: 0.5643

Test Results for model Random_Forest:

video_view_count — MSE: 159092453641556.69, R^2: 0.2298
video_like_count — MSE: 30681849900.07, R^2: 0.8045

Test Results for model XGBoost:

video_view_count — MSE: 134152469665445.97, R^2: 0.3505
video_like_count — MSE: 29951198438.09, R^2: 0.8091


In [15]:
for metric in ["MSE", "R^2"]:
    for target in ["video_view_count", "video_like_count"]:
        values = [results[model][target][metric] for model in model_classes]
        fig = go.Figure(data=[
            go.Bar(x=list(model_classes.keys()), y=values)
        ])
        fig.update_layout(
            title=f"Validation {metric.upper()} Comparison for {target}",
            xaxis_title="Model",
            yaxis_title=metric.upper(),
            template="plotly_white"
        )
        fig.show()

        filename = f'figure_{100+n_target}.html'
        filepath = os.path.join('iframe_figures', filename)
        fig.write_html(filepath)

        n_target += 1

In [16]:
for metric in ["MSE", "R^2"]:
    for target in ["video_view_count", "video_like_count"]:
        values = [results_test[model][target][metric] for model in model_classes]
        fig = go.Figure(data=[
            go.Bar(x=list(model_classes.keys()), y=values)
        ])
        fig.update_layout(
            title=f"Test {metric.upper()} Comparison for {target}",
            xaxis_title="Model",
            yaxis_title=metric.upper(),
            template="plotly_white"
        )
        fig.show()

        filename = f'figure_{100+n_target}.html'
        filepath = os.path.join('iframe_figures', filename)
        fig.write_html(filepath)

        n_target += 1

#### Conclusions

Based on the validation results, the best models for predicting the number of likes (video_like_count) are the linear models—specifically Ridge and Lasso regression—which achieved high $R^2$ values around 0.92, indicating excellent predictive performance. For predicting the number of views (video_view_count), although all models showed moderate performance, the Lasso regression slightly outperformed others with an $R^2$ of 0.5821. Therefore, Ridge and Lasso are the most reliable models for predicting likes, while Lasso is the most suitable choice for estimating views during validation.

Considering that the splitting of training, validation, and test data was done based on the videos’ publication date and the compatibility of Ridge and Lasso with the test dataset, the best model for predicting both the number of views and likes is Lasso, although XGBoost showed better performance with the test dataset. This is because, if we choose to use a regression model to predict views and likes per video, we will actually only be using the training and validation dataset. In this textbook, we use the test dataset only to help us choose between Ridge and Lasso.