## Goal
The aim of this project is to develop a predictive model that estimates performance metrics—such as likes, dislikes, views, and other engagement measures—for my YouTube content based solely on the video title. My channel is still in its early stages, with fewer than 100 YouTube Shorts published over the past year. I do not expect high accuracy or perfect results at this stage; rather, my objective is to build a simple model and chat-based interface where I can input a potential video title and receive an estimated performance forecast. Over time, as my channel grows and I update the dataset, I anticipate that the model’s predictions will become increasingly accurate.

### Cleaning and Importing the Dataset

In [1]:
#libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import warnings

from sklearn.preprocessing import LabelEncoder
from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

In [2]:
#This dataset was pulled from my YouTube channel analytics page
df = pd.read_csv('Table data.csv')

In [3]:
#I have videos on my channel that are not YouTube shorts and I do not want to include them in my analysis
#Therefore, they are outliers and need to be removed
df = df.loc[df['Duration'] < 100]

In [4]:
#These were columns that either had only NA values or irrelevant data
df.drop(['Content','Regular viewers',
       'Casual viewers', 'Average views per viewer', 'New viewers',
       'Unique viewers'], axis=1, inplace=True)

In [5]:
#converting the Average view duration column from str to int to be usable in the model
sec = []
for i in range(len(df['Average view duration'])):
    x = df['Average view duration'].iloc[i][-2:]
    x = int(x)
    sec.append(x)
    
df['Average Seconds Viewed'] = sec

In [6]:
df['Date_Column'] = pd.to_datetime(df['Video publish time'])

In [7]:
df['DayOfWeek_Num'] = df['Date_Column'].dt.dayofweek

In [8]:
df.drop(['Video publish time', 'Average view duration', 'Date_Column'], axis=1, inplace=True)

In [9]:
df

Unnamed: 0,Video title,Duration,Likes,Dislikes,Shares,YouTube Premium views,Stayed to watch (%),Engaged views,Views,Watch time (hours),Subscribers,Impressions,Impressions click-through rate (%),Average Seconds Viewed,DayOfWeek_Num
1,Pointallism Portrait of Lady Gaga #art #ladygaga,60.0,144,6,2,239,47.48,2766,2775,19.2306,7,728,4.12,25,5
2,Day 8 of painting a portrait every day till Ch...,59.0,74,3,0,128,49.84,1248,1248,10.9447,5,440,3.86,31,5
3,Pointallism Portrait of Benson Boone #art #por...,58.0,62,2,8,247,44.78,997,2070,7.7034,2,330,3.33,26,0
4,Pointallism Portrait of Billie Eilish #art #bi...,60.0,56,1,0,70,39.61,815,818,6.2381,1,409,1.96,27,1
5,Pointallism Portrait of Lady Gaga #art #portrait,56.0,36,2,1,81,39.09,773,779,5.6763,4,319,1.57,26,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65,Chappell Roan on the Grammys Red Carpet Painti...,74.0,3,0,0,6,12.52,65,66,0.2255,0,114,1.75,12,1
66,Painting Challenge Part 2 #paintingchallenge #...,77.0,3,0,0,5,15.20,38,38,0.1914,0,80,0.00,18,2
68,Bubble Wrap Painting #artist #asmr,61.0,1,0,0,2,100.00,5,5,0.0246,0,110,3.64,17,5
69,How I start my paintings over #art #aprilfools,60.0,3,0,0,0,75.00,4,5,0.0212,1,56,5.36,18,1


### Creating the Model

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

#vectorize the titles to use as our imput value for our model
vectorizer = TfidfVectorizer(max_features=500)
X = vectorizer.fit_transform(df["Video title"])

In [11]:
#target values
y = df[[
    'Duration', 'Likes', 'Dislikes', 'Shares',
       'YouTube Premium views', 'Stayed to watch (%)', 'Engaged views',
       'Views', 'Watch time (hours)', 'Subscribers', 'Impressions',
       'Impressions click-through rate (%)', 'Average Seconds Viewed',
       'DayOfWeek_Num'
]]

In [12]:
from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = MultiOutputRegressor(RandomForestRegressor(n_estimators=100))
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

In [13]:
#funcation 
def predict_metrics(title):
    x = vectorizer.transform([title])
    prediction = model.predict(x)
    return dict(zip(y.columns, prediction[0]))

predict_metrics("Lady Gaga")

{'Duration': 47.597142857142856,
 'Likes': 38.64583333333333,
 'Dislikes': 2.685,
 'Shares': 0.64,
 'YouTube Premium views': 48.06,
 'Stayed to watch (%)': 33.92959047619048,
 'Engaged views': 828.76,
 'Views': 694.6080000000001,
 'Watch time (hours)': 3.8081216261904784,
 'Subscribers': 3.35,
 'Impressions': 166.13766666666666,
 'Impressions click-through rate (%)': 2.5309000000000013,
 'Average Seconds Viewed': 17.29666666666667,
 'DayOfWeek_Num': 2.1069999999999998}

### Fine Tuning the Model

In [14]:
from sklearn.model_selection import GridSearchCV

Define the parameter grid for the underlying RandomForestRegressor
param_grid = {
    'estimator__n_estimators': [10, 20, 50, 75, 100, 150, 200],  # for RandomForestRegressor's n_estimators
    'estimator__max_depth': [10, 15, 20, 25, 30, 35, None],     # for RandomForestRegressor's max_depth
    'n_jobs': [-1] # For MultiOutputRegressor's n_jobs to use all available cores
}

Create the GridSearchCV object
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)

Fit the model
grid_search.fit(X_train, y_train) #  Replace with your training data

In [15]:
print("Best parameters:", grid_search.best_params_)
print("Best negative mean squared error:", grid_search.best_score_)

Best parameters: {'estimator__max_depth': 20, 'estimator__n_estimators': 50, 'n_jobs': -1}
Best negative mean squared error: -0.5716103978524455


In [16]:
model2 = MultiOutputRegressor(RandomForestRegressor(n_estimators=200, max_depth=30))
model2.fit(X_train, y_train)

y_pred2 = model2.predict(X_test)

def predict_metrics2(title):
    x = vectorizer.transform([title])
    prediction = model2.predict(x)
    return dict(zip(y.columns, prediction[0]))

predict_metrics2("Lady Gaga")

{'Duration': 47.84207142857144,
 'Likes': 33.160833333333336,
 'Dislikes': 2.7201666666666666,
 'Shares': 0.623,
 'YouTube Premium views': 43.53869047619047,
 'Stayed to watch (%)': 32.59073333333334,
 'Engaged views': 844.3991666666666,
 'Views': 688.7257777777778,
 'Watch time (hours)': 5.20941594761905,
 'Subscribers': 3.51,
 'Impressions': 149.54673809523808,
 'Impressions click-through rate (%)': 2.506300000000002,
 'Average Seconds Viewed': 17.690833333333334,
 'DayOfWeek_Num': 1.970625}

In [17]:
from sklearn.metrics import mean_squared_error, r2_score

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"R-squared (R2) Score: {r2:.4f}")

Mean Squared Error (MSE): 44360.6726
R-squared (R2) Score: -0.8609


In [18]:
mse = mean_squared_error(y_test, y_pred2)
r2 = r2_score(y_test, y_pred2)

print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"R-squared (R2) Score: {r2:.4f}")

Mean Squared Error (MSE): 44518.5438
R-squared (R2) Score: -0.8342


The accuracy is not optimal for either version of the model, so the next step is to improve performance by adjusting the vectorizer.

In [22]:
vectorizer = TfidfVectorizer(max_features=1000)

X = vectorizer.fit_transform(df["Video title"])
y = df[[
    'Duration', 'Likes', 'Dislikes', 'Shares',
       'YouTube Premium views', 'Stayed to watch (%)', 'Engaged views',
       'Views', 'Watch time (hours)', 'Subscribers', 'Impressions',
       'Impressions click-through rate (%)', 'Average Seconds Viewed',
       'DayOfWeek_Num'
]]

In [23]:
model3 = MultiOutputRegressor(RandomForestRegressor(n_estimators=200, max_depth=30))
model3.fit(X_train, y_train)

y_pred3 = model2.predict(X_test)

def predict_metrics2(title):
    x = vectorizer.transform([title])
    prediction = model2.predict(x)
    return dict(zip(y.columns, prediction[0]))

predict_metrics2("Lady Gaga")

{'Duration': 47.84207142857144,
 'Likes': 33.160833333333336,
 'Dislikes': 2.7201666666666666,
 'Shares': 0.623,
 'YouTube Premium views': 43.53869047619047,
 'Stayed to watch (%)': 32.59073333333334,
 'Engaged views': 844.3991666666666,
 'Views': 688.7257777777778,
 'Watch time (hours)': 5.20941594761905,
 'Subscribers': 3.51,
 'Impressions': 149.54673809523808,
 'Impressions click-through rate (%)': 2.506300000000002,
 'Average Seconds Viewed': 17.690833333333334,
 'DayOfWeek_Num': 1.970625}

In [24]:
mse = mean_squared_error(y_test, y_pred3)
r2 = r2_score(y_test, y_pred3)

print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"R-squared (R2) Score: {r2:.4f}")

Mean Squared Error (MSE): 44518.5438
R-squared (R2) Score: -0.8342


## Simple Chatbot

In [29]:
#the chatbot takes the user_input (the video title) and run it through the model 
#The predicted results will be returned until the user enters exit

def simple_chatbot(user_input):
    user_input = user_input.lower()
    return predict_metrics2(user_input)

while True:
    user_message = input("You: ")
    if user_message.lower() == "exit":
        break
    bot_response = simple_chatbot(user_message)
    print(f"Bot: {bot_response}")

You: Lady Gaga
Bot: {'Duration': 47.84207142857144, 'Likes': 33.160833333333336, 'Dislikes': 2.7201666666666666, 'Shares': 0.623, 'YouTube Premium views': 43.53869047619047, 'Stayed to watch (%)': 32.59073333333334, 'Engaged views': 844.3991666666666, 'Views': 688.7257777777778, 'Watch time (hours)': 5.20941594761905, 'Subscribers': 3.51, 'Impressions': 149.54673809523808, 'Impressions click-through rate (%)': 2.506300000000002, 'Average Seconds Viewed': 17.690833333333334, 'DayOfWeek_Num': 1.970625}
You: #art
Bot: {'Duration': 52.78672619047618, 'Likes': 15.334166666666668, 'Dislikes': 1.5717499999999998, 'Shares': 0.28125, 'YouTube Premium views': 78.68912499999999, 'Stayed to watch (%)': 35.68410813492064, 'Engaged views': 535.1745476190476, 'Views': 882.4942777777777, 'Watch time (hours)': 3.107991453571426, 'Subscribers': 0.24333333333333332, 'Impressions': 162.4800119047619, 'Impressions click-through rate (%)': 2.001091666666667, 'Average Seconds Viewed': 20.308611111111112, 'Da