<h1 style="text-align:center;"><font color='red' size=10><b> Before you start! </b></font></h1>

This project is divided in 3 parts:

1) ETL, where we will Extract, Transform and Load data into an AWS RDS.

https://colab.research.google.com/drive/1w9bPc49joLrceMWAF_RlEgXo73s1Eeee?usp=sharing

<br><br>

2) Data Analysis: exploratory data analysis to identify key features.

https://colab.research.google.com/drive/1a_Etj5kwEaq5epwoV9TVPNS3ShRRd7wU?usp=sharing

<br><br>

**3) Prediction models: model building and comparison. <font color='red'>-> you are here</font>**

https://colab.research.google.com/drive/1Nbj6TM5HaK2krMRa9R1oiolDhDronYQB?usp=sharing

<br><br>

Summary of this project: https://colab.research.google.com/drive/1CUjP7SdFGldPjuSVSIbHAk9UPWYp_RYz?usp=sharing

<br><br>

A summary of the data can be visualized on this <font color='red'>**Power BI dashboard:**</font> https://app.powerbi.com/view?r=eyJrIjoiNTkzZjNmY2UtNmQ5Mi00MTJhLTliNzgtZGU2NzRlYzQ5NDA1IiwidCI6IjE2OGQ0MTM3LWQ2ZjYtNDVmOC1hYWE3LWQxYTcwMjMzMDk1ZSIsImMiOjR9&pageName=ReportSection4f69a4c8629ea033a165

# **Analyzing Youtube channels that I am subscribed to using its API.**

I love watching Youtube. It's a very diverse streaming service (even before this word was cool), with videos ranging from comedy to science and curiosity, short or long, from channels with millions of subscribers to small comunnities with just a few hundred. 

But content creators suffer from "punitive algorithms" (in their words) that increase or decrease their visibility based on a plethora of metrics, more often than ever unknown to them. From YT perspective it's actually understandable, since a lot of viewers are migrating to other platforms (Tiktok, Twitch and an ever growing streaming services options). They need to maximize the amount of viewers and time spent on each video to justify that Youtube is actually a good platform to place ads.

With this in mind I decided to use its API to collect some data and try to predict a video "view count" at the time it was published. Some models were built using video title features, thumbnail color, day and hour it was published, and so on. If successful, it could be used for example to fine tune ad placement in terms of expected return it will give. If unsuccessful it can at least be improved in future iterations.

# **0) Import libraries**

In [1]:
# !pip install psycopg2
import psycopg2 as ps
import pandas.io.sql as sqlio
import pandas as pd
import numpy as np

# Datetime libraries
import datetime
from datetime import datetime
from pytz import country_timezones


# Scikit learn libraries
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn import tree
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import RandomForestRegressor

# **1) Import data from AWS**

## 1.1) Define connection method

In [2]:
def connect_to_db(host_name, database, port, username, password):
    """ Connects to a database and returns the connection.
    host_name = typically an IP from AWS
    dbname = the type of the db (in my case, Postgres)
    port = the port that gives access to the db
    username = username of the db
    password = password of the db
    """
    try:
        conn = ps.connect(host=host_name, database=database, user=username, password=password, port=port)
    except ps.OperationalError as e:
        raise e
    else:
        print("Connected!")
    return conn

## 1.2) Import data

In [3]:
host_name = "datascience-youtube.cuxnfdexw55l.us-east-1.rds.amazonaws.com"
database = "postgres"
port = "5432"
username = "main_read_only" # read only user of the DB
password = "read_only_123"
conn = None

conn = connect_to_db(host_name, database, port, username, password)

Connected!


In [4]:
df_videos = pd.read_sql_query("SELECT * FROM videos", con=conn)
df_videos.head()

Unnamed: 0,video_id,video_title,view_count,like_count,comment_count,channel_id,thumbnail_url,upload_date_time,n_words_title,n_question_marks_title,n_exclamation_marks_title,n_ellipsis_title,thumb_red,thumb_green,thumb_blue,video_duration
0,XxXKz9GtACg,LFA - Aula 13 - Dia 1/04/2022,350,11,3,UCYJv-TfmSU0xUuKI7N0zkJw,https://i.ytimg.com/vi/XxXKz9GtACg/hqdefault.jpg,2022-04-01 18:58:04+00:00,7,0,0,0,57,58,57,6170
1,g63dqkcDOG4,Mergesort (corretude e tempo),823,37,1,UCYJv-TfmSU0xUuKI7N0zkJw,https://i.ytimg.com/vi/g63dqkcDOG4/hqdefault.jpg,2022-02-28 13:35:14+00:00,4,0,0,0,32,27,25,1424
2,pfCjJqTKvFY,Insertion Sort,1190,48,1,UCYJv-TfmSU0xUuKI7N0zkJw,https://i.ytimg.com/vi/pfCjJqTKvFY/hqdefault.jpg,2022-02-18 18:19:10+00:00,2,0,0,0,33,27,25,3204
3,xRkq-wMRHqk,Tempo de execução com notação assintótica,1225,50,0,UCYJv-TfmSU0xUuKI7N0zkJw,https://i.ytimg.com/vi/xRkq-wMRHqk/hqdefault.jpg,2022-02-08 19:02:58+00:00,6,0,0,0,20,22,21,1366
4,JlHfhLQ2qrc,Tempo de execução (análise de casos),1285,66,2,UCYJv-TfmSU0xUuKI7N0zkJw,https://i.ytimg.com/vi/JlHfhLQ2qrc/hqdefault.jpg,2022-02-08 14:30:03+00:00,6,0,0,0,24,24,21,729


In [5]:
df_channels = pd.read_sql_query("SELECT * FROM channels", con=conn)
df_channels.head()

Unnamed: 0,channel_id,channel_title,channel_country,subscriber_count,video_count,topic_categories,total_view_count
0,UCHnyfMqiRRG1u-2MsSQLbXA,Veritasium,US,13100000,343,"{https://en.wikipedia.org/wiki/Knowledge,https...",1891837444
1,UCYO_jab_esuFRV4b17AJtAw,3Blue1Brown,US,4840000,127,{https://en.wikipedia.org/wiki/Knowledge},311356579
2,UCJLtfES4K2aYCLFz-i5teFA,Tá Gravando,BR,1600000,321,"{https://en.wikipedia.org/wiki/Entertainment,h...",257307314
3,UCQxWq7wL4HY40mqbr3f0Z2A,UMotivo,BR,103000,2528,{https://en.wikipedia.org/wiki/Strategy_video_...,27752172
4,UCBFLqK7PAP9DQ3JpIrWFI7w,Pipocando,BR,4320000,1343,"{https://en.wikipedia.org/wiki/Entertainment,h...",689510386


In [6]:
youtube_df = df_videos.merge(df_channels, on='channel_id', how='left')
youtube_df

Unnamed: 0,video_id,video_title,view_count,like_count,comment_count,channel_id,thumbnail_url,upload_date_time,n_words_title,n_question_marks_title,...,thumb_red,thumb_green,thumb_blue,video_duration,channel_title,channel_country,subscriber_count,video_count,topic_categories,total_view_count
0,XxXKz9GtACg,LFA - Aula 13 - Dia 1/04/2022,350,11,3,UCYJv-TfmSU0xUuKI7N0zkJw,https://i.ytimg.com/vi/XxXKz9GtACg/hqdefault.jpg,2022-04-01 18:58:04+00:00,7,0,...,57,58,57,6170,Carla Quem Disse,BR,2950,231,{https://en.wikipedia.org/wiki/Knowledge},273933
1,g63dqkcDOG4,Mergesort (corretude e tempo),823,37,1,UCYJv-TfmSU0xUuKI7N0zkJw,https://i.ytimg.com/vi/g63dqkcDOG4/hqdefault.jpg,2022-02-28 13:35:14+00:00,4,0,...,32,27,25,1424,Carla Quem Disse,BR,2950,231,{https://en.wikipedia.org/wiki/Knowledge},273933
2,pfCjJqTKvFY,Insertion Sort,1190,48,1,UCYJv-TfmSU0xUuKI7N0zkJw,https://i.ytimg.com/vi/pfCjJqTKvFY/hqdefault.jpg,2022-02-18 18:19:10+00:00,2,0,...,33,27,25,3204,Carla Quem Disse,BR,2950,231,{https://en.wikipedia.org/wiki/Knowledge},273933
3,xRkq-wMRHqk,Tempo de execução com notação assintótica,1225,50,0,UCYJv-TfmSU0xUuKI7N0zkJw,https://i.ytimg.com/vi/xRkq-wMRHqk/hqdefault.jpg,2022-02-08 19:02:58+00:00,6,0,...,20,22,21,1366,Carla Quem Disse,BR,2950,231,{https://en.wikipedia.org/wiki/Knowledge},273933
4,JlHfhLQ2qrc,Tempo de execução (análise de casos),1285,66,2,UCYJv-TfmSU0xUuKI7N0zkJw,https://i.ytimg.com/vi/JlHfhLQ2qrc/hqdefault.jpg,2022-02-08 14:30:03+00:00,6,0,...,24,24,21,729,Carla Quem Disse,BR,2950,231,{https://en.wikipedia.org/wiki/Knowledge},273933
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2852,uo8-4cC87D0,"FOI MAL, TAVA DOIDÃO! 🥴",848421,117096,1228,UCm2CE2YfpmobBmF8ARLPzAw,https://i.ytimg.com/vi/uo8-4cC87D0/hqdefault.jpg,2020-02-11 13:00:15+00:00,4,0,...,111,104,109,505,Caue Moura,BR,5200000,875,{https://en.wikipedia.org/wiki/Lifestyle_(soci...,726810307
2853,JMhTeemmnYs,OS MELHORES XINGAMENTOS CONTRA MIM,720839,105143,1846,UCm2CE2YfpmobBmF8ARLPzAw,https://i.ytimg.com/vi/JMhTeemmnYs/hqdefault.jpg,2020-01-11 19:57:30+00:00,5,0,...,110,102,127,444,Caue Moura,BR,5200000,875,{https://en.wikipedia.org/wiki/Lifestyle_(soci...,726810307
2854,YtscNYNOz6k,AZORIUS DRAW GO → Partidas ÉPICAS e INESQUECÍV...,15396,1071,85,UCQxWq7wL4HY40mqbr3f0Z2A,https://i.ytimg.com/vi/YtscNYNOz6k/hqdefault.jpg,2019-11-07 15:03:08+00:00,9,0,...,106,115,124,3680,UMotivo,BR,103000,2528,{https://en.wikipedia.org/wiki/Strategy_video_...,27752172
2855,5mClePLmQOc,Uma boa opção para o formato ARTESÃO! (Magic A...,5982,512,73,UCQxWq7wL4HY40mqbr3f0Z2A,https://i.ytimg.com/vi/5mClePLmQOc/hqdefault.jpg,2019-11-05 11:19:43+00:00,9,0,...,77,82,81,2498,UMotivo,BR,103000,2528,{https://en.wikipedia.org/wiki/Strategy_video_...,27752172


## 1.3) Simple transformations on data
Get the name of the day and the local hour it was published (considering the timezone).

### 1.3.1) Get name of the day

In [7]:
youtube_df['upload_day_name'] = youtube_df['upload_date_time'].dt.day_name()
youtube_df[['video_title','upload_date_time','upload_day_name']].head()

Unnamed: 0,video_title,upload_date_time,upload_day_name
0,LFA - Aula 13 - Dia 1/04/2022,2022-04-01 18:58:04+00:00,Friday
1,Mergesort (corretude e tempo),2022-02-28 13:35:14+00:00,Monday
2,Insertion Sort,2022-02-18 18:19:10+00:00,Friday
3,Tempo de execução com notação assintótica,2022-02-08 19:02:58+00:00,Tuesday
4,Tempo de execução (análise de casos),2022-02-08 14:30:03+00:00,Tuesday


### 1.3.2) Get hour
We need to pay close attention when dealing with time. Some channels are located in Brazil, while some are based in the US. When we extracted the data from Youtube's API, it gave us the time in UTC. We need to transform it to the local upload hour, that is, for a video from a brazilian channel, it should be the upload hour of Brazil (we will simplify and assume that there's only one timezone in Brazil). Similarly, for a video from a US channel, it should be the upload hour of the US (again, assuming only one timezone in the US).


In [8]:
# Group by country, convert the timezone to channel's country time, convert to string removing the timezone suffix and convert back to pandas datetime
youtube_df['upload_hour_local'] = (pd.to_datetime(\
                                                youtube_df.groupby('channel_country')['upload_date_time']\
                                                .transform(lambda x: x.dt.tz_convert(country_timezones(x.name)[0]))\
                                                .astype(str).str[:-6]))
youtube_df['upload_hour_local'] = youtube_df['upload_hour_local'].dt.strftime('%H').add(':00')
youtube_df[['video_title','upload_date_time','upload_hour_local']]

Unnamed: 0,video_title,upload_date_time,upload_hour_local
0,LFA - Aula 13 - Dia 1/04/2022,2022-04-01 18:58:04+00:00,16:00
1,Mergesort (corretude e tempo),2022-02-28 13:35:14+00:00,11:00
2,Insertion Sort,2022-02-18 18:19:10+00:00,16:00
3,Tempo de execução com notação assintótica,2022-02-08 19:02:58+00:00,17:00
4,Tempo de execução (análise de casos),2022-02-08 14:30:03+00:00,12:00
...,...,...,...
2852,"FOI MAL, TAVA DOIDÃO! 🥴",2020-02-11 13:00:15+00:00,11:00
2853,OS MELHORES XINGAMENTOS CONTRA MIM,2020-01-11 19:57:30+00:00,17:00
2854,AZORIUS DRAW GO → Partidas ÉPICAS e INESQUECÍV...,2019-11-07 15:03:08+00:00,13:00
2855,Uma boa opção para o formato ARTESÃO! (Magic A...,2019-11-05 11:19:43+00:00,09:00


# **2) Create a prediction model of view count**
We will create and compare different models:


*   A dummy model (median of view count by channel)
*   Baseline model: Decision Tree Regressor
*   Challenger model: Histogram Gradient Boosting Regressor
*   Challenger model: Random Forest



## 2.0) Will the model output y or f(y)?
If the model outputs f(y), you should set f(x) and f_inv(x) to the desired values.

If the model outputs y, you should set f(x) = x and f_inv(x) = x.

In [9]:
# def f(x):
#     return x

# def f_inv(x):
#     return x

f = np.log
f_inv = np.exp

## 2.1) Separate train and test data

In [10]:
youtube_X = youtube_df.drop('view_count', axis=1)
youtube_y = youtube_df['view_count']

youtube_y_log = f(youtube_y)

In [11]:
yt_X_train, yt_X_test, yt_y_train, yt_y_test = train_test_split(youtube_X, youtube_y_log, random_state=42) # colocar uma seed
yt_y_train_exp = np.exp(yt_y_train)
yt_y_test_exp = np.exp(yt_y_test)

## 2.2) Dummy model

Assumptions:


*   View count is dependent on channel it's going to be published on
*   For each channel, predicted view count is simply the **median** of view count



### 2.1.1) "Train" model

In [12]:
dummy_model_df = yt_X_train[['channel_title']].join(yt_y_train).groupby('channel_title').median()
dummy_model_df.style.format(thousands=',', precision=0)

Unnamed: 0_level_0,view_count
channel_title,Unnamed: 1_level_1
3Blue1Brown,14
Carla Quem Disse,7
Caue Moura,13
Pipocando,13
"Sao Paulo Nas Alturas , por Raul Juste Lores",10
SmarterEveryDay,15
StrataScratch,8
Tá Gravando,14
UMotivo,9
Veritasium,16


### 2.1.2) Make predictions

In [13]:
def dummy_predictor_view_count(video, model):
    channel_title = video['channel_title']
    
    predicted_views = model.loc[channel_title]['view_count']

    return predicted_views

In [14]:
def predictions_pretty_print_df(predictions, f_inv):
    """ Creates a pandas df with video title, channel title, 
    ground truth view count, predicted view count and error.
    
    If the predictions are logs of the real value, it will take the exponential
    of the prediction to output the real value without log.

    Returns a pandas df.

    """
    predictions_df = pd.DataFrame(columns=['video_title', 'channel_title', 'view_count', 'view_count_pred'])
    predictions_df[['video_title','channel_title']] = yt_X_test[['video_title','channel_title']]

    predictions_df['view_count'] = f_inv(yt_y_test)
    predictions_df['view_count_pred'] = f_inv(predictions)
    predictions_df['error'] = predictions_df['view_count_pred'] - f_inv(yt_y_test)

    return predictions_df

predictions_dummy = yt_X_test.apply(lambda x: dummy_predictor_view_count(x, dummy_model_df), axis=1) # make predictions
predictions_dummy_pretty = predictions_pretty_print_df(predictions_dummy, f_inv)
predictions_dummy_pretty.head()

Unnamed: 0,video_title,channel_title,view_count,view_count_pred,error
1583,"Trying to Catch a 1,000 MPH Baseball - Smarter...",SmarterEveryDay,15180905.0,2833793.0,-12347110.0
1745,⚡ TRIBAL DE RAIOS! → Izzet Spells no Alchemy c...,UMotivo,8137.0,11508.5,3371.5
772,"⚪ O NOVO WHITE WEENIE → Basri Ket Aggro, com C...",UMotivo,16050.0,11508.5,-4541.5
1728,"PI - Comando ""para""",Carla Quem Disse,160.0,673.8932,513.8932
387,The Best Test of General Relativity (by 2 Misp...,Veritasium,1858083.0,7747475.0,5889392.0


### 2.1.3) Prediction metrics
Coefficient of determination R^2 ("The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.")

In [15]:
rss = ((yt_y_test - predictions_dummy)**2).sum() # residual sum of squares
tss = ((yt_y_test - yt_y_test.mean())**2).sum() # total sum of squares

model_score_dummy = (1-rss/tss)

In [16]:
print(model_score_dummy)

0.8986743311526758


## 2.3) Baseline model

We will use a Decision Tree Regressor.

In this model we will **not** use the information of which channel the video was published on. This model relies simply on video title features, thumbnail color, video duration and upload day and hour.

### 2.2.1) Let's select the features that we can actually use
Note that likes_count and comments_count for example cannot be used because it would be considered data leakege. These features are highly correlated with view count because it actually comes **after** a video is watched.

Also in this model we will **not** use the information of which channel the video was published on. This model relies simply on video title information, thumbnail color, video duration and upload day and hour.

In [17]:
X_train_baseline1 = yt_X_train[['n_words_title', 'n_question_marks_title', 'n_exclamation_marks_title',
                                            'n_ellipsis_title', 'thumb_red', 'thumb_green', 'thumb_blue', 
                                            'video_duration', 'upload_day_name', 'upload_hour_local']]

X_test_baseline1 = yt_X_test[['n_words_title', 'n_question_marks_title', 'n_exclamation_marks_title',
                                            'n_ellipsis_title', 'thumb_red', 'thumb_green', 'thumb_blue', 
                                            'video_duration', 'upload_day_name', 'upload_hour_local']]

### 2.2.2) Build the model pipeline

In [18]:
numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)

numerical_columns_baseline1 = numerical_columns_selector(X_train_baseline1)
categorical_columns_baseline1 = categorical_columns_selector(X_train_baseline1)

In [19]:
categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()

preprocessor_baseline1 = ColumnTransformer([
    ('one-hot-encoder', categorical_preprocessor, categorical_columns_baseline1),
    ('standard_scaler', numerical_preprocessor, numerical_columns_baseline1)])

In [20]:
model_baseline1 = make_pipeline(preprocessor_baseline1, DecisionTreeRegressor(random_state=0))
model_baseline1

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('one-hot-encoder',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['upload_day_name',
                                                   'upload_hour_local']),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  ['n_words_title',
                                                   'n_question_marks_title',
                                                   'n_exclamation_marks_title',
                                                   'n_ellipsis_title',
                                                   'thumb_red', 'thumb_green',
                                                   'thumb_blue',
                                                   'video_duration'])])),
        

### 2.2.3) Train model

In [21]:
_ = model_baseline1.fit(X_train_baseline1, yt_y_train)

### 2.2.4) Make predictions

In [22]:
predictions_baseline1 = model_baseline1.predict(X_test_baseline1) # make predictions
predictions_baseline1_pretty = predictions_pretty_print_df(predictions_baseline1, f_inv)
predictions_baseline1_pretty.head()

Unnamed: 0,video_title,channel_title,view_count,view_count_pred,error
1583,"Trying to Catch a 1,000 MPH Baseball - Smarter...",SmarterEveryDay,15180905.0,3357319.0,-11823586.0
1745,⚡ TRIBAL DE RAIOS! → Izzet Spells no Alchemy c...,UMotivo,8137.0,11522.0,3385.0
772,"⚪ O NOVO WHITE WEENIE → Basri Ket Aggro, com C...",UMotivo,16050.0,17638.0,1588.0
1728,"PI - Comando ""para""",Carla Quem Disse,160.0,147.0,-13.0
387,The Best Test of General Relativity (by 2 Misp...,Veritasium,1858083.0,7246.0,-1850837.0


### 2.2.5) Prediction metrics

In [23]:
model_score_baseline1 = model_baseline1.score(X_test_baseline1, yt_y_test)
model_score_baseline1

0.130234156464053

Clearly a really bad model that we don't want to use. Let's try to improve it a little bit.

## 2.4) Baseline model 2

We will use a Decision Tree Regressor once again, but this time we will use the information of which channel the video was published on.

### 2.3.1) Let's select the features that we can actually use
Note that likes_count and comments_count for example cannot be used because it would be considered data leakege. These features are highly correlated with view count because it actually comes **after** a video is watched.

In [24]:
X_train_baseline2 = yt_X_train[['n_words_title', 'n_question_marks_title', 'n_exclamation_marks_title',
                      'n_ellipsis_title', 'thumb_red', 'thumb_green', 'thumb_blue', 
                      'video_duration', 'upload_day_name', 'upload_hour_local', 'channel_title']]

X_test_baseline2 = yt_X_test[['n_words_title', 'n_question_marks_title', 'n_exclamation_marks_title',
                      'n_ellipsis_title', 'thumb_red', 'thumb_green', 'thumb_blue', 
                      'video_duration', 'upload_day_name', 'upload_hour_local', 'channel_title']]

### 2.3.2) Build the model pipeline

In [25]:
numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)

numerical_columns_baseline2 = numerical_columns_selector(X_train_baseline2)
categorical_columns_baseline2 = categorical_columns_selector(X_train_baseline2)

In [26]:
categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()

preprocessor_baseline2 = ColumnTransformer([
    ('one-hot-encoder', categorical_preprocessor, categorical_columns_baseline2),
    ('standard_scaler', numerical_preprocessor, numerical_columns_baseline2)])

In [27]:
model_baseline2 = make_pipeline(preprocessor_baseline2, DecisionTreeRegressor(random_state=0))
model_baseline2

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('one-hot-encoder',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['upload_day_name',
                                                   'upload_hour_local',
                                                   'channel_title']),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  ['n_words_title',
                                                   'n_question_marks_title',
                                                   'n_exclamation_marks_title',
                                                   'n_ellipsis_title',
                                                   'thumb_red', 'thumb_green',
                                                   'thumb_blue',
              

### 2.3.3) Train model

In [28]:
_ = model_baseline2.fit(X_train_baseline2, yt_y_train)

### 2.3.4) Make predictions

In [29]:
predictions_baseline2 = model_baseline2.predict(X_test_baseline2) # make predictions
predictions_baseline2_pretty = predictions_pretty_print_df(predictions_baseline2, f_inv)
predictions_baseline2_pretty.head()

Unnamed: 0,video_title,channel_title,view_count,view_count_pred,error
1583,"Trying to Catch a 1,000 MPH Baseball - Smarter...",SmarterEveryDay,15180905.0,13478418.0,-1702487.0
1745,⚡ TRIBAL DE RAIOS! → Izzet Spells no Alchemy c...,UMotivo,8137.0,20214.0,12077.0
772,"⚪ O NOVO WHITE WEENIE → Basri Ket Aggro, com C...",UMotivo,16050.0,6022.0,-10028.0
1728,"PI - Comando ""para""",Carla Quem Disse,160.0,147.0,-13.0
387,The Best Test of General Relativity (by 2 Misp...,Veritasium,1858083.0,21720570.0,19862487.0


### 2.3.5) Prediction metrics

In [30]:
model_score_baseline2 = model_baseline2.score(X_test_baseline2, yt_y_test)
model_score_baseline2

0.8767191632447643

This is a way better model, but it is still worse than the dummy model, so we clearly don't want to use it either. 

Now we will try to use another model, a little bit more sophisticated.

## 2.5) Challenger model 1

So far we have only used Decision Trees Regressor. Now we will use a Histogram Gradient Boosting Regressor.

### 2.4.1) Let's select the features that we can actually use

In [31]:
X_train_challenger1 = yt_X_train[['n_words_title', 'n_question_marks_title', 'n_exclamation_marks_title',
                      'n_ellipsis_title', 'thumb_red', 'thumb_green', 'thumb_blue', 
                      'video_duration', 'upload_day_name', 'upload_hour_local', 'channel_title']]

X_test_challenger1 = yt_X_test[['n_words_title', 'n_question_marks_title', 'n_exclamation_marks_title',
                      'n_ellipsis_title', 'thumb_red', 'thumb_green', 'thumb_blue', 
                      'video_duration', 'upload_day_name', 'upload_hour_local', 'channel_title']]

### 2.4.2) Build the model pipeline

In [32]:
categorical_columns_selector = selector(dtype_include=object)

categorical_columns_challenger1 = categorical_columns_selector(X_train_challenger1)

In [33]:
categorical_preprocessor = OrdinalEncoder(handle_unknown="use_encoded_value",
                                          unknown_value=-1)

preprocessor_challenger1 = ColumnTransformer([
    ('categorical', categorical_preprocessor, categorical_columns_challenger1)],
    remainder="passthrough")

In [34]:
model_challenger1 = make_pipeline(preprocessor_challenger1, HistGradientBoostingRegressor(loss='squared_error', max_iter=50, min_samples_leaf=30))
model_challenger1

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('categorical',
                                                  OrdinalEncoder(handle_unknown='use_encoded_value',
                                                                 unknown_value=-1),
                                                  ['upload_day_name',
                                                   'upload_hour_local',
                                                   'channel_title'])])),
                ('histgradientboostingregressor',
                 HistGradientBoostingRegressor(max_iter=50,
                                               min_samples_leaf=30))])

### 2.4.3) Train model

In [35]:
_ = model_challenger1.fit(X_train_challenger1, yt_y_train)

### 2.4.4) Make predictions

In [36]:
predictions_challenger1 = model_challenger1.predict(X_test_challenger1) # make predictions
predictions_challenger1_pretty = predictions_pretty_print_df(predictions_challenger1, f_inv)
predictions_challenger1_pretty.head()

Unnamed: 0,video_title,channel_title,view_count,view_count_pred,error
1583,"Trying to Catch a 1,000 MPH Baseball - Smarter...",SmarterEveryDay,15180905.0,5120980.0,-10059930.0
1745,⚡ TRIBAL DE RAIOS! → Izzet Spells no Alchemy c...,UMotivo,8137.0,13791.6,5654.598
772,"⚪ O NOVO WHITE WEENIE → Basri Ket Aggro, com C...",UMotivo,16050.0,14225.0,-1825.002
1728,"PI - Comando ""para""",Carla Quem Disse,160.0,491.0519,331.0519
387,The Best Test of General Relativity (by 2 Misp...,Veritasium,1858083.0,3132567.0,1274484.0


### 2.4.5) Prediction metrics


In [37]:
model_score_challenger1 = model_challenger1.score(yt_X_test, yt_y_test)
model_score_challenger1

0.9161204265820342

This is the first model that is actually better than the dummy model, but only with a really small margin. This means that it is actually better to implement the dummy model so far, i.e., it gives best compromise between effort and value.

## 2.6) Challenger model 2

Histogram Gradient Boosting Regressor removing features associated with main thumbnail color.

### 2.5.2) Let's select the features that we can actually use
Note that likes_count and comments_count for example cannot be used because it would be considered data leakege. These features are highly correlated with view count because it actually comes **after** a video is watched.

In [38]:
X_train_challenger2 = yt_X_train[['n_words_title', 'n_question_marks_title', 'n_exclamation_marks_title',
                      'n_ellipsis_title', 'video_duration', 'upload_day_name', 
                   'upload_hour_local', 'channel_title']]

X_test_challenger2 = yt_X_test[['n_words_title', 'n_question_marks_title', 'n_exclamation_marks_title',
                      'n_ellipsis_title', 'video_duration', 'upload_day_name', 
                   'upload_hour_local', 'channel_title']]

### 2.5.2) Build the model pipeline

In [39]:
categorical_columns_selector = selector(dtype_include=object)

categorical_columns_challenger2 = categorical_columns_selector(X_train_challenger2)

In [40]:
categorical_preprocessor = OrdinalEncoder(handle_unknown="use_encoded_value",
                                          unknown_value=-1)

preprocessor_challenger2 = ColumnTransformer([
    ('categorical', categorical_preprocessor, categorical_columns_challenger2)],
    remainder="passthrough")

In [90]:
#Best Hyperparameters: {'HGBRegr__loss': 'absolute_error', 'HGBRegr__max_iter': 200, 'HGBRegr__min_samples_leaf': 5}
model_challenger2 = make_pipeline(preprocessor_challenger2, HistGradientBoostingRegressor(loss = 'absolute_error', max_iter=200, min_samples_leaf=5))
model_challenger2

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('categorical',
                                                  OrdinalEncoder(handle_unknown='use_encoded_value',
                                                                 unknown_value=-1),
                                                  ['upload_day_name',
                                                   'upload_hour_local',
                                                   'channel_title'])])),
                ('histgradientboostingregressor',
                 HistGradientBoostingRegressor(loss='absolute_error',
                                               max_iter=200,
                                               min_samples_leaf=5))])

### 2.5.3) Train model

In [91]:
_ = model_challenger2.fit(X_train_challenger2, yt_y_train)

### 2.5.4) Make predictions

In [92]:
predictions_challenger2 = model_challenger2.predict(X_test_challenger2) # make predictions
predictions_challenger2_pretty = predictions_pretty_print_df(predictions_challenger2, f_inv)
predictions_challenger2_pretty.head()

Unnamed: 0,video_title,channel_title,view_count,view_count_pred,error
1583,"Trying to Catch a 1,000 MPH Baseball - Smarter...",SmarterEveryDay,15180905.0,3505088.0,-11675820.0
1745,⚡ TRIBAL DE RAIOS! → Izzet Spells no Alchemy c...,UMotivo,8137.0,10247.87,2110.873
772,"⚪ O NOVO WHITE WEENIE → Basri Ket Aggro, com C...",UMotivo,16050.0,12896.14,-3153.863
1728,"PI - Comando ""para""",Carla Quem Disse,160.0,362.791,202.791
387,The Best Test of General Relativity (by 2 Misp...,Veritasium,1858083.0,5092161.0,3234078.0


### 2.5.5) Prediction metrics
We will use "model.score", i.e., coefficient of determination R^2.

In [93]:
model_score_challenger2 = model_challenger2.score(X_test_challenger2, yt_y_test)
model_score_challenger2

0.9263967731200753

Another small improvement here! Thumbnail main color was actually a bad feature for this model. It seems a good idea, maybe implementation is bad (we could use main color that CONTRASTS with background, or a pallette of colors).

## 2.7) Challenger model 3

Histogram Gradient Boosting Regressor using only video duration and channel title.

### 2.5.1) Let's select the features that we can actually use
Note that likes_count and comments_count for example cannot be used because it would be considered data leakege. These features are highly correlated with view count because it actually comes **after** a video is watched.

In [45]:
X_train_challenger3 = yt_X_train[['video_duration', 'channel_title']]

X_test_challenger3 = yt_X_test[['video_duration', 'channel_title']]

### 2.5.2) Build the model pipeline

In [46]:
categorical_columns_selector = selector(dtype_include=object)

categorical_columns_challenger3 = categorical_columns_selector(X_train_challenger3)

In [47]:
categorical_preprocessor = OrdinalEncoder(handle_unknown="use_encoded_value",
                                          unknown_value=-1)

preprocessor_challenger3 = ColumnTransformer([
    ('categorical', categorical_preprocessor, categorical_columns_challenger3)],
    remainder="passthrough")

In [48]:
# Best Hyperparameters: {'HGBRegr__loss': 'absolute_error', 'HGBRegr__max_iter': 50, 'HGBRegr__min_samples_leaf': 10} -> from Grid Search
model_challenger3 = make_pipeline(preprocessor_challenger3, HistGradientBoostingRegressor(loss='absolute_error', max_iter=50, min_samples_leaf=10))
model_challenger3

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('categorical',
                                                  OrdinalEncoder(handle_unknown='use_encoded_value',
                                                                 unknown_value=-1),
                                                  ['channel_title'])])),
                ('histgradientboostingregressor',
                 HistGradientBoostingRegressor(loss='absolute_error',
                                               max_iter=50,
                                               min_samples_leaf=10))])

### 2.5.3) Train model

In [49]:
_ = model_challenger3.fit(X_train_challenger3, yt_y_train)

### 2.5.4) Make predictions

In [50]:
predictions_challenger3 = model_challenger3.predict(X_test_challenger3) # make predictions
predictions_challenger3_pretty = predictions_pretty_print_df(predictions_challenger3, f_inv)
predictions_challenger3_pretty.head()

Unnamed: 0,video_title,channel_title,view_count,view_count_pred,error
1583,"Trying to Catch a 1,000 MPH Baseball - Smarter...",SmarterEveryDay,15180905.0,3771049.0,-11409860.0
1745,⚡ TRIBAL DE RAIOS! → Izzet Spells no Alchemy c...,UMotivo,8137.0,11178.18,3041.18
772,"⚪ O NOVO WHITE WEENIE → Basri Ket Aggro, com C...",UMotivo,16050.0,12030.54,-4019.456
1728,"PI - Comando ""para""",Carla Quem Disse,160.0,518.4557,358.4557
387,The Best Test of General Relativity (by 2 Misp...,Veritasium,1858083.0,5468570.0,3610487.0


### 2.5.5) Prediction metrics
We will use "model.score", i.e., coefficient of determination R^2.

In [51]:
model_score_challenger3 = model_challenger3.score(X_test_challenger3, yt_y_test)
model_score_challenger3

0.9146466707112778

## 2.8) Challenger model 4

We now switch to Random Forest, using all features.

### 2.4.1) Let's select the features that we can actually use

In [52]:
X_train_challenger4 = yt_X_train[['n_words_title', 'n_question_marks_title', 'n_exclamation_marks_title',
                      'n_ellipsis_title', 'thumb_red', 'thumb_green', 'thumb_blue', 
                      'video_duration', 'upload_day_name', 'upload_hour_local', 'channel_title']]

X_test_challenger4 = yt_X_test[['n_words_title', 'n_question_marks_title', 'n_exclamation_marks_title',
                      'n_ellipsis_title', 'thumb_red', 'thumb_green', 'thumb_blue', 
                      'video_duration', 'upload_day_name', 'upload_hour_local', 'channel_title']]

### 2.4.2) Build the model pipeline

In [102]:
numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)

numerical_columns_challenger4 = numerical_columns_selector(X_train_challenger4)
categorical_columns_challenger4 = categorical_columns_selector(X_train_challenger4)

In [103]:
categorical_preprocessor = OrdinalEncoder(handle_unknown="use_encoded_value",
                                          unknown_value=-1)
numerical_preprocessor = StandardScaler()

preprocessor_challenger4 = ColumnTransformer([
    ('categorical', categorical_preprocessor, categorical_columns_challenger4),
    ('standard_scaler', numerical_preprocessor, numerical_columns_challenger4)])

In [112]:
# Best Hyperparameters: {'RFRegr__criterion': 'absolute_error', 'RFRegr__n_estimators': 75}
model_challenger4 = make_pipeline(preprocessor_challenger4, RandomForestRegressor(criterion = 'absolute_error', n_estimators = 75, random_state=123))
model_challenger4

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('categorical',
                                                  OrdinalEncoder(handle_unknown='use_encoded_value',
                                                                 unknown_value=-1),
                                                  ['upload_day_name',
                                                   'upload_hour_local',
                                                   'channel_title']),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  ['n_words_title',
                                                   'n_question_marks_title',
                                                   'n_exclamation_marks_title',
                                                   'n_ellipsis_title',
                                                   'thumb_red', 'th

### 2.4.3) Train model

In [113]:
_ = model_challenger4.fit(X_train_challenger4, yt_y_train)

### 2.4.4) Make predictions

In [114]:
predictions_challenger4 = model_challenger4.predict(X_test_challenger4) # make predictions
predictions_challenger4_pretty = predictions_pretty_print_df(predictions_challenger4, f_inv)
predictions_challenger4_pretty.head()

Unnamed: 0,video_title,channel_title,view_count,view_count_pred,error
1583,"Trying to Catch a 1,000 MPH Baseball - Smarter...",SmarterEveryDay,15180905.0,4645073.0,-10535830.0
1745,⚡ TRIBAL DE RAIOS! → Izzet Spells no Alchemy c...,UMotivo,8137.0,12078.5,3941.502
772,"⚪ O NOVO WHITE WEENIE → Basri Ket Aggro, com C...",UMotivo,16050.0,12009.38,-4040.622
1728,"PI - Comando ""para""",Carla Quem Disse,160.0,226.739,66.73901
387,The Best Test of General Relativity (by 2 Misp...,Veritasium,1858083.0,5454851.0,3596768.0


### 2.4.5) Prediction metrics
We will use "model.score", i.e., coefficient of determination R^2.

In [115]:
model_score_challenger4 = model_challenger4.score(X_test_challenger4, yt_y_test)
model_score_challenger4

0.9143962874084941

## 2.9) Challenger model 5

Random Forest using all features but thumbnail color.

### 2.4.1) Let's select the features that we can actually use

In [59]:
X_train_challenger5 = yt_X_train[['n_words_title', 'n_question_marks_title', 'n_exclamation_marks_title',
                      'n_ellipsis_title',
                      'video_duration', 'upload_day_name', 'upload_hour_local', 'channel_title']]

X_test_challenger5 = yt_X_test[['n_words_title', 'n_question_marks_title', 'n_exclamation_marks_title',
                      'n_ellipsis_title',
                      'video_duration', 'upload_day_name', 'upload_hour_local', 'channel_title']]

### 2.4.2) Build the model pipeline

In [60]:
numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)

numerical_columns_challenger5 = numerical_columns_selector(X_train_challenger5)
categorical_columns_challenger5 = categorical_columns_selector(X_train_challenger5)

In [61]:
categorical_preprocessor = OrdinalEncoder(handle_unknown="use_encoded_value",
                                          unknown_value=-1)
numerical_preprocessor = StandardScaler()

preprocessor_challenger5 = ColumnTransformer([
    ('categorical', categorical_preprocessor, categorical_columns_challenger5),
    ('standard_scaler', numerical_preprocessor, numerical_columns_challenger5)])

In [78]:
# Best Hyperparameters: {'RFRegr__criterion': 'absolute_error', 'RFRegr__n_estimators': 75}
model_challenger5 = make_pipeline(preprocessor_challenger5, RandomForestRegressor(criterion = 'absolute_error', n_estimators = 75, random_state=123))
model_challenger5

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('categorical',
                                                  OrdinalEncoder(handle_unknown='use_encoded_value',
                                                                 unknown_value=-1),
                                                  ['upload_day_name',
                                                   'upload_hour_local',
                                                   'channel_title']),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  ['n_words_title',
                                                   'n_question_marks_title',
                                                   'n_exclamation_marks_title',
                                                   'n_ellipsis_title',
                                                   'video_duration'

### 2.4.3) Train model

In [79]:
_ = model_challenger5.fit(X_train_challenger5, yt_y_train)

### 2.4.4) Make predictions

In [80]:
predictions_challenger5 = model_challenger5.predict(X_test_challenger5) # make predictions
predictions_challenger5_pretty = predictions_pretty_print_df(predictions_challenger5, f_inv)
predictions_challenger5_pretty.head()

Unnamed: 0,video_title,channel_title,view_count,view_count_pred,error
1583,"Trying to Catch a 1,000 MPH Baseball - Smarter...",SmarterEveryDay,15180905.0,5180683.0,-10000220.0
1745,⚡ TRIBAL DE RAIOS! → Izzet Spells no Alchemy c...,UMotivo,8137.0,10996.16,2859.162
772,"⚪ O NOVO WHITE WEENIE → Basri Ket Aggro, com C...",UMotivo,16050.0,11896.86,-4153.138
1728,"PI - Comando ""para""",Carla Quem Disse,160.0,304.591,144.591
387,The Best Test of General Relativity (by 2 Misp...,Veritasium,1858083.0,9852416.0,7994333.0


### 2.4.5) Prediction metrics
We will use "model.score", i.e., coefficient of determination R^2.

In [81]:
model_score_challenger5 = model_challenger5.score(X_test_challenger5, yt_y_test)
model_score_challenger5

0.9267171187408957

# **3) Grid search**
Now we will select the most promising models and do a Grid Search on them, in order to tune hyper parameters. Then we will train it again with those parameters.

## 3.1) Grid search for Challenger model 2

In [85]:
hgbr_regr = Pipeline(steps=[("preprocessor", preprocessor_challenger2), ("HGBRegr", HistGradientBoostingRegressor())])

In [86]:
parameters_hgbr = {'HGBRegr__max_iter': [10, 25, 50, 75, 100, 200],
                    'HGBRegr__min_samples_leaf': [5, 10, 20, 30],
                    'HGBRegr__loss': ('squared_error', 'absolute_error', 'poisson')} 

In [87]:
hgbr_regr_grid = GridSearchCV(estimator = hgbr_regr, 
                            param_grid = parameters_hgbr,
                            scoring='r2',         # score metric for regression
                            cv=20,
                            n_jobs=-1)

In [88]:
hgbr_regr_grid.fit(X_train_challenger2, yt_y_train)

GridSearchCV(cv=20,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('categorical',
                                                                         OrdinalEncoder(handle_unknown='use_encoded_value',
                                                                                        unknown_value=-1),
                                                                         ['upload_day_name',
                                                                          'upload_hour_local',
                                                                          'channel_title'])])),
                                       ('HGBRegr',
                                        HistGradientBoostingRegressor())]),
             n_jobs=-1,
             param_grid={'HGBRegr__loss': ('squared_error', 'absolute_error',
         

In [89]:
print(f"Best Score: {hgbr_regr_grid.best_score_}")
print(f"Best Hyperparameters: {hgbr_regr_grid.best_params_}")

#pd.DataFrame(hgbr_regr_grid.cv_results_)

Best Score: 0.9329296034272913
Best Hyperparameters: {'HGBRegr__loss': 'absolute_error', 'HGBRegr__max_iter': 200, 'HGBRegr__min_samples_leaf': 5}


## 3.2) Grid search for Challenger model 4

In [97]:
rf_regr = Pipeline(steps=[("preprocessor", preprocessor_challenger4), ("RFRegr", RandomForestRegressor())])

In [98]:
parameters_rf = {'RFRegr__n_estimators': [10, 25, 50, 75, 100, 200],
              'RFRegr__criterion': ('squared_error', 'absolute_error', 'poisson')}

In [99]:
rf_regr_grid = GridSearchCV(estimator = rf_regr, 
                            param_grid = parameters_rf,
                            scoring='r2',         # score metric for regression
                            cv=20,
                            n_jobs=-1)

In [100]:
rf_regr_grid.fit(X_train_challenger4, yt_y_train)

GridSearchCV(cv=20,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('categorical',
                                                                         OrdinalEncoder(handle_unknown='use_encoded_value',
                                                                                        unknown_value=-1),
                                                                         ['upload_day_name',
                                                                          'upload_hour_local',
                                                                          'channel_title']),
                                                                        ('standard_scaler',
                                                                         StandardScaler(),
                                                                         ['n_words_title',
                                                  

In [77]:
print(f"Best Score: {rf_regr_grid.best_score_}")
print(f"Best Hyperparameters: {rf_regr_grid.best_params_}")

#pd.DataFrame(rf_regr_grid.cv_results_)

Best Score: 0.9342203687930267
Best Hyperparameters: {'RFRegr__criterion': 'absolute_error', 'RFRegr__n_estimators': 75}


# **4) Compare models**

In [116]:
print(f"Dummy model score:\n {model_score_dummy}")
print(f"Baseline model 1 (decision tree, w/o channel_title feature) score:\n {model_score_baseline1}")       # Decision tree, w/o channel_title feature
print(f"Baseline model 2 (Decision tree, with channel_title feature) score:\n {model_score_baseline2}")       # Decision tree, with channel_title feature
print(f"Challenger model 1 (HGBR, all features) score:\n {model_score_challenger1}")   # HGBR, all features
print(f"Challenger model 2 (HGBR, all features but thumbnail color) score:\n {model_score_challenger2}")   # HGBR, all features but thumbnail color
print(f"Challenger model 3 (HGBR, only video duration and channel_title) score:\n {model_score_challenger3}")   # HGBR, only video duration and channel_title
print(f"Challenger model 4 (Random Forest, all features) score:\n {model_score_challenger4}")   # Random Forest, all features
print(f"Challenger model 5 (Random Forest, all features but thumbnail color) score:\n {model_score_challenger5}")   # Random Forest, all features but thumbnail color

Dummy model score:
 0.8986743311526758
Baseline model 1 (decision tree, w/o channel_title feature) score:
 0.130234156464053
Baseline model 2 (Decision tree, with channel_title feature) score:
 0.8767191632447643
Challenger model 1 (HGBR, all features) score:
 0.9161204265820342
Challenger model 2 (HGBR, all features but thumbnail color) score:
 0.9263967731200753
Challenger model 3 (HGBR, only video duration and channel_title) score:
 0.9146466707112778
Challenger model 4 (Random Forest, all features) score:
 0.9143962874084941
Challenger model 5 (Random Forest, all features but thumbnail color) score:
 0.9267171187408957


Best models are HGBR using all feature but thumbnail color, and Random Forest using all features but thumbnail color. That's the expected result when we analyzed the correlation matrix of features. Thumbnail color wasn't correlated to view count.

Based on that, and the ease-of-use of implementation of this techniques, I would choose any of them over the Dummy model. But in an ideal scenario, more features would need to be extracted in order to increase prediction quality.