# MSA 2020 - AI & Advanced Analytics - Trending Youtube Video Statistics

By Abigail Sarmiento<br>
Last updated: 27 June 2020

### Read in file

In [1]:
import pandas as pd

df = pd.read_csv('US_youtube.csv', index_col=0)
df.head()

Unnamed: 0_level_0,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
video_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
puqaWrEC7tY,17.14.11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...
d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...


### Remove unwanted columns

Columns that have a large amount of missing values (e.g. description), are not deemed relevant or have a high variance in their values (e.g. tags), were dropped from the DataFrame. This latter was done mostly as it produced a high overhead when encoding the categorical values. 

In [2]:
df.isnull().sum()

trending_date               0
title                       0
channel_title               0
category_id                 0
publish_time                0
tags                        0
views                       0
likes                       0
dislikes                    0
comment_count               0
thumbnail_link              0
comments_disabled           0
ratings_disabled            0
video_error_or_removed      0
description               570
dtype: int64

In [3]:
df = df[["channel_title", "category_id", "publish_time", "views", "likes", "dislikes", "comment_count", "comments_disabled", "ratings_disabled", "video_error_or_removed"]]
df.isnull().sum()

channel_title             0
category_id               0
publish_time              0
views                     0
likes                     0
dislikes                  0
comment_count             0
comments_disabled         0
ratings_disabled          0
video_error_or_removed    0
dtype: int64

### Parse publish_time column

The original 'publish_time' column which was given as a datetime object was divided into its' year, month and hour. These three factors are deemed to most likely have an effect on the model. For example, viewership may be higher for a video that was published 5 years ago compared to one that was published 1 month ago, or uploading at a particular time of day might make a video more popular compared to if it were uploaded at any another time.

In [4]:
df['publish_time'] = df['publish_time'].astype(str)
df['publish_date'] = df['publish_time'].str.split('T').str[0]
df['publish_t'] = df['publish_time'].str.split('T').str[1]

df['publish_year'] = df['publish_date'].str.split('-').str[0].astype(int)
df['publish_month'] = df['publish_date'].str.split('-').str[1].astype(int)
df['publish_hour'] = df['publish_t'].str.split(':').str[0].astype(int)

df = df.drop(['publish_time', 'publish_date', 'publish_t'], axis=1)
df.head()

Unnamed: 0_level_0,channel_title,category_id,views,likes,dislikes,comment_count,comments_disabled,ratings_disabled,video_error_or_removed,publish_year,publish_month,publish_hour
video_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2kyS6SvSYSE,CaseyNeistat,22,748374,57527,2966,15954,False,False,False,2017,11,17
1ZAPwfrtAFY,LastWeekTonight,24,2418783,97185,6146,12703,False,False,False,2017,11,7
5qpjK5DgCt4,Rudy Mancuso,23,3191434,146033,5339,8181,False,False,False,2017,11,19
puqaWrEC7tY,Good Mythical Morning,24,343168,10172,666,2146,False,False,False,2017,11,11
d380meD0W0M,nigahiga,24,2095731,132235,1989,17518,False,False,False,2017,11,18


### Encode categorial columns

Categorical values were encoded into binary digits (1 if that category applies, 0 if not). This nominal encoding methodology was chosen, as opposed to ordinal encoding, due to the fact that the ordering did not matter as all variables were of equal weighting.

In [5]:
df = pd.get_dummies(df, columns=['channel_title', 'category_id', 'comments_disabled', 'ratings_disabled', 'video_error_or_removed'])
df.head()

Unnamed: 0_level_0,views,likes,dislikes,comment_count,publish_year,publish_month,publish_hour,channel_title_12 News,channel_title_1MILLION Dance Studio,channel_title_1theK (원더케이),...,category_id_27,category_id_28,category_id_29,category_id_43,comments_disabled_False,comments_disabled_True,ratings_disabled_False,ratings_disabled_True,video_error_or_removed_False,video_error_or_removed_True
video_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2kyS6SvSYSE,748374,57527,2966,15954,2017,11,17,0,0,0,...,0,0,0,0,1,0,1,0,1,0
1ZAPwfrtAFY,2418783,97185,6146,12703,2017,11,7,0,0,0,...,0,0,0,0,1,0,1,0,1,0
5qpjK5DgCt4,3191434,146033,5339,8181,2017,11,19,0,0,0,...,0,0,0,0,1,0,1,0,1,0
puqaWrEC7tY,343168,10172,666,2146,2017,11,11,0,0,0,...,0,0,0,0,1,0,1,0,1,0
d380meD0W0M,2095731,132235,1989,17518,2017,11,18,0,0,0,...,0,0,0,0,1,0,1,0,1,0


### Apply feature selection techniques

A filtering method was applied which filters out all columns which did not have a correlation score of above 0.025 with respect to 'views'. Because the filtered out columns had a relatively low correlation to the target variable, their inclusion would most likely not have made a significant impact on the result. Additionally, by refining the feature list, it improves runtime.

<font color=red> NOTE: This takes a few minutes to complete.</font>

In [6]:
from scipy.stats import pearsonr

corr = df.corr()
corr_target = abs(corr['views'])
relevant_features = corr_target[corr_target > 0.025]
print(relevant_features)
features_list = []
for f in relevant_features:
    features_list.append(relevant_features[relevant_features == f].index[0])
df = df[df.columns.intersection(features_list)]

df.shape

views             1.000000
likes             0.849177
dislikes          0.472213
comment_count     0.617621
publish_year      0.075538
                    ...   
category_id_23    0.036159
category_id_25    0.060810
category_id_26    0.062509
category_id_27    0.045752
category_id_28    0.030653
Name: views, Length: 77, dtype: float64


(40949, 77)

Through this filtering method, the DataFrame was able to be refined from 2366 features down to 77 features.

### Split data into training and testing sets

In [7]:
from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y = train_test_split(df.drop('views', axis=1), df['views'], test_size=0.3)

### Perform a 5-fold cross validation to find the most fitting model

A k-fold cross validation method, using k = 5, was done on the training data set to determine the most effective regression algorithm out of LinearRegression(), ElasticNet(), Ridge(), DecisionTreeRegressor(), and KNeighboursRegressor. The value of k was set to 5 as it is a common value used within machine learning as it is not prone to errors from high bias or variance. 

In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score

regressors = [LinearRegression(), ElasticNet(), Ridge(), DecisionTreeRegressor(), KNeighborsRegressor()]
regressor_accuracy_list = []

for i, regressor in enumerate(regressors):
    print(regressor)
    accuracies = cross_val_score(regressor, train_x, train_y, cv=5)
    regressor_accuracy_list.append((accuracies.mean(), type(regressor).__name__))

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
           max_iter=1000, normalize=False, positive=False, precompute=False,
           random_state=None, selection='cyclic', tol=0.0001, warm_start=False)
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001)
DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
               

In [9]:
regressor_accuracy_list = sorted(regressor_accuracy_list, reverse=True)

print("===== Results of 5-fold cross validation =====")
for item in regressor_accuracy_list:
    print(item[1], ": ", item[0])

===== Results of 5-fold cross validation =====
DecisionTreeRegressor :  0.9359952714203537
LinearRegression :  0.8821029939905065
Ridge :  0.8819170567565869
KNeighborsRegressor :  0.8771409355060081
ElasticNet :  0.7907320373541147


The results showed that the highest performing model was DecisionTreeRegressor() with a mean accuracy of 0.94. Therefore, I decided to train the model using DecisionTreeRegressor().

### Train the dataset

In [10]:
dtr = DecisionTreeRegressor()
dtr.fit(train_x, train_y)
y_pred = dtr.predict(test_x)

### Evaluate the model

In [11]:
results = pd.DataFrame({'Actual': test_y, 'Predicted' : y_pred, 'Difference' : abs(test_y - y_pred)})
results.head(10)

Unnamed: 0_level_0,Actual,Predicted,Difference
video_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
z91xQ5u-_wY,47299,50628.0,3329.0
wI89nVn6LHk,2003345,1993400.0,9945.0
u5Jw7D-c3II,48870,32788.0,16082.0
5NaAwSpswJE,2763694,2093287.0,670407.0
v85VNXT0Euk,368627,319773.0,48854.0
K1uiJIl614c,57035,52591.0,4444.0
4LlQwTgB5Rc,302623,292725.0,9898.0
doP7xKdGOKs,671618,919231.0,247613.0
NJAtY43vsl8,2985521,8843946.0,5858425.0
BxZYIjdgGtg,15478,84534.0,69056.0


In [12]:
from sklearn.metrics import mean_squared_log_error, r2_score

msle = mean_squared_log_error(test_y, y_pred)
r2 = r2_score(test_y, y_pred)

print("Mean squared error: " + str(msle))
print("R^2: " + str(r2))

Mean squared error: 0.2777144785437705
R^2: 0.9539979158288119


Two metrics, mean squared log error (MSLE) and the coefficient of determination (R^2), were used to evaluate the regression model. According to these two scores, the model seems to have performed well. In particular, the R^2 score, which measures how well samples are likely to be predicted, was 0.95 which is very close to a perfect fit of 1.