
### Project Overview

The goal of the project is to provide Youtube content creators with a way to gauge how effective a video title will be and to provide insight on which titles work best. Video titles are one of the main factors that determine whether people decide to click on a video or not. Coming up with a ‘formula’ for creating effective video titles would be advantageous to content creators. Phase I Our goal is to automate this process by using unsupervised & supervised learning to determine if there is a relationship between the structure of a title and the number of views.

We reviewed similar projects as part of our literature review and found most of the approaches that did ........ 

The common finding was that ..... 

We intend to use unsupervised learning to extract an underlying structure of video titles and use those features for prediction. This will require us to represent the features of video titles appropriately such that those features can be used by ML algorithms.  

- sarcasm score
- boring score
- remove outliers  UMAP & LOF
- choosing top_n vs bottom_n to reduce noise from average performing vidoes
- build multiple model for sub groups as titles vary too much across broad categories 
- becasuse it was found that previous video views are the strongest predictor of future views, try restricting to channels with low subscriber count
- views/subscriber count (normalizes views by subscriber count to estimate new views)
- comments/subscribers   (highly engaged viewers)
- change target to engagement which is a compination of views, likes and comments
- use topic modeling and calculate similarity to trending topics or topics in video comments


In this notebook, I will focus on extracting and preparing the data to be used by the ML algorithms. I will also establish the baseline model performance by training some naive regressors and a linear regressor model.  

### Data Extraction & Exploration 

#### Data Extraction

Extracting the data from all json files and combining into a dataframe 

In [3]:
import os, json
import pandas as pd
import numpy as np
import glob
from datetime import datetime

dir = 'data'
path = os.path.join(dir, '**/*.json')
file_list = glob.glob(path)
print('Total number of channels ' + str(len(file_list)))

dfs = list()

for file in file_list:
    with open(file, 'r') as f:
        data = json.load(f)
    channel_id, stats = data.popitem()
    pchannel_stats = stats["channel_statistics"]
    video_stats = stats["video_data"]
    vids = video_stats.items()
    stats = []
    for vid in vids:
        video_id = vid[0]
        title = vid[1]["title"]
        try:
            views = vid[1]["viewCount"]
            likes = vid[1]["likeCount"]
            duration = vid[1]['duration']
            tags = vid[1]['tags']
            description = vid[1]['description']
            comments = vid[1]["commentCount"]
            channel = vid[1]['channelTitle']
            published = vid[1]['publishedAt'].split('T')[0]
        except:
            pass
        cat = os.path.dirname(file).split('\\')[1]
        stats.append([title,views, published, likes, comments, duration, tags, description, channel, cat])
    vid_df = pd.DataFrame(stats, columns=["title","views", 'published',"likes","comments", 'duration','tag','description', 'channel', 'category'])
    dfs.append(vid_df)
    
    
df = pd.concat(dfs, ignore_index=True)


Total number of channels 86


In [4]:
df.head()

Unnamed: 0,title,views,published,likes,comments,duration,tag,description,channel,category
0,"Bartending, Improv, and Other Trends of 2011 |...",438729,2021-12-22,17207,628,PT10M1S,"[Collegehumor, CH originals, comedy, sketch co...","The news crew talks corgi butts, National Call...",CollegeHumor,Comedy
1,Inspector Gadget's Death Sparks Oscar Buzz | N...,479899,2021-11-10,16614,605,PT5M,"[Collegehumor, CH originals, comedy, sketch co...",Jeffrey Self and Grant question the wetness of...,CollegeHumor,Comedy
2,Delicious Kitchen Fire Caramelizes Dozens of P...,740322,2021-08-25,28629,779,PT4M33S,"[Collegehumor, CH originals, comedy, sketch co...",Stamps Dotcom & Marc Maron report that Russian...,CollegeHumor,Comedy
3,"Handshakes For Men, Hugs For Women | No Laugh ...",1642663,2021-07-14,71402,1507,PT8M16S,"[Collegehumor, CH originals, comedy, sketch co...",Amy shows off some new dances. Brennan is a 12...,CollegeHumor,Comedy
4,True Stories From the CollegeHumor Office | No...,1828336,2021-06-02,66660,1368,PT6M4S,"[Collegehumor, CH originals, comedy, sketch co...",Trapp & Katie read a riveting report from the ...,CollegeHumor,Comedy


In [5]:
df.tail()

Unnamed: 0,title,views,published,likes,comments,duration,tag,description,channel,category
33691,Downward Dog - Downward Facing Dog Yoga Pose,1327691,2012-12-12,13997,565,PT7M58S,"[downward dog, down dog, downward facing dog, ...",Learn Downward Dog yoga pose with Adriene. If ...,Yoga With Adriene,Yoga
33692,Reclined Twist Yoga Pose - Yoga With Adriene,162174,2012-11-14,2397,116,PT8M40S,"[yoga, adriene mishler, yoga with adriene, yog...",Learn the Reclined Twist Yoga Pose with Adrien...,Yoga With Adriene,Yoga
33693,Corpse Pose - Yoga With Adriene,336507,2012-10-31,4589,249,PT9M27S,"[yoga for beginners, yoga with adriene, founda...",Learn how to do the Corpse Pose with Adriene! ...,Yoga With Adriene,Yoga
33694,Extended Child's Pose - Yoga With Adriene,1451610,2012-10-24,11649,373,PT6M14S,"[extended child's pose, yoga for beginners, ch...",Learn Extended Child's Pose with Adriene! This...,Yoga With Adriene,Yoga
33695,Runner's Lunge - Foundations of Yoga,247577,2012-10-10,3364,165,PT6M4S,"[yoga for beginners, yoga with adriene, adrien...",Learn Runner's Lunge with Adriene! (Not just f...,Yoga With Adriene,Yoga


In [36]:
df.isna().sum()
#df.views = (df.views).astype(int)

title          0
views          0
published      0
likes          0
comments       0
duration       0
tag            0
description    0
channel        0
category       0
dtype: int64

In [39]:
df.dtypes

title                  object
views                  object
published      datetime64[ns]
likes                  object
comments               object
duration               object
tag                    object
description            object
channel                object
category               object
dtype: object

In [40]:
df['views'] = df["views"]. apply(lambda x: int(x))

In [41]:
df.published = pd.to_datetime(df.published, format='%Y-%m-%d')
df.shape

(33696, 10)

In [42]:
def top10():
    
    """This function sorts the data by views and title and selects the top 10 titles"""
    
    df_sorted = df.sort_values(['views','title'], ascending=[False,True])[:10]
    
    return df_sorted[['title','views']]

top10()  # mostly popular songs at the top. We will probably have to remove this category 

Unnamed: 0,title,views
15873,Ed Sheeran - Shape of You (Official Music Video),5809870270
15893,Ed Sheeran - Thinking Out Loud (Official Music...,3499179932
15848,Ed Sheeran - Perfect (Official Music Video),3256243151
15695,Bruno Mars - The Lazy Song (Official Music Video),2416952328
15658,Bruno Mars - That’s What I Like [Official Musi...,2061152302
15705,Bruno Mars - Just The Way You Are (Official Mu...,1752542816
15669,Bruno Mars - 24K Magic (Official Music Video),1532258215
15882,Ed Sheeran - Photograph (Official Music Video),1238010919
15676,Bruno Mars - When I Was Your Man (Official Mus...,1182488125
15701,Bruno Mars - Grenade (Official Music Video),1098346906


In [43]:
def bottom10():
    
    """This function sorts the data by views and title and selects the top 10 titles"""
    
    df_sorted = df.sort_values(['views','title'], ascending=[True,True])[:10]
    
    return df_sorted[['title','views']]

bottom10() # we will have to consider removing titles with no views. 

Unnamed: 0,title,views
11284,10-Minute Dance Cardio Workout With Charlize G...,0
11283,10-Minute Feel-Good Standing Workout With Rach...,0
11286,"10-Minute, No-Equipment Cardio HIIT With Ranei...",0
16129,ROUND 2 LDN,0
15708,"Ed Sheeran, Pokémon - Celestial [Official Video]",9
23908,SMASHING COLOR FILLED BALLOON AT BOSTON FENWAY...,305
19013,FDI net inflows slow for 4th straight month in...,365
18472,"Pamilya ng nasagasaang street sweeper, desidid...",377
18986,Manhit: It is govt's role to protect PH mariti...,399
18639,Hottest new films to stream from thrillers to ...,458


#### Splitting our data into train, dev and test sets 

In [44]:
# Splitting the data in 80%, 10%, 10% train, dev and test sets

RANDOM_SEED = 42

train_df, dev_df, test_df = np.split(df.sample(frac=1, random_state=RANDOM_SEED),[int(.8*len(df)), int(.9*len(df))])

print(len(train_df), len(dev_df), len(test_df))

26956 3370 3370


#### Converting the titles to features 

In [45]:
from sklearn.dummy import DummyRegressor
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# fitting a vectorizer
# We will use 50 for min_df to ensure that a word shows up at least 50 times. 
# We will also specify that english stop words are to be removed and 
# We will use unigrams and bigrams

vectorizer = TfidfVectorizer(min_df=50,stop_words='english',ngram_range=(1, 2))
X_train = vectorizer.fit_transform(train_df.title)

In [46]:
X_train.shape  # We found 619 word features 

(26956, 619)

In [47]:
# getting list of labels 
y_train = list(train_df.views)

#### Fit a unigram and bigram LinearRegression classifier

In [48]:
reg = LinearRegression()
reg.fit(X_train,y_train)

LinearRegression()

#### Generating development data 

In [49]:
X_dev = vectorizer.transform(dev_df.title)
y_dev = list(dev_df.views)

#### Create Dummy Classifiers 

In [50]:
dummy_clf_mean = DummyRegressor(strategy="mean")
dummy_clf_mean.fit(X_train, y_train)
dummy_clf_median = DummyRegressor(strategy="median")
dummy_clf_median.fit(X_train, y_train)

DummyRegressor(strategy='median')

#### Generating all the predictions 

In [54]:
lr_dev_preds = reg.predict(X_dev)
mean_dev_preds = dummy_clf_mean.predict(X_dev)
median_dev_preds = dummy_clf_median.predict(X_dev)

#### Scoring the predictions

In [55]:
lr_mse = mean_squared_error(y_dev, lr_dev_preds)
mean_mse = mean_squared_error(y_dev, mean_dev_preds)
median_mse = mean_squared_error(y_dev, median_dev_preds)

In [56]:
print(lr_mse)
print(mean_mse)
print(median_mse)

3613698765604540.0
3836618484685100.0
3851671859597287.5
