# Youtube Video Analysis and Classification

This project analyzes trending YouTube videos from August 2020 to present. It explores the attributes of trending videos, such popular channels, categories, and keywords. It also trains a linear SVC model to classify videos by category, using tokens extracted from video titles and tags by NLTK.

To run, download data from your preferred country and its associated category IDs [here](https://www.kaggle.com/datasets/rsrishav/youtube-trending-video-dataset).

## Import Libraries

In [2]:
# imports for basic data processing
import pandas as pd
import numpy as np
import json
import string
from datetime import datetime, timedelta

# imports for NLP
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import wordpunct_tokenize
from nltk.stem import WordNetLemmatizer

# imports for categorization
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer

# imports for notebook readability
from IPython.display import Markdown, display

## Read Data
If using data from countries other than the US, be sure to update the corresponding country code in the following cell.

In [3]:
# read data
category_ids = json.load(open('US_category_id.json'))
data = pd.read_csv('US_youtube_trending_data.csv')

total_size = len(data)

display(Markdown(
    '### Data Overview \n'
    f'There are {len(data)} entries in the dataset before cleaning. <br>'
    'Here\'s an example row:'
))
display(data.sample())
display(Markdown(
    f'<br>The columns are <ul style="columns: 3;"><li>`{"`</li><li>`".join(data.columns)}`</li></ul><br>'
    'Some of these columns are irrelevant. We will remove these in the next cell.'
))

### Data Overview 
There are 214388 entries in the dataset before cleaning. <br>Here's an example row:

Unnamed: 0,video_id,title,publishedAt,channelId,channelTitle,categoryId,trending_date,tags,view_count,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,description
19182,UfqV3dAoiUQ,Among Us Logic 7 | Cartoon Animation,2020-11-07T23:30:01Z,UCToxKVrkEuAONR4rFIJ_DyQ,GameToons,1,2020-11-16T00:00:00Z,among us|among us logic|animation|funny animat...,10274283,277089,5864,18132,https://i.ytimg.com/vi/UfqV3dAoiUQ/default.jpg,False,False,► SUBSCRIBE to GameTunes! -https://www.youtube...


<br>The columns are <ul style="columns: 3;"><li>`video_id`</li><li>`title`</li><li>`publishedAt`</li><li>`channelId`</li><li>`channelTitle`</li><li>`categoryId`</li><li>`trending_date`</li><li>`tags`</li><li>`view_count`</li><li>`likes`</li><li>`dislikes`</li><li>`comment_count`</li><li>`thumbnail_link`</li><li>`comments_disabled`</li><li>`ratings_disabled`</li><li>`description`</li></ul><br>Some of these columns are irrelevant. We will remove these in the next cell.

## Clean and preprocess data
Due to how the data is collected, there are many duplicate entries in the dataset. Remove these, and keep only necessary columns.

In [4]:
# remove duplicates
data.drop_duplicates(subset='video_id', keep='first', inplace=True)

# fill in category from ID
category_ids = {c['id']: c['snippet']['title'] for c in category_ids['items']}
data['category'] = [category_ids[str(i)] for i in data.categoryId]

# isolate relevant columns, and rename to follow snake_case naming conventions for consistency
data = data[['title', 'channelTitle', 'category', 'publishedAt', 'tags', 'view_count']].rename(columns={'channelTitle': 'channel_title', 'publishedAt': 'published_at'})

# convert string to datetime object
data['published_at'] = [datetime.strptime(time, '%Y-%m-%dT%H:%M:%SZ') for time in data['published_at']]

display(Markdown(
    '### After Cleaning \n'
    f'Number of unique videos indexed: {len(data)} <br><br>'
    f'Number of duplicates removed: {total_size - len(data)} <br>'
    f'This means {round(100.0 * (total_size - len(data)) / total_size, 1)}% of entries were duplicates! \n'
    '### Most recent trending video'
))
display(data.tail(1))

### After Cleaning 
Number of unique videos indexed: 38747 <br><br>Number of duplicates removed: 175641 <br>This means 81.9% of entries were duplicates! 
### Most recent trending video

Unnamed: 0,title,channel_title,category,published_at,tags,view_count
214265,MrBeast's Diamond Play Button is on Ebay!,JackSucksAtLife,Entertainment,2023-07-15 14:31:23,jacksucksatlife|JackSucksAtLife YouTube|JackSu...,492203


## Explore Data
Find the post popular videos, channels, and categories, by view count.

In [5]:
# sort by views
data.sort_values('view_count', ascending=False, inplace=True)
display(Markdown('### Top Videos of All Time'))
display(data.head())

# get videos published in the last month
recent_data = data[data.published_at >= (datetime.today() - timedelta(days=31))]
display(Markdown('\n### Top Videos This Month'))
display(recent_data.head())

### Top Videos of All Time

Unnamed: 0,title,channel_title,category,published_at,tags,view_count
212189,"Salaar Teaser | Prabhas, Prashanth Neel, Prith...",Hombale Films,Entertainment,2023-07-05 23:41:10,salaar update|salaar teaser|salaar|salaar teas...,91463891
80193,LISA - 'LALISA' M/V,BLACKPINK,Music,2021-09-10 04:00:13,YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블...,85890366
100194,Crazy #alluarjun #painting #shorts #viral #tr...,Dr.Harrsha Artist,Film & Animation,2021-12-08 13:16:02,[None],79283769
51,Cardi B - WAP feat. Megan Thee Stallion [Offic...,Cardi B,Music,2020-08-07 04:00:10,Cardi B|Cardi|Atlantic Records|rap|hip hop|tra...,76805026
114216,"Hey man, we are Italian 🇮🇹😅🤷🏼‍♀️#shorts #funny...",Jessi & Sean,People & Blogs,2022-02-20 20:42:28,[None],71401624



### Top Videos This Month

Unnamed: 0,title,channel_title,category,published_at,tags,view_count
212189,"Salaar Teaser | Prabhas, Prashanth Neel, Prith...",Hombale Films,Entertainment,2023-07-05 23:41:10,salaar update|salaar teaser|salaar|salaar teas...,91463891
212991,Jawan |Official Hindi Prevue |Shah Rukh Khan |...,Red Chillies Entertainment,Entertainment,2023-07-10 04:58:09,SRK|Shah rukh khan|shahruh khan|Srk movies|red...,51798724
213788,정국 (Jung Kook) 'Seven (feat. Latto)' Official MV,HYBE LABELS,Music,2023-07-14 04:00:00,HYBE|HYBE LABELS|하이브|하이브레이블즈|정국|Jung Kook|Seven,41185831
211790,#RockyAurRaniKiiPremKahaani - OFFICIAL TRAILER...,Dharma Productions,Film & Animation,2023-07-04 06:30:09,rocky aur rani|ranveer singh|ranveer singh new...,36842906
212588,Train Vs Giant Pit,MrBeast,Entertainment,2023-07-08 16:00:00,[None],33142241


In [6]:
# get popular channels
channels = data.groupby('channel_title').agg(category=('category', lambda x: x.mode()[0]), video_count=('title', 'count'), total_views=('view_count', 'sum'))
channels.sort_values('total_views', ascending=False, inplace=True)

display(Markdown('### Top 10 Most Popular Channels'))
display(channels.head(10))

# get popular categories
categories = data.groupby('category').agg(top_channel=('channel_title', lambda x: x.mode()[0]), video_count=('title', 'count'), total_views=('view_count', 'sum'))
categories.sort_values('total_views', ascending=False, inplace=True)

display(Markdown('### Top 10 Most Popular Categories'))
display(categories.head(10))

### Top 10 Most Popular Channels

Unnamed: 0_level_0,category,video_count,total_views
channel_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MrBeast,Entertainment,65,1291342767
NBA,Sports,367,719057959
HYBE LABELS,Music,67,703937472
BLACKPINK,Music,57,689071265
SMTOWN,Music,74,601840367
NFL,Sports,329,529529265
JYP Entertainment,Music,75,501748170
BANGTANTV,Music,74,492688501
MrBeast Gaming,Gaming,77,477439867
SSSniperWolf,Entertainment,117,441121235


### Top 10 Most Popular Categories

Unnamed: 0_level_0,top_channel,video_count,total_views
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Entertainment,SSSniperWolf,7609,11319529027
Music,JYP Entertainment,6126,10805752685
Gaming,SSundee,7706,7705949654
Sports,NBA,4850,5430000082
People & Blogs,Ryland vlogs,3399,3274057443
Film & Animation,The Film Theorists,1474,2038739612
Comedy,The Try Guys,1926,1759602894
Science & Technology,SpaceX,1160,1638284204
News & Politics,TODAY,1423,1379830580
Education,Veritasium,909,845022489


## Extract Keywords
Reduce videos to their relevant keywords: tokenize video titles and tags, remove stopwords, and lemmatize tokens.

In [7]:
# setup
wnl = WordNetLemmatizer()
punct = set(string.punctuation)
to_remove = set(stopwords.words('english')) | punct | set(['–', '—', '...'])

token_freq = pd.DataFrame(columns=['token', 'video_count', 'view_count'])
token_freq.set_index('token')

# return list of tokens, and frequency/total views for each token
def analyze_tokens(to_tokenize_views):
    to_tokenize, views = to_tokenize_views
    tokens = wordpunct_tokenize(to_tokenize.lower().replace('"', ''))
    cleaned = set()
    
    for t in tokens:
        # skip unwanted tokens
        if t in to_remove:
            continue
        
        # convert word to base form
        t = wnl.lemmatize(t)
        
        # track number of times each token appears
        if t in token_freq.token:
            token_freq.loc[t, 'video_count'] += 1
            token_freq.loc[t, 'view_count'] += views
        else:
            token_freq.loc[t] = [t, 1, views]
            
        # add to output list of tokens
        cleaned.add(t)
    # end loop
        
    return cleaned

In [8]:
# tokenize title and tags
x = data.apply(lambda x: (x.title + ' ' + x.tags, x.view_count), axis=1)
transformer = CountVectorizer(analyzer=analyze_tokens).fit(x)
x = transformer.transform(x)

Look at the most common keywords in trending videos.

In [9]:
# add a column for average views per video
token_freq['avg_views'] = token_freq.view_count / token_freq.video_count

# explore common keywords
token_freq.sort_values('video_count', ascending=False, inplace=True)
display(Markdown('### Top 10 Most Common Keywords'))
display(token_freq.head(10))

# get videos with most common keyword
display(Markdown(f'### Top {token_freq.token[0].capitalize()} Videos\nThe most common keyword was {token_freq.token[0]}.'))
display(data[[(token_freq.token[0] in title) for title in data.title]].head())

### Top 10 Most Common Keywords

Unnamed: 0,token,video_count,view_count,avg_views
minecraft,minecraft,34830,33580390908,964122.6
video,video,24278,34699189288,1429244.0
game,game,22004,26083874834,1185415.0
new,new,19028,22807741340,1198641.0
v,v,17784,23170427768,1302881.0
highlight,highlight,17544,20988444796,1196332.0
official,official,14652,23290492236,1589578.0
music,music,13432,19947386876,1485065.0
none,none,12852,18813573170,1463863.0
fortnite,fortnite,12404,9355179560,754206.7


### Top Minecraft Videos
The most common keyword was minecraft.

Unnamed: 0,title,channel_title,category,published_at,tags,view_count
83636,realistic lava vs water in minecraft,steveee,Gaming,2021-09-27 07:00:10,minecraft|realistic|physics|water|shaders|mine...,4264951
81799,realistic lava in minecraft,steveee,Gaming,2021-09-18 07:00:30,minecraft|realistic|physics|water|snapshot|mod...,3870630
122015,when minecraft removed the inventory... (april...,camman18,Entertainment,2022-04-09 15:00:01,camman18|camman18 minecraft|minecraft|minecraf...,2885109
119244,revisiting old minecraft textures,camman18,Entertainment,2022-03-26 15:00:17,camman18|camman18 minecraft|minecraft|minecraf...,2274038
116391,what if minecraft didn't have wood...,camman18,Entertainment,2022-03-12 16:00:21,camman18|camman18 minecraft|minecraft|minecraf...,2224013


## Classify videos
Train a model to predict a video's category using its extracted tokens.

In [20]:
# train model
y = data.category
x_train, x_test, y_train, y_test = train_test_split(x.todense(), y, test_size=0.5)
model = SGDClassifier()
model.fit(x_train, y_train)

# test accuracy of model
predictions = model.predict(x_test)
print(f'Classifies videos with {round(accuracy_score(y_test, predictions) * 100, 2)} % accuracy')



Classifies videos with 80.55 % accuracy
