In [1]:
import pandas as pd
import json
import os

Data exploration on US_trending_videos.json to further clean it

In [2]:
trending_file_path = 'C:/Users/TKN/Downloads/New-Youtube-Scraper-v3/data/24.11.12_US_trending_videos.json'
with open(trending_file_path, "r", encoding="utf-8") as f:
    trending_videos = json.load(f)

trending_df = pd.DataFrame.from_dict(trending_videos, orient="index")

In [3]:
trending_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 200 entries, mcvLKldPM08 to ccTtAOI_kP4
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   fetchedDate      200 non-null    object 
 1   publishedAt      200 non-null    object 
 2   elapsedDays      200 non-null    float64
 3   title            200 non-null    object 
 4   description      200 non-null    object 
 5   channelTitle     200 non-null    object 
 6   channelId        200 non-null    object 
 7   tags             150 non-null    object 
 8   category         200 non-null    object 
 9   duration         200 non-null    object 
 10  licensedContent  200 non-null    bool   
 11  viewCount        200 non-null    int64  
 12  avgDailyViews    200 non-null    float64
 13  likeCount        200 non-null    int64  
 14  commentCount     200 non-null    int64  
 15  topicCategories  200 non-null    object 
dtypes: bool(1), float64(2), int64(3), object(10)
memo

Drop `channelId`, since we no longer need it

In [4]:
trending_df = trending_df.drop(columns=['channelId'])

Drop MrBeast rows for a more 'subjective' trending analysis (Since all videos of his all goes on Trending with top-of-the-line of everything, it will be counted as outlier)

In [5]:
trending_df = trending_df[~trending_df['channelTitle'].isin(['MrBeast', 'MrBeast 2'])]

trending_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 197 entries, mcvLKldPM08 to ccTtAOI_kP4
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   fetchedDate      197 non-null    object 
 1   publishedAt      197 non-null    object 
 2   elapsedDays      197 non-null    float64
 3   title            197 non-null    object 
 4   description      197 non-null    object 
 5   channelTitle     197 non-null    object 
 6   tags             150 non-null    object 
 7   category         197 non-null    object 
 8   duration         197 non-null    object 
 9   licensedContent  197 non-null    bool   
 10  viewCount        197 non-null    int64  
 11  avgDailyViews    197 non-null    float64
 12  likeCount        197 non-null    int64  
 13  commentCount     197 non-null    int64  
 14  topicCategories  197 non-null    object 
dtypes: bool(1), float64(2), int64(3), object(9)
memory usage: 23.3+ KB


Goal: Determine what influences YouTube's Trending page (Trending criteria) and the Trending properties of a video in the US

#### Determine trending criteria
We want only the videos that satisfy the 'trending' definition:

##### Gains a significant amount of attention over a short amount of time.

For this, we have the `avgDailyViews` column. It calculates the amount of views gained per day for each videos, by taking the `viewCount` - view count of the video up to the day the data was collected, divided by `elapsedDays` - number of days elapsed from the published day of the video to the day the data was collected (11/12). 

This number will determine the 'amount of attention' that video received over time. Since the YouTube Data API doesn't facilitate continuous crawling (and our quota is limited), we will assume an uniform distribution of views per day for all videos (by taking the mean number). 

Doing this will add significance to old videos with massive 'attention amount' in the past (popular). For example, a video posted from 6 years ago with 2 billion views will have an `avgDailyViews` of around 1.1 million. In our case, it will be equivalent to a video posted 1 day ago with 1.1 million views (will be on Trending). 

This metric will also 'punish' popular past videos as follow. Another example is a video with 300 million views posted from 10 years ago may normally sound popular. However, that video will have an `avgDailyViews` of around 80k. As a result, it may not be on Trending today.

This metric, together with `viewCount` will be the most significant when determining which video can be trending (be featured on YouTube's Trending page). Through examining the Trending pages of different countries, Trending videos tend to have a significant `viewCount` over a short time (Trending page mainly have videos that are posted within 1 week from today). The Trending page only considers the video alone to put on Trending, because even channels with a relatively low subscribers and overall video view count can still appear on Trending page.

#### Have a good engagement rate

`likeCount` and `commentCount` will also be considered to determine whether the video as a good viewer base (good engagement), indicating a genuine interest via `engagementRate`. *Insert formula here*

As stated above, statistics related to views will be the most significant when it comes to Trending. Because there are many videos that accumulates their views via. false methods (eg. bots, ads), those will be less likely to represent trending contents. Videos with high engagement are more likely to be discussed, which is vital to trending contents. The industry benchmark for `engagementRate` is around 2%, but we are analyzing US trending videos as of Dec.24, we will adjust it to ensure a fair rate across the dataset (quantile).

New metric to determine cut-off: engagement rate - ensure we only take videos with a relatively good amount of engaging

In [6]:
trending_df['engagementRate'] = ((trending_df['likeCount'] + trending_df['commentCount']) / trending_df['viewCount']).round(5)

`view_stat` function to view the lowest numbers across the vital categories.

In [7]:
def view_stats(column_name):
    list = sorted(trending_df[column_name].tolist())
    print(list[:10])

In [8]:
view_stats('viewCount')
view_stats('likeCount')
view_stats('commentCount')
view_stats('engagementRate')
view_stats('avgDailyViews')

[29676, 71511, 95789, 96578, 113637, 121701, 132751, 137043, 139946, 146784]
[0, 0, 564, 1322, 1412, 1448, 2243, 2381, 2403, 2504]
[0, 0, 60, 155, 166, 210, 220, 285, 300, 373]
[0.0, 0.00084, 0.00253, 0.00415, 0.00447, 0.00614, 0.00675, 0.00684, 0.00799, 0.00904]
[7932.0, 15740.23, 17741.33, 21616.32, 26639.74, 31071.0, 37357.4, 37960.11, 39428.83, 47480.16]


`view_lowest` function to view the videos with the lowest interactions (view count, like count, comment count, average daily views)

In [8]:
def view_lowest(column_name):
    lowest_view_count_row = trending_df.loc[trending_df[column_name].idxmin()]
    print(f"Title: {lowest_view_count_row['title']}")
    print(f'Published At: {lowest_view_count_row["publishedAt"]}')
    print(f"View Count: {lowest_view_count_row['viewCount']}")
    print(f"Like Count: {lowest_view_count_row['likeCount']}")
    print(f"Comment Count: {lowest_view_count_row['commentCount']}")
    print(f"Avg Daily Views: {lowest_view_count_row['avgDailyViews']}")
    print(f"Engagement Rate: {lowest_view_count_row['engagementRate']}")

In [None]:
view_lowest('viewCount')

#### Cut-off criteria

To further filter the dataset and ensure only the most relevant videos qualify as trending, we will use percentile-based cut-offs for key metrics. Specifically, we will eliminate the bottom 20th percentile of videos based on the following features `'viewCount', 'likeCount', 'commentCount', 'engagementRate', 'avgDailyViews'`

In [9]:
percentiles = trending_df[['viewCount', 'likeCount', 'commentCount', 'engagementRate', 'avgDailyViews']].quantile(0.2)
print("20th percentiles:")
print(percentiles)

20th percentiles:
viewCount         390311.600000
likeCount           8898.800000
commentCount         972.000000
engagementRate         0.016104
avgDailyViews     117688.160000
Name: 0.2, dtype: float64


We take rounded numbers for the filtered dataset (keep the original numbers won't change number of data points). Only the videos which satisfy all the conditions remain in the dataset, ensuring that all videos are suitable representations of trending topics.

In [10]:
trending_df = trending_df[
    (trending_df["viewCount"] >= 390000)
    & (trending_df['likeCount'] >= 9000)
    & (trending_df['commentCount'] >= 1000)
    & (trending_df['engagementRate'] >= 0.015)
    & (trending_df['avgDailyViews'] >= 100000)
]

In [11]:
trending_df.shape[0]

108

#### Assigning trending labels

We will assign 1 more property for the dataset (to serve the prediction part)

- `isTrending`: Determine whether a certain video will be featured on the Trending page. `1` for yes, `0` otherwise. All videos in this dataset will be marked as `1` (since they're all featured)

- The videos with the lowest satisfied criteria will be the benchmark to determine whether a video from the channel dataset will be on Trending.

#### Depreciated
- `trendingPercentile`: If a video is on Trending, which percentile will it belong. We will order the dataset by `avgDailyViews`, then `viewCount`, `likeCount` and `commentCount` (the major indicator of Trending videos), bin the dataset into 10 parts, then assign 10 different percentiles, from `0.05` (top 95%) to `0.95` (top 5%)

In [12]:
trending_df['isTrending'] = 1   

In [13]:
benchmarks = trending_df.groupby('isTrending').agg({
    'avgDailyViews': 'min',
    'viewCount': 'min',
    'likeCount': 'min',
    'commentCount': 'min'
}).reset_index()

print("Trending Benchmarks:")
print(benchmarks)

Trending Benchmarks:
   isTrending  avgDailyViews  viewCount  likeCount  commentCount
0           1      102733.75     397425       9619          1086


In [14]:
trending_df_rep = trending_df[['publishedAt','elapsedDays', 'title', 'channelTitle', 
                           'category','topicCategories', 'duration', 'licensedContent',
                           'viewCount', 'likeCount', 'commentCount', 'avgDailyViews',
                           'engagementRate', 'isTrending']]
trending_df_rep.info()

<class 'pandas.core.frame.DataFrame'>
Index: 108 entries, mcvLKldPM08 to LlBzNiQeeXM
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   publishedAt      108 non-null    object 
 1   elapsedDays      108 non-null    float64
 2   title            108 non-null    object 
 3   channelTitle     108 non-null    object 
 4   category         108 non-null    object 
 5   topicCategories  108 non-null    object 
 6   duration         108 non-null    object 
 7   licensedContent  108 non-null    bool   
 8   viewCount        108 non-null    int64  
 9   likeCount        108 non-null    int64  
 10  commentCount     108 non-null    int64  
 11  avgDailyViews    108 non-null    float64
 12  engagementRate   108 non-null    float64
 13  isTrending       108 non-null    int64  
dtypes: bool(1), float64(3), int64(4), object(6)
memory usage: 11.9+ KB


In [18]:
output_dir_2 = 'C:/Users/TKN/Downloads/New-Youtube-Scraper-v3/data/yt_processed_data'
combined_json = trending_df.to_json(orient="index", force_ascii=False, indent=4)

with open(os.path.join(output_dir_2, "processed_US_trending_data.json"), "w", encoding="utf-8") as f:
    f.write(combined_json)    
print(f"Combined channel_data JSON saved to {output_dir_2}.")

Combined channel_data JSON saved to C:/Users/TKN/Downloads/New-Youtube-Scraper-v2_turn-in - Copy/data/yt_processed_data.
