In [1]:
import pandas as pd
import json
import os

For the channel videos dataset (after combination), we will commence the similar cleaning techniques as used in trending dataset (percentile-based), but this time, only 10% will be cut off (those videos won't likely be trending as of Dec.2024)

In [2]:
channel_file_path = "C:/Users/TKN/Downloads/New-Youtube-Scraper-v3/data/combined_channel_data.json"
with open(channel_file_path, "r", encoding="utf-8") as f:
    trending_videos = json.load(f)

channel_df = pd.DataFrame.from_dict(trending_videos, orient="index")

In [3]:
channel_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17651 entries, V0CniCFbxLs to uF1YHaeAHEw
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   fetchedDate      17651 non-null  object 
 1   publishedAt      17651 non-null  object 
 2   elapsedDays      17651 non-null  float64
 3   title            17651 non-null  object 
 4   description      17651 non-null  object 
 5   channelTitle     17651 non-null  object 
 6   tags             13662 non-null  object 
 7   category         17651 non-null  object 
 8   duration         17651 non-null  object 
 9   licensedContent  17651 non-null  bool   
 10  viewCount        17651 non-null  int64  
 11  avgDailyViews    17651 non-null  float64
 12  likeCount        17651 non-null  int64  
 13  commentCount     17651 non-null  int64  
 14  engagementRate   17651 non-null  float64
 15  topicCategories  17651 non-null  object 
dtypes: bool(1), float64(3), int64(3), object(9)
mem

`view_stat` function to view the lowest numbers across the vital categories.

In [4]:
def view_stats(column_name):
    list = sorted(channel_df[column_name].tolist())
    print(list[:10])

In [5]:
view_stats('viewCount')
view_stats('likeCount')
view_stats('commentCount')
view_stats('engagementRate')
view_stats('avgDailyViews')

[112, 114, 117, 119, 136, 140, 171, 171, 173, 176]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.02, 0.02, 0.02, 0.02, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03]


`view_lowest` function to view the videos with the lowest interactions (view count, like count, comment count, average daily views)

In [6]:
def view_lowest(column_name):
    lowest_view_count_row = channel_df.loc[channel_df[column_name].idxmin()]
    print(f"Title: {lowest_view_count_row['title']}")
    print(f'Published At: {lowest_view_count_row["publishedAt"]}')
    print(f"View Count: {lowest_view_count_row['viewCount']}")
    print(f"Like Count: {lowest_view_count_row['likeCount']}")
    print(f"Comment Count: {lowest_view_count_row['commentCount']}")
    print(f"Avg Daily Views: {lowest_view_count_row['avgDailyViews']}")
    print(f"Engagement Rate: {lowest_view_count_row['engagementRate']}")

In [37]:
view_lowest('viewCount')

Title: Rustin LA stage
Published At: 2010-06-26T20:54:32Z
View Count: 112
Like Count: 5
Comment Count: 0
Avg Daily Views: 0.02
Engagement Rate: 0.04464


In [4]:
percentiles = channel_df[['viewCount', 'likeCount', 'commentCount', 'engagementRate', 'avgDailyViews']].quantile(0.1)
print("10th percentiles:")
print(percentiles)

10th percentiles:
viewCount         89309.0000
likeCount          2298.0000
commentCount         70.0000
engagementRate        0.0077
avgDailyViews       170.6000
Name: 0.1, dtype: float64


We take rounded numbers for the filtered dataset (keep the original numbers won't change number of data points). Only the videos which satisfy all the conditions remain in the dataset.

We lower the `engagementRate` for channel video dataset this time since lots of videos (music videos and shorts mostly) have a low `engagementRate`. 0.001 is the industry-standardized amount for engagement rate of large channels (around 1-10 million views)

In [5]:
channel_df = channel_df[
    (channel_df["viewCount"] >= 90000)
    & (channel_df['likeCount'] >= 2300)
    & (channel_df['commentCount'] >= 70)
    & (channel_df['engagementRate'] >= 0.001)
    & (channel_df['avgDailyViews'] >= 170)
]

In [6]:
channel_df.shape[0]

14744

#### Assigning trending labels

Similar to the trending dataset, we will also assign `isTrending` and `trendingPercentile`. But we need to assign `isTrending` first. The result is a dataset with `isTrending` either `1` or `0`. 

The videos with `1` (they are trending) will be separated from videos with `0`. 

Videos with `1` then get concatenated with trending videos to assign `trendingPercentile` to serve for prediction

Videos with `0` will also get a `nonTrendingPercentile` ? With top means (not trending, still significance) and bottom means (not trending, little significance)

From trending dataset, we obtained the benchmark for trending videos is as follow
```
Trending Benchmarks:
   isTrending  avgDailyViews  viewCount  likeCount  commentCount
0           1      102733.75     397425       9619          1086
```
The videos which satisfy all these conditions will be marked as trending (1) else (0)

In [8]:
trending_benchmarks = {
    'avgDailyViews': 102733.75,
    'viewCount': 397425,
    'likeCount': 9619,
    'commentCount': 1086
}

channel_df['isTrending'] = (
    (channel_df['avgDailyViews'] >= trending_benchmarks['avgDailyViews']) &
    (channel_df['viewCount'] >= trending_benchmarks['viewCount']) &
    (channel_df['likeCount'] >= trending_benchmarks['likeCount']) &
    (channel_df['commentCount'] >= trending_benchmarks['commentCount'])
).astype(int)

In [9]:
trending_count_df = channel_df[channel_df['isTrending'] == 1]

trending_count_df.shape[0] # videos in channel video dataset is marked as trending

613

In [10]:
channel_df_rep = channel_df[['publishedAt','elapsedDays', 'title', 'channelTitle', 
                           'category','topicCategories', 'duration', 'licensedContent',
                           'viewCount', 'likeCount', 'commentCount', 'avgDailyViews',
                           'engagementRate', 'isTrending']]
channel_df_rep.info()

<class 'pandas.core.frame.DataFrame'>
Index: 14744 entries, V0CniCFbxLs to 9Nx849WhPFc
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   publishedAt      14744 non-null  object 
 1   elapsedDays      14744 non-null  float64
 2   title            14744 non-null  object 
 3   channelTitle     14744 non-null  object 
 4   category         14744 non-null  object 
 5   topicCategories  14744 non-null  object 
 6   duration         14744 non-null  object 
 7   licensedContent  14744 non-null  bool   
 8   viewCount        14744 non-null  int64  
 9   likeCount        14744 non-null  int64  
 10  commentCount     14744 non-null  int64  
 11  avgDailyViews    14744 non-null  float64
 12  engagementRate   14744 non-null  float64
 13  isTrending       14744 non-null  int32  
dtypes: bool(1), float64(3), int32(1), int64(3), object(6)
memory usage: 2.0+ MB


In [17]:
output_dir_2 = 'C:/Users/TKN/Downloads/New-Youtube-Scraper-v3/data/yt_processed_data'
combined_json = channel_df.to_json(orient="index", force_ascii=False, indent=4)

with open(os.path.join(output_dir_2, "processed_channel_videos.json"), "w", encoding="utf-8") as f:
    f.write(combined_json)    
print(f"Combined channel_data JSON saved to {output_dir_2}.")

Combined channel_data JSON saved to C:/Users/TKN/Downloads/New-Youtube-Scraper-v2_turn-in - Copy/data/yt_processed_data.
