#### imports

In [1]:
import pandas as pd
import numpy as np
import polars as pl
from matplotlib import pyplot as plt
from analysis_functions import *

### strcuture of different datasets

#### df_channels_en.tsv

CSV file, having the following structure:
| **category_cc** | **join_date** | **channel**              | **name_cc**             | **subscribers_cc** | **videos_cc** | **subsriber_rank_sb** | **weights** |
|-----------------|---------------|--------------------------|-------------------------|--------------------|---------------|-----------------------|-------------|
| News & Politics | 2013-03-11    | UCcRgZlgsk5m-aDQa_d6BTkQ | NorthWestLibertyNews... | 16700              | 845           | 639043.0              | 10.0035     |
| Gaming          |    2012-01-15 | UCnnXR0VIJVpeL1wEr-bBaRw | Felix Guaman            | 112000             | 703           | 137318.0              | 5.4915      |

With:
- `category_cc`: most frequent category of the channel. One of: ['Gaming', 'Education', 'Entertainment', 'Howto & Style', 'Sports', 'Music', 'Film and Animation', 'Comedy', 'Nonprofits & Activism', 'People & Blogs', 'News & Politics', 'Science & Technology', 'Pets & Animals', 'Autos & Vehicles', 'Travel & Events', nan]
- `join_date`: join date of the channel.
- `channel`: unique channel id.
- `name_cc`: name of the channel.
- `subscribers_cc`: number of subscribers.
- `videos_cc`: number of videos.
- `subscriber_rank_sb`: rank in terms of number of subscribers.
- `weights`: weights cal



#### yt_metadata_helper.feather   (yt_metadata_helper.feather.csv, filtered_yt_metadata_helper.feather.csv)
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>categories</th>
      <th>channel_id</th>
      <th>dislike_count</th>
      <th>display_id</th>
      <th>duration</th>
      <th>like_count</th>
      <th>upload_date</th>
      <th>view_count</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Film &amp; Animation</td>
      <td>UCy6sWF4taso5GtrfDGhwpBA</td>
      <td>0.0</td>
      <td>EXOviJ_EJDo</td>
      <td>68</td>
      <td>0.0</td>
      <td>2011-12-07</td>
      <td>76.0</td>
    </tr>
    <tr>
      <td>Gaming</td>
      <td>UCEPYwwuGhgA9wfO2It11OXw</td>
      <td>0.0</td>
      <td>xSKA6VX7Tdo</td>
      <td>125</td>
      <td>6.0</td>
      <td>2016-10-04</td>
      <td>198.0</td>
    </tr>
    <tr>
      <td>News &amp; Politics</td>
      <td>UCojNA7ZvnmGuIvYnm44wl3Q</td>
      <td>NaN</td>
      <td>FsucWMijKA4</td>
      <td>130</td>
      <td>NaN</td>
      <td>2010-11-18</td>
      <td>106.0</td>
    </tr>
  </tbody>
</table>
</div>

With (values were crawled from YouTube between 2019-10-29 and 2019-11-23):
- `categories`: category (self-defined when they upload a video to YouTube)
- `channel_id`: unique channel id
- `dislike_count`: dislikes of the video
- `display_id`: unique video id
- `duration`: duration of the video
- `like_count`:likes of the video.
- `upload_date`: upload date
- `view_count`: views of the video.


#### yt_metadata.jsonl
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>categories</th>
      <th>channel_id</th>
      <th>crawl_date</th>
      <th>description</th>
      <th>dislike_count</th>
      <th>display_id</th>
      <th>duration</th>
      <th>like_count</th>
      <th>tags</th>
      <th>title</th>
      <th>upload_date</th>
      <th>view_count</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Film &amp; Animation</td>
      <td>UCy6sWF4taso5GtrfDGhwpBA</td>
      <td>2019-10-29</td>
      <td>description</td>
      <td>0.0</td>
      <td>EXOviJ_EJDo</td>
      <td>68</td>
      <td>0.0</td>
      <td>tags</td>
      <td>title</td>
      <td>2011-12-07</td>
      <td>76.0</td>
    </tr>
    <tr>
      <td>Gaming</td>
      <td>UCEPYwwuGhgA9wfO2It11OXw</td>
      <td>2019-10-29</td>
      <td>description</td>
      <td>0.0</td>
      <td>xSKA6VX7Tdo</td>
      <td>125</td>
      <td>6.0</td>
      <td>tags</td>
      <td>title</td>
      <td>2016-10-04</td>
      <td>198.0</td>
    </tr>
    <tr>
      <td>News &amp; Politics</td>
      <td>UCojNA7ZvnmGuIvYnm44wl3Q</td>
      <td>2019-10-29</td>
      <td>description</td>
      <td>NaN</td>
      <td>FsucWMijKA4</td>
      <td>130</td>
      <td>NaN</td>
      <td>tags</td>
      <td>title</td>
      <td>2010-11-18</td>
      <td>106.0</td>
    </tr>
  </tbody>
</table>
</div>

With (values were crawled from YouTube between 2019-10-29 and 2019-11-23):
- `categories`: category (self-defined when they upload a video to YouTube)
- `channel_id`: unique channel id
- `dislike_count`: dislikes of the video
- `display_id`: unique video id
- `duration`: duration of the video
- `like_count`:likes of the video.
- `upload_date`: upload date
- `view_count`: views of the video.

#### df_timeseries_en.tsv

| **channel**              | **category**       | **datetime** | **views**   | **delta_views** | **subs** | **delta_subs** | **videos** | **delta_videos** | **activity** |
|--------------------------|--------------------|--------------|-------------|-----------------|----------|----------------|------------|------------------|--------------|
| UCBJuEqXfXTdcPSbGO9qqn1g | Film and Animation | 2017-07-03 | 202495  |           0 |  650 |   0        |      5 |            0 |        3 |
| UCBJuEqXfXTdcPSbGO9qqn1g | Film and Animation | 2017-07-10 | 394086  |      191591 | 1046 | 396        |      6 |            1 |        1 |

With:
- `channel`: channel id.
- `category`: category of the channel as assigned by `socialblade.com` according to the last 10 videos at time of crawl.
- `datetime`: Week related to the data point.
- `views`: Total number of views the channel had this week.
- `delta_views`: Delta views obtained this week.
- `subs`: Total number of subscribers the channel had this week.
- `delta_subs`: Delta subscribers obtained this week.
- `videos`: Total number of videos the channel had this week.
- `delta_videos`: Delta videos obtained this week.
- `activity`: Number of videos published in the last 15 days.


#### youtube_comments.tsv

| **author** | **video_id**      |  **likes** |  **replies** |
|------|--------------|-------|---------|
| 1      | Gkb1QMHrGvA   |  2     |  0       |
| 1      | CNtp0xqoods   |  0     |  0       |
| 1      | 249EEzQmVmQ   |  1     |  0       |

With (data obtained at crawl time between 2019-09-12 and 2019-09-17):
- `author`: anonymized author id (unique)
- `video_id`: unique video id of the video the comment was written
- `likes`: likes for the comment
- `replies`: replies for the comment

In [None]:
# channels_df_f = pd.read_csv("./Youniverse/channels_dfannels_en.tsv", sep="\t")

pl_df_f = pl.read_csv("../Youniverse/df_channels_en.tsv", separator="\t")

filtered_df_ch = pl_df_f.filter(pl.col("category_cc") == "News & Politics")
pl_df_f.sample(5)

In [None]:
# df_vd_f = pd.read_feather("../Youniverse/yt_metadata_helper.feather")

# df_vd_f.sample(5)

In [None]:
# save to csv
# df_vd_f.to_csv("../Youniverse/yt_metadata_helper.feather.csv", sep="\t")

In [None]:
channels_df.filter(pl.col("channel").is_in(filtered_df_ch["channel"]))

In [None]:
# load yt metadata in chunks and filter for videos contained in filtered_df_ch
# df.filter(pl.col("categories") == "News & Politics") to filter for categories of videos instead
reader = pl.read_csv_batched(
    "../Youniverse/yt_metadata_helper.feather.csv",
    separator="\t",
    batch_size=5000
)


batches = reader.next_batches(5)  
i = 0
while batches:
    for df in batches:
        if i == 0:
            df.filter(pl.col("channel_id").is_in(filtered_df_ch["channel"])).write_csv("../Youniverse/filtered_yt_metadata_helper.feather.csv", include_header=True)
        else:
            with open("../Youniverse/filtered_yt_metadata_helper.feather.csv", "a") as fh:
                fh.write(df.filter(pl.col("channel_id").is_in(filtered_df_ch["channel"])).write_csv(file=None, include_header=False))
        i = i+1
        print(f"batch {i}\r", end='')
    batches = reader.next_batches(5)

In [None]:
filtered_df_metadata_feather = pl.read_csv("../Youniverse/filtered_yt_metadata_helper.feather.csv")
filtered_df_metadata_feather.sample(5)

In [None]:
len(filtered_df_metadata_feather)

In [None]:
list(filtered_df_metadata_feather["categories"].unique())

In [None]:
filtered_df_metadata_feather["categories"].value_counts().sort(by="count", descending=True)

In [None]:
chunks = pd.read_json("../Youniverse/yt_metadata_en.jsonl", lines=True, chunksize = 500)
for i, c in enumerate(chunks):
    c = c[c["channel_id"].isin(filtered_df_ch["channel"])]
    if i == 0:
        print(c)
        c.to_csv("../Youniverse/filtered_yt_metadata.csv", header=True, index=False)
    else: 
        with open("../Youniverse/filtered_yt_metadata.csv", "a") as fh:
            fh.write(c.to_csv(path_or_buf=None, header=False, index=False))
    print(f"batch {i} / 145390 \r", end="")

In [None]:
# load yt metadata in chunks and filter for videos contained in filtered_df_ch
# df.filter(pl.col("categories") == "News & Politics") to filter for categories of videos instead
batch_size = 10000
reader = pl.read_csv_batched(
    "../Youniverse/youtube_comments.tsv",
    separator="\t",
    batch_size= batch_size
)

total_batches = 8600000000/batch_size

batches = reader.next_batches(5)  
i = 0
while batches:
    for df in batches:
        if i == 0:
            # df.filter(pl.col("video_id").is_in(filtered_df_metadata_feather["display_id"])).write_csv("../Youniverse/filtered_youtube_comments.tsv", include_header=True)
            df.filter(pl.col("video_id") == "fXN0ABkfZ7M").write_csv("../Youniverse/filtered_youtube_comments.tsv", include_header=True)
        else:
            with open("../Youniverse/filtered_youtube_comments.tsv", "a") as fh:
                fh.write(df.filter(pl.col("video_id") == "fXN0ABkfZ7M").write_csv(file=None, include_header=False))
        i = i+1
        print(f"batch {i} / {total_batches} \r", end='')
    batches = reader.next_batches(5)

In [None]:
# load yt metadata in chunks and filter for videos contained in filtered_df_ch
reader = pl.read_csv_batched(
    "../Youniverse/num_comments_authors.tsv",
    separator="\t",
    batch_size=5000
)  
batches = reader.next_batches(5)  
for df in batches:  
    print(df)

In [None]:
list(pl_df_f["category_cc"].unique())

## Filtering of activity

In [None]:
df_timeseries = pl.read_csv("../Youniverse/df_timeseries_en.tsv", separator="\t")
df_timeseries.sample(5)

In [None]:
df_timeseries.filter(pl.col("channel").is_in(filtered_df_ch["channel"])).write_csv("../Youniverse/filtered_df_timeseries_en.tsv", include_header=True, separator="\t")

In [None]:
filtered_df_timeseries = pl.read_csv("../Youniverse/filtered_df_timeseries_en.tsv", separator="\t")
filtered_df_timeseries.sample(5)

In [None]:
grouped_df = filtered_df_timeseries.group_by('channel').agg(pl.col('activity').mean().alias('mean_activity'))

# Extract the mean activity values into a list
mean_activities = grouped_df['mean_activity'].to_list()

# Plot histogram of the mean activity values
plt.hist(mean_activities, bins=100, edgecolor="black", alpha=0.7)
plt.xlabel('Mean Activity')
plt.ylabel('Frequency')
plt.title('Histogram of Mean Activity by Channel')
# plt.xscale("log")
plt.yscale("log")
plt.grid(True, which="both")
plt.show()


In [None]:
# 56 == 4 videos per day
len(grouped_df.filter(pl.col("mean_activity")>56))

In [None]:
high_activity_channels = channels_df.filter(pl.col("channel").is_in(grouped_df.filter(pl.col("mean_activity")>56)["channel"]))
# merge high_activity_channels with grouped_df on the channel column
high_activity_channels = high_activity_channels.join(grouped_df, on="channel", how="inner")
high_activity_channels.sort(by="mean_activity", descending=True).head(10)

## Saving the high_activity_channels to a csv

In [None]:
def write_polars_to_csv(polars_dataframe, name):
# # Convert to pandas DataFrame
    polars_dataframe = polars_dataframe.to_pandas()

    # Write the DataFrame to a CSV file
    polars_dataframe.to_csv(f"{name}.csv", index=False)

In [None]:
import keys
YOUTUBE_KEY = keys.YOUTUBE_API_KEY
OPEN_API_KEY = keys.OPENAI_API_KEY

# Get country of channel

In [None]:
from googleapiclient.discovery import build

youtube = build('youtube', 'v3', developerKey=YOUTUBE_KEY)

def get_channel_country(channel_id):
    # Make the API request to get the channel details
    request = youtube.channels().list(
        part="snippet",
        id=channel_id
    )
    
    # Execute the request and get the response
    response = request.execute()
    
    # Check if the response contains the necessary information
    if "items" in response and len(response["items"]) > 0:
        # Extract country information from the channel snippet
        country = response["items"][0]["snippet"].get("country", "Country not available")
        return country
    else:
        return "Channel not found"


In [None]:
high_activity_channels = high_activity_channels.with_columns(
    pl.col("channel").map_elements(lambda channel_id:get_channel_country(channel_id)).alias("Channel_country")
)

In [None]:
print(high_activity_channels.sample(100))
write_polars_to_csv(high_activity_channels, "high_activity_channels_with_country_test_2")

# Filtering out english speaking channels with CHATGPT LLM API

In [None]:
# Initialize the chunk reader
chunk_reader = pd.read_csv("../Youniverse/filtered_yt_metadata.csv", chunksize=5000)

# Read the first chunk and print the first few rows
chunk = next(chunk_reader)  # Get the first chunk
print(chunk.head())  # Display the first few rows of the first chunk

In [None]:
# Load the second dataset (with the list of channel IDs to compare against)
channels_df = pd.read_csv("high_activity_channels_with_country.csv") 
channel_ids = set(channels_df['channel'].unique())  
# Initialize the chunk reader for the large CSV file
chunk_reader = pd.read_csv("../Youniverse/filtered_yt_metadata.csv", chunksize=5000)

matching_videos = []
# Dictionary to track how many videos are saved for each channel
channel_video_count = {channel_id: 0 for channel_id in channel_ids}

for chunk in chunk_reader:
    # Filter rows where channel_id in chunk is in the set of channel_ids from channels_df
    matching_rows = chunk[chunk['channel_id'].isin(channel_ids)]
    for channel_id, group in matching_rows.groupby('channel_id'):
        # If we've already saved 5 videos for this channel, skip it
        if channel_video_count[channel_id] >= 5:
            continue
        # Get the first 5 videos for this channel (or fewer if there are less than 5)
        first_5_videos = group.head(5 - channel_video_count[channel_id])  # Adjust to avoid exceeding 5
        # Add the number of videos saved for this channel
        channel_video_count[channel_id] += len(first_5_videos)
        matching_videos.append(first_5_videos)

# Convert the list into DataFrame
final_df = pd.concat(matching_videos, ignore_index=True)
final_df.to_csv('matching_videos.csv', index=False)


In [None]:
file = pd.read_csv("./data/matching_videos.csv")
print(file.shape)

In [None]:
import pandas as pd
from googleapiclient.discovery import build
import ollama
import datetime
import os
from openai import OpenAI
import time

client = OpenAI(api_key=OPEN_API_KEY)

# YouTube API credentials and setup
youtube = build('youtube', 'v3', developerKey=YOUTUBE_KEY)

# Define the function to detect the language using ChatGPT
def check_video_language(video_title, video_description, closed_captions=""):
    # Combine title, description, and captions to form the text to be checked
    print("title: ", video_title)
    print("description: ", video_description)

    messages = [
        {"role": "system", "content": "You are a helpful assistant who only focuses on language identification."},
        {"role": "user", "content": f"""
        Given the title and description of a YouTube video, please determine if the text is in English. Ignore URLs and non-English symbols. 
        Respond with "yes" if you think the text is in English, and "no" if you think it is not.

        Title: "{video_title}"
        Description: "{video_description}"

        Is the text in English?
        """}
    ]

    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0.2,
        messages=messages
    )

    response = completion.choices[0].message.content

    print("Chat response: ", response)
    # Check if the response includes "yes"
    return "yes" in response

def check_channel_english(channel_id):
    videos = final_df.loc[final_df['channel_id'] == channel_id]
    # print(videos)
    for index, video in videos.iterrows():
        # Check if the text is in English using CHATGPT API
        is_english = check_video_language(video_title=video['title'], video_description=video['description'])
        if not is_english:
            print("channel is not english")
            return False  # If any video is not English, return False
        time.sleep(0.5)
    print("channel is english")
    return True  # If all videos checked are English, return True

In [None]:
## TESTING 

check_channel_english("UClMs26ViHFMy7MS897Alcxw")

In [None]:
# Iterate through pandas dataframe and check if channel is english
# Check each channel and store results
high_activity_channels = high_activity_channels.with_columns(
    pl.col("channel").map_elements(lambda channel_id:check_channel_english(channel_id)).alias("Is_English")
)


In [None]:
write_polars_to_csv(high_activity_channels, "high_activity_channels_country_and_english.csv")

In [None]:
filtered = pd.read_csv("data\high_activity_channels_country_and_english.csv")

In [None]:
# filtered["Is_English"].value_counts()
english = filtered[filtered["Is_English"] == True]
print(len(english))
print("english")
print(english["Channel_country"].value_counts())

not_english = filtered[filtered["Is_English"] == False]
print("not_english")
print(not_english["Channel_country"].value_counts())

# Analysis

In [54]:
channels_df = pd.read_csv("./data/high_activity_channels_country_and_english.csv")

In [56]:
videos_df = pd.read_csv("./../data/filtered_yt_metadata.csv")

  videos_df = pd.read_csv("./../data/filtered_yt_metadata.csv")


In [57]:
videos_df

Unnamed: 0,categories,channel_id,crawl_date,description,dislike_count,display_id,duration,like_count,tags,title,upload_date,view_count
0,News & Politics,UCzUV5283-l5c0oKRtyenj6Q,2019-11-22 08:47:10.520209,👕 Order your shirts here: https://Teespring.co...,195.0,MBgzne7djFU,378,47027.0,"Funny,Entertainment,Fun,Laughing,Educational,L...",Elizabeth Warren Gets a Big Surprise at the Ai...,2019-10-03 00:00:00,374711.0
1,News & Politics,UCzUV5283-l5c0oKRtyenj6Q,2019-11-22 08:46:16.481889,👕 Order your shirts here: https://Teespring.co...,114.0,AbH3pJnFgY8,278,36384.0,"Funny,Entertainment,Fun,Laughing,Educational,L...",No More Twitter? 😂,2019-10-02 00:00:00,245617.0
2,News & Politics,UCzUV5283-l5c0oKRtyenj6Q,2019-11-22 08:46:17.137786,👕 Order your shirts here: https://Teespring.co...,143.0,QBuwj_h1SH4,385,40597.0,"Funny,Entertainment,Fun,Laughing,Educational,L...",The Only Thing Stopping Them 😂,2019-10-01 00:00:00,299535.0
3,News & Politics,UCzUV5283-l5c0oKRtyenj6Q,2019-11-22 08:46:17.823119,👕 Order your shirts here: https://Teespring.co...,193.0,Reogq26-KpI,419,42658.0,"Funny,Entertainment,Fun,Laughing,Educational,L...",Speaking of Losers...,2019-09-30 00:00:00,357126.0
4,News & Politics,UCzUV5283-l5c0oKRtyenj6Q,2019-11-22 08:46:18.497042,👕 Order your shirts here: https://Teespring.co...,136.0,uBY9OtlSnX8,414,44246.0,"Funny,Entertainment,Laughing,Educational,Learn...",The Circus Continues!,2019-09-27 00:00:00,297704.0
...,...,...,...,...,...,...,...,...,...,...,...,...
9503605,News & Politics,UCrwE8kVqtIUVUzKui2WVpuQ,2019-11-01 23:46:02.294620,Shri Manoj Kumar Tiwari's speech during Motion...,3,YQLoxwLpjSU,270,67,"BJP,Bharatiya Janata Party,BJP videos,Yuva TV,...",Shri Manoj Kumar Tiwari's speech during Motion...,2017-02-06 00:00:00,4409.0
9503606,News & Politics,UCrwE8kVqtIUVUzKui2WVpuQ,2019-11-01 23:46:06.401481,Shri La Ganesan's speech during Motion of Than...,0,mINQHg1QBcg,878,21,"BJP,Bharatiya Janata Party,BJP videos,Yuva TV,...",Shri La Ganesan's speech during Motion of Than...,2017-02-06 00:00:00,1172.0
9503607,News & Politics,UCrwE8kVqtIUVUzKui2WVpuQ,2019-11-01 23:46:09.530822,Shri Mukhtar Abbas Naqvi's speech during Motio...,2,x20aNOWh1yI,1003,35,"BJP,Bharatiya Janata Party,BJP videos,Yuva TV,...",Shri Mukhtar Abbas Naqvi's speech during Motio...,2017-02-06 00:00:00,1898.0
9503608,News & Politics,UCrwE8kVqtIUVUzKui2WVpuQ,2019-11-01 23:46:00.080054,BJP submitted complaint to EC against Chief Se...,0,-Nn6FL2gqEw,755,27,"BJP,Bharatiya Janata Party,BJP videos,Yuva TV,...",BJP submitted complaint to EC against Chief Se...,2017-02-06 00:00:00,726.0


In [None]:
event_channels = []
event_videos = []
event_timeseries = []

## Filter out a certain event

In [None]:
#load all dataframes
timeseries_df = pl.read_csv("./../data/filtered_df_timeseries_en.tsv", separator='\t')
num_comments = pl.read_csv("./../data/num_comments.tsv", separator='\t')

In [None]:
#join videos dataset with num comment dataset for easier use
videos_df = videos_df.join(num_comments, on='display_id')

In [None]:
#rename channel_id columns to all have the same name
channels_df = channels_df.rename({'channel':'channel_id'})
timeseries_df = timeseries_df.rename({'channel':'channel_id'})

In [None]:
#filter by date
min_date = pl.datetime(2017,1,1)
max_date = pl.datetime(2018,1,1)

timeseries_df = timeseries_df.with_columns(pl.col('datetime').str.to_datetime())
videos_df = videos_df.with_columns(pl.col('upload_date').str.to_datetime())

timeseries_df = timeseries_df.filter((pl.col('datetime') >= min_date) & (pl.col('datetime') <= max_date))
videos_df = videos_df.filter((pl.col('upload_date') >= min_date) & (pl.col('upload_date') <= max_date))
channels_df = channels_df.filter(pl.col('channel_id').is_in(videos_df['channel_id']))
videos_df = videos_df.join(num_comments, on='display_id')

##### !!! Note : Keyword filtering needs to be added.

In [None]:
#create dictionnary to more easily navigate between channel name and channel id
channel_dict = dict(channels_df[['name_cc','channel_id']].iter_rows())
inv_channel_dict = {v: k for k, v in channel_dict.items()}

## Identify holes in the data
    - channels that don’t report for specific events
    - videos with too few comments (under 50 it is not in the comment dataset)

In [None]:
#store data from next event
event_timeseries.append(timeseries_df)
event_videos.append(videos_df)
event_channels.append(channels_df)

### Looking for channels that do not report on certain events

In [None]:
# channels that report on event1 but not on event2
event_channels[0].filter(~pl.col('channel_id').is_in(event_channels[1]['channel_id']))

In [None]:
# channels that report on event2 but not on event1
event_channels[1].filter(~pl.col('channel_id').is_in(event_channels[0]['channel_id']))

### Locating videos with too few comments

In [None]:
#the videos with too few comments can be excluded by filtering
comment_threshold = 100

too_few_comms = videos_df.filter(pl.col('num_comms') < comment_threshold)

## Compare channels
    - this channels videos, have these characteristics, or perform well with these subjects

### Channels with correlated video performances

In [None]:
#get general statistics for all channel
#gives information on the general performance characteristics of the videos from each channel
vid_count, vid_mean, vid_std, vid_med = get_general_ch_statistics(videos_df, cols_to_keep=['dislike_count','like_count','view_count','num_comms'])

In [None]:
vid_mean_performance = vid_mean.drop('duration')
cov = plot_covariance (vid_mean_performance,'Covariance matrix between channels', 'Histogram of covariances')

In [None]:
threshold = 1e10
corrolated_channels = get_correlated_channels(vid_mean_performance,threshold)
corrolated_channels

### Channels with correlated video characteristics (length, key words?, ...)

##### !!! Note : Keyword analysis needs to be added for fully functionning code.

In [None]:
vid_mean_characteristics = vid_mean.drop(['num_comms','like_count','dislike_count','view_count'])
cov = plot_covariance(vid_mean_characteristics, 'Covariance matrix of video characteristics', 'Histogram of covariances')

In [None]:
#print channels that have similar videos
threshold = 1e10
corrolated_channels = get_correlated_channels(vid_mean_characteristics, threshold)
corrolated_channels

### More in depth comparaison between two given channels

Optional procedure to analyse more in depth the relation between two channels

##### Based on video dataframe

In [None]:
# ttest : checks the null hypothesis that two independant channels have an identical mean number of views, likes etc...
# used to compare if two sample's means differ significantly or not

ttest_between_two_channels(videos_df, channel_dict['BBC News'],channel_dict["CNBC"], 'num_comms')

In [None]:
# F test : test for the null hypothesis that two channels have the same variance
# used to compare if two sample's variance differ significantly or not

Ftest_between_two_channels(videos_df, channel_dict['BBC News'],channel_dict["CNBC"], 'num_comms')

##### Based on timeseries dataframe

In [None]:
ts_count, ts_mean, ts_std, ts_med = get_general_ch_statistics(timeseries_df,cols_to_keep=['views', 'delta_views', 'subs','delta_subs','videos','delta_videos','activity'])

In [None]:
ttest_between_two_channels(timeseries_df, channel_dict['BBC News'],channel_dict["CNBC"], 'activity')

In [None]:
Ftest_between_two_channels(timeseries_df, channel_dict['BBC News'],channel_dict["CNBC"], 'activity')

### Compare channel's video performance when normalized by size (number of subscribers or number of views)

##### Normalize by subs

In [None]:
normalized_videos_df = normalize_vids_with_timeseries(videos_df, timeseries_df, 'subs') #not working because of video_df issues

In [None]:
cols_to_keep = ['dislike_count','like_count','view_count','num_comms','duration']
vid_mean, vid_std, vid_med = get_general_ch_statistics(normalized_videos_df, cols_to_keep)

In [None]:
cov = plot_covariance (vid_mean.drop('duration'),'Covariance matrix between channels normalized by subscribers', 'Histogram of covariances')

In [None]:
threshold = 25000
corrolated_channels = get_correlated_channels(vid_mean,threshold)

##### Normalize by views

In [None]:
normalized_videos_df = normalize_vids_with_timeseries(videos_df, timeseries_df, 'views') #not working because of video_df issues

In [None]:
cols_to_keep = ['dislike_count','like_count','view_count','num_comms','duration']
vid_mean, vid_std, vid_med = get_general_ch_statistics(normalized_videos_df, cols_to_keep)

In [None]:
cov = plot_covariance (vid_mean.drop('duration'),'Covariance matrix between channels normalized by subscribers', 'Histogram of covariances')

In [None]:
threshold = 25000
corrolated_channels = get_correlated_channels(vid_mean,threshold)

### Compare channel performance across events

In [None]:
videos_event_1 = event_videos[0]
videos_event_2 = event_videos[1]

ts_event_1 = event_timeseries[0]
ts_event_2 = event_timeseries[1]

In [None]:
#calculate general statistics for both events
vid_count_1, vid_mean_1, vid_std_1, vid_med_1 = get_general_ch_statistics(videos_event_1, cols_to_keep=['dislike_count','like_count','view_count','num_comms'])

vid_count_2, vid_mean_2, vid_std_2, vid_med_2 = get_general_ch_statistics(videos_event_2, cols_to_keep=['dislike_count','like_count','view_count','num_comms'])


ts_count_1, ts_mean_1, ts_std_1, ts_med_1 = get_general_ch_statistics(ts_event_1,cols_to_keep=['views', 'delta_views', 'subs','delta_subs','videos','delta_videos','activity'])
ts_count_2, ts_mean_2, ts_std_2, ts_med_2 = get_general_ch_statistics(ts_event_2,cols_to_keep=['views', 'delta_views', 'subs','delta_subs','videos','delta_videos','activity'])

##### Compare general channel performance between multiple events

In [None]:
event_performance = pl.concat([vid_mean_1.mean(),vid_mean_2.mean()])

In [None]:
#Covariance matrix for the channel performance to identify channels that perform similarly for a given event.
cov = plot_covariance(event_performance,'Covariance across the mean performance of all channels for different events','Histogram of the covariance between events')

##### Compare a given channel statistic between two events

In [None]:
# ttest : checks the null hypothesis that a given parameter has the same mean across between two events
# used to compare if two means differ significantly or not

ttest_between_events(ts_mean_1['activity'], ts_mean_2['activity'])

In [None]:
# Ftest : checks the null hypothesis that a given parameter has the same variance across between two events
# used to compare if two means differ significantly or not

Ftest_between_events(ts_mean_1['activity'], ts_mean_2['activity'])

## Compare between kinds of events and where events are from
    - how many videos
    - how many views
    - interactions: likes, comments

### Compare number of videos

In [None]:
#compare number of videos between two events

compare_overall_vid_count_between_events(vid_count_1, vid_count_2)

In [None]:
# Compare average number of videos per channel between two events
ttest_between_events(vid_count_1['counts'], vid_count_2['counts'])

In [None]:
# Compare variance of the number of videos per channel between two events
Ftest_between_events(vid_count_1['counts'], vid_count_2['counts'])

### Analyse each event by videos (number of views, number of likes/dislikes)

In [None]:
v_means_1,v_stdevs_1,v_medians_1 = get_general_vid_statistics(videos_event_1)
v_means_2,v_stdevs_2,v_medians_2 = get_general_vid_statistics(videos_event_2)

In [None]:
pl.concat([v_means_1,v_means_2]).insert_column(0,pl.Series(['event_1','event_2']))

In [None]:
compare_video_statistics_between_events(videos_event_1,videos_event_2)

In [None]:
ttest_between_events(videos_event_1['view_count'], videos_event_2['view_count'])