# YouNiverse Project
By **AdamantiumForgers**

*Study of the politic polarization in various topics (based on big media followers) : Analyze vocabulary in videos posted (Title and description), How the videos are reveived (likes, dislikes, views),list of user_id and their behaviour in the platform, coverage of events (Elections, war)*

# Dataset
YouNiverse comprises metadata from over 136k channels and 72.9M videos (English-Speaking only!) published between May 2005 and October 2019, as well as channel-level time-series data with weekly subscriber and view counts.

Source: https://zenodo.org/record/4650046 \
Github: https://github.com/epfl-dlab/YouNiverse \
Size: 111 GB (compressed, in total)

**List of files**
* df_channels_en.tsv.gz (6MB): List of the 136'471 channels with some infos (state in october 2019)
* df_timeseries_en.tsv.gz (571MB): Weekly timeseries for each channel, from 03 July 2017 to 23 October 2019
* num_comments.tsv.gz (755MB): List of videos (display_id) with their number of comments
* num_comments_authors.tsv.gz (1.4GB):  ?????????????????????
* youtube_comments.tsv.gz (77.2GB): ~8.6B comments made by ~449M users in 20.5M videos. Each rows = 1 comment: user id, a video id, number of replies and likes the comment received
* yt_metadata_en.jsonl.gz (13.6GB): metadata data related to ~73M videos from ~137k channels
* yt_metadata_helper.feather (2.8GB): Same as jsonl except description, tags, and title (the largest fields)

# Packages

In [39]:
import numpy as np
import pandas as pd
import seaborn as sns

import os
import json
#import glob
#import gzip
#import swifter
#import langdetect
#import zstandard as zstd
#import matplotlib as mpl
#import scipy.stats as stats
#import matplotlib.pyplot as plt
#import matplotlib.ticker as mtick
#from matplotlib.lines import Line2D
#import matplotlib.font_manager as font_manager

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'



In [40]:
#class zreader:
#
#    def __init__(self, file, chunk_size=16384):
#        self.fh = open(file, 'rb')
#        self.chunk_size = chunk_size
#        self.dctx = zstd.ZstdDecompressor()
#        self.reader = self.dctx.stream_reader(self.fh)
#        self.buffer = ''
#
#    def readlines(self):
#        while True:
#            chunk = self.reader.read(self.chunk_size).decode("utf-8", errors="ignore")
#            if not chunk:
#                break
#            lines = (self.buffer + chunk).split("\n")
#
#            for line in lines[:-1]:
#                yield line
#
#            self.buffer = lines[-1]

#reader = Zreader("PATH_COMMENTS", chunk_size=16384)

## Paths

In [41]:
## Directories for the data.
## Large data can be stored on an external drive and accessed by creating a simlink to a "large" directory in the data folder
## ln -s target_path link_path

DIR = "data/"
DIR_LARGE = "data/large/"

## Path for each file
PATH_TIME_SERIES = DIR + "df_timeseries_en.tsv.gz"
PATH_CHANNELS = DIR + "df_channels_en.tsv.gz"
PATH_NUM_COMMENTS = DIR + "num_comments.tsv.gz"
PATH_NUM_COMMENTS_AUTHORS = DIR + "num_comments_authors.tsv.gz"
PATH_METADATA = DIR_LARGE + "yt_metadata_en.jsonl.gz"
PATH_METADATA_HELPER = DIR + "yt_metadata_helper.feather"
PATH_COMMENTS = DIR_LARGE + "youtube_comments.tsv.gz"

## Load data

In [42]:
## Can be read entirely 
df_channels = pd.read_csv(PATH_CHANNELS, compression="infer", sep="\t", nrows=10000)
df_timeseries = pd.read_csv(PATH_TIME_SERIES, compression="infer", sep="\t", nrows=10000) # Bigger, but can still be read entirely

## Feather files not useful (use the whole json file instead, it's simpler since read_feather does not have any nrows or chunksize options...)
#df_metadata_light = pd.read_feather(PATH_METADATA_HELPER, use_threads=False, columns=["channel_id", "categories", "dislike_count", "like_count", "duration", "upload_date", "view_count"])
#df_metadata_light = pd.read_feather(PATH_METADATA_HELPER,[1,5]) #Only 1:channel_id and 5:like_count

## Too big, only read nrows then apply by chunks
df_comments = pd.read_csv(PATH_COMMENTS, compression="infer", sep="\t", nrows=10000)
df_num_comments = pd.read_csv(PATH_NUM_COMMENTS, compression="infer", sep="\t", nrows=10000)
df_num_comments_authors = pd.read_csv(PATH_NUM_COMMENTS_AUTHORS, compression="infer", sep="\t", nrows=10000)
df_metadata = pd.read_json(PATH_METADATA, compression='infer', lines=True, nrows=10000)

## Preprocess date fields
df_channels["join_date"] = pd.to_datetime(df_channels["join_date"])
df_timeseries["datetime"] = pd.to_datetime(df_timeseries["datetime"])


## Work with the data

In [67]:
## Create function on the dataframe
def function_to_apply(df):
    # For example, filter all channels with more than 10M subs
    return df[df['subscribers_cc'] > 1e7]


## Then apply to the whole dataset by a loop and using the argument "chunksize"
result = None  

for chunk in pd.read_csv(PATH_CHANNELS, compression="infer", sep="\t", chunksize=10000):
    df_temp = function_to_apply(chunk)
    result = pd.concat([result, df_temp])

df_result = pd.DataFrame(data=result)

type(result) 
## Look at the result
df_result.head(5)
print(len(df_result))

pandas.core.frame.DataFrame

Unnamed: 0,category_cc,join_date,channel,name_cc,subscribers_cc,videos_cc,subscriber_rank_sb,weights
0,Gaming,2010-04-29,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PewDiePie,101000000,3956,3.0,2.087
1,Education,2006-09-01,UCbCmjCuTUZos6Inko4u57UQ,Cocomelon - Nursery ...,60100000,458,7.0,2.087
2,Entertainment,2006-09-20,UCpEhnqL0y41EpW2TvWAHD7Q,SET India,56018869,32661,8.0,2.087
3,Howto & Style,2016-11-15,UC295-Dw_tDNtZXFeAPAW6Aw,5-Minute Crafts,60600000,3591,9.0,2.087
4,Sports,2007-05-11,UCJ5v_MCY6GNUBTO8-D3XoAg,WWE,48400000,43421,11.0,2.087


296


## Overview of all dataframes

In [46]:
df_channels.head()
df_channels[df_channels['subscribers_cc'] > 1e7]

Unnamed: 0,category_cc,join_date,channel,name_cc,subscribers_cc,videos_cc,subscriber_rank_sb,weights
0,Gaming,2010-04-29,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PewDiePie,101000000,3956,3.0,2.087
1,Education,2006-09-01,UCbCmjCuTUZos6Inko4u57UQ,Cocomelon - Nursery ...,60100000,458,7.0,2.087
2,Entertainment,2006-09-20,UCpEhnqL0y41EpW2TvWAHD7Q,SET India,56018869,32661,8.0,2.087
3,Howto & Style,2016-11-15,UC295-Dw_tDNtZXFeAPAW6Aw,5-Minute Crafts,60600000,3591,9.0,2.087
4,Sports,2007-05-11,UCJ5v_MCY6GNUBTO8-D3XoAg,WWE,48400000,43421,11.0,2.087


Unnamed: 0,category_cc,join_date,channel,name_cc,subscribers_cc,videos_cc,subscriber_rank_sb,weights
0,Gaming,2010-04-29,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PewDiePie,101000000,3956,3.0,2.087
1,Education,2006-09-01,UCbCmjCuTUZos6Inko4u57UQ,Cocomelon - Nursery ...,60100000,458,7.0,2.087
2,Entertainment,2006-09-20,UCpEhnqL0y41EpW2TvWAHD7Q,SET India,56018869,32661,8.0,2.087
3,Howto & Style,2016-11-15,UC295-Dw_tDNtZXFeAPAW6Aw,5-Minute Crafts,60600000,3591,9.0,2.087
4,Sports,2007-05-11,UCJ5v_MCY6GNUBTO8-D3XoAg,WWE,48400000,43421,11.0,2.087
...,...,...,...,...,...,...,...,...
324,Music,2012-01-23,UCpx_k19S2vUutWUUM9qmXEg,ImagineDragonsVEVO,10300000,108,533.0,2.087
331,Entertainment,2013-04-24,UCwjoPtSoNLAoX2sLBaKLYng,Toys And Funny Kids ...,10200000,3655,543.0,2.087
333,Entertainment,2009-06-08,UCc6W7efUSkd9YYoxOnctlFg,Bethany Mota,10200000,483,545.0,2.087
339,Comedy,2006-12-01,UCazMm3tOCkYrIGE_17j0mVg,Bart Baker,10100000,288,553.0,2.087


In [60]:
df_metadata_light.head()

Unnamed: 0,channel_id,like_count
0,UCzWrhkg9eK5I8Bm3HfV-unA,8.0
1,UCzWrhkg9eK5I8Bm3HfV-unA,23.0
2,UCzWrhkg9eK5I8Bm3HfV-unA,1607.0
3,UCzWrhkg9eK5I8Bm3HfV-unA,227.0
4,UCzWrhkg9eK5I8Bm3HfV-unA,105.0


In [61]:
df_comments.head()

Unnamed: 0,author,video_id,likes,replies
0,1,Gkb1QMHrGvA,2,0
1,1,CNtp0xqoods,0,0
2,1,249EEzQmVmQ,1,0
3,1,_U443T2K_Bs,0,0
4,1,rJbjhm0weYc,0,0


In [62]:
df_num_comments.head()

Unnamed: 0,display_id,num_comms
0,SBqSc91Hn9g,0.0
1,UuugEl86ESY,0.0
2,oB4c-yvnbjs,48.0
3,ZaV-gTCMV8E,6.0
4,cGvL7AvMfM0,5.0


In [63]:
df_num_comments_authors.head()

Unnamed: 0,author,video_id
0,1,5
1,2,3
2,3,2
3,4,6
4,5,3


In [64]:
df_metadata.head(3)

Unnamed: 0,categories,channel_id,crawl_date,description,dislike_count,display_id,duration,like_count,tags,title,upload_date,view_count
0,Film & Animation,UCzWrhkg9eK5I8Bm3HfV-unA,2019-10-31 20:19:26.270363,Lego City Police Lego Firetruck Cartoons about...,1.0,SBqSc91Hn9g,1159,8.0,"lego city,lego police,lego city police,lego ci...",Lego City Police Lego Firetruck Cartoons about...,2016-09-28 00:00:00,1057
1,Film & Animation,UCzWrhkg9eK5I8Bm3HfV-unA,2019-10-31 20:19:26.914516,Lego Marvel SuperHeroes Lego Hulk Smash Iron-M...,1.0,UuugEl86ESY,2681,23.0,"Lego superheroes,lego hulk,hulk smash,lego mar...",Lego Marvel SuperHeroes Lego Hulk Smash Iron-M...,2016-09-28 00:00:00,12894
2,Film & Animation,UCzWrhkg9eK5I8Bm3HfV-unA,2019-10-31 20:19:26.531203,Lego City Police Lego Fireman Cartoons about L...,779.0,oB4c-yvnbjs,1394,1607.0,"lego city,lego police,lego city police,lego fi...",Lego City Police Lego Fireman Cartoons about L...,2016-09-28 00:00:00,1800602


In [65]:
# Sort by date
df_channels.sort_values(ascending=[True],by=['join_date']).head(5)

Unnamed: 0,category_cc,join_date,channel,name_cc,subscribers_cc,videos_cc,subscriber_rank_sb,weights
192,Entertainment,2005-06-16,UCvC4D8onUfXzvjTOM-dBfEA,Marvel Entertainment...,13200000,6332,302.0,2.087
2396,Film and Animation,2005-06-20,UCiCnPY--pbn5S8JkJdV2PbQ,Bengali Movies - Ang...,2312489,9659,5101.0,2.3905
299,Science & Technology,2005-06-22,UCE_M8A5yxnLfW0KghEeajjw,Apple,9970000,291,489.0,2.087
1764,Comedy,2005-07-19,UC7RGwnsSAEClqqxJ9MxUhyQ,Lucas,3100000,403,3648.0,2.2735
9219,Gaming,2005-08-04,UCbrLdiQyWBiFxWPLW3oI9Sw,Mikey Gaming,676000,1216,23820.0,2.948
