# Characterizing Patronage on YouTube

## 0. Files and brief explanation of those

All data is located in `/dlabdata1/youtube_large/`

In [None]:
DATA_FOLDER = "/dlabdata1/youtube_large/"
LOCAL_DATA_FOLDER = "local_data/"

**YouNiverse dataset:**

- `df_channels_en.tsv.gz`: channel metadata.
- `df_timeseries_en.tsv.gz`: channel-level time-series.
- `yt_metadata_en.jsonl.gz`: raw video metadata.
- `youtube_comments.tsv.gz`: user-comment matrices.
- `youtube_comments.ndjson.zst`: raw comments — this is a HUGE file.

**Graphteon dataset:**
- `creators.csv` list with all creator names.
- `final_processed_file.jsonl.gz` all graphteon time-series.
- `pages.zip` raw html of the pages in graphteon.

#### Libaries imports

In [None]:
# !conda list

In [None]:
import os 
import io
import pandas as pd
import json
import re
import zstandard
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from matplotlib.ticker import FuncFormatter
import numpy as np
import seaborn as sns
import gzip
from tqdm import tqdm
import timeit
import ast
import math
import ruptures as rpt

In [None]:
# list all files in current directory
# !ls -lh

In [None]:
# list all files in DATA_FOLDER
!ls -lh {DATA_FOLDER}

## 1. Exploratory Data Analysis (EDA)

### 1.1. YouNiverse dataset

#### 1.1.1 Channel metadata
Metadata associated with the 136,470 channels: **channel ID**, **join date**, **country**, **number of subscribers**, **most frequent category**, and the **channel’s position** in socialblade.com’s subscriber ranking. \
The number of subscribers is provided both as obtained from channelcrawler.com (between 2019-09-12 and 2019-09-17) and as crawled from socialblade.com (2019-09-27). Additionally, we also provide a set of **weights** (derived from socialblade.com’s subscriber rankings) that can be used to partially correct sample biases in our dataset.

- `category_cc`: category of the channel (majority based)
- `join_date`: join date of the channel
- `channel`: channel id
- `name_cc`: name of the channel.
- `subscribers_cc`: number of subscribers
- `videos_cc`: number of videos
- `subscriber_rank_sb`: rank in terms of number of subscribers (channel’s position in socialblade.com’s subscriber ranking)
- `weights`: weights cal (Set of weights derived from socialblade.com’s subscriber rankings. Can be used to partially correct sample biases in our dataset. -> correction for representation)

In [None]:
!ls -lh {DATA_FOLDER}df_channels_en.tsv.gz

In [None]:
# channel metadata
# df_yt_channels = pd.read_csv(DATA_FOLDER+'df_channels_en.tsv.gz', sep="\t", compression='gzip', nrows=10)
df_yt_channels = pd.read_csv(DATA_FOLDER+'df_channels_en.tsv.gz', sep="\t", compression='gzip')
df_yt_channels

Facts about this data (taken from [YouNiverse github page](https://github.com/epfl-dlab/YouNiverse)) 

- This dataframe has 136,470 rows, where each one corresponds to a different channel.
- We obtained all channels with >10k subscribers and >10 videos from channelcrawler.com in the 27 October 2019.
- Additionally we filtered all channels that were not in english given their video metadata (see `Raw Channels').

##### Summary statistics

In [None]:
print('Number of unique categories:         {:,}'.format(df_yt_channels['category_cc'].nunique()))
print('Number of unique channels:      {:,}'.format(df_yt_channels['channel'].nunique()))
print('Number of unique channel names: {:,}'.format(df_yt_channels['name_cc'].nunique()))

print('\nNote: there are more unique channels than unique names, so some channels might have the same name!')

##### Distribution of videos and subscribers per channel

In [None]:
selected_cols = ['videos_cc', 'subscribers_cc']

# plot with linear scale for x axis and log scale for y axis
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10,5))

for i,(col,ax) in enumerate(zip(selected_cols, axs.flatten())):
    sns.histplot(data=df_yt_channels[col], ax=ax, bins=50, kde=False, color=f'C{i}')
    ax.set(title=f'Distribution of {col}')
    ax.set_ylabel("Count - number of channels (log scale)")
    ax.set(yscale="log")
    # ax.set(xscale="log")
plt.tight_layout()
plt.show()


# plot with log scale for x axis 
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10,5))

xlabels = [r'$\log_{10}(videos)$', r'$\log_{10}(subscribers)$']

for i,(col,ax) in enumerate(zip(selected_cols, axs.flatten())):
    sns.histplot(data=np.log10(df_yt_channels[col]), ax=ax, bins=50, kde=False, cumulative=False, color=f'C{i}')

    ax.set(title=f'Distribution of {col} (log-log scale)')
    ax.set_xlabel(xlabels[i])
    ax.set_ylabel("Count - number of channels")

    # ax.set(yscale="log")
    # ax.set(xscale="log")
plt.tight_layout()
plt.show()


# plot with linear scale for both axes 
# fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10,5))

# for i,(col,ax) in enumerate(zip(selected_cols, axs.flatten())):
#     sns.histplot(data=df_yt_channels[col], ax=ax, bins=50, kde=False, color=f'C{i}')
#     ax.set(title=f'Distribution of {col}')
#     ax.set_ylabel("Count - number of channels")
#     # ax.set(yscale="log")
#     ax.set(xscale="log")
# plt.tight_layout()
# plt.show()

# # plot with log scale for x axis (distplot)
# fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10,5))

# for i,(col,ax) in enumerate(zip(selected_cols, axs.flatten())):
#     sns.distplot(np.log10(df_yt_channels[col]), hist_kws=kwargs, kde=False, kde_kws=kwargs, ax=ax, norm_hist=True)

#     ax.set(title=f'Distribution of {col} (log-log scale)')
#     ax.set_ylabel("Count - number of channels")
#     # ax.set(yscale="log")
#     # ax.set(xscale="log")
# plt.tight_layout()
# plt.show()


# descriptive statistics table
df_yt_channels[selected_cols].describe().T

**Discussion:** \
From the above graphs and table, we can see that _videos_ and _subscribers_ distributions among YouTube channels follow a **power law**, meaning that most channels have a only a few videos and a few subscribers, but a few of them have a lot of videos and a lot of subscribers.

More specifically:
- 50% of the YouTube channels have less than 175 videos
- 50% of the YouTube channels have less than 42,400 subscribers

_Note: only channels with at least 10 videos and 10,000 subscribers were considered for this study._

##### Group by categories

In [None]:
data_per_cat_chan = df_yt_channels.groupby(['category_cc', 'channel'])[['videos_cc', 'subscribers_cc']].agg(['max'])

# set the columns to the top level of the multi-index
data_per_cat_chan.columns = data_per_cat_chan.columns.get_level_values(0)
data_per_cat_chan

In [None]:
data_per_cat_chan.reset_index(inplace=True)
data_per_cat_chan

##### Number of channels per category

In [None]:
chan_per_cat = data_per_cat_chan.groupby('category_cc')[['channel']].count().sort_values('channel', ascending=False)

In [None]:
chan_per_cat.plot(kind='bar')
plt.title("Number of channels per category")
plt.xlabel("Categories")
plt.ylabel("Number of channels")
plt.show()
chan_per_cat['channel']

In [None]:
# data_per_cat = data_per_cat_chan.groupby('category')['videos_cc','subscribers_cc'].agg(['min', 'max', 'count', 'sum'])
data_per_cat = data_per_cat_chan.groupby('category_cc')[['videos_cc','subscribers_cc']].agg(['sum'])
data_per_cat.columns = data_per_cat.columns.get_level_values(0)
data_per_cat = data_per_cat.add_suffix('_sum')
data_per_cat

##### Number of videos per category

In [None]:
data_per_cat['videos_cc_sum'].sort_values(ascending=False).plot(kind='bar')
plt.title("Number of videos per category")
plt.xlabel("Categories")
plt.ylabel("Number of videos")
plt.show()

data_per_cat['videos_cc_sum'].sort_values(ascending=False)

##### Number of subscribers per category

In [None]:
data_per_cat['subscribers_cc_sum'].sort_values(ascending=False).plot(kind='bar')
plt.title("Number of subscribers per category")
plt.xlabel("Categories")
plt.ylabel("Number of videos")
plt.show()

data_per_cat['subscribers_cc_sum'].sort_values(ascending=False)

#### 1.1.2 YouTube Channels time-series data
Weekly number of viewers and subscribers. We have a data point for each channel and each week.

Time series of channel activity at **weekly granularity**. The span of time series varies by channel depending on when socialblade.com started tracking the channel. On average, it contains **2.8 years of data per channel** for **133k channels** (notice that this means there are roughly 4k channels for which there is no time-series data). \
Each data point includes the **number of views** (`views`) and **subscribers** (`subs`) obtained in the given week, as well as the **number of videos** (`videos`) posted by the **channel** (`channel`). The number of videos is calculated using the video upload dates in our video metadata, such that videos that were unavailable at crawl time are not accounted for. 

---

Time series related to each channel.\
These come from a mix of YouTube data and time series crawled from [socialblade.com](https://socialblade.com/):
- From the former (YouTube data): derived weekly time series indicating **how many videos each channel had posted per week**. 
- From the latter (socialblade.com): crawled weekly statistics on the **number of viewers** `views` and **subscribers** `subs` per channel `channel`. This data was available for around 153k channels.

    - `channel`: unique channel ID, which is the numbers and letters at the end of the URL.
    - `category`: category of the channel as assigned by [socialblade.com](https://socialblade.com/) according to the last 10 videos at time of crawl (categories organize channels and videos on YouTube and help creators, advertisers, and channel managers identify with content and audiences they wish to associate with).
    - `datetime`: First day of the week related to the data point
    - `views`: Total number of views the channel had this week.
    - `delta_views`: Delta views obtained this week (difference of nb of views between current and former week). (Interpolation)
    - `subs`: Total number of subscribers the channel had this week.
    - `delta_subs`: Delta subscribers obtained this week (difference of nb of subscribers between current and former week)
    - `videos`: Number of videos posted by the channel up to date
    - `delta_videos`:  Delta videos obtained this week (difference of number of videos posted by the channel between current and former week).
    - `activity`: Number of videos published in the last 15 days.
    
    
Note: Can view the channel by appending the channel id to the url, e.g.  https://www.youtube.com/channel/UCBJuEqXfXTdcPSbGO9qqn1g


In [None]:
!ls -lh /dlabdata1/youtube_large/df_timeseries_en.tsv.gz

In [None]:
# load channel-level time-series. (takes btw 50 secs and 2 mins)
df_yt_timeseries = pd.read_csv(DATA_FOLDER+'df_timeseries_en.tsv.gz', sep="\t", compression='gzip', parse_dates=['datetime'])
df_yt_timeseries

##### Summary statistics

In [None]:
# df_yt_timeseries.describe().T

In [None]:
yt_ts_uniq_chan_cnt = df_yt_timeseries['channel'].nunique()

print('Timeseries data was gathered between {} and {}'.format(df_yt_timeseries['datetime'].min().strftime('%B %d, %Y'),
                                                         df_yt_timeseries['datetime'].max().strftime('%B %d, %Y')))
print('Total number of datapoints accross all channels: {:>12,}'.format(len(df_yt_timeseries)))
data_points_dist = df_yt_timeseries['channel'].value_counts()
print('Average number of datapoints per channel:       {:>12.0f} weeks (≈{:,.1f} years)'.format(data_points_dist.mean(), data_points_dist.mean()/52))
print('Number of unique categories:                     {:>12,}'.format(df_yt_timeseries['category'].nunique()))
print('Number of unique channels:                       {:>12,}'.format(yt_ts_uniq_chan_cnt))

##### Datetime points per channel

Not all channels timeseries start and end at the same time, therefore we have a different amount of datapoints for each channel

In [None]:
datetime_data = df_yt_timeseries.groupby('channel')['datetime'].agg(['min', 'max'])
datetime_data.head()

In [None]:
# datetime_data.describe().T

##### Datetime points per year

In [None]:
yt_ts_year_cnt = df_yt_timeseries.groupby(df_yt_timeseries.datetime.dt.year).size()

In [None]:
print('Timeseries data was gathered between {} and {}'.format(df_yt_timeseries['datetime'].min().strftime('%B %d, %Y'),
                                                         df_yt_timeseries['datetime'].max().strftime('%B %d, %Y')))
yt_ts_year_cnt.plot(kind='bar')
plt.title("Nb of datapoints per year accross all channels")
plt.xlabel("Year")
plt.ylabel("Count (datapoints)")
plt.show()

yt_ts_year_cnt

##### Datetime points per month

In [None]:
yt_ts_month_cnt = df_yt_timeseries.groupby([df_yt_timeseries.datetime.dt.year, df_yt_timeseries.datetime.dt.month]).size()
yt_ts_month_cnt.head()

In [None]:
# using pandas.Grouper
# yt_ts_month_cnt_grouper = df_yt_timeseries.groupby(pd.Grouper(key='datetime', freq='M')).count().channel
# yt_ts_month_cnt_grouper.head()

In [None]:
# plot number of datapoints per month
plt.figure(figsize=(15,2))
yt_ts_month_cnt.plot(kind='bar')
plt.title("Number of datapoints per month accross all channels (using regular group by method)")
plt.xlabel("Month")
plt.ylabel("Count (datapoints)")
plt.show()

# plot number of datapoints per month using grouper
# plt.figure(figsize=(15,2))
# yt_ts_month_cnt_grouper.plot(kind='bar')
# plt.title("Number of datapoints per month accross all channels (using grouper)")
# plt.xlabel("Month")
# plt.ylabel("Count (datapoints)")
# plt.show()

In [None]:
# only consider unique values per channel
yt_ts_month_unique_cnt = df_yt_timeseries.groupby(pd.Grouper(key='datetime', freq='M')).agg({"channel": pd.Series.nunique})
yt_ts_month_unique_cnt.head()

In [None]:
(df_yt_timeseries.groupby(['datetime', 'channel']).count() > 1).sum()

In [None]:
# Number of channels with timeseries (only consider unique values per channel) --> see https://stackoverflow.com/questions/38309729/count-unique-values-per-groups-with-pandas

years = mdates.YearLocator()   # every year
months = mdates.MonthLocator()  # every month

fig, ax = plt.subplots(1, figsize=(7,3), sharey=True, sharex=True,
                       gridspec_kw={"wspace": 0.05})

ax.plot(yt_ts_month_unique_cnt)

ax.set(title='Number of channels with timeseries')
ax.set_xlabel("Month")
ax.set_ylabel("# channels")
ax.xaxis.set_major_locator(years)
ax.xaxis.set_minor_locator(months)

##### Datetime points accross channels

In [None]:
# Distribution of datapoints accross channels

print('Total number of datapoints accross all channels: {:>12,}'.format(len(df_yt_timeseries)))
data_points_dist = df_yt_timeseries['channel'].value_counts()
print('Average number of datapoints per channel:        {:>12,.0f} weeks (≈{:,.1f} years)'.format(data_points_dist.mean(), data_points_dist.mean()/52))

ax = sns.histplot(data=data_points_dist, bins=50, kde=False, color=f'C{1}')

ax.set(title=f'Distribution of datapoints (weeks) accross channels')
ax.set_xlabel('number of data points (weeks)')
ax.set_ylabel('number of channels')

# ax.set(yscale="log")
# plt.tight_layout()
plt.show()

In [None]:
# Aggregates per channel
sel_cols = ['datetime', 'views', 'delta_views', 'subs', 'delta_subs', 'videos', 'delta_videos', 'activity']
data_per_channel = df_yt_timeseries.groupby('channel')[sel_cols].agg(['min', 'max', 'count', 'mean'])
data_per_channel.head()

#####  Views per channel

In [None]:
data_per_channel['views'].head()

In [None]:
# Distribution of total views per channel
fig, axs = plt.subplots(nrows=2, ncols=1, figsize=(6,8))

sns.histplot(data=data_per_channel['views']['max'], ax=axs[0], bins=20, kde=False, color=f'C{1}')
axs[0].set(title=f'Distribution of total views per channel')
axs[0].set_xlabel('number of views (in billions)')
axs[0].set_ylabel('number of channels')
axs[0].set(yscale="log")
xlabels1 = ['{:,.0f}'.format(x) + 'bn' for x in axs[0].get_xticks()/1_000_000_000]
axs[0].set_xticklabels(xlabels1)

# Distribution of total views per channel (log scale)
sns.histplot(data=data_per_channel['views']['max'], ax=axs[1], bins=1000, kde=False, color=f'C{1}')
axs[1].set(title=f'Distribution of total views per channel (log-log scale)')
axs[1].set_xlabel('number of views (in millions)')
axs[1].set_ylabel('number of channels')
axs[1].set(yscale="log")
axs[1].set(xscale="log")
xlabels2 = ['{:,.0f}'.format(x) + 'M' for x in axs[1].get_xticks()/1_000_000]
axs[1].set_xticklabels(xlabels2)

plt.tight_layout()
plt.show()

data_per_channel['views'][['max']].describe().T

In [None]:
print("Top 10 channels with the most total views (in billions):")

for index, value in data_per_channel['views']['max'].sort_values(ascending=False)[:10].items():
    print('https://www.youtube.com/channel/{} : {:,.1f} bn views'.format(index, value/1_000_000_000))

##### Videos per channel

In [None]:
data_per_channel['videos'].head()

In [None]:
# Distribution of total videos per channel
fig, axs = plt.subplots(nrows=2, ncols=1, figsize=(6,8))
sns.histplot(data=data_per_channel['videos']['max'], ax=axs[0], bins=20, kde=False, color=f'C{1}')

axs[0].set(title=f'Distribution of total videos per channel')
axs[0].set_xlabel('number of videos')
axs[0].set_ylabel('number of channels')
axs[0].set(yscale="log")

# # Distribution of total views per channel (log scale)
sns.histplot(data=data_per_channel['videos']['max'], ax=axs[1], bins=100, kde=False, color=f'C{1}')

axs[1].set(title=f'Distribution of total videos per channel (log-log scale)')
axs[1].set_xlabel('number of videos')
axs[1].set_ylabel('number of channels')
axs[1].set(yscale="log")
axs[1].set(xscale="log")

plt.tight_layout()
plt.show()

data_per_channel['videos'][['max']].describe().T

In [None]:
print("Top 10 channels with the most total videos:")

for index, value in data_per_channel['videos']['max'].sort_values(ascending=False)[:10].items():
    print('https://www.youtube.com/channel/{} : {:,.0f} videos'.format(index, value))

##### Subscribers per channel

In [None]:
data_per_channel['subs'].head()

In [None]:
# Distribution of total subscribers per channel
fig, axs = plt.subplots(nrows=2, ncols=1, figsize=(6,8))
sns.histplot(data=data_per_channel['subs']['max'], ax=axs[0], bins=20, kde=False, color=f'C{1}')

axs[0].set(title=f'Distribution of total subscribers per channel')
axs[0].set_xlabel('number of subscribers (in millions)')
axs[0].set_ylabel('number of channels')
axs[0].set(yscale="log")
xlabels0 = ['{:,.0f}'.format(x) + 'M' for x in axs[0].get_xticks()/1_000_000]
axs[0].set_xticklabels(xlabels0)

# # Distribution of total views per channel (log scale)
sns.histplot(data=data_per_channel['subs']['max'], ax=axs[1], bins=500, kde=False, color=f'C{1}')

axs[1].set(title=f'Distribution of total subscribers per channel (log-log scale)')
axs[1].set_xlabel('number of subscribers')
axs[1].set_ylabel('number of channels')
axs[1].set(yscale="log")
axs[1].set(xscale="log")

plt.tight_layout()
plt.show()

data_per_channel['videos'][['max']].describe().T

In [None]:
data_per_channel['subs']['max'].sort_values(ascending=False)[:10]

In [None]:
print("Top 10 channels with the most total subscribers:")

for index, value in data_per_channel['subs']['max'].sort_values(ascending=False)[:10].items():
    print('https://www.youtube.com/channel/{} : {:,.1f}M subscribers'.format(index, value/1_000_000))

In [None]:
# set the columns to the top level of the multi-index
# data_per_channel.columns = data_per_channel.columns.get_level_values(0)
# data_per_channel

#### 1.1.3 Raw video metadata
The file `df_videos_raw.jsonl.gz` contains metadata data related to ~73M videos from ~137k channels. Below we show the data recorded for each of the video

In [None]:
!ls -lh {DATA_FOLDER}yt_metadata_en.jsonl.gz

In [None]:
# ! zcat {DATA_FOLDER}yt_metadata_en.jsonl.gz | head

In [None]:
df_yt_metadata = pd.read_json(DATA_FOLDER+'yt_metadata_en.jsonl.gz', compression='gzip', lines=True, nrows=2000)
df_yt_metadata.head(2)

#### 1.1.4 user-comment matrices

In [None]:
# !ls -lh {DATA_FOLDER}youtube_comments.tsv.gz

In [None]:
# user-comment matrices
df_yt_comments = pd.read_csv(DATA_FOLDER+'youtube_comments.tsv.gz', sep="\t", compression='gzip', nrows=100)
df_yt_comments.head()

#### 1.1.5 raw comments

In [None]:
# !ls -lh {DATA_FOLDER}youtube_comments.ndjson.zst

In [None]:
def line_jsonify(line): 
    """

    :param line: string to parse and jsonify
    :return: 
    """    
    
    # add square brackets around line
    line = "[" + line + "]"

    # remove quotes before and after square brackets   
    line = line.replace("\"[{", "[{")
    line = line.replace("}]\"", "}]")    
    
    # replace double double-quotes with single double-quotes
    line = line.replace("{\"\"", "{\"")
    line = line.replace("\"\"}", "\"}")
    line = line.replace("\"\":\"\"", "\":\"")
    line = line.replace(":\"\"", ":\"")
    line = line.replace("\"\":", "\":")
    
    # line = line.replace("\"\":", "\":")
    line = line.replace("\"\",\"\"", "\",\"")
    line = line.replace("\"\",\"\"", "\",\"")
    line = line.replace("\\\"\"", "\\\"")
    line = line.replace("\\\",[", "\\\\ \",[")
    
    line = re.sub(r',\"\"(?!\,)', ',\"', line)

    line = line.replace("true,\"\"", "true,\"")
    line = line.replace("false,\"\"", "false,\"")
    
    return line

In [None]:
class Zreader:

    def __init__(self, file, chunk_size=16384):
        '''Init method'''
        import codecs
        self.fh = open(file,'rb')
        print(f"reading {file} in chunks ...")
        self.chunk_size = chunk_size
        self.dctx = zstandard.ZstdDecompressor(max_window_size=2147483648)
        self.reader = self.dctx.stream_reader(self.fh)
        self.buffer = ''

    def readlines(self):
        '''Generator method that creates an iterator for each line of JSON'''
        nb_chunk = 0
        while True:
            nb_chunk = nb_chunk + 1
            if nb_chunk % 5000 == 0:
                print("number of chunks read: ", nb_chunk)
                
            chunk = self.reader.read(self.chunk_size).decode("utf-8", "replace")

            if not chunk:
                break
            lines = (self.buffer + chunk).split("\n")

            # print("lines per chunk: ", len(lines))
            # print(lines)
            
            for line in lines[:-1]:
                # print(line)
                yield line

            self.buffer = lines[-1]

In [None]:
NB_OF_LINES = 350000
lines_json = []
inp_file = DATA_FOLDER+"youtube_comments.ndjson.zst"
reader = Zreader(inp_file, chunk_size=4092)

for i, line in enumerate(reader.readlines()):
    if i > NB_OF_LINES:
        # print(line)
        break
    line_json = json.loads(line_jsonify(line))
    lines_json.append(line_json)

print("==> number of lines read:", len(lines_json))

df_yt_comments_raw = pd.DataFrame(data=lines_json[1:], columns=lines_json[0])
df_yt_comments_raw.head()

### 1.2. Graphtreon dataset

#### 1.2.1 List with all creator names.

In [None]:
# !ls -lh {DATA_FOLDER}creators.csv

In [None]:
# list with all creator names.
df_gt_creators = pd.read_csv(DATA_FOLDER+'creators.csv')
df_gt_creators.head()

#### 1.2.2 All graphtreon time-series

In [None]:
!ls -lh {DATA_FOLDER}final_processed_file.jsonl.gz

In [None]:
# final_processed_file.jsonl.gz all graphteon time-series.
df_gt_timeseries = pd.read_json(DATA_FOLDER+'final_processed_file.jsonl.gz', compression='gzip', lines=True, nrows=10)
df_gt_timeseries.head()

# df_gt_timeseries = pd.read_json(DATA_FOLDER+'final_processed_file.jsonl.gz', compression='gzip', lines=True)

# get only id and match them first
# 

##### Summary statistics

In [None]:
df_gt_timeseries['startDate'] = pd.to_datetime(df_gt_timeseries['startDate'])
df_gt_timeseries.head()

In [None]:
print('Number of unique creators:         {:,}'.format(df_gt_timeseries['creatorName'].nunique()))
print('Number of unique patreon ids:         {:,}'.format(df_gt_timeseries['patreon'].nunique()))

print('Timeseries data was gathered between {} and {}'.format(df_gt_timeseries['startDate'].min().strftime('%B %d, %Y'),
                                                         df_gt_timeseries['startDate'].max().strftime('%B %d, %Y')))
print('Total number of datapoints accross all channels: {:>12,}'.format(len(df_gt_timeseries)))

#### 1.2.3 Raw html of the pages in graphteon.

In [None]:
!ls -lh {DATA_FOLDER}pages.zip

In [None]:
# pages.zip raw html of the pages in graphteon.

## 2. Match data

In [None]:
# DATA_FOLDER = "/dlabdata1/youtube_large/"

Files used in this section

**YouNiverse dataset:**

- (`df_channels_en.tsv.gz`: channel metadata.)
- `df_timeseries_en.tsv.gz`: channel-level time-series.
- `yt_metadata_en.jsonl.gz`: raw video metadata.

**Graphteon dataset:**
- `final_processed_file.jsonl.gz` all graphteon time-series.

### Libaries imports

In [None]:
# !conda list

In [None]:
# import os 
# import io
# import pandas as pd
# import json
# import re
# import zstandard
# import matplotlib.pyplot as plt
# import matplotlib.dates as mdates
# from matplotlib.ticker import FuncFormatter
# import numpy as np
# import seaborn as sns
# import gzip
# from tqdm import tqdm
# import timeit
# import ast
# import math
# import datetime
# import ruptures as rpt

### 2.1. Filter YouTube metadata containing patreon id
_Extract Patreon urls from YouTube metadata description (if they exist) and keep only those rows_

YT_metadata_filter_results_040422.jpg _(filter script in script/scripts.ipynb)_
<div>
    <img src="img/YT_metadata_filter_results_040422.jpg" alt="YT_metadata_filter_results_040422.jpg" />
</div>

In [None]:
# declare global variable for size of original YT dataset
DF_YT_METADATA_ROWS = 72_924_794

In [None]:
# YT metadata containing patreon ids in description
!ls -lh {LOCAL_DATA_FOLDER}yt_metadata_en_pt_040422.tsv.gz

In [None]:
# read filtered youtube metadata file (takes about 2 mins)
df_yt_metadata_pt = pd.read_csv(LOCAL_DATA_FOLDER+"yt_metadata_en_pt_040422.tsv.gz", sep="\t", lineterminator='\n', compression='gzip') 

In [None]:
# remove rows where patreon_ids = patreon.com/posts or patreon.com/user (in the future fix in regex)
df_yt_metadata_pt = df_yt_metadata_pt[df_yt_metadata_pt['patreon_id'] != 'patreon.com/posts']
df_yt_metadata_pt = df_yt_metadata_pt[df_yt_metadata_pt['patreon_id'] != 'patreon.com/user']

# lowercase all patreon ids to avoid duplicates
df_yt_metadata_pt['patreon_id'] = df_yt_metadata_pt['patreon_id'].str.lower()

In [None]:
df_yt_metadata_pt.head(1)

In [None]:
# stats 
print("[YouTube metadata] Total number of videos:                                                {:>10,}".format(DF_YT_METADATA_ROWS))
print("[Filtered YouTube metadata] number of videos that contain a patreon link in description:  {:>10,} ({:.1%} of total dataset)".format(len(df_yt_metadata_pt), len(df_yt_metadata_pt)/DF_YT_METADATA_ROWS))

# get list of all unique patreon ids in df_yt_metadata_pt
yt_patreon_list = df_yt_metadata_pt['patreon_id'].unique()
yt_pt_channel_list = df_yt_metadata_pt['channel_id'].unique()

print("[Filtered YouTube metadata] total number of unique patreon ids:                           {:>9,}".format(len(yt_patreon_list)))
print("[Filtered YouTube metadata] number of unique channels that contain a patreon account:     {:>9,}".format(len(yt_pt_channel_list)))

**Observation:** \
We can see that we have _**more patreon ids than channels**_ . Let's investigate further:

#### Restrict to 1 patreon id per youtube channel

In [None]:
# group by channel_id AND patreon_id and count the number of unique videos (display_ids)
df_yt_metadata_pt_grp_chan = df_yt_metadata_pt.groupby(['channel_id','patreon_id']).agg(display_id_cnt=("display_id", pd.Series.nunique))
df_yt_metadata_pt_grp_chan.head()

In [None]:
# reset index
df_yt_metadata_pt_grp_chan = df_yt_metadata_pt_grp_chan.reset_index()
# df_yt_metadata_pt_grp_chan.head(4)

# count the number of patreon_ids per channel
pt_id_cnt_pr_chan = df_yt_metadata_pt_grp_chan.groupby('channel_id').count()['patreon_id'].sort_values(ascending=False)
pt_id_cnt_pr_chan = pt_id_cnt_pr_chan.to_frame(name='patreon_id_cnt')
pt_id_cnt_pr_chan.head()

In [None]:
# plot Distribution of patreon ids per channel
fig, axs = plt.subplots(nrows=1, ncols=1, figsize=(6,4))

# plot with log scale for x axis and log scale for y axis
sns.histplot(data=pt_id_cnt_pr_chan, ax=axs, bins=50, kde=False, legend=False, color=f'C{0}')
axs.set(title=f'Distribution of patreon ids per channel (log scale)')
axs.set_xlabel("Number of patreon ids")
axs.set_ylabel("Count of channels (log scale)")
axs.set(yscale="log")

# plt.tight_layout()
plt.show()

# descriptive statistics table
pt_id_cnt_pr_chan.describe().T

**Discussion:** \
As we observed earlier, some channels use more than 1 patreon id, and use different patreon urls for different videos. For example:
- [Patreon_Gaming](https://www.youtube.com/channel/UCAsLyFlWkbdhvri02tO6veA) uses 73 different patreon ids.
- [Artistic Maniacs](https://www.youtube.com/channel/UC3pcSD6_RRisNLaHGznemJA) uses 69 different patreon ids.

In [None]:
# example for Artistic Maniacs
df_yt_metadata_pt_grp_chan[df_yt_metadata_pt_grp_chan['channel_id'] == 'UC3pcSD6_RRisNLaHGznemJA'].head()

_Optional: Keep only most used patreon_id per channel (patreon_id with most videos for each channel)_

In [None]:
# sort metadata df by diplay_id_cnt within each channel_id group
df_yt_metadata_pt_grp_chan = df_yt_metadata_pt_grp_chan.sort_values(['channel_id','display_id_cnt'], ascending=[True, False])
df_yt_metadata_pt_grp_chan.head(5)

In [None]:
# calculate the number of duplicate of rows with same channel id but different patreon ids
dup_chan_id = df_yt_metadata_pt_grp_chan[df_yt_metadata_pt_grp_chan.duplicated(subset=['channel_id'], keep='first')]
print("Number of duplicate rows (same channel id with multiple patreon_ids): {:,}".format(len(dup_chan_id)))

In [None]:
# drop duplicate rows, keep the patreon ids with the most videos
df_yt_metadata_unique_pt = df_yt_metadata_pt_grp_chan.drop_duplicates(subset='channel_id', keep='first')
print('Removed {:,} rows'.format(len(df_yt_metadata_pt_grp_chan) - len(df_yt_metadata_unique_pt)))
df_yt_metadata_unique_pt.head()

#### "Match" dataframe (channel/patreon)

Consider them linked only if 
- [TODO] there is a Patreon link >10% of their videos and if the second most common Patreon link occurs less than 2-3 videos.
- [TODO] Remove channels whose patreon ids are not unique
- Match YouTube channel to Patreon id which appears in most of its videos

In [None]:
# store into new "matched" dataframe
df_matched_channel_patreon = df_yt_metadata_unique_pt[['channel_id', 'patreon_id']]
df_matched_channel_patreon.head()

In [None]:
# save "matched" dataframe to LOCAL SCRATCH FOLDER as a compressed tsv
# output_file_path = LOCAL_DATA_FOLDER+"df_matched_channel_patreon.tsv.gz"
# df_matched_channel_patreon.to_csv(output_file_path, index=False, sep='\t', compression='gzip')

##### [_Ignore for now_] Further Observation

**Further Observation:** \
When grouping YouTube metadata by `channel_id` and `patreon_id`, we also notice that we have more rows than the total number of unique patreon ids. \
This is because some `patreon_id` are used on multiple channels. 

In [None]:
print("total rows:                        {:,}".format(len(df_yt_metadata_pt_grp_chan)))
print("total number of unique patreon ids {:,}".format(df_yt_metadata_pt.patreon_id.nunique()))

In [None]:
# show patreon_id that are used on multiple channels.
df_yt_metadata_pt_grp_chan[df_yt_metadata_pt_grp_chan.duplicated(subset=['patreon_id'], keep=False)].sort_values(by='patreon_id')

In [None]:
print("[Filtered YouTube metadata] number of channels per patreon id:")

chan_cnt_per_patreon_id = df_yt_metadata_pt.groupby('patreon_id')\
                                            .agg(channel_id_count=('channel_id', 'count'))\
                                            .sort_values(by=['channel_id_count'], ascending=False)
chan_cnt_per_patreon_id
# chan_cnt_per_patreon_id.reset_index()

##### [_Ignore for now_] Number of videos per patreon id

In [None]:
# group by patreon_id and count the number of unique display_ids
vids_cnt_per_patreon_id = df_yt_metadata_pt.groupby('patreon_id').agg({"display_id": pd.Series.nunique}).sort_values(by='display_id', ascending=False)
vids_cnt_per_patreon_id.rename(columns={'display_id':'display_id_cnt'}, inplace=True)

print("[Filtered YouTube metadata] number of videos per patreon id:")
vids_cnt_per_patreon_id

In [None]:
# plot with linear scale for both axes
fig, axs = plt.subplots(nrows=1, ncols=1, figsize=(6,4))


# plot with log scale for x axis and log scale for y axis
sns.histplot(data=vids_cnt_per_patreon_id, ax=axs, bins=50, kde=False, color=f'C{0}')
axs.set(title=f'Distribution of videos per patreon id (log scale)')
axs.set_xlabel("Number of videos")
axs.set_ylabel("# patreon ids (log scale)")
axs.set(yscale="log")

plt.tight_layout()
plt.show()

# descriptive statistics table
vids_cnt_per_patreon_id.describe().T

**Discussion:** \
From the above graph and table, we can see that the _videos_ distributions among patreon ids follows a **power law**, meaning that most patreon accounts have a only a few videos, but a few of them have a lot of videos.

More specifically:
- 25% of the Patreon accounts have 1 video
- 50% of the Patreon accounts have less than 4 videos

### 2.2 Filter YouTube timeseries - Restrict YouTube channels (4 filters)
Restrict YouTube channels according to the following criteria (filters are applied sequentially):
- Filter 1: Keep only YouTube channels that are in YouTube Timeseries dataset AND linked to a patreon account 
- Filter 2: At least 2 year between first and last video
- Filter 3: At least 20 videos with patreon ids
- Filter 4: At least 250k subscribers at data crawling time

In [None]:
!ls -lh {DATA_FOLDER}df_timeseries_en.tsv.gz

In [None]:
# load channel-level time-series. (takes about 50 secs)
df_yt_timeseries = pd.read_csv(DATA_FOLDER+'df_timeseries_en.tsv.gz', sep="\t", compression='gzip', parse_dates=['datetime'])

In [None]:
df_yt_timeseries.head(3)

In [None]:
# Define global values for filters
MIN_DAYS_DELTA = "730 day"    # filter 2
NB_PATREON_VIDS = 20          # filter 3
NB_SUBS = 250_000             # filter 4

In [None]:
# Nb of channels of original YT timeseries dataset (need to first load df_yt_timeseries in 1.1.2)
yt_ts_uniq_chan_cnt = df_yt_timeseries['channel'].nunique()
print("[YouTube Timeseries] Nb of rows of original dataset:                  {:>10,}".format(len(df_yt_timeseries)))
print("[YouTube Timeseries] Nb of channels of original dataset:              {:>10,}".format(yt_ts_uniq_chan_cnt))

---
##### **• Filter 1:** Keep only YouTube channels that are in YouTube Timeseries dataset AND linked to a patreon account

In [None]:
# Apply filter 1: retain only the YT channels that exist in the filtered YT metadata dataset (need to first load df_yt_metadata_pt and yt_pt_channel_list in 2.2.1)
df_yt_timeseries_filt1 = df_yt_timeseries[df_yt_timeseries['channel'].isin(yt_pt_channel_list)]
chan_list_filt1 = df_yt_timeseries_filt1['channel'].unique()
chan_list_filt1_cnt = len(chan_list_filt1)

print("[YouTube Timeseries] Nb of rows of after applying filter 1:           {:>10,} ({:5.1%} of original dataset)".format(len(df_yt_timeseries_filt1), len(df_yt_timeseries_filt1)/len(df_yt_timeseries)))
print("[YouTube Timeseries] Nb of channels after applying filter 1:          {:>10,} ({:5.1%} of original dataset)".format(chan_list_filt1_cnt, chan_list_filt1_cnt/yt_ts_uniq_chan_cnt))

---
##### **• Filter 2:** At least 2 year between first and last video

In [None]:
# among filter1 channels, calculate time difference between the first and the last video for each channel
datetime_data = df_yt_timeseries_filt1.groupby('channel').agg(datetime_min=('datetime', 'min'),
                                                              datetime_max=('datetime', 'max'))
datetime_data['delta_datetime'] = datetime_data['datetime_max'] - datetime_data['datetime_min']

# filter channels that we have data for at least MIN_TIME_DELTA days
datetime_data_filt2 = datetime_data[datetime_data['delta_datetime'] > pd.Timedelta(MIN_DAYS_DELTA)]

# Apply filter on YT Timeseries dataset: retain only those channels that have data for at least MIN_TIME_DELTA days
df_yt_timeseries_filt2 = df_yt_timeseries_filt1[df_yt_timeseries_filt1['channel'].isin(datetime_data_filt2.index)]

chan_list_filt2 = df_yt_timeseries_filt2['channel'].unique()
chan_list_filt2_cnt = len(chan_list_filt2)

print("[YouTube Timeseries] Nb of rows of after applying filter 1+2:         {:>10,} ({:5.1%} of original dataset, {:5.1%} of filter 1 dataset)".format(len(df_yt_timeseries_filt2), len(df_yt_timeseries_filt2)/len(df_yt_timeseries), len(df_yt_timeseries_filt2)/len(df_yt_timeseries_filt1)))
print("[YouTube Timeseries] Nb of channels after applying filter 1+2:        {:>10,} ({:5.1%} of original dataset, {:5.1%} of filter 1 channels)".format(chan_list_filt2_cnt, chan_list_filt2_cnt/yt_ts_uniq_chan_cnt, chan_list_filt2_cnt/chan_list_filt1_cnt))

___

##### **• Filter 3:** At least 20 videos with patreon ids per channel 

In [None]:
# group by channel_id AND patreon_id and count the number of unique videos (=display_ids). (need to load df_yt_metadata_pt_grp_chan from point 2.2.1)
# Then filter rows that have at least 20 videos (display_ids) 
df_yt_metadata_pt_grp_chan_filt3 = df_yt_metadata_pt_grp_chan[df_yt_metadata_pt_grp_chan['display_id_cnt'] > NB_PATREON_VIDS]
df_yt_metadata_pt_grp_chan_filt3

# get list of unique channels satisfying filter 3
chan_list_filt_3 = df_yt_metadata_pt_grp_chan_filt3['channel_id'].unique()

# Apply filter on YT Timeseries dataset: retain only those channels from filt 2 that are in the chan_list_filt_3
df_yt_timeseries_filt3 = df_yt_timeseries_filt2[df_yt_timeseries_filt2['channel'].isin(chan_list_filt_3)]

chan_list_filt3 = df_yt_timeseries_filt3['channel'].unique()
chan_list_filt3_cnt = len(chan_list_filt3)

print("[YouTube Timeseries] Nb of rows of after applying filter 1+2+3:       {:>10,} ({:5.1%} of original dataset, {:5.1%} of filter 2 dataset)".format(len(df_yt_timeseries_filt3), len(df_yt_timeseries_filt3)/len(df_yt_timeseries), len(df_yt_timeseries_filt3)/len(df_yt_timeseries_filt2)))
print("[YouTube Timeseries] Nb of channels after applying filter 1+2+3:      {:>10,} ({:5.1%} of original dataset, {:5.1%} of filter 2 channels)".format(chan_list_filt3_cnt, chan_list_filt3_cnt/yt_ts_uniq_chan_cnt, chan_list_filt3_cnt/chan_list_filt2_cnt))

---
##### **• Filter 4:** At least 250k subscribers at data crawling time

In [None]:
# Aggregates per channel
subs_aggr_per_channel = df_yt_timeseries_filt3.groupby('channel')\
                                               .agg(min_subs=('subs', 'min'),
                                                    max_subs=('subs', 'max'))\
                                                .sort_values(by=['max_subs'], ascending=False)\
                                                .reset_index()
# subs_aggr_per_channel.head()

In [None]:
# Need to first load data_per_channel (aggregates per channel in 1.1.2 'Datetime points accross channels' section)
subs_per_channel_filt4 = subs_aggr_per_channel[subs_aggr_per_channel['max_subs'] > NB_SUBS]

# get list of unique channels satisfying filter 4
chan_list_filt_4 = subs_per_channel_filt4['channel'].unique()

# # Apply filter on YT Timeseries dataset: retain only those channels from filt_3 that are in the chan_list_filt_4
df_yt_timeseries_filt4 = df_yt_timeseries_filt3[df_yt_timeseries_filt3['channel'].isin(chan_list_filt_4)]

chan_list_filt4 = df_yt_timeseries_filt4['channel'].unique()
chan_list_filt4_cnt = len(chan_list_filt4)

print("[YouTube Timeseries] Nb of rows of after applying filter 1+2+3+4:     {:>10,} ({:5.1%} of original dataset, {:5.1%} of filter 3 dataset)".format(len(df_yt_timeseries_filt4), len(df_yt_timeseries_filt4)/len(df_yt_timeseries), len(df_yt_timeseries_filt4)/len(df_yt_timeseries_filt3)))
print("[YouTube Timeseries] Nb of channels after applying filter 1+2+3+4:    {:>10,} ({:5.1%} of original dataset, {:5.1%} of filter 3 channels)".format(chan_list_filt4_cnt, chan_list_filt4_cnt/yt_ts_uniq_chan_cnt, chan_list_filt4_cnt/chan_list_filt3_cnt))

---
##### **• Filter 4b**: At least 50k subscribers in the first 6 months

___
___
**• Filters summary**

In [None]:
print("[YouTube Timeseries] Stats before and after filters:")
print()

print("Filter 1 = \"keep only YouTube channels that are in YouTube Timeseries dataset AND linked to a patreon account\"")
print("Filter 2 = \"at least {:.1f} years ({} days) between first and last video\"".format(pd.Timedelta(MIN_DAYS_DELTA).days/365, pd.Timedelta(MIN_DAYS_DELTA).days))
print("Filter 3 = \"at least {:,} videos with patreon ids per channel\"".format(NB_PATREON_VIDS))
print("Filter 4 = \"at least {:,} subscribers at data crawling time\"".format(NB_SUBS))
print()




print("[YouTube Timeseries] Nb of rows of original dataset:                  {:>10,}".format(len(df_yt_timeseries)))
print("[YouTube Timeseries] Nb of rows of after applying filter 1:           {:>10,} ({:5.1%} of original dataset)".format(len(df_yt_timeseries_filt1), len(df_yt_timeseries_filt1)/len(df_yt_timeseries)))
print("[YouTube Timeseries] Nb of rows of after applying filter 1+2:         {:>10,} ({:5.1%} of original dataset, {:5.1%} of filter 1 dataset)".format(len(df_yt_timeseries_filt2), len(df_yt_timeseries_filt2)/len(df_yt_timeseries), len(df_yt_timeseries_filt2)/len(df_yt_timeseries_filt1)))
print("[YouTube Timeseries] Nb of rows of after applying filter 1+2+3:       {:>10,} ({:5.1%} of original dataset, {:5.1%} of filter 2 dataset)".format(len(df_yt_timeseries_filt3), len(df_yt_timeseries_filt3)/len(df_yt_timeseries), len(df_yt_timeseries_filt3)/len(df_yt_timeseries_filt2)))
print("[YouTube Timeseries] Nb of rows of after applying filter 1+2+3+4:     {:>10,} ({:5.1%} of original dataset, {:5.1%} of filter 3 dataset)".format(len(df_yt_timeseries_filt4), len(df_yt_timeseries_filt4)/len(df_yt_timeseries), len(df_yt_timeseries_filt4)/len(df_yt_timeseries_filt3)))


print()

print("[YouTube Timeseries] Nb of channels of original dataset:              {:>10,}".format(yt_ts_uniq_chan_cnt))
print("[YouTube Timeseries] Nb of channels after applying filter 1:          {:>10,} ({:5.1%} of original dataset)".format(chan_list_filt1_cnt, chan_list_filt1_cnt/yt_ts_uniq_chan_cnt))
print("[YouTube Timeseries] Nb of channels after applying filter 1+2:        {:>10,} ({:5.1%} of original dataset, {:5.1%} of filter 1 channels)".format(chan_list_filt2_cnt, chan_list_filt2_cnt/yt_ts_uniq_chan_cnt, chan_list_filt2_cnt/chan_list_filt1_cnt))
print("[YouTube Timeseries] Nb of channels after applying filter 1+2+3:      {:>10,} ({:5.1%} of original dataset, {:5.1%} of filter 2 channels)".format(chan_list_filt3_cnt, chan_list_filt3_cnt/yt_ts_uniq_chan_cnt, chan_list_filt3_cnt/chan_list_filt2_cnt))
print("[YouTube Timeseries] Nb of channels after applying filter 1+2+3+4:    {:>10,} ({:5.1%} of original dataset, {:5.1%} of filter 3 channels)".format(chan_list_filt4_cnt, chan_list_filt4_cnt/yt_ts_uniq_chan_cnt, chan_list_filt4_cnt/chan_list_filt3_cnt))
print()


print('[YouTube Timeseries] Time range of original dataset                   {} and {}'.format(df_yt_timeseries['datetime'].min().strftime('%B %d, %Y'),
                                                              df_yt_timeseries['datetime'].max().strftime('%B %d, %Y')))

print('[YouTube Timeseries] Time range after applying filter 1+2+3+4        {} and {}'.format(df_yt_timeseries_filt4['datetime'].min().strftime('%B %d, %Y'),
                                                              df_yt_timeseries_filt4['datetime'].max().strftime('%B %d, %Y')))

display(df_yt_timeseries_filt4.head())

print("Restricted list of channels after 4 filters (count = {:,}):".format(chan_list_filt4_cnt))
print(chan_list_filt4)

[ignore] Match patreon_ids and channel_ids

In [None]:
# # filter YT metadata dataset by list of filtered channels from YT timeseries above
# df_yt_metadata_pt_restr = df_yt_metadata_pt[df_yt_metadata_pt['channel_id'].isin(chan_list_filt4)]

# # get unique channels for youtube metadata (original and restricted)
# yt_metadata_uniq_chan = df_yt_metadata_pt['channel_id'].unique()
# yt_metadata_uniq_chan_restr = df_yt_metadata_pt_restr['channel_id'].unique()

# # get unique patreon ids for youtube metadata (original and restricted)
# yt_metadata_uniq_pat = df_yt_metadata_pt['patreon_id'].unique()
# yt_metadata_uniq_pat_restr = df_yt_metadata_pt_restr['patreon_id'].unique()

# print("[YouTube Metadata]:")
# print()
# print("Restriction = \"keep only YouTube channels that are in YouTube Timeseries filtered (filters 1-4) dataset\"")
# print()
# # print("[YouTube Metadata] Nb of videos in original dataset:                                   {:>10,}".format(DF_YT_METADATA_ROWS))
# # print("[YouTube Metadata] Nb of videos in pre-filtered (containing patreon id) dataset:       {:>10,}".format(len(df_yt_metadata_pt)))
# # print("[YouTube Metadata] Nb of videos after filtering by restricted channels:                {:>10,} ({:5.1%} of pre-filtered dataset dataset)".format(len(df_yt_metadata_pt_restr), len(df_yt_metadata_pt_restr)/len(df_yt_metadata_pt)))
# # print()
# print("[YouTube Metadata] Nb of channels in pre-filtered (containing patreon id) dataset:     {:>10,}".format(len(yt_metadata_uniq_chan)))
# print("[YouTube Metadata] Nb of channels after filtering by restricted channels:              {:>10,} ({:5.1%} of pre-filtered dataset dataset)".format(len(yt_metadata_uniq_chan_restr), len(yt_metadata_uniq_chan_restr)/len(yt_metadata_uniq_chan)))
# print()
# print("[YouTube Metadata] Nb of patreon ids in pre-filtered (containing patreon id) dataset:  {:>10,}".format(len(yt_metadata_uniq_pat)))
# print("[YouTube Metadata] Nb of patreon ids after filtering by restricted channels:           {:>10,} ({:5.1%} of pre-filtered dataset dataset)".format(len(yt_metadata_uniq_pat_restr), len(yt_metadata_uniq_pat_restr)/len(yt_metadata_uniq_pat)))


### 2.3 Filter Graphtreon to keep only the ones matching patreon id

GT_timeseries_filter_results_032622.jpg _(filter script in scripts/scripts.ipynb)_
<div>
    <img src="img/GT_timeseries_filter_results_032622.jpg" alt="GT_timeseries_filter_results_032622.jpg" />
</div>

In [None]:
# declare global variable for size of original GT dataset
GT_final_processed_file_ROWS = 232_269

In [None]:
!ls -lh {LOCAL_DATA_FOLDER}df_gt_timeseries_filtered.tsv.gz

In [None]:
df_gt_timeseries_filtered = pd.read_csv(LOCAL_DATA_FOLDER+"df_gt_timeseries_filtered.tsv.gz", sep="\t", compression='gzip')
df_gt_timeseries_filtered.head(3)

In [None]:
print("Statistics of loaded pre-filtered Graphtreon Timeseries file:")
print("[Graphtreon Timeseries] Total number of patreon ids:                                                   {:>9,}".format(GT_final_processed_file_ROWS))
print("[Graphtreon Timeseries] Nb of patreon ids that exist in both GT Timeseries and YT metadata:            {:>9,} ({:.1%} of GT timeseries dataset)".format(len(df_gt_timeseries_filtered), len(df_gt_timeseries_filtered)/GT_final_processed_file_ROWS))


#### 2.3.1. Join GT timeseries with matched channel_id
match the channels in the restricted list of channels of the matched dataframe 

In [None]:
# join GT timeseries and matched channels
df_gt_timeseries_merged = df_gt_timeseries_filtered.merge(df_matched_channel_patreon, left_on='patreon', right_on='patreon_id')
df_gt_timeseries_merged.head(3)

#### 2.3.2. Filter/Restrict GT timeseries further
We now want to reduce the Graphtreon dataset by keeping only rows in filtered list of channels (chan_list_filt4)

In [None]:
# filter Graphtreon dataset by keeping only rows in filtered list of channels (chan_list_filt4)
df_gt_timeseries_restricted = df_gt_timeseries_merged[df_gt_timeseries_merged['channel_id'].isin(chan_list_filt4)]

print("[Graphtreon Timeseries] Total number of patreon ids:                                                   {:>9,}".format(GT_final_processed_file_ROWS))
print("[Graphtreon Timeseries] Nb of patreon ids that exist in both GT Timeseries and YT metadata:            {:>9,} ({:.1%} of GT timeseries dataset)".format(len(df_gt_timeseries_filtered), len(df_gt_timeseries_filtered)/GT_final_processed_file_ROWS))
print("[Graphtreon Timeseries] Nb of patreon ids that exist in both GT Timeseries and YT metadata restricted  {:>9,} ({:.1%} of GT timeseries dataset)".format(len(df_gt_timeseries_restricted), len(df_gt_timeseries_restricted)/GT_final_processed_file_ROWS))


#### 2.3.3 Extract the date and daily earnings per patreon account

In [None]:
# get list of all unique patreon ids in df_gt_timeseries_restricted
yt_gt_patreon_list_restricted = df_gt_timeseries_restricted.patreon.unique()
print("list of restricted patreon ids", yt_gt_patreon_list_restricted)
print("number of restricted patreon ids", len(yt_gt_patreon_list_restricted))

In [None]:
df_gt_timeseries_restricted.head(1)

In [None]:
# example of NaN value
# df_gt_timeseries_sample[df_gt_timeseries_sample['creatorName'] == 'Comedy Trap House']

In [None]:
# # From the Graphtreon dataset, for each channel, extract the date and earnings from “dailyGraph_earningsSeriesData” (takes about 3 mins)
# input_file_path = DATA_FOLDER+"/final_processed_file.jsonl.gz"

# MAX_ITER = 100

# nb_rows_read = 0
# valid_predicate_count = 0
# JSONDecodeErrors_cnt = 0 
# dailyEarningsError_cnt = 0 
# lines_json = []    

# compressed_file_size = os.stat(input_file_path).st_size
# print("Compressed file size is :                 {:>8,.2f} GB".format(compressed_file_size / 2**30))

# uncompressed_file_size = 13_310_000_000
# print("Estimated Uncompressed file size is :     {:>8,.2f} GB".format(uncompressed_file_size / 2**30))

# start = timeit.default_timer()

# # Load tqdm with size counter instead of file counter
# with tqdm(total=uncompressed_file_size, unit='B', unit_scale=True, unit_divisor=1024) as pbar:
#     with gzip.open(input_file_path, "r") as f:
#         for i, line in enumerate(f): 

#             read_bytes = len(line)
#             if read_bytes:
#                 pbar.set_postfix(file=input_file_path[len(DATA_FOLDER)+1:], refresh=False)
#                 pbar.update(read_bytes)

#             nb_rows_read += 1
            
#             # set a maximum iteration for tests
#             if nb_rows_read >= MAX_ITER:
#                 break
    
#             try:
#                 line_json = json.loads(line)
#             except Exception as e:
#                 JSONDecodeErrors_cnt += 1
#                 continue
                
#             # add line if patreon id is exists in df_yt_metadata_pt
#             if line_json['patreon'] in yt_gt_patreon_list_restricted:
#                 valid_predicate_count += 1
                
#                 # Use ast.literal_eval to convert string of lists, to list of list
#                 dailyGraph_earningsSeriesData = line_json.get('dailyGraph_earningsSeriesData')
                
#                 if dailyGraph_earningsSeriesData:
#                     daily_earnings = ast.literal_eval(dailyGraph_earningsSeriesData)
#                 else:
#                     daily_earnings = [[np.nan, np.nan]]
                                            
#                 for daily_earning in daily_earnings:
#                     # case where there are multiple tuples per row
#                     if isinstance(daily_earning, list):
#                         date = daily_earning[0]
#                         earning = daily_earning[1]
#                         lines_json.append({
#                             'creatorName':   line_json.get('creatorName'), 
#                             'creatorRange':  line_json.get('creatorRange'), 
#                             'startDate':     line_json.get('startDate'),
#                             'categoryTitle': line_json.get('categoryTitle'),
#                             'patreon':       line_json.get('patreon'),
#                             'date':          date,
#                             'earning':       earning
#                         })
#                     else:
#                         dailyEarningsError_cnt += 1
#                         print(">>>> dailyEarningsError - skipped line value: ")
#                         print(line_json.get('creatorName'), line_json.get('creatorRange'), line_json.get('startDate'), line_json.get('categoryTitle'), line_json.get('patreon'), daily_earnings)

# stop = timeit.default_timer()
# time_diff = stop - start

# print()
# print("==> total time to read and filter graphtreon time series:                      {:>10.0f} min. ({:.0f}s.)".format(time_diff/60, time_diff)) 
# print("==> number of rows read:                                                       {:>10,}".format(nb_rows_read))
# print("==> number of patreon ids that exist in both GTts and restricted YT metadata:  {:>10,} ({:.2%})".format(valid_predicate_count, valid_predicate_count/nb_rows_read ))
# print("==> number of skipped rows (JSONDecodeErrors):                                 {:>10,}".format(JSONDecodeErrors_cnt))
# print("==> number of skipped rows (dailyEarningsError):                               {:>10,}".format(dailyEarningsError_cnt))

# # create new dataframe with the filtered lines
# df_dailyGraph_earningsSeries = pd.DataFrame(data=lines_json)

GT_timeseries_date_earnings_extract_040422.jpg _(filter script above)_
<div>
    <img src="img/GT_timeseries_date_earnings_extract_040422.jpg" alt="GT_timeseries_date_earnings_extract_040422.jpg" />
</div>

In [None]:
# check for NaN values
# df_dailyGraph_earningsSeries[df_dailyGraph_earningsSeries.isna().any(axis=1)]

In [None]:
# save filtered data to LOCAL SCRATCH FOLDER as a compressed tsv (5.3Mb)
# output_file_path = LOCAL_DATA_FOLDER+"dailyGraph_earningsSeries.tsv.gz"
# df_dailyGraph_earningsSeries.to_csv(output_file_path, index=False, sep='\t', compression='gzip')

#### 2.3.4 Extract the date and daily patrons per patreon account

In [None]:
# # From the Graphtreon dataset, for each channel, extract the date and patrons from “dailyGraph_patronSeriesData” (takes about 3 mins)
# input_file_path = DATA_FOLDER+"/final_processed_file.jsonl.gz"

# # MAX_ITER = 1000

# nb_rows_read = 0
# valid_predicate_count = 0
# JSONDecodeErrors_cnt = 0 
# dailyPatronsError_cnt = 0 
# lines_json = []    

# compressed_file_size = os.stat(input_file_path).st_size
# print("Compressed file size is :                 {:>8,.2f} GB".format(compressed_file_size / 2**30))

# uncompressed_file_size = 13_310_000_000
# print("Estimated Uncompressed file size is :     {:>8,.2f} GB".format(uncompressed_file_size / 2**30))

# start = timeit.default_timer()

# # Load tqdm with size counter instead of file counter
# with tqdm(total=uncompressed_file_size, unit='B', unit_scale=True, unit_divisor=1024) as pbar:
#     with gzip.open(input_file_path, "r") as f:
#         for i, line in enumerate(f): 

#             read_bytes = len(line)
#             if read_bytes:
#                 pbar.set_postfix(file=input_file_path[len(DATA_FOLDER)+1:], refresh=False)
#                 pbar.update(read_bytes)

#             nb_rows_read += 1
            
#             # set a maximum iteration for tests
#             # if nb_rows_read >= MAX_ITER:
#             #     break
    
#             try:
#                 line_json = json.loads(line)
#             except Exception as e:
#                 JSONDecodeErrors_cnt += 1
#                 continue
                
#             # add line if patreon id is exists in df_yt_metadata_pt
#             if line_json['patreon'] in yt_gt_patreon_list_restricted:
#                 valid_predicate_count += 1
                
#                 # Use ast.literal_eval to convert string of lists, to list of list
#                 dailyGraph_patronSeriesData = line_json.get('dailyGraph_patronSeriesData')
                
#                 if dailyGraph_patronSeriesData:
#                     daily_patrons = ast.literal_eval(dailyGraph_patronSeriesData)
#                 else:
#                     daily_patrons = [[np.nan, np.nan]]
                                            
#                 for daily_patron in daily_patrons:
#                     # case where there are multiple tuples per row
#                     if isinstance(daily_patron, list):
#                         date = daily_patron[0]
#                         patrons = daily_patron[1]
#                         lines_json.append({
#                             'creatorName':   line_json.get('creatorName'), 
#                             'creatorRange':  line_json.get('creatorRange'), 
#                             'startDate':     line_json.get('startDate'),
#                             'categoryTitle': line_json.get('categoryTitle'),
#                             'patreon':       line_json.get('patreon'),
#                             'date':          date,
#                             'patrons':       patrons
#                         })
#                     else:
#                         dailyPatronsError_cnt += 1
#                         print(">>>> dailyPatronsError - skipped line value: ")
#                         print(line_json.get('creatorName'), line_json.get('creatorRange'), line_json.get('startDate'), line_json.get('categoryTitle'), line_json.get('patreon'), daily_patrons)

# stop = timeit.default_timer()
# time_diff = stop - start

# print()
# print("==> total time to read and filter graphtreon time series:                      {:>10.0f} min. ({:.0f}s.)".format(time_diff/60, time_diff)) 
# print("==> number of rows read:                                                       {:>10,}".format(nb_rows_read))
# print("==> number of patreon ids that exist in both GTts and restricted YT metadata:  {:>10,} ({:.2%})".format(valid_predicate_count, valid_predicate_count/nb_rows_read ))
# print("==> number of skipped rows (JSONDecodeErrors):                                 {:>10,}".format(JSONDecodeErrors_cnt))
# print("==> number of skipped rows (dailyPatronsError):                               {:>10,}".format(dailyPatronsError_cnt))

# # create new dataframe with the filtered lines
# df_dailyGraph_patronsSeries = pd.DataFrame(data=lines_json)

GT_timeseries_date_patrons_extract_042922.jpg _(filter script above)_
<div>
    <img src="img/GT_timeseries_date_patrons_extract_042922.jpg" alt="GT_timeseries_date_patrons_extract_042922.jpg" />
</div>

In [None]:
# check for NaN values
# df_dailyGraph_patronsSeries[df_dailyGraph_patronsSeries.isna().any(axis=1)]

In [None]:
# save filtered data to LOCAL SCRATCH FOLDER as a compressed tsv (7.1Mb)
# output_file_path = LOCAL_DATA_FOLDER+"dailyGraph_patronsSeries.tsv.gz"
# df_dailyGraph_patronsSeries.to_csv(output_file_path, index=False, sep='\t', compression='gzip')

#### 2.3.5 Merge extracted times series of daily earnings and daily patrons

In [None]:
!ls -lh {LOCAL_DATA_FOLDER}dailyGraph_earningsSeries.tsv.gz

In [None]:
# read dailyGraph_earningsSeries file from disk and convert dates
df_dailyGraph_earningsSeries = pd.read_csv(LOCAL_DATA_FOLDER+"dailyGraph_earningsSeries.tsv.gz", sep="\t", compression='gzip')
# df_dailyGraph_earningsSeries.date = pd.to_datetime(df_dailyGraph_earningsSeries.date, unit='ms')
df_dailyGraph_earningsSeries

In [None]:
!ls -lh {LOCAL_DATA_FOLDER}dailyGraph_patronsSeries.tsv.gz

In [None]:
# read dailyGraph_patronsSeries from disk and convert dates
df_dailyGraph_patronsSeries = pd.read_csv(LOCAL_DATA_FOLDER+"dailyGraph_patronsSeries.tsv.gz", sep="\t", compression='gzip')
# df_dailyGraph_patronsSeries.date = pd.to_datetime(df_dailyGraph_patronsSeries.date, unit='ms')
df_dailyGraph_patronsSeries

In [None]:
# join dailyGraph_earningsSeries with df_dailyGraph_patronsSeries
df_dailyGraph_patrons_and_earnings_Series = df_dailyGraph_earningsSeries.merge(df_dailyGraph_patronsSeries, how='outer')

# convert patrons column to Int64 so it can hold NaN values after outer join
df_dailyGraph_patrons_and_earnings_Series['patrons'] = df_dailyGraph_patrons_and_earnings_Series['patrons'].astype('Int64')
df_dailyGraph_patrons_and_earnings_Series.head()

In [None]:
# save filtered data to LOCAL SCRATCH FOLDER as a compressed tsv (6.2Mb)
# output_file_path = LOCAL_DATA_FOLDER+"dailyGraph_patrons_and_earnings_Series.tsv.gz"
# df_dailyGraph_patrons_and_earnings_Series.to_csv(output_file_path, index=False, sep='\t', compression='gzip')

### 2.4 Plots

In [None]:
!ls -lh {LOCAL_DATA_FOLDER}dailyGraph_patrons_and_earnings_Series.tsv.gz

In [None]:
# read merged dailyGraph_patrons_and_earnings_Series from disk
df_dailyGraph_patrons_and_earnings_Series = pd.read_csv(LOCAL_DATA_FOLDER+"dailyGraph_patrons_and_earnings_Series.tsv.gz", sep="\t", compression='gzip')
df_dailyGraph_patrons_and_earnings_Series['date'] = pd.to_datetime(df_dailyGraph_patrons_and_earnings_Series['date'], unit='ms')
df_dailyGraph_patrons_and_earnings_Series['patrons'] = df_dailyGraph_patrons_and_earnings_Series['patrons'].astype('Int64')

print(df_dailyGraph_patrons_and_earnings_Series.dtypes)
df_dailyGraph_patrons_and_earnings_Series


#### 2.4.1 Plot Patreon Time Series

In [None]:
years = mdates.YearLocator()
months = mdates.MonthLocator()
years_fmt = mdates.DateFormatter('%Y')

In [None]:
# RE-declare global variable for size of original GT dataset
GT_final_processed_file_ROWS = 232_269

##### Restrict to top 10 patrons

In [None]:
TOP_CNT = 20
# group by patreon account
dailyGraph_grp_patreon = df_dailyGraph_patrons_and_earnings_Series.groupby('patreon')\
                                                     .agg(date_cnt=('date', 'count'),
                                                          earliest_date=('date', 'min'),
                                                          lastest_date=('date', 'max'),
                                                          daily_earning_mean=('earning', 'mean'),
                                                          daily_earning_max=('earning', 'max'))\
                                                     .sort_values(by=['daily_earning_max'], ascending=False)\
                                                     .reset_index()\
                                                     .round(2)

# remove hours from dates
dailyGraph_grp_patreon.earliest_date = dailyGraph_grp_patreon.earliest_date.dt.date
dailyGraph_grp_patreon.lastest_date = dailyGraph_grp_patreon.lastest_date.dt.date

dailyGraph_grp_patreon

# extract the top 10 most profitable patreon accounts
top_patreons = dailyGraph_grp_patreon[:TOP_CNT]['patreon']

print("[Graphtreon Timeseries] Total number of patreon ids (original file):                      {:>9,}".format(GT_final_processed_file_ROWS))
print("[Graphtreon Timeseries] Nb of patreon ids in dailyGraph patreon + earnings time series:   {:>9,} ({:.1%} of original dataset)".format(len(dailyGraph_grp_patreon), len(dailyGraph_grp_patreon)/GT_final_processed_file_ROWS))

print()

dailyGraph_grp_patreon[:TOP_CNT].style.set_caption(f"Top {TOP_CNT} highest-earning Patreon accounts (sorted by max daily earnings)")



In [None]:
df_top_pt_daily_earnings = df_dailyGraph_patrons_and_earnings_Series[df_dailyGraph_patrons_and_earnings_Series['patreon'].isin(top_patreons)]
df_top_pt_daily_earnings

In [None]:
years = mdates.YearLocator()   # every year
months = mdates.MonthLocator()  # every month

# plot Patreon daily earningsSeriesData for top patreon accounts
fig, axs = plt.subplots(int(TOP_CNT/2), 2, figsize=(12, TOP_CNT*1.2), sharey=False, sharex=False)
for idx, patreon in tqdm(enumerate(top_patreons)):
    row = math.floor(idx/2)
    col = idx % 2
    ax1 = axs[row, col]
    
    # ax1.scatter(x[:4], y[:4], s=10, c='b', marker="s", label='first')
    # ax1.scatter(x[40:],y[40:], s=10, c='r', marker="o", label='second')

    tmp_df = df_top_pt_daily_earnings[df_top_pt_daily_earnings['patreon'] == patreon]

    # sbplt = axs[idx, 0]
    

    color = 'tab:blue'
    patrons, = ax1.plot(tmp_df['date'], tmp_df['patrons'], color=color, label='patrons')
    ax1.set_ylabel('# Patrons', color=color) 
    ax1.tick_params(axis='y', labelcolor=color)
    ax1.set(title=patreon)
    
    
    color = 'tab:orange'
    ax2 = ax1.twinx()  # Create a twin Axes sharing the xaxis.
    earnings, = ax2.plot(tmp_df['date'], tmp_df['earning'], color=color, label='earnings')
    ax2.set_ylabel("Earnings per month", color=color) 
    ax2.tick_params(axis='y', labelcolor=color)
    
    ax1.xaxis.set_major_locator(years)
    ax1.xaxis.set_major_formatter(years_fmt)
    ax1.xaxis.set_minor_locator(months)
    # ax1.legend(handles=[earnings, patrons], loc='upper left');
    
fig.suptitle(f'Timeseries of the top {TOP_CNT} highest-earning Patreon accounts \n (earnings per month in dollars)', fontweight="bold")
fig.text(0.5,0, 'Month')
# fig.text(0,0.5, 'Earnings per month ($)', rotation = 90)
fig.tight_layout(pad=3, w_pad=5, h_pad=2)

**Observation:**
We can see a drop of income at the beginning of each month. 
--> due to people unsubscribing
--> could do some averaging

In [None]:
# analyse 1 account in detail
patreon_account = 'patreon.com/tinymeatgang'

with pd.option_context('display.max_rows', 90, 'display.min_rows', 90):
    display(df_top_pt_daily_earnings[(df_top_pt_daily_earnings['patreon'] == patreon_account) 
                                     # & (df_top_pt_daily_earnings['date'] > pd.Timestamp('2021-01-01'))
                                    ].head(20))
        
df_top_pt_daily_earnings.dtypes

# check for NaN values
# df_top_pt_daily_earnings[df_top_pt_daily_earnings.isna().any(axis=1)]

##### Detect breaks / shocks

In [None]:
# n_breaks = 3
# model = rpt.Dynp(model="l1")

# # plot Patreon daily earningsSeriesData for top patreon accounts
# fig, axs = plt.subplots(int(TOP_CNT/2), 2, figsize=(12, TOP_CNT*1.2), sharey=False, sharex=False)
# for idx, patreon in enumerate(top_patreons):
#     print(patreon)
#     row = math.floor(idx/2)
#     col = idx % 2
#     sbplt = axs[row, col]

#     tmp_df = df_top_pt_daily_earnings[df_top_pt_daily_earnings['patreon'] == patreon]

#     # convert the dataframe into a time series.
#     ts_df = tmp_df.set_index(tmp_df['date'])
#     ts = ts_df['earning']

#     y = np.array(ts.tolist())

#     model.fit(y)
#     breaks = model.predict(n_bkps=n_breaks-1)
    
#     breaks_rpt = []
#     for i in breaks:
#         breaks_rpt.append(ts.index[i-1])
#     breaks_rpt = pd.to_datetime(breaks_rpt)
    
#     sbplt.plot(ts, label='data')
#     sbplt.set(title=patreon)
#     print_legend = True
#     for i in breaks_rpt:
#         if print_legend:
#             sbplt.axvline(i, color='red',linestyle='dashed', label='breaks')
#             print_legend = False
#         else:
#             sbplt.axvline(i, color='red',linestyle='dashed')

#     sbplt.xaxis.set_major_locator(years)
#     sbplt.xaxis.set_major_formatter(years_fmt)
#     sbplt.xaxis.set_minor_locator(months)
#     sbplt.xaxis.grid(color="#CCCCCC", ls=":")
#     sbplt.yaxis.grid(color="#CCCCCC", ls=":")
    
    
# fig.suptitle(f'Timeseries of the top {TOP_CNT} highest-earning Patreon accounts \n (earnings per month in dollars)', fontweight="bold")
# fig.text(0.5,0, 'Month')
# fig.text(0,0.5, 'Earnings per month ($)', rotation = 90)
# fig.tight_layout(pad=3, w_pad=5, h_pad=2)

#### 2.4.2 Plot YouTube timeseries for channels matching top Patreon accounts

In [None]:
# load matching dataframe
df_matched_channel_patreon = pd.read_csv(LOCAL_DATA_FOLDER+"df_matched_channel_patreon.tsv.gz", sep="\t", compression="gzip")

In [None]:
# add patreon_id column to YT timeseries
df_yt_timeseries_filt4_merged = df_yt_timeseries_filt4.merge(df_matched_channel_patreon, left_on='channel', right_on='channel_id')
df_yt_timeseries_filt4_merged.head(1)

In [None]:
# filter channels matching top patreon accounts
df_yt_timeseries_top_pt = df_yt_timeseries_filt4_merged[df_yt_timeseries_filt4_merged['patreon_id'].isin(top_patreons)]


print('[YouTube Timeseries] Time range after applying filter 1+2+3+4              {} and {}'.format(df_yt_timeseries_filt4['datetime'].min().strftime('%B %d, %Y'),
                                                                                                    df_yt_timeseries_filt4['datetime'].max().strftime('%B %d, %Y')))

print('[YouTube Timeseries] Time range after matching top patreon accounts        {} and {}'.format(df_yt_timeseries_top_pt['datetime'].min().strftime('%B %d, %Y'),
                                                                                                    df_yt_timeseries_top_pt['datetime'].max().strftime('%B %d, %Y')))

top_yt_patreons = df_yt_timeseries_top_pt.patreon_id.unique()
top_yt_patreons

In [None]:
df_yt_timeseries_top_pt.groupby(['patreon_id', 'channel_id'])\
                                                     .agg(datetime_cnt=('datetime', 'count'),
                                                          date_min=('datetime', 'min'),
                                                          date_max=('datetime', 'max'),
                                                          views_max=('views', 'max'),
                                                          subs_date=('subs', 'max'),
                                                          videos_max=('videos', 'mean'))\
                                                     .sort_values(by=['videos_max'], ascending=False)
                                                     #      \
                                                     # .reset_index()\
                                                     # .round(2)

In [None]:
df_yt_timeseries_top_pt.head(3)

In [None]:
# # plot YT cumulative views timeseries for top patreon accounts
# fig, axs = plt.subplots(int(math.ceil(len(top_yt_patreons)/2)), 2, figsize=(12, len(top_yt_patreons)*1.2), sharey=False, sharex=False)
# for idx, patreon in enumerate(top_yt_patreons):
#     row = math.floor(idx/2)
#     col = idx % 2
#     sbplt = axs[row, col]

#     tmp_df = df_yt_timeseries_top_pt[df_yt_timeseries_top_pt['patreon_id'] == patreon]

#     sbplt.plot(tmp_df['datetime'], tmp_df['views'])
#     sbplt.set(title=patreon+"\n"+tmp_df['channel'].iloc[0])
#     sbplt.xaxis.set_major_locator(years)
#     sbplt.xaxis.set_major_formatter(years_fmt)
#     sbplt.xaxis.set_minor_locator(months)
    
    
# fig.suptitle(f'YouTube timeseries of the channels corresponging to the top {TOP_CNT} highest-earning Patreon accounts \n (YT views per week)', fontweight="bold")
# fig.text(0.5,0, 'Week')
# fig.text(0,0.5, 'Views', rotation = 90)
# fig.tight_layout(pad=3, w_pad=5, h_pad=2)

In [None]:
# # plot YT views timeseries for top patreon accounts
# print(f'YouTube views per week timeseries per channel (for the top {TOP_CNT} highest-earning Patreon accounts)')

# for idx, patreon in enumerate(top_yt_patreons):
#     fig, axs = plt.subplots(1, 2, figsize=(12, 3), sharey=False, sharex=True)
#     row = idx    

#     tmp_df = df_yt_timeseries_top_pt[df_yt_timeseries_top_pt['patreon_id'] == patreon]

#     # delta views per week
#     sbplt = axs[0]
#     sbplt.plot(tmp_df['datetime'], tmp_df['delta_views'])
#     sbplt.set(title="YouTube delta views per week")
#     sbplt.set_xlabel('Week')
#     sbplt.set_ylabel('Delta Views')
#     sbplt.xaxis.set_major_locator(years)
#     sbplt.xaxis.set_major_formatter(years_fmt)
#     sbplt.xaxis.set_minor_locator(months)
    
#     # cumulative views per week
#     sbplt = axs[1]
#     sbplt.plot(tmp_df['datetime'], tmp_df['views'])
#     sbplt.set(title="YouTube cumulative views per week")
#     sbplt.set_xlabel('Week')
#     sbplt.set_ylabel('Views')
#     sbplt.xaxis.set_major_locator(years)
#     sbplt.xaxis.set_major_formatter(years_fmt)
#     sbplt.xaxis.set_minor_locator(months)
    
    
#     print(f"{idx+1}: https://youtube.com/channel/{tmp_df['channel'].iloc[0]} \t\t {patreon}")
#     fig.suptitle(f"{patreon}\n{tmp_df['channel'].iloc[0]}", fontweight="bold") 
#     fig.tight_layout(w_pad=5)
    

In [None]:
# # explore the descending drop in cumulative views per week for sonicether channel 
# with pd.option_context('display.max_rows', 90, 'display.min_rows', 90):
#     display(df_yt_timeseries_top_pt[
#         (df_yt_timeseries_top_pt['channel_id'] == 'UCAbpj6UljjAz7JvJt-yJIjg') 
#      & (df_yt_timeseries_top_pt['datetime'] > pd.Timestamp('2018-04-08'))
#      & (df_yt_timeseries_top_pt['datetime'] < pd.Timestamp('2018-06-01'))
#     ].head(90))

In [None]:
# # plot YT subsciptions timeseries for top patreon accounts
# fig, axs = plt.subplots(int(math.ceil(len(top_yt_patreons)/2)), 2, figsize=(12, len(top_yt_patreons)*1.2), sharey=False, sharex=False)
# for idx, patreon in enumerate(top_yt_patreons):
#     row = math.floor(idx/2)
#     col = idx % 2
#     sbplt = axs[row, col]

#     tmp_df = df_yt_timeseries_top_pt[df_yt_timeseries_top_pt['patreon_id'] == patreon]

#     sbplt.plot(tmp_df['datetime'], tmp_df['subs'])
#     sbplt.set(title=patreon+"\n"+tmp_df['channel'].iloc[0])
#     sbplt.xaxis.set_major_locator(years)
#     sbplt.xaxis.set_major_formatter(years_fmt)
#     sbplt.xaxis.set_minor_locator(months)
    
    
# fig.suptitle(f'YouTube timeseries of the channels corresponging to the top {TOP_CNT} highest-earning Patreon accounts \n (YT subscriptions per week)', fontweight="bold")
# fig.text(0.5,0, 'Week')
# fig.text(0,0.5, 'Views', rotation = 90)
# fig.tight_layout(pad=3, w_pad=5, h_pad=2)

In [None]:
# # plot YT # videos timeseries for top patreon accounts

# fig, axs = plt.subplots(int(math.ceil(len(top_yt_patreons)/2)), 2, figsize=(12, len(top_yt_patreons)*1.2), sharey=False, sharex=False)
# for idx, patreon in enumerate(top_yt_patreons):
#     row = math.floor(idx/2)
#     col = idx % 2
#     sbplt = axs[row, col]

#     tmp_df = df_yt_timeseries_top_pt[df_yt_timeseries_top_pt['patreon_id'] == patreon]

#     sbplt.plot(tmp_df['datetime'], tmp_df['videos'])
#     sbplt.set(title=patreon+"\n"+tmp_df['channel'].iloc[0])
#     sbplt.xaxis.set_major_locator(years)
#     sbplt.xaxis.set_major_formatter(years_fmt)
#     sbplt.xaxis.set_minor_locator(months)
    
# fig.suptitle(f'YouTube timeseries of the channels corresponging to the top {TOP_CNT} highest-earning Patreon accounts \n (YT videos per week)', fontweight="bold")
# fig.text(0.5,0, 'Week')
# fig.text(0,0.5, 'Views', rotation = 90)
# fig.tight_layout(pad=3, w_pad=5, h_pad=2)

#### 2.4.3 Compare YouTube and top Patreon timeseries

In [None]:
# remove patreon accounts that have more than 1 youtube channel
df_yt_timeseries_top_pt_chan_id_cnt = df_yt_timeseries_top_pt.groupby(['patreon_id','channel_id']).agg(channel_id_cnt=("channel_id", pd.Series.nunique))
df_yt_timeseries_top_pt_chan_id_cnt = df_yt_timeseries_top_pt_chan_id_cnt.groupby('patreon_id').count()
df_yt_timeseries_top_pt_unique_chan = df_yt_timeseries_top_pt_chan_id_cnt[df_yt_timeseries_top_pt_chan_id_cnt['channel_id_cnt']==1]

top_patreons_unique_chan = df_yt_timeseries_top_pt_unique_chan.index
top_patreons_unique_chan.size

In [None]:
def KM(x, pos):
    'The two args are the value and tick position'
    if x > 999_999:
        return '%2.1fM' % (x * 1e-6)
    elif x > 999:
        return '%2.1fK' % (x * 1e-3)
    else:
        return '%3.0f ' % (x)
KM_formatter = FuncFormatter(KM)

In [None]:
# compare YouTube and Patreon timeseries for top patreon accounts
n_breaks = 1
model = rpt.Dynp(model="l1")
date_offset = pd.DateOffset(months=1)
week_offset = pd.DateOffset(weeks=1)

for idx, patreon in enumerate(top_patreons_unique_chan):
    fig, axs = plt.subplots(5, 4, figsize=(26, 10), sharey=False, sharex=False)

    # patreon earnings and users
    tmp_df_pt = df_top_pt_daily_earnings[df_top_pt_daily_earnings['patreon'] == patreon]
    # tmp_df_pt = df_top_pt_daily_earnings[(df_top_pt_daily_earnings['patreon'] == patreon) & (df_top_pt_daily_earnings['earning'].notna())]
    
    # youtube videos
    tmp_df_yt = df_yt_timeseries_top_pt[df_yt_timeseries_top_pt['patreon_id'] == patreon]

    # set min and max dates for plots   
    date_min = max([tmp_df_yt['datetime'].min(), tmp_df_pt['date'].min()])
    date_max = min([tmp_df_yt['datetime'].max(), tmp_df_pt['date'].max()])

    # restrict datasets between min and max dates
    tmp_df_pt = tmp_df_pt[(tmp_df_pt['date'] >= date_min) & (tmp_df_pt['date'] <= date_max)]
    tmp_df_yt = tmp_df_yt[(tmp_df_yt['datetime'] >= date_min) & (tmp_df_yt['datetime'] <= date_max)]
    
    
    # patreon earnings with breaks/shocks
    # code inspired by https://towardsdatascience.com/getting-started-with-breakpoints-analysis-in-python-124471708d38

    # convert the dataframe into a time series.
    ts_pt_df = tmp_df_pt.set_index(tmp_df_pt['date'])
    ts_pt = ts_pt_df['patrons']
    y = np.array(ts_pt.tolist())
    
    # train the model
    model.fit(y)
    
    # get the breakpoints
    breaks = model.predict(n_breaks)
    breaks_rpt = []
    for i in breaks[:-1]:
        breaks_rpt.append(ts_pt.index[i-1])

        
    # plot number of patrons (twice the same plot: 1 for each column)
    axs[0,0].plot(ts_pt)
    axs[0,0].set(title="Number of patrons")
    axs[0,0].set_ylabel("# Patrons")    
    
    axs[0,1].plot(ts_pt)
    axs[0,1].set(title="Number of patrons")
    axs[0,1].set_ylabel("# Patrons")
    

    # plot vertical lines for each breakpoint
    print_legend = True
    for breakpoint in breaks_rpt:
        for i in range(axs.shape[0]):
            for j in range(axs.shape[1]):
                if print_legend:
                    axs[i,j].axvline(breakpoint, color='red', linestyle='--', label='breaks', linewidth=2.5)
                    axs[i,j].axvline(breakpoint - date_offset*2, color='pink', linestyle=':', label='- ' + str(2*date_offset.months)+' months', linewidth=2)
                    axs[i,j].axvline(breakpoint - date_offset, color='green', linestyle=':', label='- ' + str(date_offset.months)+' months', linewidth=2)
                    axs[i,j].axvline(breakpoint + date_offset, color='orange', linestyle=':', label='+' + str(date_offset.months)+' months', linewidth=2)          
                    print_legend = False
                else:
                    axs[i,j].axvline(breakpoint, color='red', linestyle='--', linewidth=2.5)
                    axs[i,j].axvline(breakpoint - date_offset*2, color='pink', linestyle=':', linewidth=2)
                    axs[i,j].axvline(breakpoint - date_offset, color='green', linestyle=':', linewidth=2)
                    axs[i,j].axvline(breakpoint + date_offset, color='orange', linestyle=':', linewidth=2)

    axs[0,0].legend()
    # axs[0,1].legend()

    # model_2 = "l2"
    # signal = ts_pt.values
    # algo = rpt.n(model=model_2).fit(signal)
    # my_bkps = algo.predict(n_bkps=1)
    # rpt.show.display(signal, my_bkps, figsize=(6, 2))

    # plot patreon earnings (twice the same plot: 1 for each column)
    axs[1,0].plot(ts_pt_df['date'], ts_pt_df['earning'])
    axs[1,0].set(title="Patreon earnings per month")
    axs[1,0].set_ylabel("Earnings")    
    
    axs[1,1].plot(ts_pt_df['date'], ts_pt_df['earning'])
    axs[1,1].set(title="Patreon earnings per month")
    axs[1,1].set_ylabel("Earnings")
    
    
    # youtube videos (delta)
    # axs[2,0].plot(tmp_df_yt['datetime'], tmp_df_yt['delta_videos'], 'r')
    # axs[2,0].scatter(tmp_df_yt['datetime'], tmp_df_yt['delta_videos'], c='r', s=3, marker='o')
    axs[2,0].scatter(tmp_df_yt['datetime'], tmp_df_yt['delta_videos'], c='r', s=30, marker='+')
    axs[2,0].set(title="YouTube delta videos per week")
    axs[2,0].set_ylabel("Δ Videos")

    # youtube videos (cumulative)
    axs[2,1].plot(tmp_df_yt['datetime'], tmp_df_yt['videos'], 'r')
    axs[2,1].set(title="YouTube cumulative videos")
    axs[2,1].set_ylabel("# Videos")

    
    # youtube views (delta)
    # axs[3,0].plot(tmp_df_yt['datetime'], tmp_df_yt['delta_views'], 'g')
    axs[3,0].scatter(tmp_df_yt['datetime'], tmp_df_yt['delta_views'], c='g', s=30, marker='+')
    axs[3,0].set(title="YouTube delta views per week")
    axs[3,0].set_ylabel("Δ Views")

    # youtube views (cumulative)
    axs[3,1].plot(tmp_df_yt['datetime'], tmp_df_yt['views'], 'g')
    axs[3,1].set(title="YouTube cumulative views")
    axs[3,1].set_ylabel("# Views")

    
    # youtube subs (delta)
    # axs[4,0].plot(tmp_df_yt['datetime'], tmp_df_yt['delta_subs'], 'm')
    axs[4,0].scatter(tmp_df_yt['datetime'], tmp_df_yt['delta_subs'], c='m', s=30, marker='+')

    axs[4,0].set(title="YouTube delta subscriptions per week")
    axs[4,0].set_ylabel("Δ Subscriptions")

    # youtube subs (cumulative)
    axs[4,1].plot(tmp_df_yt['datetime'], tmp_df_yt['subs'], 'm')
    axs[4,1].set(title="YouTube cumulative subscriptions")
    axs[4,1].set_ylabel("# Subscriptions")

    
        
    ################################### ZOOM IN THE BREAK PERIOD ###################################

    date_min_zoom = breaks_rpt[0] - date_offset*2 - week_offset
    date_max_zoom = breaks_rpt[0] + date_offset*1 + week_offset
            
    # restrict datasets between min and max dates
    tmp_df_pt_zoomed = tmp_df_pt[(tmp_df_pt['date'] >= date_min_zoom) & (tmp_df_pt['date'] <= date_max_zoom)]
    tmp_df_yt_zoomed = tmp_df_yt[(tmp_df_yt['datetime'] >= date_min_zoom) & (tmp_df_yt['datetime'] <= date_max_zoom)]
    
    
    # zoomed in patrons
    axs[0,2].plot(tmp_df_pt_zoomed['date'], tmp_df_pt_zoomed['patrons'])
    axs[0,2].set(title="Number of patrons (zoomed in)")
    axs[0,2].set_ylabel("# Patrons")
    
    
    axs[0,3].plot(tmp_df_pt_zoomed['date'], tmp_df_pt_zoomed['patrons'])
    axs[0,3].set(title="Number of patrons (zoomed in)")
    axs[0,3].set_ylabel("# Patrons")
    
            
    # plot patreon earnings (twice the same plot: 1 for each column)
    axs[1,2].plot(tmp_df_pt_zoomed['date'], tmp_df_pt_zoomed['earning'])
    axs[1,2].set(title="Patreon earnings per month (zoomed in)")
    axs[1,2].set_ylabel("Earnings")    
    
    axs[1,3].plot(tmp_df_pt_zoomed['date'], tmp_df_pt_zoomed['earning'])
    axs[1,3].set(title="Patreon earnings per month (zoomed in)")
    axs[1,3].set_ylabel("Earnings")
    
    
    # zoomed in youtube videos (delta)
    # axs[2,0].plot(tmp_df_yt['datetime'], tmp_df_yt['delta_videos'], 'r')
    # axs[2,0].scatter(tmp_df_yt['datetime'], tmp_df_yt['delta_videos'], c='r', s=3, marker='o')
    axs[2,2].scatter(tmp_df_yt_zoomed['datetime'], tmp_df_yt_zoomed['delta_videos'], c='r', s=30, marker='+')
    axs[2,2].set(title="YouTube delta videos per week (zoomed in)")
    axs[2,2].set_ylabel("Δ Videos")

    # zoomed in youtube videos (cumulative)
    axs[2,3].plot(tmp_df_yt_zoomed['datetime'], tmp_df_yt_zoomed['videos'], 'r')
    axs[2,3].set(title="YouTube cumulative videos (zoomed in)")
    axs[2,3].set_ylabel("# Videos")

    
    # zoomed in youtube views (delta)
    # axs[3,2].plot(tmp_df_yt_zoomed['datetime'], tmp_df_yt_zoomed['delta_views'], 'g')
    axs[3,2].scatter(tmp_df_yt_zoomed['datetime'], tmp_df_yt_zoomed['delta_views'], c='g', s=30, marker='+')
    axs[3,2].set(title="YouTube delta views per week (zoomed in)")
    axs[3,2].set_ylabel("Δ Views")

    # zoomed in youtube views (cumulative)
    axs[3,3].plot(tmp_df_yt_zoomed['datetime'], tmp_df_yt_zoomed['views'], 'g')
    axs[3,3].set(title="YouTube cumulative views (zoomed in)")
    axs[3,3].set_ylabel("# Views")

    
    # zoomed in youtube subs (delta)
    # axs[4,2].plot(tmp_df_yt_zoomed['datetime'], tmp_df_yt_zoomed['delta_subs'], 'm')
    axs[4,2].scatter(tmp_df_yt_zoomed['datetime'], tmp_df_yt_zoomed['delta_subs'], c='m', s=30, marker='+')

    axs[4,2].set(title="YouTube delta subscriptions per week (zoomed in)")
    axs[4,2].set_ylabel("Δ Subscriptions")

    # zoomed in youtube subs (cumulative)
    axs[4,3].plot(tmp_df_yt_zoomed['datetime'], tmp_df_yt_zoomed['subs'], 'm')
    axs[4,3].set(title="YouTube cumulative subscriptions (zoomed in)")
    axs[4,3].set_ylabel("# Subscriptions")
            
            
            
            
    # format the axes
    for i in range(axs.shape[0]):
        for j in range(axs.shape[1]):
            if j < 2:
                axs[i,j].set_xlim([date_min, date_max])
                axs[i,j].xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
                axs[i,j].xaxis.set_major_locator(mdates.YearLocator())
                axs[i,j].xaxis.set_minor_locator(mdates.MonthLocator())
            if j >= 2:
                axs[i,j].set_xlim([date_min_zoom, date_max_zoom])
                axs[i,j].xaxis.set_major_formatter(mdates.DateFormatter('%Y-%b'))
                axs[i,j].xaxis.set_major_locator(mdates.MonthLocator())
                # axs[i,j].xaxis.set_minor_locator(mdates.WeekdayLocator())
            axs[i,j].xaxis.grid(color="#CCCCCC", ls=":")
            axs[i,j].yaxis.grid(color="#CCCCCC", ls=":")
            
            axs[i,j].yaxis.set_major_formatter(KM_formatter)
            
            
            
    # print YT channels URL, patreon ids, and titles of plots
    ch_ids = tmp_df_yt['channel'].unique()

    print(f"\t\033[1m Rank {idx+1}: {patreon[12:]} \033[0m")
    print(f"\t https://www.{patreon}")
    print(f"\t https://graphtreon.com/creator/{patreon[12:]}")
    
    # print channel(s) related to this patreon account
    for ch_id in ch_ids:
        print(f"\t https://youtube.com/channel/{ch_id}")
    print()
    
    
#     nb_patrons_bkpoint =        tmp_df_pt[tmp_df_pt['date'] == breakpoint].patrons.values[0] 
#     nb_patrons_bkpoint_befor = tmp_df_pt[tmp_df_pt['date'] == breakpoint - date_offset].patrons.values[0] 
#     nb_patrons_bkpoint_after = tmp_df_pt[tmp_df_pt['date'] == breakpoint + date_offset].patrons.values[0] 

#     diff_after_break = nb_patrons_bkpoint_after - nb_patrons_bkpoint
#     diff_before_break = nb_patrons_bkpoint - nb_patrons_bkpoint_befor

#     print("Number of patrons 1 month before breakpoint:    {:2}".format(nb_patrons_bkpoint_befor))
#     print("Increase of patrons 1 month before breakpoint:  {:2}".format(diff_before_break))
#     print("----------------------------------------------------")
#     print("Number of patrons at breakpoint:                {:2}".format(nb_patrons_bkpoint))
#     print("Increase of patrons 1 month after breakpoint:   {:2}".format(diff_after_break))
#     print("----------------------------------------------------")
#     print("Number of patrons 1 month after breakpoint:     {:2}".format(nb_patrons_bkpoint_after))
#     print()
#     print("change of increase between 1 month before and 1 month after: {:.1%}".format((diff_after_break/diff_before_break) - 1))


    tmp_df_pt_prior = tmp_df_pt[(tmp_df_pt['date'] >= breakpoint - 2*date_offset) & (tmp_df_pt['date'] <= breakpoint - date_offset)]
    # display(tmp_df_pt_prior)

    tmp_df_pt_befor = tmp_df_pt[(tmp_df_pt['date'] >= breakpoint - date_offset) & (tmp_df_pt['date'] <= breakpoint)]
    # display(tmp_df_pt_befor)

    tmp_df_pt_after = tmp_df_pt[(tmp_df_pt['date'] >= breakpoint) & (tmp_df_pt['date'] <= breakpoint + date_offset)]
    # display(tmp_df_pt_after)

    patrons_prior_mean = tmp_df_pt_prior['patrons'].mean()
    patrons_befor_mean = tmp_df_pt_befor['patrons'].mean()
    patrons_after_mean = tmp_df_pt_after['patrons'].mean()
    
    # plot the means   
    axs[0,2].plot(tmp_df_pt_prior['date'].mean(), patrons_prior_mean, marker='o', color='pink')
    axs[0,2].plot(tmp_df_pt_befor['date'].mean(), patrons_befor_mean, marker='o', color='green')
    axs[0,2].plot(tmp_df_pt_after['date'].mean(), patrons_after_mean, marker='o', color='orange')
    
    diff_mean_prior_to_befor = patrons_befor_mean - patrons_prior_mean
    diff_mean_befor_to_after = patrons_after_mean - patrons_befor_mean

    print("\t Mean Number of patrons PRIOR to BEFORE:                   {:5.1f}".format(patrons_prior_mean))
    print("\t Increase of patrons from PRIOR mean to BEFORE mean        {:5.1f}".format(diff_mean_prior_to_befor))
    print("\t ------------------------------------------------------------------")
    print("\t Mean Number of patrons BEFORE breakpoint:                 {:5.1f}".format(patrons_befor_mean))
    print("\t Increase of patrons from BEFORE mean to AFTER mean        {:5.1f}".format(diff_mean_befor_to_after))
    print("\t ------------------------------------------------------------------")
    print("\t Mean Number of patrons AFTER breakpoint:                  {:5.1f}".format(patrons_after_mean))
    print()
    print("\t Percentage of increase between PRIOR to BEFORE and BEFORE to AFTER: {:.1%}".format((diff_mean_befor_to_after/diff_mean_prior_to_befor) - 1))
    print("\n\n")

    
    # fig.suptitle(f'{patreon} (rank {idx+1}) \nyoutube.com/channel/{ch_id}', fontweight="bold")
    fig.tight_layout(w_pad=0)
    plt.show()
    print("\n\n\n")


##### Toy example

In [None]:
# toy example
data = {'channel': ["baba", "baba", "baba", "baba", "baba", "baba", "baba", "baba"], 'time': [1, 2, 3, 4, 5, 6, 7, 8], 'patrons': [2,4,6,8,10,12,14,16]}
df = pd.DataFrame(data)
    
breakpoint_date = 5
    
plt.plot(data['time'], data['patrons'], marker='o')
plt.xlabel("time")
plt.ylabel("# patrons")

plt.axvline(breakpoint_date, color='red', linestyle='--', label='breakpoint', linewidth=2.5)
plt.axvline(breakpoint_date-2, color='green', linestyle='--', label='-2 month', linewidth=2.5)
plt.axvline(breakpoint_date+2, color='orange', linestyle='--', label='+ 2 month', linewidth=2.5)
# plt.fill_between(df['time'], 4, 12, color='C0', alpha=0.3)

plt.fill_betweenx(df['patrons'], breakpoint_date-2, breakpoint_date, facecolor='green', alpha=0.3)
plt.fill_betweenx(df['patrons'], breakpoint_date+2, breakpoint_date, facecolor='orange', alpha=0.3)


plt.legend()
plt.title("Break detection toy example")
plt.show()




patrons_bkpoint = df[df['time'] == breakpoint_date].patrons.values[0] 
patrons_bkpoint_plus_1 = df[df['time'] == breakpoint_date+2].patrons.values[0] 
patrons_bkpoint_minu_1 = df[df['time'] == breakpoint_date-2].patrons.values[0] 

diff_after_break = patrons_bkpoint_plus_1 - patrons_bkpoint
diff_before_break = patrons_bkpoint - patrons_bkpoint_minu_1

print("Number of patrons 1 month before breakpoint:    {:2}".format(patrons_bkpoint_minu_1))
print("Increase of patrons 1 month before breakpoint:  {:2}".format(diff_before_break))
print("----------------------------------------------------")
print("Number of patrons at breakpoint:                {:2}".format(patrons_bkpoint))
print("Increase of patrons 1 month after breakpoint:   {:2}".format(diff_after_break))
print("----------------------------------------------------")
print("Number of patrons 1 month after breakpoint:     {:2}".format(patrons_bkpoint_plus_1))
print()
print("change of increase between 1 month before and 1 month after: {:.1%}".format((diff_after_break/diff_before_break) - 1))


df_prior = df[(df['time'] >= breakpoint_date-4) & (df['time'] <= breakpoint_date-2)]
display(df_prior)

df_befor = df[(df['time'] >= breakpoint_date-2) & (df['time'] <= breakpoint_date)]
display(df_befor)

df_after = df[(df['time'] >= breakpoint_date) & (df['time'] <= breakpoint_date+2)]
display(df_after)

patrons_prior_mean = df_prior['patrons'].mean()
patrons_befor_mean = df_befor['patrons'].mean()
patrons_after_mean = df_after['patrons'].mean()

diff_mean_prior_to_befor = patrons_befor_mean - patrons_prior_mean
diff_mean_befor_to_after = patrons_after_mean - patrons_befor_mean

print("Mean Number of patrons PRIOR to BEFORE:                   {:5.1f}".format(patrons_prior_mean))
print("Increase of patrons from PRIOR mean to BEFORE mean        {:5.1f}".format(diff_mean_prior_to_befor))
print("------------------------------------------------------------------")
print("Mean Number of patrons BEFORE breakpoint:                 {:5.1f}".format(patrons_befor_mean))
print("Increase of patrons from BEFORE mean to AFTER mean        {:5.1f}".format(diff_mean_befor_to_after))
print("------------------------------------------------------------------")
print("Mean Number of patrons AFTER breakpoint:                  {:5.1f}".format(patrons_after_mean))
print()


print("Percentage of increase between BEFORE and AFTER: {:.1%}".format((diff_mean_befor_to_after/diff_mean_prior_to_befor) - 1))


In [None]:
# toy example
data = {'channel': ["baba", "baba", "baba", "baba", "baba", "baba", "baba", "baba"], 'time': [1, 2, 3, 4, 5, 6, 7, 8], 'patrons': [2,4,6,8,10,18,14,16]}
df = pd.DataFrame(data)
    
breakpoint_date = 5
    
plt.plot(data['time'], data['patrons'], marker='o')
plt.xlabel("time")
plt.ylabel("# patrons")

plt.axvline(breakpoint_date, color='red', linestyle='--', label='breakpoint', linewidth=2.5)
plt.axvline(breakpoint_date-2, color='green', linestyle='--', label='-2 month', linewidth=2.5)
plt.axvline(breakpoint_date+2, color='orange', linestyle='--', label='+ 2 month', linewidth=2.5)
# plt.fill_between(df['time'], 4, 12, color='C0', alpha=0.3)

plt.fill_betweenx(df['patrons'], breakpoint_date-2, breakpoint_date, facecolor='green', alpha=0.3)
plt.fill_betweenx(df['patrons'], breakpoint_date+2, breakpoint_date, facecolor='orange', alpha=0.3)


plt.legend()
plt.title("Break detection toy example")
plt.show()




patrons_bkpoint = df[df['time'] == breakpoint_date].patrons.values[0] 
patrons_bkpoint_plus_1 = df[df['time'] == breakpoint_date+2].patrons.values[0] 
patrons_bkpoint_minu_1 = df[df['time'] == breakpoint_date-2].patrons.values[0] 

diff_after_break = patrons_bkpoint_plus_1 - patrons_bkpoint
diff_before_break = patrons_bkpoint - patrons_bkpoint_minu_1

print("Number of patrons 1 month before breakpoint:    {:2}".format(patrons_bkpoint_minu_1))
print("Increase of patrons 1 month before breakpoint:  {:2}".format(diff_before_break))
print("----------------------------------------------------")
print("Number of patrons at breakpoint:                {:2}".format(patrons_bkpoint))
print("Increase of patrons 1 month after breakpoint:   {:2}".format(diff_after_break))
print("----------------------------------------------------")
print("Number of patrons 1 month after breakpoint:     {:2}".format(patrons_bkpoint_plus_1))
print()
print("change of increase between 1 month before and 1 month after: {:.1%}".format((diff_after_break/diff_before_break) - 1))


df_prior = df[(df['time'] >= breakpoint_date-4) & (df['time'] <= breakpoint_date-2)]
display(df_prior)

df_befor = df[(df['time'] >= breakpoint_date-2) & (df['time'] <= breakpoint_date)]
display(df_befor)

df_after = df[(df['time'] >= breakpoint_date) & (df['time'] <= breakpoint_date+2)]
display(df_after)

patrons_prior_mean = df_prior['patrons'].mean()
patrons_befor_mean = df_befor['patrons'].mean()
patrons_after_mean = df_after['patrons'].mean()

diff_mean_prior_to_befor = patrons_befor_mean - patrons_prior_mean
diff_mean_befor_to_after = patrons_after_mean - patrons_befor_mean

print("Mean Number of patrons PRIOR to BEFORE:                   {:5.1f}".format(patrons_prior_mean))
print("Increase of patrons from PRIOR mean to BEFORE mean        {:5.1f}".format(diff_mean_prior_to_befor))
print("------------------------------------------------------------------")
print("Mean Number of patrons BEFORE breakpoint:                 {:5.1f}".format(patrons_befor_mean))
print("Increase of patrons from BEFORE mean to AFTER mean        {:5.1f}".format(diff_mean_befor_to_after))
print("------------------------------------------------------------------")
print("Mean Number of patrons AFTER breakpoint:                  {:5.1f}".format(patrons_after_mean))
print()

print("Percentage of increase between BEFORE and AFTER: {:.1%}".format((diff_mean_befor_to_after/diff_mean_prior_to_befor) - 1))


### 2.5 Plots (II)

#### 2.5.1 Plot YouTube timeseries for channels matching top Patreon accounts

In [None]:
# load matching dataframe
df_matched_channel_patreon = pd.read_csv(LOCAL_DATA_FOLDER+"df_matched_channel_patreon.tsv.gz", sep="\t", compression="gzip")

In [None]:
# add patreon_id column to YT timeseries
df_yt_timeseries_filt4_merged = df_yt_timeseries_filt4.merge(df_matched_channel_patreon, left_on='channel', right_on='channel_id')
df_yt_timeseries_filt4_merged.head(1)

In [None]:
# filter channels matching top patreon accounts
# df_yt_timeseries_top_pt = df_yt_timeseries_filt4_merged[df_yt_timeseries_filt4_merged['patreon_id'].isin(top_patreons)]
# top_yt_patreons = df_yt_timeseries_top_pt.patreon_id.unique()
# top_yt_patreons

In [None]:
df_yt_ts_top_views = df_yt_timeseries_filt4_merged.groupby(['patreon_id', 'channel_id'])\
                                                     .agg(datetime_cnt=('datetime', 'count'),
                                                          views_max=('views', 'max'),
                                                          subs_date=('subs', 'max'),
                                                          videos_max=('videos', 'mean'))\
                                                     .sort_values(by=['videos_max'], ascending=False)\
                                                     .reset_index()[:10]
df_yt_ts_top_views

In [None]:
# get top 10 youtube channels in terms of views
top_channel_ids = df_yt_ts_top_views['channel_id'].unique()
top_channel_ids

In [None]:
# plot YT timeseries for top youtube accounts
fig, axs = plt.subplots(int(math.ceil(len(top_channel_ids)/2)), 2, figsize=(12, len(top_channel_ids)*1.2), sharey=False, sharex=False)
for idx, chan_id in enumerate(top_channel_ids):
    row = math.floor(idx/2)
    col = idx % 2
    sbplt = axs[row, col]

    tmp_df = df_yt_timeseries_filt4_merged[df_yt_timeseries_filt4_merged['channel_id'] == chan_id]

    sbplt.plot(tmp_df['datetime'], tmp_df['views'])
    sbplt.set(title=tmp_df['channel'].iloc[0])
    sbplt.xaxis.set_major_locator(years)
    sbplt.xaxis.set_major_formatter(years_fmt)
    sbplt.xaxis.set_minor_locator(months)
    
    
fig.suptitle(f'Timeseries of the YouTube channels with most views \n (views per week)', fontweight="bold")
fig.text(0.5,0, 'Week')
fig.text(0,0.5, 'Views', rotation = 90)
fig.tight_layout(pad=3, w_pad=5, h_pad=2)

#### 2.5.2 Plot Patreon Time Series

In [None]:
# RE-declare global variable for size of original GT dataset
GT_final_processed_file_ROWS = 232_269

In [None]:
!ls -lh {LOCAL_DATA_FOLDER}dailyGraph_earningsSeries.tsv.gz

In [None]:
# read file from disk
df_dailyGraph_earningsSeries = pd.read_csv(LOCAL_DATA_FOLDER+"/dailyGraph_earningsSeries.tsv.gz", sep="\t", compression='gzip')
df_dailyGraph_earningsSeries.date = pd.to_datetime(df_dailyGraph_earningsSeries.date, unit='ms')
df_dailyGraph_earningsSeries

##### Restrict patreons corresponding to the top 10

In [None]:
# TOP_CNT = 10
# # group by patreon account
# dailyGraph_grp_patreon = df_dailyGraph_earningsSeries.groupby('patreon')\
#                                                      .agg(date_cnt=('date', 'count'),
#                                                           earliest_date=('date', 'min'),
#                                                           lastest_date=('date', 'max'),
#                                                           daily_earning_mean=('earning', 'mean'),
#                                                           daily_earning_max=('earning', 'max'))\
#                                                      .sort_values(by=['daily_earning_max'], ascending=False)\
#                                                      .reset_index()\
#                                                      .round(2)

# # remove hours from dates
# dailyGraph_grp_patreon.earliest_date = dailyGraph_grp_patreon.earliest_date.dt.date
# dailyGraph_grp_patreon.lastest_date = dailyGraph_grp_patreon.lastest_date.dt.date

# dailyGraph_grp_patreon

# # extract the top 10 most profitable patreon accounts
# top_patreons = dailyGraph_grp_patreon[:TOP_CNT]['patreon']

# print("[Graphtreon Timeseries] Total number of patreon ids (original file):            {:>9,}".format(GT_final_processed_file_ROWS))
# print("[Graphtreon Timeseries] Nb of patreon ids in dailyGraph earnings time series:   {:>9,} ({:.1%} of original dataset)".format(len(dailyGraph_grp_patreon), len(dailyGraph_grp_patreon)/GT_final_processed_file_ROWS))

# print()

# dailyGraph_grp_patreon[:TOP_CNT].style.set_caption(f"Top {TOP_CNT} highest-earning Patreon accounts (sorted by max daily earnings)")



In [None]:
# join to get the corresponding channel_id
df_dailyGraph_earningsSeries_merged = df_dailyGraph_earningsSeries.merge(df_matched_channel_patreon, left_on='patreon', right_on='patreon_id')
df_dailyGraph_earningsSeries_merged.head(3)

In [None]:
top_channel_ids

In [None]:
df_dailyGraph_earningsSeries_merged.groupby(['channel_id', 'patreon_id']).count()

In [None]:
# plot Patreon daily earningsSeriesData for top patreon accounts
fig, axs = plt.subplots(int(len(top_channel_ids)/2), 2, figsize=(12, len(top_channel_ids)*1.2), sharey=False, sharex=False)
for idx, chan_id in enumerate(top_channel_ids):
    row = math.floor(idx/2)
    col = idx % 2
    sbplt = axs[row, col]

    tmp_df = df_dailyGraph_earningsSeries_merged[df_dailyGraph_earningsSeries_merged['channel_id'] == chan_id]
    display(tmp_df)
    sbplt.plot(tmp_df['date'], tmp_df['earning'])
    sbplt.set(title=chan_id)
    sbplt.xaxis.set_major_locator(years)
    sbplt.xaxis.set_major_formatter(years_fmt)
    sbplt.xaxis.set_minor_locator(months)
    
    
fig.suptitle(f'Timeseries of the Patreon accounts corresponding to YT channel with most views\n (earnings per day in dollars)', fontweight="bold")
fig.text(0.5,0, 'Day')
fig.text(0,0.5, 'Earnings per day ($)', rotation = 90)
fig.tight_layout(pad=3, w_pad=5, h_pad=2)

In [None]:
# # analyse 1 account in detail
# patreon_account = 'patreon.com/tinymeatgang'

# with pd.option_context('display.max_rows', 90, 'display.min_rows', 90):
#     display(df_top_pt_daily_earnings[(df_top_pt_daily_earnings['patreon'] == patreon_account) 
#                                      # & (df_top_pt_daily_earnings['date'] > pd.Timestamp('2021-01-01'))
#                                     ].head(90))
        
# df_top_pt_daily_earnings.dtypes

# # check for NaN values
# df_top_pt_daily_earnings[df_top_pt_daily_earnings.isna().any(axis=1)]

#### 2.5.3 Compare YouTube and top Patreon timeseries

In [None]:
# compare YouTube and Patreon timeseries for top patreon accounts
for idx, patreon in enumerate(top_patreons):
    
    fig, axs = plt.subplots(4, 1, figsize=(6, 8), sharey=False, sharex=True)

    # patreon earnings
    tmp_df_pt = df_top_pt_daily_earnings[df_top_pt_daily_earnings['patreon'] == patreon]

    axs[0].plot(tmp_df_pt['date'], tmp_df_pt['earning'])
    axs[0].set(title="Patreon earnings")
    axs[0].set_ylabel("Earnings ($)")
    
    axs[0].xaxis.set_major_locator(years)
    axs[0].xaxis.set_major_formatter(years_fmt)
    axs[0].xaxis.set_minor_locator(months)
    

    # youtube views
    tmp_df_yt = df_yt_timeseries_top_pt[df_yt_timeseries_top_pt['patreon_id'] == patreon]

    axs[1].plot(tmp_df_yt['datetime'], tmp_df_yt['views'])
    axs[1].set(title="YouTube views")
    axs[1].set_ylabel("Views")

    axs[1].xaxis.set_major_locator(years)
    axs[1].xaxis.set_major_formatter(years_fmt)
    axs[1].xaxis.set_minor_locator(months)
    
    
    # youtube subs
    tmp_df3 = df_yt_timeseries_top_pt[df_yt_timeseries_top_pt['patreon_id'] == patreon]

    axs[2].plot(tmp_df3['datetime'], tmp_df3['subs'])
    axs[2].set(title="YouTube subscriptions")
    axs[2].set_ylabel("Subscriptions")

    axs[2].xaxis.set_major_locator(years)
    axs[2].xaxis.set_major_formatter(years_fmt)
    axs[2].xaxis.set_minor_locator(months)
    
    
    # youtube videos
    tmp_df4 = df_yt_timeseries_top_pt[df_yt_timeseries_top_pt['patreon_id'] == patreon]

    axs[3].plot(tmp_df4['datetime'], tmp_df4['videos'])
    axs[3].set(title="YouTube videos")
    axs[3].set_ylabel("Videos")

    axs[3].xaxis.set_major_locator(years)
    axs[3].xaxis.set_major_formatter(years_fmt)
    axs[3].xaxis.set_minor_locator(months)
    
    ch_id = tmp_df_yt['channel'].unique()[0]
    fig.suptitle(f'{patreon}\nchannel id(s): {ch_id}', fontweight="bold")
    fig.tight_layout()