# Characterizing Patronage on YouTube - EDA

## 0. Files and brief explanation of those

All data is located in `/dlabdata1/youtube_large/`

In [None]:
DATA_FOLDER = "/dlabdata1/youtube_large/"
LOCAL_DATA_FOLDER = "local_data/"

**YouNiverse dataset:**

- `df_channels_en.tsv.gz`: channel metadata.
- `df_timeseries_en.tsv.gz`: channel-level time-series.
- `yt_metadata_en.jsonl.gz`: raw video metadata.
- `youtube_comments.tsv.gz`: user-comment matrices.
- `youtube_comments.ndjson.zst`: raw comments — this is a HUGE file.

**Graphteon dataset:**
- `creators.csv` list with all creator names.
- `final_processed_file.jsonl.gz` all graphteon time-series.
- `pages.zip` raw html of the pages in graphteon.

#### Libaries imports

In [None]:
# !conda list

In [None]:
import os 
import io
import pandas as pd
import json
import re
import zstandard
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from matplotlib.ticker import FuncFormatter
import numpy as np
import seaborn as sns
import gzip
from tqdm import tqdm
import timeit
import ast
import math
import ruptures as rpt

In [None]:
# list all files in current directory
# !ls -lh

In [None]:
# list all files in DATA_FOLDER
!ls -lh {DATA_FOLDER}

## 1. Exploratory Data Analysis (EDA)

### 1.1. YouNiverse dataset

#### 1.1.1 Channel metadata
Metadata associated with the 136,470 channels: **channel ID**, **join date**, **country**, **number of subscribers**, **most frequent category**, and the **channel’s position** in socialblade.com’s subscriber ranking. \
The number of subscribers is provided both as obtained from channelcrawler.com (between 2019-09-12 and 2019-09-17) and as crawled from socialblade.com (2019-09-27). Additionally, we also provide a set of **weights** (derived from socialblade.com’s subscriber rankings) that can be used to partially correct sample biases in our dataset.

- `category_cc`: category of the channel (majority based)
- `join_date`: join date of the channel
- `channel`: channel id
- `name_cc`: name of the channel.
- `subscribers_cc`: number of subscribers
- `videos_cc`: number of videos
- `subscriber_rank_sb`: rank in terms of number of subscribers (channel’s position in socialblade.com’s subscriber ranking)
- `weights`: weights cal (Set of weights derived from socialblade.com’s subscriber rankings. Can be used to partially correct sample biases in our dataset. -> correction for representation)

In [None]:
!ls -lh {DATA_FOLDER}df_channels_en.tsv.gz

In [None]:
# channel metadata
# df_yt_channels = pd.read_csv(DATA_FOLDER+'df_channels_en.tsv.gz', sep="\t", compression='gzip', nrows=10)
df_yt_channels = pd.read_csv(DATA_FOLDER+'df_channels_en.tsv.gz', sep="\t", compression='gzip')
df_yt_channels

Facts about this data (taken from [YouNiverse github page](https://github.com/epfl-dlab/YouNiverse)) 

- This dataframe has 136,470 rows, where each one corresponds to a different channel.
- We obtained all channels with >10k subscribers and >10 videos from channelcrawler.com in the 27 October 2019.
- Additionally we filtered all channels that were not in english given their video metadata (see `Raw Channels').

##### Summary statistics

In [None]:
print('Number of unique categories:         {:,}'.format(df_yt_channels['category_cc'].nunique()))
print('Number of unique channels:      {:,}'.format(df_yt_channels['channel'].nunique()))
print('Number of unique channel names: {:,}'.format(df_yt_channels['name_cc'].nunique()))

print('\nNote: there are more unique channels than unique names, so some channels might have the same name!')

##### Distribution of videos and subscribers per channel

In [None]:
selected_cols = ['videos_cc', 'subscribers_cc']

# plot with linear scale for x axis and log scale for y axis
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10,5))

for i,(col,ax) in enumerate(zip(selected_cols, axs.flatten())):
    sns.histplot(data=df_yt_channels[col], ax=ax, bins=50, kde=False, color=f'C{i}')
    ax.set(title=f'Distribution of {col}')
    ax.set_ylabel("Count - number of channels (log scale)")
    ax.set(yscale="log")
    # ax.set(xscale="log")
plt.tight_layout()
plt.show()


# plot with log scale for x axis 
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10,5))

xlabels = [r'$\log_{10}(videos)$', r'$\log_{10}(subscribers)$']

for i,(col,ax) in enumerate(zip(selected_cols, axs.flatten())):
    sns.histplot(data=np.log10(df_yt_channels[col]), ax=ax, bins=50, kde=False, cumulative=False, color=f'C{i}')

    ax.set(title=f'Distribution of {col} (log-log scale)')
    ax.set_xlabel(xlabels[i])
    ax.set_ylabel("Count - number of channels")

    # ax.set(yscale="log")
    # ax.set(xscale="log")
plt.tight_layout()
plt.show()


# plot with linear scale for both axes 
# fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10,5))

# for i,(col,ax) in enumerate(zip(selected_cols, axs.flatten())):
#     sns.histplot(data=df_yt_channels[col], ax=ax, bins=50, kde=False, color=f'C{i}')
#     ax.set(title=f'Distribution of {col}')
#     ax.set_ylabel("Count - number of channels")
#     # ax.set(yscale="log")
#     ax.set(xscale="log")
# plt.tight_layout()
# plt.show()

# # plot with log scale for x axis (distplot)
# fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10,5))

# for i,(col,ax) in enumerate(zip(selected_cols, axs.flatten())):
#     sns.distplot(np.log10(df_yt_channels[col]), hist_kws=kwargs, kde=False, kde_kws=kwargs, ax=ax, norm_hist=True)

#     ax.set(title=f'Distribution of {col} (log-log scale)')
#     ax.set_ylabel("Count - number of channels")
#     # ax.set(yscale="log")
#     # ax.set(xscale="log")
# plt.tight_layout()
# plt.show()


# descriptive statistics table
df_yt_channels[selected_cols].describe().T

**Discussion:** \
From the above graphs and table, we can see that _videos_ and _subscribers_ distributions among YouTube channels follow a **power law**, meaning that most channels have a only a few videos and a few subscribers, but a few of them have a lot of videos and a lot of subscribers.

More specifically:
- 50% of the YouTube channels have less than 175 videos
- 50% of the YouTube channels have less than 42,400 subscribers

_Note: only channels with at least 10 videos and 10,000 subscribers were considered for this study._

##### Group by categories

In [None]:
data_per_cat_chan = df_yt_channels.groupby(['category_cc', 'channel'])[['videos_cc', 'subscribers_cc']].agg(['max'])

# set the columns to the top level of the multi-index
data_per_cat_chan.columns = data_per_cat_chan.columns.get_level_values(0)
data_per_cat_chan

In [None]:
data_per_cat_chan.reset_index(inplace=True)
data_per_cat_chan

##### Number of channels per category

In [None]:
chan_per_cat = data_per_cat_chan.groupby('category_cc')[['channel']].count().sort_values('channel', ascending=False)

In [None]:
chan_per_cat.plot(kind='bar')
plt.title("Number of channels per category")
plt.xlabel("Categories")
plt.ylabel("Number of channels")
plt.show()
chan_per_cat['channel']

In [None]:
# data_per_cat = data_per_cat_chan.groupby('category')['videos_cc','subscribers_cc'].agg(['min', 'max', 'count', 'sum'])
data_per_cat = data_per_cat_chan.groupby('category_cc')[['videos_cc','subscribers_cc']].agg(['sum'])
data_per_cat.columns = data_per_cat.columns.get_level_values(0)
data_per_cat = data_per_cat.add_suffix('_sum')
data_per_cat

##### Number of videos per category

In [None]:
data_per_cat['videos_cc_sum'].sort_values(ascending=False).plot(kind='bar')
plt.title("Number of videos per category")
plt.xlabel("Categories")
plt.ylabel("Number of videos")
plt.show()

data_per_cat['videos_cc_sum'].sort_values(ascending=False)

##### Number of subscribers per category

In [None]:
data_per_cat['subscribers_cc_sum'].sort_values(ascending=False).plot(kind='bar')
plt.title("Number of subscribers per category")
plt.xlabel("Categories")
plt.ylabel("Number of videos")
plt.show()

data_per_cat['subscribers_cc_sum'].sort_values(ascending=False)

#### 1.1.2 YouTube Channels time-series data
Weekly number of viewers and subscribers. We have a data point for each channel and each week.

Time series of channel activity at **weekly granularity**. The span of time series varies by channel depending on when socialblade.com started tracking the channel. On average, it contains **2.8 years of data per channel** for **133k channels** (notice that this means there are roughly 4k channels for which there is no time-series data). \
Each data point includes the **number of views** (`views`) and **subscribers** (`subs`) obtained in the given week, as well as the **number of videos** (`videos`) posted by the **channel** (`channel`). The number of videos is calculated using the video upload dates in our video metadata, such that videos that were unavailable at crawl time are not accounted for. 

---

Time series related to each channel.\
These come from a mix of YouTube data and time series crawled from [socialblade.com](https://socialblade.com/):
- From the former (YouTube data): derived weekly time series indicating **how many videos each channel had posted per week**. 
- From the latter (socialblade.com): crawled weekly statistics on the **number of viewers** `views` and **subscribers** `subs` per channel `channel`. This data was available for around 153k channels.

    - `channel`: unique channel ID, which is the numbers and letters at the end of the URL.
    - `category`: category of the channel as assigned by [socialblade.com](https://socialblade.com/) according to the last 10 videos at time of crawl (categories organize channels and videos on YouTube and help creators, advertisers, and channel managers identify with content and audiences they wish to associate with).
    - `datetime`: First day of the week related to the data point
    - `views`: Total number of views the channel had this week.
    - `delta_views`: Delta views obtained this week (difference of nb of views between current and former week). (Interpolation)
    - `subs`: Total number of subscribers the channel had this week.
    - `delta_subs`: Delta subscribers obtained this week (difference of nb of subscribers between current and former week)
    - `videos`: Number of videos posted by the channel up to date
    - `delta_videos`:  Delta videos obtained this week (difference of number of videos posted by the channel between current and former week).
    - `activity`: Number of videos published in the last 15 days.
    
    
Note: Can view the channel by appending the channel id to the url, e.g.  https://www.youtube.com/channel/UCBJuEqXfXTdcPSbGO9qqn1g


In [None]:
!ls -lh /dlabdata1/youtube_large/df_timeseries_en.tsv.gz

In [None]:
# load channel-level time-series. (takes btw 50 secs and 2 mins)
df_yt_timeseries = pd.read_csv(DATA_FOLDER+'df_timeseries_en.tsv.gz', sep="\t", compression='gzip', parse_dates=['datetime'])
df_yt_timeseries

##### Summary statistics

In [None]:
# df_yt_timeseries.describe().T

In [None]:
yt_ts_uniq_chan_cnt = df_yt_timeseries['channel'].nunique()

print('Timeseries data was gathered between {} and {}'.format(df_yt_timeseries['datetime'].min().strftime('%B %d, %Y'),
                                                         df_yt_timeseries['datetime'].max().strftime('%B %d, %Y')))
print('Total number of datapoints accross all channels: {:>12,}'.format(len(df_yt_timeseries)))
data_points_dist = df_yt_timeseries['channel'].value_counts()
print('Average number of datapoints per channel:       {:>12.0f} weeks (≈{:,.1f} years)'.format(data_points_dist.mean(), data_points_dist.mean()/52))
print('Number of unique categories:                     {:>12,}'.format(df_yt_timeseries['category'].nunique()))
print('Number of unique channels:                       {:>12,}'.format(yt_ts_uniq_chan_cnt))

##### Datetime points per channel

Not all channels timeseries start and end at the same time, therefore we have a different amount of datapoints for each channel

In [None]:
datetime_data = df_yt_timeseries.groupby('channel')['datetime'].agg(['min', 'max'])
datetime_data.head()

In [None]:
# datetime_data.describe().T

##### Datetime points per year

In [None]:
yt_ts_year_cnt = df_yt_timeseries.groupby(df_yt_timeseries.datetime.dt.year).size()

In [None]:
print('Timeseries data was gathered between {} and {}'.format(df_yt_timeseries['datetime'].min().strftime('%B %d, %Y'),
                                                         df_yt_timeseries['datetime'].max().strftime('%B %d, %Y')))
yt_ts_year_cnt.plot(kind='bar')
plt.title("Nb of datapoints per year accross all channels")
plt.xlabel("Year")
plt.ylabel("Count (datapoints)")
plt.show()

yt_ts_year_cnt

##### Datetime points per month

In [None]:
yt_ts_month_cnt = df_yt_timeseries.groupby([df_yt_timeseries.datetime.dt.year, df_yt_timeseries.datetime.dt.month]).size()
yt_ts_month_cnt.head()

In [None]:
# using pandas.Grouper
# yt_ts_month_cnt_grouper = df_yt_timeseries.groupby(pd.Grouper(key='datetime', freq='M')).count().channel
# yt_ts_month_cnt_grouper.head()

In [None]:
# plot number of datapoints per month
plt.figure(figsize=(15,2))
yt_ts_month_cnt.plot(kind='bar')
plt.title("Number of datapoints per month accross all channels (using regular group by method)")
plt.xlabel("Month")
plt.ylabel("Count (datapoints)")
plt.show()

# plot number of datapoints per month using grouper
# plt.figure(figsize=(15,2))
# yt_ts_month_cnt_grouper.plot(kind='bar')
# plt.title("Number of datapoints per month accross all channels (using grouper)")
# plt.xlabel("Month")
# plt.ylabel("Count (datapoints)")
# plt.show()

In [None]:
# only consider unique values per channel
yt_ts_month_unique_cnt = df_yt_timeseries.groupby(pd.Grouper(key='datetime', freq='M')).agg({"channel": pd.Series.nunique})
yt_ts_month_unique_cnt.head()

In [None]:
(df_yt_timeseries.groupby(['datetime', 'channel']).count() > 1).sum()

In [None]:
# Number of channels with timeseries (only consider unique values per channel) --> see https://stackoverflow.com/questions/38309729/count-unique-values-per-groups-with-pandas

years = mdates.YearLocator()   # every year
months = mdates.MonthLocator()  # every month

fig, ax = plt.subplots(1, figsize=(7,3), sharey=True, sharex=True,
                       gridspec_kw={"wspace": 0.05})

ax.plot(yt_ts_month_unique_cnt)

ax.set(title='Number of channels with timeseries')
ax.set_xlabel("Month")
ax.set_ylabel("# channels")
ax.xaxis.set_major_locator(years)
ax.xaxis.set_minor_locator(months)

##### Datetime points accross channels

In [None]:
# Distribution of datapoints accross channels

print('Total number of datapoints accross all channels: {:>12,}'.format(len(df_yt_timeseries)))
data_points_dist = df_yt_timeseries['channel'].value_counts()
print('Average number of datapoints per channel:        {:>12,.0f} weeks (≈{:,.1f} years)'.format(data_points_dist.mean(), data_points_dist.mean()/52))

ax = sns.histplot(data=data_points_dist, bins=50, kde=False, color=f'C{1}')

ax.set(title=f'Distribution of datapoints (weeks) accross channels')
ax.set_xlabel('number of data points (weeks)')
ax.set_ylabel('number of channels')

# ax.set(yscale="log")
# plt.tight_layout()
plt.show()

In [None]:
# Aggregates per channel
sel_cols = ['datetime', 'views', 'delta_views', 'subs', 'delta_subs', 'videos', 'delta_videos', 'activity']
data_per_channel = df_yt_timeseries.groupby('channel')[sel_cols].agg(['min', 'max', 'count', 'mean'])
data_per_channel.head()

#####  Views per channel

In [None]:
data_per_channel['views'].head()

In [None]:
# Distribution of total views per channel
fig, axs = plt.subplots(nrows=2, ncols=1, figsize=(6,8))

sns.histplot(data=data_per_channel['views']['max'], ax=axs[0], bins=20, kde=False, color=f'C{1}')
axs[0].set(title=f'Distribution of total views per channel')
axs[0].set_xlabel('number of views (in billions)')
axs[0].set_ylabel('number of channels')
axs[0].set(yscale="log")
xlabels1 = ['{:,.0f}'.format(x) + 'bn' for x in axs[0].get_xticks()/1_000_000_000]
axs[0].set_xticklabels(xlabels1)

# Distribution of total views per channel (log scale)
sns.histplot(data=data_per_channel['views']['max'], ax=axs[1], bins=1000, kde=False, color=f'C{1}')
axs[1].set(title=f'Distribution of total views per channel (log-log scale)')
axs[1].set_xlabel('number of views (in millions)')
axs[1].set_ylabel('number of channels')
axs[1].set(yscale="log")
axs[1].set(xscale="log")
xlabels2 = ['{:,.0f}'.format(x) + 'M' for x in axs[1].get_xticks()/1_000_000]
axs[1].set_xticklabels(xlabels2)

plt.tight_layout()
plt.show()

data_per_channel['views'][['max']].describe().T

In [None]:
print("Top 10 channels with the most total views (in billions):")

for index, value in data_per_channel['views']['max'].sort_values(ascending=False)[:10].items():
    print('https://www.youtube.com/channel/{} : {:,.1f} bn views'.format(index, value/1_000_000_000))

##### Videos per channel

In [None]:
data_per_channel['videos'].head()

In [None]:
# Distribution of total videos per channel
fig, axs = plt.subplots(nrows=2, ncols=1, figsize=(6,8))
sns.histplot(data=data_per_channel['videos']['max'], ax=axs[0], bins=20, kde=False, color=f'C{1}')

axs[0].set(title=f'Distribution of total videos per channel')
axs[0].set_xlabel('number of videos')
axs[0].set_ylabel('number of channels')
axs[0].set(yscale="log")

# # Distribution of total views per channel (log scale)
sns.histplot(data=data_per_channel['videos']['max'], ax=axs[1], bins=100, kde=False, color=f'C{1}')

axs[1].set(title=f'Distribution of total videos per channel (log-log scale)')
axs[1].set_xlabel('number of videos')
axs[1].set_ylabel('number of channels')
axs[1].set(yscale="log")
axs[1].set(xscale="log")

plt.tight_layout()
plt.show()

data_per_channel['videos'][['max']].describe().T

In [None]:
print("Top 10 channels with the most total videos:")

for index, value in data_per_channel['videos']['max'].sort_values(ascending=False)[:10].items():
    print('https://www.youtube.com/channel/{} : {:,.0f} videos'.format(index, value))

##### Subscribers per channel

In [None]:
data_per_channel['subs'].head()

In [None]:
# Distribution of total subscribers per channel
fig, axs = plt.subplots(nrows=2, ncols=1, figsize=(6,8))
sns.histplot(data=data_per_channel['subs']['max'], ax=axs[0], bins=20, kde=False, color=f'C{1}')

axs[0].set(title=f'Distribution of total subscribers per channel')
axs[0].set_xlabel('number of subscribers (in millions)')
axs[0].set_ylabel('number of channels')
axs[0].set(yscale="log")
xlabels0 = ['{:,.0f}'.format(x) + 'M' for x in axs[0].get_xticks()/1_000_000]
axs[0].set_xticklabels(xlabels0)

# # Distribution of total views per channel (log scale)
sns.histplot(data=data_per_channel['subs']['max'], ax=axs[1], bins=500, kde=False, color=f'C{1}')

axs[1].set(title=f'Distribution of total subscribers per channel (log-log scale)')
axs[1].set_xlabel('number of subscribers')
axs[1].set_ylabel('number of channels')
axs[1].set(yscale="log")
axs[1].set(xscale="log")

plt.tight_layout()
plt.show()

data_per_channel['videos'][['max']].describe().T

In [None]:
data_per_channel['subs']['max'].sort_values(ascending=False)[:10]

In [None]:
print("Top 10 channels with the most total subscribers:")

for index, value in data_per_channel['subs']['max'].sort_values(ascending=False)[:10].items():
    print('https://www.youtube.com/channel/{} : {:,.1f}M subscribers'.format(index, value/1_000_000))

In [None]:
# set the columns to the top level of the multi-index
# data_per_channel.columns = data_per_channel.columns.get_level_values(0)
# data_per_channel

#### 1.1.3 Raw video metadata
The file `df_videos_raw.jsonl.gz` contains metadata data related to ~73M videos from ~137k channels. Below we show the data recorded for each of the video

In [None]:
!ls -lh {DATA_FOLDER}yt_metadata_en.jsonl.gz

In [None]:
# ! zcat {DATA_FOLDER}yt_metadata_en.jsonl.gz | head

In [None]:
df_yt_metadata = pd.read_json(DATA_FOLDER+'yt_metadata_en.jsonl.gz', compression='gzip', lines=True, nrows=2000)
df_yt_metadata.head(2)

#### 1.1.4 user-comment matrices

In [None]:
# !ls -lh {DATA_FOLDER}youtube_comments.tsv.gz

In [None]:
# user-comment matrices
df_yt_comments = pd.read_csv(DATA_FOLDER+'youtube_comments.tsv.gz', sep="\t", compression='gzip', nrows=100)
df_yt_comments.head()

#### 1.1.5 raw comments

In [None]:
# !ls -lh {DATA_FOLDER}youtube_comments.ndjson.zst

In [None]:
def line_jsonify(line): 
    """

    :param line: string to parse and jsonify
    :return: 
    """    
    
    # add square brackets around line
    line = "[" + line + "]"

    # remove quotes before and after square brackets   
    line = line.replace("\"[{", "[{")
    line = line.replace("}]\"", "}]")    
    
    # replace double double-quotes with single double-quotes
    line = line.replace("{\"\"", "{\"")
    line = line.replace("\"\"}", "\"}")
    line = line.replace("\"\":\"\"", "\":\"")
    line = line.replace(":\"\"", ":\"")
    line = line.replace("\"\":", "\":")
    
    # line = line.replace("\"\":", "\":")
    line = line.replace("\"\",\"\"", "\",\"")
    line = line.replace("\"\",\"\"", "\",\"")
    line = line.replace("\\\"\"", "\\\"")
    line = line.replace("\\\",[", "\\\\ \",[")
    
    line = re.sub(r',\"\"(?!\,)', ',\"', line)

    line = line.replace("true,\"\"", "true,\"")
    line = line.replace("false,\"\"", "false,\"")
    
    return line

In [None]:
class Zreader:

    def __init__(self, file, chunk_size=16384):
        '''Init method'''
        import codecs
        self.fh = open(file,'rb')
        print(f"reading {file} in chunks ...")
        self.chunk_size = chunk_size
        self.dctx = zstandard.ZstdDecompressor(max_window_size=2147483648)
        self.reader = self.dctx.stream_reader(self.fh)
        self.buffer = ''

    def readlines(self):
        '''Generator method that creates an iterator for each line of JSON'''
        nb_chunk = 0
        while True:
            nb_chunk = nb_chunk + 1
            if nb_chunk % 5000 == 0:
                print("number of chunks read: ", nb_chunk)
                
            chunk = self.reader.read(self.chunk_size).decode("utf-8", "replace")

            if not chunk:
                break
            lines = (self.buffer + chunk).split("\n")

            # print("lines per chunk: ", len(lines))
            # print(lines)
            
            for line in lines[:-1]:
                # print(line)
                yield line

            self.buffer = lines[-1]

In [None]:
NB_OF_LINES = 350000
lines_json = []
inp_file = DATA_FOLDER+"youtube_comments.ndjson.zst"
reader = Zreader(inp_file, chunk_size=4092)

for i, line in enumerate(reader.readlines()):
    if i > NB_OF_LINES:
        # print(line)
        break
    line_json = json.loads(line_jsonify(line))
    lines_json.append(line_json)

print("==> number of lines read:", len(lines_json))

df_yt_comments_raw = pd.DataFrame(data=lines_json[1:], columns=lines_json[0])
df_yt_comments_raw.head()

### 1.2. Graphtreon dataset

#### 1.2.1 List with all creator names.

In [None]:
# !ls -lh {DATA_FOLDER}creators.csv

In [None]:
# list with all creator names.
df_gt_creators = pd.read_csv(DATA_FOLDER+'creators.csv')
df_gt_creators.head()

#### 1.2.2 All graphtreon time-series

In [None]:
!ls -lh {DATA_FOLDER}final_processed_file.jsonl.gz

In [None]:
# final_processed_file.jsonl.gz all graphteon time-series.
df_gt_timeseries = pd.read_json(DATA_FOLDER+'final_processed_file.jsonl.gz', compression='gzip', lines=True, nrows=10)
df_gt_timeseries.head()

# df_gt_timeseries = pd.read_json(DATA_FOLDER+'final_processed_file.jsonl.gz', compression='gzip', lines=True)

# get only id and match them first
# 

##### Summary statistics

In [None]:
df_gt_timeseries['startDate'] = pd.to_datetime(df_gt_timeseries['startDate'])
df_gt_timeseries.head()

In [None]:
print('Number of unique creators:         {:,}'.format(df_gt_timeseries['creatorName'].nunique()))
print('Number of unique patreon ids:         {:,}'.format(df_gt_timeseries['patreon'].nunique()))

print('Timeseries data was gathered between {} and {}'.format(df_gt_timeseries['startDate'].min().strftime('%B %d, %Y'),
                                                         df_gt_timeseries['startDate'].max().strftime('%B %d, %Y')))
print('Total number of datapoints accross all channels: {:>12,}'.format(len(df_gt_timeseries)))

#### 1.2.3 Raw html of the pages in graphteon.

In [None]:
!ls -lh {DATA_FOLDER}pages.zip

In [None]:
# pages.zip raw html of the pages in graphteon.