# Processing Notebook

The purpose of this notebook is to process and aggregate the *.csv* files provided by `twitch_stream_bot.py`.


In [1]:
import pandas as pd
import os

Twitch's API has changed (end of 2022) and now provides a list of tags and not just a list of tag IDs for each stream.

As a result, the stream filtering process in `twitch_stream_bot.py` now checks if a stream has the Vtuber tag id or if it has the word `'Vtuber'` in its tag list. Because of this API enhancement, we are receiving many more new streams than before and because of this, the data collected before and after the change in the filtering process is not comparable.

So I decided to split the data.

In [2]:
processing_new_twitch_API_streams = True

All the *.csv* files are located in the `sync` folder.

In [3]:
if processing_new_twitch_API_streams:
    csv_path = './sync/tagIds_tags/' # For new Twitch API
else:
    csv_path = './sync/tagIds/' # For old Twitch API

csv_dfs = [pd.read_csv(csv_path + file) for file in os.listdir(path=csv_path) if file.endswith('csv')]

df = pd.concat(csv_dfs, ignore_index=True)

print(f'There are {len(df) - len(df.id.drop_duplicates())} duplicates')

print('Deleting duplicates...')
df.sort_values('_custom_ended_at', inplace=True)
df = df.groupby(['id']).apply(lambda df_ : df_.iloc[-1]).reset_index(drop=True)

print(f'There are {len(df) - len(df.id.drop_duplicates())} duplicates')

df.sort_values('started_at', inplace=True, ignore_index=True)


There are 37 duplicates
Deleting duplicates...
There are 0 duplicates


Printing some infos about the data.

In [4]:
print(f'''{len(df)} streams
{len(df.groupby('user_id'))} different broadcasters
first started live : {df.iloc[0]['started_at']}
last started live : {df.iloc[-1]['started_at']}
''')

2410 streams
818 different broadcasters
first started live : 2023-01-05T13:18:16Z
last started live : 2023-01-11T20:25:05Z



Saving the dataframe in the `../Data` folder for further use.

In [5]:
if processing_new_twitch_API_streams:
    df.to_csv('../Data/twitch_streams_vtuber_tagIds_tags.csv', index=False)
else:
    df.to_csv('../Data/twitch_streams_vtuber_tagIds.csv', index=False)
