# Processing Notebook

The purpose of this notebook is to process and aggregate the *.csv* files provided by `twitch_stream_bot.py` and retrieve the streamer information.

In [13]:
import pandas as pd
import os

## Aggregating streams

Twitch's API has changed (end of 2022) and now provides a list of tags and not just a list of tag IDs for each stream.

As a result, the stream filtering process in `twitch_stream_bot.py` now checks if a stream has the Vtuber tag id or if it has the word `'Vtuber'` in its tag list. Because of this API enhancement, we are receiving many more new streams than before and because of this, the data collected before and after the change in the filtering process is not comparable.

So I decided to split the data.

In [14]:
processing_new_twitch_API_streams = True

All the *.csv* files are located in the `sync` folder.

In [15]:
if processing_new_twitch_API_streams:
    csv_path = '../Data/scattered/tagIds_tags/' # For new Twitch API
else:
    csv_path = '../Data/scattered/tagIds/' # For old Twitch API

csv_dfs = [pd.read_csv(csv_path + file) for file in os.listdir(path=csv_path) if file.endswith('csv')]

df = pd.concat(csv_dfs, ignore_index=True)

print(f'There are {len(df) - len(df.id.drop_duplicates())} duplicates')

print('Deleting duplicates...')
df.sort_values('_custom_ended_at', inplace=True)
df = df.groupby(['id']).apply(lambda df_ : df_.iloc[-1]).reset_index(drop=True)

print(f'There are {len(df) - len(df.id.drop_duplicates())} duplicates')

df.sort_values('started_at', inplace=True, ignore_index=True)


There are 68 duplicates
Deleting duplicates...
There are 0 duplicates


Printing some infos about the data.

In [16]:
print(f'''{len(df)} streams
{len(df.groupby('user_id'))} different broadcasters
first started live : {df.iloc[0]['started_at']}
last started live : {df.iloc[-1]['started_at']}
''')

4612 streams
1006 different broadcasters
first started live : 2023-01-05T13:18:16Z
last started live : 2023-01-17T04:57:30Z



Saving the dataframe in the `../Data` folder for further use.

In [17]:
if processing_new_twitch_API_streams:
    df.to_csv('../Data/aggregated/twitch_streams_vtuber_tagIds_tags.csv', index=False)
else:
    df.to_csv('../Data/aggregated/twitch_streams_vtuber_tagIds.csv', index=False)


## Aggregating videos

Quite the same work but with videos... and with more attention on the use of RAM.

In [18]:
csv_path = '../Data/scattered/videos/'

csv_dfs = [pd.read_csv(csv_path + file) for file in os.listdir(path=csv_path) if file.endswith('csv')]

df = pd.concat(csv_dfs, ignore_index=True)

df = pd.DataFrame()

for file in os.listdir(path=csv_path):
    if file.endswith('csv'):
        videos = pd.read_csv(csv_path + file).sort_values('view_count')
        df = pd.concat([df, videos]).drop_duplicates()

print(f'There are {len(df) - len(df.id.drop_duplicates())} id duplicates')

print('Deleting duplicates...')
df.sort_values('view_count', inplace=True)
df = df.groupby(['id']).apply(lambda df_ : df_.iloc[-1]).reset_index(drop=True)

print(f'There are {len(df) - len(df.id.drop_duplicates())} id duplicates')

df.sort_values('created_at', inplace=True, ignore_index=True)

There are 6026 id duplicates
Deleting duplicates...
There are 0 id duplicates


In [19]:
print(f'''{len(df)} videos
{len(df.groupby('user_id'))} different broadcasters
first created video : {df.iloc[0]['created_at']}
last created video : {df.iloc[-1]['created_at']}
''')

31859 videos
1022 different broadcasters
first created video : 2013-02-05T13:47:00Z
last created video : 2023-01-17T05:09:49Z



In [20]:
df.to_csv('../Data/aggregated/twitch_videos_vtuber.csv', index=False)

## Querying users info

In [21]:
import requests
import json

 # The .json with the CLIENT_ID and the TOKEN
with open('../twitch_credentials.json', mode='r') as f:
    twitch_credentials = json.load(f)

    # Twitch API variables
    CLIENT_ID = twitch_credentials['CLIENT_ID']
    TOKEN = twitch_credentials['TOKEN']
    
headers = {"Authorization":"Bearer " + TOKEN, "Client-Id":CLIENT_ID}

oldAPIStreams = pd.read_csv('../Data/aggregated/twitch_streams_vtuber_tagIds.csv')
newAPIStreams = pd.read_csv('../Data/aggregated/twitch_streams_vtuber_tagIds_tags.csv')

streams = pd.concat([oldAPIStreams, newAPIStreams], ignore_index=True)
users_id = streams.loc[:, ['user_id']].drop_duplicates()

We use 2 requests : `Get Users` to get basic user information and `Get Channel Information` to get the channel *tags*.

In [22]:
users_id_list = list(users_id.user_id.drop_duplicates())

channels_descr = pd.DataFrame()
users_descr = pd.DataFrame()

for i in range(0, len(users_id_list), 100):
    url = 'https://api.twitch.tv/helix/channels?broadcaster_id=' + "&broadcaster_id=".join(map(str, users_id_list[i:i+100]))
    response = requests.get(url, headers=headers)
    channels_descr = pd.concat([pd.DataFrame(response.json()["data"]), channels_descr])

for i in range(0, len(users_id_list), 100):
    url = 'https://api.twitch.tv/helix/users?id=' + "&id=".join(map(str, users_id_list[i:i+100]))
    response = requests.get(url, headers=headers)
    users_descr = pd.concat([pd.DataFrame(response.json()["data"]), users_descr])

We merge the 2 DataFrames to obtain only one.

In [23]:
channels_descr_relevant = channels_descr.loc[:,['broadcaster_id', 'broadcaster_login', 'tags']].rename(columns={'broadcaster_id':'id','broadcaster_login':'login', 'tags':'_channel_tags'})
users_descr_tags = pd.merge(users_descr, channels_descr_relevant, how='outer', on=['id','login'])

In [24]:
users_descr_tags.to_csv('../Data/aggregated/twitch_users_channelTags_vtuber.csv', index=False)

In [28]:
type(users_descr_tags.iloc[0]._channel_tags)

list