<a id = "top"></a>

# Data Gathering and Formatting
---

This notebook documents the process by which the chat message data was pulled from a live broadcast on Twitch using Python sockets. The raw messages are written to a .log file along with a timestamp that is sensitive down to the second. The messages are then processed into a pandas dataframe using a regex, and message per second data was calculated based on this dataframe. For a more technical description of how the data was collected and formatted, check out `twitch_chat_scrape.py` and `twitch_chat_format.py` in the repository.

After running the scrape, I found some issues with the Python socket object. Namely, the socket can get "backed up" and start returning multiple messages on the same timestamp. This is not too much of an issue for the most part, as the clogs will sort themselves out relatively quickly. However, this becomes a much larger issue if the chat is extremely busy, such as the second scrape I attempted. The data that I have is functional but were I to conduct this scrape again, I would just take the chat from the VOD (Video-On-Demand), as all of the video editing was done on the VOD anyways.

Actual scraping code has been commented out so that the .log files will not be over written if you decide to run all of this code from the top.

**This Notebook:**

- [Scraping from AdmiralBulldog](#bulldog)
- [Scraping from xQcOW](#xqcow)



**Other Notebooks:**

- [Anomaly Detection](02_anomaly_detection.ipynb)
- [Topic Modeling](03_topic_modeling.ipynb)
- [Video Editing](04_video_editing.ipynb)

### Importing
---

In [5]:
import pandas as pd
from twitch_chat_scrape import twitch_chat_scrape
import twitch_chat_format as tcf

In [2]:
# Supply your own oauth token, which can be 
# generated here: https://twitchapps.com/tmi/ 

oauth_path = "../twitch_oauth_token/token.txt"

<a id = "bulldog"></a>

## Scraping from AdmiralBulldog
---

AdmiralBulldog's stream averages between 4,000 - 5,000 concurrent viewers and an average of just under 3 chat messages per second. This makes his chat ideal for this analysis as it is busy enough to be highly responsive to on-stream stimulus but not too busy to cause errors with the Python socket. 

In [3]:
# with open(oauth_path, "r", encoding = "utf-8") as oauth:
#     twitch_chat_scrape(nickname = "ticklebits",
#                        token = oauth.read(),
#                        channel = "admiralbulldog",
#                        minutes = 240,
#                        path = "../data/chat_admiralbulldog_4_30.log")

----------------------------------------
Checkpoint number 1 at 24.0 minutes.
4374 messages logged!
----------------------------------------
Checkpoint number 2 at 48.0 minutes.
8236 messages logged!
----------------------------------------
Checkpoint number 3 at 72.02 minutes.
12172 messages logged!
----------------------------------------
Checkpoint number 4 at 96.0 minutes.
15803 messages logged!
----------------------------------------
Checkpoint number 5 at 120.01 minutes.
19337 messages logged!
----------------------------------------
Checkpoint number 6 at 144.0 minutes.
23840 messages logged!
----------------------------------------
Checkpoint number 7 at 168.0 minutes.
26465 messages logged!
----------------------------------------
Checkpoint number 8 at 192.03 minutes.
28522 messages logged!
----------------------------------------
Scrape interrupted by user after 194.09 minutes.
28635 messages logged!
----------------------------------------


> Scrape ended early due to the stream ending. 

In [3]:
admiralbulldog_4_30 = tcf.twitch_chat_format("../data/logs/chat_admiralbulldog_4_30.log")

In [4]:
admiralbulldog_4_30.shape

(28844, 3)

In [5]:
admiralbulldog_4_30.head()

Unnamed: 0_level_0,username,channel,message
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-04-30 08:56:59,collectcalled,admiralbulldog,IF HENRIK WAS AN AniMAL
2019-04-30 08:56:59,laudon,admiralbulldog,gachiHYPER
2019-04-30 08:57:00,hyper_brah,admiralbulldog,WutFace WutFace WutFace WutFace WutFace WutFac...
2019-04-30 08:57:00,felianjo,admiralbulldog,"I LOVE THEM, JUST LET THEM IN CAGES Pepega Clap"
2019-04-30 08:57:00,nevervvinterr,admiralbulldog,gachiHYPER


In [6]:
admiralbulldog_4_30.isna().sum()

username    0
channel     0
message     0
dtype: int64

In [7]:
# Removing the last few hundred messages because 
# they occur after the stream ends

admiralbulldog_4_30 = admiralbulldog_4_30[:len(admiralbulldog_4_30) - 250]

In [8]:
# Removing messages sent by the Auto-moderator, a chat bot,
# as well as the bot commands sent to it.

admiralbulldog_4_30 = tcf.filter_bot_messages(admiralbulldog_4_30, 
                                              bot_name = "admiralbullbot")

In [9]:
# Finding the messages per second for each second

admiralbulldog_4_30_mps = tcf.messages_per_second(admiralbulldog_4_30)

Processing 188 minutes of chat messages...
20 out of 188 minutes of messages processed.
40 out of 188 minutes of messages processed.
60 out of 188 minutes of messages processed.
80 out of 188 minutes of messages processed.
100 out of 188 minutes of messages processed.
120 out of 188 minutes of messages processed.
140 out of 188 minutes of messages processed.
160 out of 188 minutes of messages processed.
180 out of 188 minutes of messages processed.
188 out of 188 minutes of messages processed.
...All messages processed.


In [10]:
# One data frame has the messages and timestamps... 
admiralbulldog_4_30.to_csv("../data/formatted/admiralbulldog_4_30.csv", 
                           index = True)

# ...and the other has the messages per second.
admiralbulldog_4_30_mps.to_csv("../data/formatted/admiralbulldog_4_30_mps.csv", 
                               index = True,
                               header = "mps")

<a id = "xqcow"></a>
## Scraping from xQcOW
---

xQcOW's stream is much larger with an average concurrent viewership of over 10,000. These viewers average over 7 messages per second, which is more than double AdmiralBulldog's chat. This data ended up being unused, as issues with the Python sockets are exacerbated by the busier chat.

In [7]:
# with open(oauth_path, "r", encoding = "utf-8") as oauth:
#     twitch_chat_scrape(nickname = "ticklebits",
#                        token = oauth.read(),
#                        channel = "xqcow",
#                        minutes = 240,
#                        path = "../data/logs/chat_xqcow_5_03.log")

> Scrape completed with over 100,000 messages logged

In [13]:
xqcow_5_03 = tcf.twitch_chat_format("../data/logs/chat_xqcow_5_03.log")

In [14]:
xqcow_5_03.shape

(107109, 3)

In [15]:
xqcow_5_03.head()

Unnamed: 0_level_0,username,channel,message
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-05-03 08:46:43,mythikow,xqcow,"""LET ME TELL YOU WHAT YOU MEAN"" WeirdChamp ""LE..."
2019-05-03 08:46:43,thisispaule,xqcow,easy troll chat BIG Kappa
2019-05-03 08:46:43,epho__,xqcow,xqcL
2019-05-03 08:46:43,end_my_suffering_xd,xqcow,FeelsStrongMan DONO
2019-05-03 08:46:43,eoin_2,xqcow,THIS PVC DUDE :face_with_tears_of_joy: :OK_han...


In [16]:
xqcow_5_03.isna().sum()

username    0
channel     0
message     0
dtype: int64

In [17]:
# Removing messages sent by the Auto-moderator, a chat bot.
xqcow_5_03 = tcf.filter_bot_messages(xqcow_5_03, 
                                      bot_name = "schnozebot")

# xQcOW's chat is so busy it actually has multiple bots
xqcow_5_03 = tcf.filter_bot_messages(xqcow_5_03, 
                                      bot_name = "fossabot")

In [18]:
xqcow_5_03_mps = tcf.messages_per_second(xqcow_5_03)

Processing 240 minutes of chat messages...
20 out of 240 minutes of messages processed.
40 out of 240 minutes of messages processed.
60 out of 240 minutes of messages processed.
80 out of 240 minutes of messages processed.
100 out of 240 minutes of messages processed.
120 out of 240 minutes of messages processed.
140 out of 240 minutes of messages processed.
160 out of 240 minutes of messages processed.
180 out of 240 minutes of messages processed.
200 out of 240 minutes of messages processed.
220 out of 240 minutes of messages processed.
240 out of 240 minutes of messages processed.
...All messages processed.


In [19]:
xqcow_5_03.to_csv("../data/formatted/xqcow_5_03.csv", 
                  index = True)

xqcow_5_03_mps.to_csv("../data/formatted/xqcow_5_03_mps.csv", 
                      index = True,
                      header = "mps")

---
[Back to top](#top)