When my friends and I were in college and had more flexible schedules we all got into watching English Premier League football. Thanks to my friend Duncan, who was already a fan, many of us wound up as fans of Tottenham (Hotspur or just "Spurs" for those who don't follow soccer, not to be confused with the basketball team from San Antonio - here's looking at you Google game results). Unfortunately, once we all got jobs many of the week-day games landed right in the middle of our American workdays due to the time difference. Since it's harder to get away with watching a stream than scrolling reddit at work, I often found myself on the [r/coys](https://www.reddit.com/r/coys) subreddit watching the match thread for big events rather than watching the game itself.

Anyone who has spent time on a sports related subreddit knows that redditor fans can be fickle and extreme. If a player allows a goal one game they may be pilloried in the comments section, while a mere game later they may be held up as MOTM if they do something the fan base likes. As a relatively casual fan, it's also not always relatively clear to me _why_ the fans on reddit are angry or happy with a game. Sometimes I'll watch a game and think it's fine, only to find the online fans going berzerk; other times I'll watch the match thread go wild with excitement (over Gareth Bale breathing, for instance) and the match will look pretty shoddy to me. 

I was discussing this with some friends, and it lead me to wonder how much the comments I read while "watching" the match thread actually reflected the content of the game. Specifically, __if I was only basing my emotional experience of the game on the match thread, would I get a similar feel for how things were going as if I was watching?__ My friend Duncan (who, incidentally, is one of the three people this blog derives it's name from) suggested that it could be cool to try to automatically analyze the sentiment of match threads and see how well they reflected actual in-game events. I thought it was a neat idea, and had been wanting to mess about with some NLP, so I gave it a go.

Initially I attempted to use the native Reddit API, but I found it clunky and slow. Instead, I opted to use the [pushshift endpoints](https://pushshift.io/api-parameters/). This approach has it's drawbacks - notably, it's not updated in real time - but for a simple proof of concept it's speedy, RESTful and easy to use for matches that have already occurred. 

First we want to get the data for a thread from reddit. We accomplish this using the pushshift.io endpoints; when I was writing this as a script initially I decided to use the [KF Shkendija-Spurs game](https://www.reddit.com/r/coys/comments/iz2b68/match_thread_kf_shkendija_vs_spurs_24_sep_2020/) from September 2020 as an example as at the time it was relatively recent. At the end of this post I'll show some other examples.

In [2]:
import requests
import pandas as pd
import re 

# INPUT: URL that you want to scrape goes here. The notebook does the rest.
url = "https://www.reddit.com/r/coys/comments/iz2b68/match_thread_kf_shkendija_vs_spurs_24_sep_2020/"

# This is just a regex that cuts out the submission id from the url above.
# Incidentally, this is just the little bit after /comments/
# #TheMoreYouKnow
m = re.search('/comments/(.+?)/', url)
if m:
    found = m.group(1)
else:
    print("Didn't find a thread id in the above url.")

# This requests the submission comments from the pushshift API.
# Pushshift doesn't update super frequently, so this won't work with threads
# that are only a few hours or minutes old. To do that, use PRAW and the Reddit API
# or just scrape the data yourself; PRAW is much slower though
request = 'https://api.pushshift.io/reddit/comment/search/?link_id=' + found + '&limit=50000'

comments = pd.DataFrame(requests.get(request).json()['data'])
comments.shape

(3337, 43)

The above code chunk returns up to 50000 comments from the specified thread as a dataframe. That's way more than we need as in this case there are only 3337 comments, but hey why not be excessive? 

Just to give you an idea what kind of info you can get out of pushshift, here are the fields we're dealing with:

In [3]:
comments.columns

Index(['all_awardings', 'approved_at_utc', 'associated_award', 'author',
       'author_flair_background_color', 'author_flair_css_class',
       'author_flair_richtext', 'author_flair_template_id',
       'author_flair_text', 'author_flair_text_color', 'author_flair_type',
       'author_fullname', 'author_patreon_flair', 'author_premium', 'awarders',
       'banned_at_utc', 'body', 'can_mod_post', 'collapsed',
       'collapsed_because_crowd_control', 'collapsed_reason', 'comment_type',
       'created_utc', 'distinguished', 'edited', 'gildings', 'id',
       'is_submitter', 'link_id', 'locked', 'no_follow', 'parent_id',
       'permalink', 'retrieved_on', 'score', 'send_replies', 'stickied',
       'subreddit', 'subreddit_id', 'top_awarded_type',
       'total_awards_received', 'treatment_tags', 'author_cakeday'],
      dtype='object')

I went ahead and converted created_utc to a DateTime object cause unix time isn't actually that helpful. Also, I was interested in sentiment solely in match threads, which open (roughly) when a game begins and close shortly after it ends. So it made sense to think of time in terms of time since the start of the game (i.e. game starts at approximately _t_=0). I created a variable ```match_time``` to represent this, and also used it to filter out comments that were from more than 300 minutes after the game started as these were not really pertinent to the sentiment of comments as the game progressed:

In [4]:
comments['datetime_posted'] = pd.to_datetime(comments['created_utc'], unit='s')
comments['match_time'] = (comments.datetime_posted-min(comments.datetime_posted)).astype('timedelta64[m]')
comments = comments.loc[comments['match_time'] < 150]
comments[['datetime_posted', 'match_time']].head()

Unnamed: 0,datetime_posted,match_time
3,2020-09-24 20:32:26,148.0
4,2020-09-24 20:31:28,147.0
5,2020-09-24 20:28:50,145.0
6,2020-09-24 20:28:20,144.0
7,2020-09-24 20:27:06,143.0


Reddit comments are somewhat complicated by the fact that people can reply to comments, creating long run-on threads that dwell on a particular topic. As a consequence, reddit comments don't really conform well to a linear list-like data structure - it makes more sense to think of comments as a tree of some sort. This complicates things temporally; if someone makes a particularly controversial comment, people might reply to that comment hours later with some pretty strong language; however, as the match for these comments will appear as much later than the event they relate to, they can't really be seen as indicative of the sentiment at the point in time in a match. 

Luckily, most comments in a match thread are "top-level" meaning that they aren't responding to any other comment, but rather are responding to the events of the match as they transpire,so for our analysis I only included top level comments. I lost about 1/3 of the comments due to this, but still had 2051 to work with so no biggie:

In [5]:
top_level_comments = comments.loc[comments['parent_id'] == comments['link_id']]
top_level_comments.reset_index(drop=True, inplace=True)
top_level_comments.shape

(2051, 45)

Ultimately, what I wanted to do was to run a sentiment analysis of each of these top level comments, and then look at how the overall sentiment of the comments was changing over time in response to the match events. Now, normally for sentiment analysis you'd want to train a bespoke model to your dataset, as people (surprise) use specialized language to talk about specialized topics. For instance, how is a program supposed to interpret "Shkendija is looking terrible"? In general, "terrible" is a word associated with negative sentiment, but in the context of a pro-Spurs-anti-Shkendija match thread, Shkendija looking terrible is probably best interpreted as a positive sentiment. I thought for about two seconds about building a program that would let me label comments from randomly selected Spurs threads as negative or positive and then using a few thousand of those to train a neural net to rate the positive/negative valence of an arbitrary comment. Then I realized that sounded time-consuming and terrible, and decided to use an off-the-shelf algorithm just to see how it would do. Others have pointed out [problems](http://www.nlp.town/blog/off-the-shelf-sentiment-analysis/) with this approach, but I just figured I'd try things the lazy way first 

After some reading, I decided touse the __[VADER algorithm](http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf)__ in the python package [nltk](https://www.nltk.org/). __[This](https://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html)__ writeup is worth reading if you're interested. VADER is far from perfect - it misclassifies things reasonably often, as it's trained on genera social media data. But it's general and easy to use, and I figured since I was looking at _overall_ sentiment of comments rather than sentiment of individual comments it was safe to use an algorithm that might be right _most_ of the time, even if it wasn't right _all_ (or even near all) of the time.

The code below loads up the nltk VADER package and computes sentiment scores for each individual top-level comment in our dataset:

In [6]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Instantiate the VADER sentiment analyser
analyser = SentimentIntensityAnalyzer()

# Compute sentiment scores
snt = pd.DataFrame([analyser.polarity_scores(c) for c in top_level_comments['body']])
snt.shape

(2051, 4)

In [7]:
top_level_comments_sentiment = pd.concat([top_level_comments, snt], axis=1).sort_values(by='datetime_posted')
top_level_comments_sentiment.head()

Unnamed: 0,all_awardings,approved_at_utc,associated_award,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,...,top_awarded_type,total_awards_received,treatment_tags,author_cakeday,datetime_posted,match_time,neg,neu,pos,compound
2050,[],,,Mikalov1,,num4,"[{'a': ':finale-04:', 'e': 'emoji', 'u': 'http...",aeb3130c-03ae-11e9-b827-0e38e0aa4b12,:finale-04: Alderweireld,dark,...,,0,[],,2020-09-24 18:03:32,0.0,0.0,0.33,0.67,0.7269
2049,[],,,ComradeStrong,,flair3,[],,,dark,...,,0,[],,2020-09-24 18:03:33,0.0,0.0,1.0,0.0,0.0
2048,[],,,HKane10,#ddbd37,legend,"[{'a': ':legend:', 'e': 'emoji', 'u': 'https:/...",cf7c09d0-0170-11e9-9a89-0ed163ef429c,:legend: Darren Anderton,light,...,,0,[],,2020-09-24 18:03:47,0.0,0.0,1.0,0.0,0.0
2047,[],,,Heor326,,,[],,,,...,,0,[],,2020-09-24 18:03:48,0.0,0.231,0.769,0.0,-0.0516
2046,[],,,TogashiIsIshida,,num10,"[{'a': ':finale-10:', 'e': 'emoji', 'u': 'http...",d651b530-03ae-11e9-b07a-0e7c7e3aa2fa,:finale-10: Kane,dark,...,,0,[],,2020-09-24 18:03:57,0.0,0.0,0.711,0.289,0.431


In this new dataframe,  you can see that each comment has a "neg", "neu", "pos" and "compound" score. The first three correspond to how negative, neutral, or positive VADER thinks the valence of a comment is; the compound score is a combination of the three ranging from -1 to 1, where higher is more positive and lower is more negative.

To start with, I wanted to look at what the raw comments sentiment looked like over match time. I didn't expect to find much here, as any individual comment taken in isolation doesn't really indicate what the sentiment-trend is at a moment in time, but was interested what I'd see anyway:

In [9]:
import plotly.express as px

fig = px.line(top_level_comments_sentiment, x = 'match_time', y = 'compound')
fig.show()

OK, that looks terrible, but to be fair I expected it to! To get a better idea of how general sentiment was changing over the course of a match, I decided it made more sense to take a moving average of some sort. Below I take the triangular moving average (basically, think of it as a moving average where the impact of comments more temporally close to an instant are more heavily weighted) and plot it over the original time series:

In [11]:
import plotly.graph_objects as go
import numpy as np

def smoothed_triangle(data, degree):
    """ (Series, int) -> Series
    
    This takes a Series of floats/ints and computes the triangular moving average of the series.
    Briefly, TMA is the simple moving average of a simple moving average; you can think of it
    as a moving average where points nearer to the timepoint in question are weighted more
    heavily than those further away in the window. The window size (or 'degree') specifies
    how many points around a timepoint to use in computing the TMA for that timepoint.
    """
    triangle=np.concatenate((np.arange(degree + 1), np.arange(degree)[::-1])) # up then down
    smoothed=[]

    for i in range(degree, len(data) - degree * 2):
        point=data[i:i + len(triangle)] * triangle
        smoothed.append(np.sum(point)/np.sum(triangle))
    # Handle boundaries
    smoothed=[smoothed[0]]*int(degree + degree/2) + smoothed
    while len(smoothed) < len(data):
        smoothed.append(smoothed[-1])
    return smoothed

# Plot raw comment sentiment
fig = go.Figure()
fig.update_layout(plot_bgcolor = 'white')
fig.add_trace(go.Scatter(
    x=top_level_comments_sentiment['match_time'],
    y=top_level_comments_sentiment['compound'],
    mode='lines',
    name='Raw Sentiment'
))


# Plot TMA of comment sentiment:
fig.add_trace(go.Scatter(
    x=top_level_comments_sentiment['match_time'],
    y=smoothed_triangle(top_level_comments_sentiment['compound'],
                           50), # window size used for smoothing
    mode='lines',
    marker=dict(
        size=6,
        color='yellow',
        symbol='triangle-up'
    ),
    name='Rolling Average - Window=50'
))
fig.show()

We can see that taking the moving average pretty drastically decreases the variability. That's good, in that it is clearer when there's a trend one way or the other in terms of overall sentiment. But does this new plot actually capture anything meaningful about the game?

Long story short, yes! Zooming in on the moving average and annotating with a few major game events, we see clear peaks in general spurs-fan sentiment when a goal is scored for Spurs and negative sentiment (indicating upset fans) when KFS scores. Interestingly, we also see a pretty huge trough in sentiment when Harry Winks is awarded a yellow card; it's worth noting that at the time of this game, Winks was a bit of a whipping boy for many Spurs fans, which may attribute to the somewhat precipitous decline.

It's also worth noting that the TMA is maybe not the best metric to use as it includes information about events that happen _after_ the point of interest as well as those that occur before - that is, if we're trying to assess sentiment at time _t_, our average is equally affected by the sentiment that occurs at time _t_+5 as at _t_-5. In the context, this doesn't make much sense; the sentiment at time _t_ _can't_ actually be affected by sentiment at time _t_ + _n_ because time _t_ + _n_ hasn't occurred yet! We can see the effect of including future sentiment in our average with the Winks Yellow Card trough: our estimate of sentiment begins to plummet well before he is awarded the Yellow Card, likely due to the impending negative comments.

<img src="spurs_KFS_game.png">

Another interesting thing that we see here is that sentiment is generally positive. This particular game went well for Spurs so maybe that isn't too surprising. But even at the lowest point in the game, with the score all tied up right after halftime, sentiment barely drops below zero. While this is admittedly just one game, it does make me question the assumption I see expressed sometimes that Spurs match-threads are generally a negative, hostile environment.

In [13]:
print("Average sentiment = " + str(top_level_comments_sentiment.compound.mean()))
print("Sentiment SD = " + str(top_level_comments_sentiment.compound.std()))

Average sentiment = 0.10663924914675761
Sentiment SD = 0.4226588659672282


It's also worth noting that we probably just can't look at raw sentiment averages and see how well a match went. This match went pretty well, but the average overall sentiment really doesn't differ that much from 0. Guess that means I'll still have to dive into the details of the games. What a bummer.

More seriously, this is just one example. It's worth asking - how well does this approach generalize? Does VADER succeed in tracking sentiment/game quality over time in other matches?

Here's the analysis for a tie game against Wolverhampton on December 27th. 
<img src='spurs_wolves_game.png'>
Here we see positive and negative peaks for the goals again, though Winks' Yellow Card doesn't seem to get the negative reaction it did in the earlier game. Perhaps this is due to the halftime bump, where fans seem to be more positive in the match thread during halftime; it's hard to know for sure. In this match though we see a lot more variation and peakiness. I didn't watch the match highlights for this one, so I'm not sure what all of the peaks correspond to. However, it's telling that sentiment declines over time: Spurs were forecast to win this game, and they didn't really show up the way they needed to.

What about a game that didn't go so well? Here's the analysis for the recent 2-1 loss to Liverpool:
<img src="spurs_liverpool_game.png">
Again, we see peakiness for the goals. I delved a bit deeper into the highlights of this game, and we can see that some of the non-goal peaks correspond to chances. In particular, we can see a positive peak for Bergwijn's chance (that was tragically blocked by the post) and then a negative peak afterward in response to the missed shot. Overall though there's a nice give-and-take for the sentiment here, which I think corresponds to how the game felt watching it. Spurs played good football against a juggernaut of an opponent in this match, and there were many moments where it felt like things could go either way.

Overall, I think that the VADER analysis worked far better than I expected it to. It's noisy, sure, but it manages to capture the sentiment of the thread in a way that seems to correpond to real events. More importantly, it seems like the actual sentiment in the thread _does_ in fact reflect real events, so if you are like me and find yourself watching the match thread instead of the game... well, it's still really not the same, but the basic emotions you'll witness and experience may be similar.

__TL;DR: People trashing the match thread in the match thread. Here you fucking go. This is the match thread. No streaming links and no insightful commentary, but when you look at the overall sentiment, it does broadly track with the actual events in the match.__

TODOs:
If you've made it this far and are interested in messing about with this yourself, there are a few easy improvements I'd love to collab on:
1. Cleaning up the data. I didn't bother accounting for halftime, breaks, and extra time in the match time, so all the match times are slightly off what you'll find in a write up of the game. There are some clever ways to fix this based on the comments themselves, but I didn't bother with implementing them.
2. Changing TMA algorithm so it only takes into account comments from the past/present moment being evaluated, not from one's _after_ the moment being evaluated (think of this as "Right-triangle moving average"
3. Figure out how to snag reddit comments in real time, so you can track sentiment of a match continuously while it's underway.
4. Implement this all into a dashboard so that people can analyze any ol' thread without having to muck about with code.

I'll be working on some/all of these at some point, but I'm not in any hurry so if you beat me to it let me know, I'd love to see what you do! And as always, feel free to use any of the code here, just drop me a link if you do.