# Predicting Reddit thread participation for NBA Playoff games

My goal for this project was to develop models that would predict if a comment was posted by a fan whose team was participating in a NBA playoff game or by a fan of a non-participating team.

I collected comment data from the NBA subreddit's 'Official Game Threads' which are posted before every scheduled game and allow users to converse about the events that occur during the game as well as the game's results. By classifying each comment based on the poster's affiliation, I developed several classification models that use text content to characterize comments by the poster's affiliation.

### Setting up PRAW and PushShiftAPI

Because the native reddit API no longer allows querying of historical Reddit threads, I used the PushShiftAPI and its associated scraper (psaw) to gather reddit comments from r/NBA's Official Game Threads for the 2017-2018 NBA playoffs.

In [1]:
import pandas as pd
import praw
from psaw import PushshiftAPI
import datetime as dt

reddit=praw.Reddit(client_id='VsUMbnhDFXVWDQ',
                  client_secret='xf5IG_dfsIv94Aq2YqID7x6cnGw',
                  user_agent='tester')

api = PushshiftAPI()

### Get the urls for each game thread

Since the string 'Game Thread' is not unique to the offical game threads (it is also used in postgame threads and non-game events such as the NBA draft or draft lottery), I used the string 'You must click this link' to identify them): 

### The NBA playoffs began on 4/14/2018 so I set my scrape to only look at game threads posted since then:

In [2]:
import datetime as dt

start_time=int(dt.datetime(2018, 4, 14).timestamp())

In [3]:
gamethreads = list(api.search_submissions(q='You must click this link',
                                  after=start_time,
                                  subreddit='NBA',
                                  filter=['author', 'url', 'title'],
                                  limit=120))

In [4]:
url_list = []
for i in range(1, len(gamethreads)):
    url_list.append(gamethreads[i].url)

### Scraping Comment Info

In [5]:
sub_list = []
for i in url_list:
    sub_list.append(reddit.submission(url=i))

The replace_more method loads child comments from the top level comment. Setting it to 1 means the scrape only collects 1 child below each top level comment, while setting it to `None` would allow PRAW to scrape *every* comment in a thread.

In [6]:
%%time
comment_list = []
for i in sub_list:
    i.comments.replace_more(limit=1)
    for comment in i.comments:
        comment_list.append({'ups': comment.ups,
                         'affiliation': comment.author_flair_css_class,
                         'thread_id': comment.parent_id,
                         'when_comment_posted': comment.created_utc,
                         'text': comment.body})

Wall time: 6min 38s


In [7]:
len(comment_list)

43998

In [8]:
df = pd.DataFrame(comment_list)

In [9]:
df.head()

Unnamed: 0,affiliation,text,thread_id,ups,when_comment_posted
0,Lakers1,Washington a fool for attacking Canada in the ...,t3_8ca8gj,132,1523745000.0
1,Raptors2,Fun fact: CJ Miles has more 3s in this quarter...,t3_8ca8gj,82,1523745000.0
2,Bucks2,Casey uses Bebe like a fucking Yu-Gi-Oh trap c...,t3_8ca8gj,78,1523749000.0
3,Suns3,Ian Mahinmi is one of those players that you h...,t3_8ca8gj,73,1523743000.0
4,Celtics1,Holy fuck the ACC is loud godamn,t3_8ca8gj,72,1523743000.0


#### Exporting the data to csv for preprocessing

In [10]:
df.to_csv('../data/nbacomments_raw.csv', index=False)