## Colts Subreddit Predictive Analysis using Multiple Linear Regression

### Background 
I am an active member of Reddit. I often seek to reddit for recommendations in travel, food, breweries, music, etc. And as a Indianapolis Colts Fan, one of my guilty pleasures is frequenting the Colts Subreddit during the season and talk about the team with other fans. 

A few months ago I was scrolling through Twitter and saw an article where somebody computed the most typed curse word in every NFL teams subreddit. While knowing this information does not bring any business or personal value, it was funny at the time and gave me an idea. I wanted to look into which Colts players are talked about the most in the Colts subreddit. Furthermore, I wanted to see if there was a correlation between their play on the field and the amount they are mentioned on the subreddit. Lastly, if there was such a correlation, I wanted to build a model to predict the occurance a player will be mentioned based on their performance in a game. **For this project, I worked with only the Defensive players on the Colts.**

### Method
1. Build datasets that consists of the defensive statistics of each player on the defense. This was done for Weeks 1-15 of the 2019 Regular season. The statistics for each game/player were pulled from pff.com (Pro Football Focus). 

2. Used the Python Reddit API Wrapper (PRAW) to connect to the Colts subreddit and extract the user comments from each games **Game Day Thread and Post Game Day Thread**. 

3. Compute the count of how many times a player was talked about in the Game day and Post Game Day threads combined. Then merge those counts with the defensive statistics dataset per week.

4. Once Week 1-15 data was merged into one dataframe, I analyzed the correlation between the defensive statistics features and the reddit counts. I ran sensitivites on which features were  to be used in the model to reduce the mean root squared error and to improve the r-squared measures.

5. After settling on a regression model, I was able to predict reddit counts per individual players based on statistical performance in a game also with running sensativities, etc, using the model.

6. Lagniappe - performed data analysis on position groups

### Data
1. The defensive statistics were pulled from pff.com (Pro Football Focus). The defensive statistics I used for each player were as follows: Total Snaps, Total Pressures, Sacks, Hurries, Total Tackles, Missed Tackles, Stops, Forced Fumbles, Interceptions, Pass Breakups, Penalties, yards per reception, and receptions. I wanted to make sure to get a mixture of positive statistics (i.e. sacks, interceptions) and negative statistics (i.e. Missed_Tackles, Penalties).

2. The user comments extracted from the Game Thread and Post Game threads for each game from the Colts Subreddit. I chose those threads because they contain the most comments and the comments are very focused on how players are playing in that game. In other words, I know that if a player is being talked about in one of those threads, it is likely due to something they did on the field in the game - whether good or bad.




Okay, let's get to it!!

In [28]:
#import necessary libraries
import praw
import pandas as pd
import datetime as dt

### Connecting to Reddit and Extract Comment Information

In [29]:
#reddit info
#client_secret = '3333'
#client_id = '54443'
#user_agent = 'dfwwefw'
#username = 'ddddddddddddd'
#pw = 'hhhh'

In [2]:
#First, connect to reddit by calling praw.Reddit and store as variable

reddit = praw.Reddit(client_id = client_id, client_secret = client_secret, user_agent = user_agent, username = username, password = pw)

In [30]:
#specify which subreddit we want to look at, in this case -- Colts Subreddit
subreddit = reddit.subreddit('Colts')
#Filter all posts in Colts Subreddit to look at Game and Post threads
search_subreddit = subreddit.search('Game Thread', 'post-game','thread')

In [4]:
topics_dict = {"title":[], \
              "score":[], \
              "id":[], "url":[], \
              "comms_num": [], \
              "created": [], \
              "body": []}

In [31]:
#create dictionary that contains title of post, score, id, link, number of comments, and create date
for submission in search_subreddit:
    topics_dict["title"].append(submission.title)
    topics_dict["score"].append(submission.score)
    topics_dict["id"].append(submission.id)
    topics_dict["url"].append(submission.url)
    topics_dict["comms_num"].append(submission.num_comments)
    topics_dict["created"].append(submission.created)
    topics_dict["body"].append(submission.selftext)

In [32]:
#Creates dataframe of above dictionary and stores as 'topics_data'
topics_data = pd.DataFrame(topics_dict)

In [33]:
#this function and following code converts the "created" time column into a timestamp, which is more user friendly
def get_date(created):
    return dt.datetime.fromtimestamp(created)

_timestamp = topics_data["created"].apply(get_date)
topics_data = topics_data.assign(timestamp = _timestamp)
topics_data.sort_values(by=['timestamp'],ascending = False)

Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp
107,Post-Game Thread: Indianapolis Colts (6-7) at ...,59,e7zn5a,https://www.reddit.com/r/Colts/comments/e7zn5a...,392,1.575869e+09,,2019-12-08 23:18:19
6,Post-Game Thread: Indianapolis Colts (6-7) at ...,52,e7zn5a,https://www.reddit.com/r/Colts/comments/e7zn5a...,385,1.575869e+09,,2019-12-08 23:18:19
101,Game Thread: Indianapolis Colts (6-6) at Tampa...,60,e6pppi,https://www.reddit.com/r/Colts/comments/e6pppi...,3151,1.575855e+09,,2019-12-08 19:30:14
1,Game Thread: Indianapolis Colts (6-6) at Tampa...,58,e6pppi,https://www.reddit.com/r/Colts/comments/e6pppi...,3114,1.575855e+09,,2019-12-08 19:30:14
7,Post-Game Thread: Indianapolis Colts (6-6) @ T...,55,e4npfo,https://www.reddit.com/r/Colts/comments/e4npfo...,447,1.575263e+09,,2019-12-01 23:04:27
108,Post-Game Thread: Indianapolis Colts (6-6) @ T...,58,e4npfo,https://www.reddit.com/r/Colts/comments/e4npfo...,447,1.575263e+09,,2019-12-01 23:04:27
0,Game Thread: Tennessee Titans (6-5) at Indiana...,55,e4jbhn,https://www.reddit.com/r/Colts/comments/e4jbhn...,3842,1.575245e+09,LET'S GO,2019-12-01 18:10:55
100,Game Thread: Tennessee Titans (6-5) at Indiana...,55,e4jbhn,https://www.reddit.com/r/Colts/comments/e4jbhn...,3841,1.575245e+09,LET'S GO,2019-12-01 18:10:55
24,Game Thread: Tennessee Titans (6-5) at Indiana...,28,e4i4ul,https://www.reddit.com/r/Colts/comments/e4i4ul...,25,1.575240e+09,WHITE OUT!,2019-12-01 16:33:06
124,Game Thread: Tennessee Titans (6-5) at Indiana...,27,e4i4ul,https://www.reddit.com/r/Colts/comments/e4i4ul...,25,1.575240e+09,WHITE OUT!,2019-12-01 16:33:06


### Game Thread Info

Now it's time to navigate to the Game Thread for Colts vs Tampa Bay Bucaneeers and **obtain the id**

In [34]:
#the id for the game day thread is 'e6pppi'. 
game_submission = reddit.submission(id = 'e6pppi')

Once we saved the Game Day Thread to game_submission variable, lets extract user comments from it and store as a list of strings

In [10]:
game_submission.comments.replace_more(limit= None)
game_text = []
for comment in game_submission.comments.list():
    x = comment.body
    game_text = game_text + list(x.split(" "))

In [39]:
#clean the results
game_clean = [i.strip('\n') for i in game_text]

game_clean = [i.replace('\n', "").replace('!','').replace('?','').replace('.','').replace(',','') for i in game_clean]

#make all elements lower case
game_clean = [x.lower() for x in game_clean]

#how many words were extracted from the Game day thread?
len(game_clean)

35061

In [38]:
#Example, first twenty words extracted from the Game Day Thread user comments
print(game_clean[0:20])

['bucs', "didn't", 'respond', 'to', 'the', 'sidebar', 'bet', 'a', 'deep', 'bomb', 'followed', 'by', 'an', 'extra', 'point', 'going', 'right', 'down', 'the', 'middle']


### Post Game Thread Info

Repeat the same process as above but for the Post Game Thread!

In [12]:
post_submission = reddit.submission(id = 'e7zn5a')

In [13]:
post_submission.comments.replace_more(limit= None)
post_text = []
for comment in post_submission.comments.list():
    x = comment.body
    post_text = post_text + list(x.split(" "))

In [40]:
#clean the results
post_clean = [i.strip('\n') for i in post_text]

post_clean = [i.replace('\n', "").replace('!','').replace('?','').replace('.','').replace(',','') for i in post_clean]

#make all elements lower case
post_clean = [x.lower() for x in post_clean]

#How many words were extracted from the Post Game Thread?
len(post_clean)

8462

### Merging Game thread and Post game Info

In [15]:
merged_list = game_clean + post_clean

In [16]:
len(merged_list)

43523

### Import Game Stats

In [17]:
week14file = r'C:\Users\ddudley\Documents\Week14.xlsx'

In [93]:
#Create dataframe of Week 14 player stats and view results of first 5 players
#it is critical that the order of the players never change!
week14_df = pd.read_excel(week14file)
week14_df.head()

Unnamed: 0,Player,T_Snaps,T_pressure,Sacks,Hurries,T_Tackles,Missed_Tackles,Stop,Forced_Fumble,Int,Pass_Breakup,Penalty,y_per_rec,rec
0,Anthony Walker,62,1,0,0,3,3,2,0,0,0,0,15.5,4
1,EJ Speed,0,0,0,0,0,0,0,0,0,0,0,0.0,0
2,Bobby Okereke,27,1,0,1,2,0,1,0,0,0,0,0.0,0
3,Darius Leonard,78,3,0,3,6,1,4,0,2,0,0,7.2,5
4,Zaire Franklin,0,0,0,0,0,0,0,0,0,0,0,0.0,0


In [94]:
#Create an empty column which is where the reddit counts will go
week14_df['red_count'] = ""

In [95]:
#get list of players
playerlist = pd.Series(week14_df.Player).to_list()

In [96]:
counts_dict_week14 = {}
for player in playerlist:
    #split player names, so a reddit count will be recognized if reddit user mentions first or last name
    splitted = player.split(" ")
    #turn all data to lowercase
    lowered = [x.lower() for x in splitted]
    #counts the # of times a players first or last name is mentioned and sums results
    counts_dict = {player:merged_list.count(lowered[0])+merged_list.count(lowered[1])}
    counts_dict_week14.update(counts_dict)
print(counts_dict_week14)

{'Anthony Walker': 2, 'EJ Speed': 6, 'Bobby Okereke': 7, 'Darius Leonard': 116, 'Zaire Franklin': 0, 'Matthew Adams': 0, 'Denico Autry': 3, 'Ben Banogu': 0, 'Justin Houston': 38, 'Tyquan Lewis': 0, 'Al-Quadin Muhammad': 1, 'Jabaal Sheard': 4, 'Trevon Coley': 0, 'Carl Davis': 0, 'Margus Hunt': 0, 'Grover Stewart': 1, 'Kenny Moore': 14, 'Shakial Taylor': 1, 'Marvell Tell': 15, 'Quincy Wilson': 14, 'Rock Ya-Sin': 9, 'Clayton Geathers': 3, 'Malik Hooker': 88, 'Rolan Milligan': 0, 'George Odum': 12, 'Khari Willis': 3, 'Kemoko Turay': 2, 'Pierre Desir': 19}


In [97]:
#creates list of the counts
countlist = list(counts_dict_week14.values())
print(countlist)

[2, 6, 7, 116, 0, 0, 3, 0, 38, 0, 1, 4, 0, 0, 0, 1, 14, 1, 15, 14, 9, 3, 88, 0, 12, 3, 2, 19]


In [98]:
#enter into the red_count column in dataframe
week14_df['red_count'] = countlist

### Editing Marvell Tell and Justin Houston Manually

I am having to hard code the count for Marvell Tell due to his last name. The word 'tell' is used quite often, so I will search for Marvell Tell by just using his first name. For Justin Houston, his last name complicates things for this project. The Houston Texans are in the Colts division and Colts fans love to bring up their team in the subreddit because they are leading the divsion. So I will hard code his results by search just for his first name.

In [100]:
merged_list.count('marvell')

0

In [101]:
week14_df.at[18,'red_count'] = 0

In [103]:
merged_list.count('justin')

5

In [104]:
week14_df.at[8,'red_count'] = 5

### View Results

In [106]:
week14_df.sort_values(by = 'red_count',ascending = False)

Unnamed: 0,Player,T_Snaps,T_pressure,Sacks,Hurries,T_Tackles,Missed_Tackles,Stop,Forced_Fumble,Int,Pass_Breakup,Penalty,y_per_rec,rec,red_count
3,Darius Leonard,78,3,0,3,6,1,4,0,2,0,0,7.2,5,116
22,Malik Hooker,78,0,0,0,7,1,0,0,1,0,0,14.0,4,88
27,Pierre Desir,78,0,0,0,4,1,0,0,0,1,0,25.7,3,19
16,Kenny Moore,0,0,0,0,0,0,0,0,0,0,0,0.0,0,14
19,Quincy Wilson,4,0,0,0,0,0,0,0,0,0,0,12.0,1,14
24,George Odum,12,0,0,0,5,0,0,0,0,0,0,16.0,4,12
20,Rock Ya-Sin,75,0,0,0,3,0,0,0,0,1,0,20.5,2,9
2,Bobby Okereke,27,1,0,1,2,0,1,0,0,0,0,0.0,0,7
1,EJ Speed,0,0,0,0,0,0,0,0,0,0,0,0.0,0,6
8,Justin Houston,45,3,1,2,2,0,3,0,0,0,1,13.0,1,5


In [107]:
#Save to my computer and use in Part 2
week14_df.to_csv(r'C:\Users\ddudley\Documents\UpdatedRed\week14_upd.csv')

### This exact workflow was done for each NFL Week!