# Bucci Overtime Challenge

When a professional (and sometimes college) hockey game is tied after regulation play and goes to overtime, Buccigross will post a tweet on his twitter account (@buccigross) using hashtag \#bucciovertimechallenge.

To participate you tweet #bucciovertimechallenge followed by the two hockey players (one from each team) who you think is most likely to score the game-winning goal for their respective team in overtime. 

Winners are chosen at random from the pool of participants who selected the correct player. Winners get sent a free t-shirt or something from Buccigross' [site](https://www.bucciot.com).

**Idea**: This challenge brings up a [wisdom of the crowds](https://en.wikipedia.org/wiki/Wisdom_of_the_crowd) scenario - the collective wisdom of hockey fans might be better at predicting the game winner of a hockey game than a single professional. I am going to investigate this idea in this notebook. 

In [None]:
import json
import requests
import pandas as pd
import base64
from tqdm import tqdm_notebook as tqdm
from collections import Counter
from datetime import datetime
import time
import glob

http://benalexkeen.com/interacting-with-the-twitter-api-using-python/

https://stackoverflow.com/questions/33308634/how-to-perform-oauth-when-doing-twitter-scraping-with-python-requests

https://www.kevinsidwar.com/iot/2017/7/1/the-undocumented-nhl-stats-api

# Data Aggregation

In [None]:
def get_twitter_api_token(api_key, api_secret):
    """
    Authenticates with Twitter API.
    
    :param api_key:
    :param api_secret:
    :return: token to use with API requests
    """
    key_secret = '{}:{}'.format(api_key, api_secret).encode('ascii')
    b64_encoded_key = base64.b64encode(key_secret).decode('ascii')

    base_url = 'https://api.twitter.com'
    auth_url = '{}/oauth2/token'.format(base_url)
    
    auth_url = 'https://api.twitter.com/oauth2/token'

    auth_headers = {
        'Authorization': 'Basic {}'.format(b64_encoded_key),
        'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8'
    }

    auth_data = {
        'grant_type': 'client_credentials'
    }

    auth_resp = requests.post(auth_url, headers=auth_headers, data=auth_data)
    auth_resp.raise_for_status()
    return auth_resp.json()['access_token']

In [None]:
with open('config/twitter_creds.json') as f:
    creds = json.load(f)

base_url = 'https://api.twitter.com'
auth_token = get_twitter_api_token(creds["API_KEY"], creds["API_SECRET"])

search_headers = {
    'Authorization': 'Bearer {}'.format(auth_token)    
}
search_params = {
    'q': '#bucciovertimechallenge',
    'result_type': 'recent',
    'count' : 100
}

search_url = '{}/1.1/search/tweets.json'.format(base_url)

aggregated_results = []

search_resp = requests.get(search_url, headers=search_headers, params=search_params)
search_resp.raise_for_status()

for s in search_resp.json()['statuses']:
    aggregated_results.append({
        "screen_name" : s.get('user').get('screen_name'),
        "created_at" : s.get('created_at'),
        "text" : s.get('text')
    })

next_results = search_resp.json()['search_metadata'].get('next_results')

count = 1
while next_results is not None:
    new_search_url = '{}/1.1/search/tweets.json{}'.format(base_url,next_results)
    new_search_resp = requests.get(new_search_url, headers=search_headers)
    new_search_resp.raise_for_status()
    count += 1
    if count % 10 == 0:
        time.sleep(0.5)
        print(count)
    for s in new_search_resp.json()['statuses']:
        aggregated_results.append({
            "screen_name" : s.get('user').get('screen_name'),
            "created_at" : s.get('created_at'),
            "text" : s.get('text')
        })
    next_results = new_search_resp.json()['search_metadata'].get('next_results')

print('total requests: {}'.format(count))
print('total tweets retrieved: {}'.format(len(aggregated_results)))

# need to get around rate limits some how

In [None]:
now = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
with open('data/raw/tweets-{}.json'.format(now), 'w') as outfile:
    json.dump(aggregated_results, outfile, indent=4)

# Data Cleaning

- figure out when (time of day) the gwg goal was scored and by who from nhl api
- filter out any tweet that happened after the gwg goal was scored
- somehow use python collections.Counter object to count all the words in all the tweets and then get the counts for players (given the roster for each team)
- show/plot the data on the guesses and compare to who scored


Using a Counter below is greedy, not necessarily the best way to do this (because we end up counting too many random words that were in the many tweets). Can to think about how to make this better (only count players names, etc.), but this may be harder than its worth...

In [None]:
#now = '2019-04-26-22-21-19'

In [None]:
with open('data/raw/tweets-{}.json'.format(now)) as f:
    aggregated = json.load(f)
    
counts = Counter()
for tweet in tqdm(aggregated):
    t = tweet['text']
    # things to account for:
    #  /
    # <name>.
    # same last name (hard to do)
    # time of tweet (can't tweet after goal was scored)
    # make sure tweets were from the day of the game, make sure you don't miss if game went past midnight also 
    # random punctionation markers, replace all with / and then replace / with space
    tweet_time  = datetime.strptime(tweet['created_at'], "%a %b %d %H:%M:%S +0000 %Y")
    
    difference = datetime.utcnow() - tweet_time

    if difference.seconds > 60*60*5: # tweet older than a day
        continue
    
    cleaned = t.lower().replace('.', ' ').replace('/',' ').split()
    # periods, colons
    counts.update(cleaned)

In [None]:
counts.most_common()

We want to cut down our data for one file per game (since we scraped tweets multiple times during overtime periods). 

Other option is to run the analysis based on date, hard to do, and then just use whatever file per game that was the closest. Its hard to tell which file corresponds to which game because tweets don't specify a corresponding game, need to look it up based on player name -> team they are on.

In [None]:
df = pd.read_csv('data/2019-nhl-ot-goals.csv')

In [None]:
df.head()

In [None]:
df['datetime'] = pd.to_datetime(df["Date"], format='%Y-%m-%d')

In [None]:
df.head()

In [None]:
# for each day there was an overtime goal, get the files from that day (via glob)
# get the file with the most records
# get the records into a python counter, the see what "place" the goal scorer was in,
# relative to number of votes with naive filter, then from there clean things up
for i,r in df.iterrows():
    dt = r['datetime']
    dt_s = dt.strftime("%Y-%m-%d")
    found = glob.glob(f'data/raw/tweets-{dt_s}*.txt', recursive=True)
with open('{}.json'):
    print(r['datetime'])
    print(type(r['datetime']))
    

# Data Analysis

- https://statsapi.web.nhl.com/api/v1/teams
- https://github.com/erunion/sport-api-specifications/tree/master/nhl
- https://gitlab.com/dword4/nhlapi/blob/master/stats-api.md
- http://www.nhl.com/scores/htmlreports/20182019/GS030221.HTM

First we can get the overtime results of every game in the 2019 NHL playoffs from [here](https://www.hockey-reference.com/playoffs/NHL_2019.html#all_ots) by copying the table and saving it as a CSV locally. I have this saved in the `/data` directory.

After some EDA, I found that the #bucciovertimechallenge tweets did not do very well in predicting winners or scorers for the overtime games. Not the exciting outcome we might hope for, but useful to investigate.