## NFL Matchup Predictor

I started this project to try and predict the winners of NFL games each week.  It definitely didn't have anything to do with me making terrible predictions in Fantasy, and was only about practicing Machine Learning.... Yes, that's it...

Anyways, the idea is to predict the outcome of a given game using weekly stats of the team from previous games that year.  These include things like Total Yards, QB Rating, Rushing Rank, Points Allowed, ..., etc.  With this, each matchup represents a row of data giving us 256 total games per season (~16 games a week for ~16 weeks) over a spread of around 10 years.  That is, I am treating each team-week-year as a data point with features given by the values of the weekly stats.  Given these data, the goal is to train a predictive model to determine a correlation of each team in the matchup with Victory.  My assumption is the higher value should be the Winner.

Note: since Week 1 has no previous weeks in that season, this setup does not make predictions until Week 2.  It's possible I could use Week 17 of the previous season as a proxy, but I don't think the correlation between years is strong enough to warrant that (sounds like another future project to explore!).

The initial code follows below, though the work has so far been about constructing a valid pipeline; not much model tuning has been performed at this point.  Ideally, I plan to use this setup to learn more about neural networks, but that is not yet included.  For now it is using xgboost, as it is the prediction model I am most familiar with.

In [1]:
%autoreload 

import numpy as np
import pandas as pd

from load_data import get_stats, get_results
from xgb_fit import get_fit_predictions as xgb_pred

current_season = 2016
current_week   = 17

base_dir = "~/Documents/NFL-Predictor/"



### Data Loading

First step is to acquire the weekly stats for each team, as well as the results of the matchup.  I found the stats via www.foxsports.com, while the results are taken from www.nfl.com.  The process of scraping these data from each site can be found in the `load_data.py` script.  Most of it is fairly standard, although there were annoying things like dealing with 'JAX' transforming into 'JAC' briefly.  I am only taking data from 2002 on, as this is when the division realignment took place, putting the league into its current format.  

In [2]:
generate_stats = False

df_stats = {} # All statistics from each Week

# If generating files, usually only need to do so for the last two weeks
# (to update winners of previous week, and setup new matchups)
# Otherwise, load results from all previous seasons
seasons = range(2002, current_season + 1) if not generate_stats else [current_season]
for season in seasons:
    df_stats[season] = {}
    
    w = 18 if (season != current_season) else current_week
    weeks = range(1, w) if not generate_stats else [current_week - 1]
    for week in weeks:
        name_str = base_dir + 'data/stats/%d/stats_Week_%d_%d.csv' % (season, week, season)
        # If generating file, save it as a csv for easier loading
        if generate_stats:
            print 'Generating Team Stats - Season %d - Week %2d' % (season, week)  
            df_stats[season][week] = get_stats(season, week)
            df_stats[season][week].to_csv(name_str)
        # Otherwise, just load the created file (super fast!)
        else:
            df_stats[season][week] = pd.read_csv(name_str, index_col = 0)
    
# print df_stats[2016][6].head()

In [3]:
generate_results = False

df_results = {} # Game results from each Week

seasons = range(2002, current_season + 1) if not generate_results else [current_season]
for season in seasons:
    df_results[season] = {}
    
    w = 18 if (season != current_season) else current_week + 1
    weeks = range(1, w) if not generate_results else [current_week - 1, current_week]
    for week in weeks: 
        name_str = base_dir + 'data/results/%d/results_Week_%d_%d.csv' % (season, week, season)      
        # If generating file, save it as a csv for easier loading
        if generate_results:            
            print 'Generating Team Results - Season %d - Week %2d' % (season, week)  
            df_results[season][week] = get_results(season, week)
            df_results[season][week].to_csv(name_str)
        # Otherwise, just load the created file (super fast!)
        else:
            df_results[season][week] = pd.read_csv(name_str, index_col = 0)

# print df_results[2016][6]

### Averaging Functions

After the data are loaded, I need to construct methods to average the results of the previous weeks.  Currently, I am only using simple averages over N weeks prior.  From preliminary testing (not shown here), it seems the average over all previous weeks in the season is most effective, but that will need to be revisited once the model tuning begins.

Each of these functions runs on a given week-season pair, and returns a listing of numbers related to averages prior to the specified week.  In the case of a BYE week, the `avg_previous_week` function will look one week further back, but the other functions will ignore this (which could result in averaging over only N - 1 weeks).  I may fix this in the future if the other averaging methods seem more potentially promising.

In [4]:
def get_avg(df):
    """ Returns a DataFrame of averaged values for each team.
    
    Inputs - (df: DataFrame)    
    Output - (DataFrame)
    """
    
    dict_vals = {}
    for t in df.index.unique():
        dict_vals[t] = df[df.index == t].mean()
        
    return pd.DataFrame(dict_vals).transpose()
    
def get_avg_previous_week(season, week):
    """ Returns stats corresponding to the most recently played week.
    
    Inputs - (season: int, week: int)
    Output - (DataFrame)
    """
    
    # If Week 1, look back to previous season (unclear if viable)
    if week == 1:
        if season > 2002:
            return df_stats[season - 1][17]
        # If at earliest possible season, can only return earliest values
        else:
            return pd.DataFrame(columns = df_stats[2002][1].columns)
    
    df = df_stats[season]
    # If the team was on a Bye last week, must look 2 weeks back for stats
    return pd.concat([df[week - 1][df[week - 1].index == team] \
                      if team in df[week - 1].index \
                      else df[week - 2][df[week - 2].index == team] \
                      for team in df_results[season][week].index])


def get_avg_recent(season, week, cutoff = 3):   
    """ Averages stats over the previous weeks (up to 'cutoff' total weeks).
    
    Inputs - (season: int, week: int, cutoff: int)
    Output - (DataFrame)
    """
    
    # No need to average with only 1 point
    if week == 2:
        return df_stats[season][1]
    # Take only as many weeks as possible if below cutoff
    elif week <= cutoff:
        df = pd.concat([df_stats[season][w] for w in range(1, week)])
    # Otherwise, grab everything (does not account for Byes)
    else:
        df = pd.concat([df_stats[season][w] for w in range(week - cutoff, week)])
        
    return get_avg(df)
    
def get_avg_recent_3(season, week):
    return get_avg_recent(season, week, 3)

def get_avg_recent_4(season, week):
    return get_avg_recent(season, week, 4)

def get_avg_recent_5(season, week):
    return get_avg_recent(season, week, 5)
    
def get_avg_all_season(season, week):   
    """ Averages stats over all previous weeks.
    
    Inputs - (season: int, week: int)
    Output - (DataFrame)
    """
    
    # No need to average with only 1 point
    if week == 2:
        return df_stats[season][1]
    
    # Grab total results from previous weeks    
    return get_avg(pd.concat([df_stats[season][w] for w in range(1, week)]))


# List of potential averaging functions
avg_funcs = {
    'avg_all_season':    get_avg_all_season,
    'avg_previous_week': get_avg_previous_week,
    'avg_recent_2':      get_avg_recent,
    'avg_recent_3':      get_avg_recent,
    'avg_recent_4':      get_avg_recent,
    'avg_recent_5':      get_avg_recent,
}

### Matchup Construction

Next step is to convert all the information into a 'Matchup' row, which includes the team stats and their opponents stats.  For the current week, no information about the Victory is included.

In [5]:
def get_prev_records(season, week):
    """ Combines all previous results in the season into a DataFrame.
    
    Inputs - (season: int, week: int)
    Output - (DataFrame)
    """
    
    # No records for previous weeks at start of the season
    if week == 1:
        df = df_results[season][1].copy()
        df[['Wins', 'Losses', 'Ties']] = 0
        
        return df
    
    # Grab most recently played week (if on Bye)
    df = df_results[season]    
    return pd.concat([df[week - 1][df[week - 1].index == team] if team in df[week - 1].index \
                    else df[week - 2][df[week - 2].index == team] for team in df[week].index])
    
def parse_matchups(avg_func, season, week):
    """ Combines all previous results in the season for both teams in the matchup.
    
    Note: 'avg_func' must be specified in the 'avg_funcs' dict.
    
    Inputs - (avg_func: function, season: int, week: int)
    Output - (DataFrame)
    """
    
#     print 'Parsing Matchup Data - Season %d - Week %2d' % (season, week)

    # Grab columns related to matchups and teams involved
    df_res = df_results[season][week][['Team', 'Home', 'Opponent', 'Victory']]
    df_rec = get_prev_records(season, week)[['Wins', 'Losses', 'Ties']]
    df_avg = avg_func(season, week)    
    
    # Get all relevant columns for combining total DataFrame
    res_cols = df_res.columns
    rec_cols = df_rec.columns
    avg_cols = df_avg.columns

    # Team names to construct index for each row (TEAM_WEEK_SEASON)
    team_names = df_res['Team']
    opp_names  = df_res['Opponent']
    
    df_index = team_names.values + '_%d_%d' % (week, season)
    
    # Records from df_results
    team_recs = df_rec.ix[team_names]
    opp_recs  = df_rec.ix[opp_names]

    # Values from df_stats
    team_vals = df_avg.ix[team_names]
    opp_vals  = df_avg.ix[opp_names]
    
    # Combine everything together, and handle missing values
    data = np.concatenate([df_res.values, team_recs, team_vals, opp_recs, opp_vals], axis = 1)
    cols = np.concatenate([res_cols, rec_cols, avg_cols, 'Opponent ' + rec_cols, 'Opponent ' + avg_cols])
            
    df = pd.DataFrame(data, columns = cols, index = df_index)    
    df = df.apply(lambda x: pd.to_numeric(x, errors = 'ignore'))
    df.fillna(0, inplace=True)
    
    return df

# print get_prev_records(2016, 1)

# print parse_matchups(get_avg_all_season, 2015, 5)

Again, we can save the results to `.csv` files to make it load significantly faster later on.  This function needs to be re-run at the start of each new week in order to update the current week and the one prior.

In [6]:
def get_matchups_data(avg_func, func_name, generate_results = False):
    """ Combines data after averaging function into DataFrame.
    
    Note: Results are saved to .csv files for later use.
    
    Inputs - (avg_func: function, func_name: string, generate_results: bool)
    Output - (DataFrame) 
    """
    
#     print '\nGetting Matchups Data -', func_name
    df_matchups = {} # Game predictions for each Week

    seasons = range(2002, current_season + 1) if not generate_results else [current_season]
    for season in seasons:
        df_matchups[season] = {}
        
        if generate_results:
            print 'Parsing Team Data - Season %d' % (season)  

        w = 18 if (season != current_season) else current_week + 1
        weeks = range(2, w) if not generate_results else [current_week - 1, current_week]#range(6, current_week + 1)#
        for week in weeks:  
            name_str = base_dir + 'data/%s/%d/%s_Week_%d_%d.csv' % (func_name, season, func_name, week, season)      
            if generate_results:        
                print '\tParsing Team Data - Season %d - Week %2d' % (season, week)   
                df_matchups[season][week] = parse_matchups(avg_func, season, week)
                df_matchups[season][week].to_csv(name_str)
            else:
                df_matchups[season][week] = pd.read_csv(name_str, index_col = 0)
                
    return df_matchups

# df_matchups = get_matchups_data(get_avg_all_season, 'avg_all_season', True)
# df_matchups = get_matchups_data(get_avg_previous_week, 'avg_previous_week', True)
# df_matchups = get_matchups_data(get_avg_recent_2, 'avg_recent_2', True)
# df_matchups = get_matchups_data(get_avg_recent_3, 'avg_recent_3', True)
# df_matchups = get_matchups_data(get_avg_recent_4, 'avg_recent_4', True)
# df_matchups = get_matchups_data(get_avg_recent_5, 'avg_recent_5', True)

## Prediction Time!

This is where the actual work begins happening.  The function below calls the related Matchup Construction functions above in order to construct the train / test datasets.  It then feeds each into the specified model in order to obtain predicted Victory correlations (from 0 to 1).  From this, it calculates the winners of each matchup, and displays the overall results for a given season (up to that point).


In [7]:
def get_season_predictions(season, fit_model, avg_str,
                           data_start = 2002, 
                           show_results = False):
    """ Shows prediction results with specified model for matchups through week in season.
    
    Note: Prints results for each week with weekly running accuracy over season.
    
    Inputs - (season: int, fit_model: function, avg_str: str, 
            data_start: int, show_results: bool)
    Output - (season, accuracy)
    """
    
    try:
        df_matchups = get_matchups_data(avg_funcs[avg_str], avg_str)
        df_train = pd.concat([df_matchups[s][w] 
                           for w in range(2, 18) 
                           for s in range(data_start, season)])
    except:
        print 'Error: Unable to process DataFrame using function `%s`!' % (avg_str)
        return
    
    print 'Predicting Season %i - %s\n' % (season, avg_str)
    
    total_correct = 0
    total_games   = 0
    
    games = {}
    for team in df_results[season][1].index:
        games[team] = 0
        
    start = 2
    last = 18 if (season != current_season) else current_week + 1
    for week in range (start, last):
        df_m = df_matchups[season][week]
        
        # Set the matchups to be predicted with the correct Victory flag value
        df_pred = df_m.copy()
        df_pred['Victory'] = -1

        # Increase the training data using previous weeks in the season, if possible
        if week == start:
            df_prev = pd.DataFrame(columns = df_m.columns)
        else:        
            df_prev = pd.concat([df_matchups[season][w] for w in range(start, week)])

        # Package everything together for the model to sort out 
        df_total = pd.concat([df_pred, df_prev, df_train])   
        labels, uniques = pd.factorize(df_total['Team'])
        enum = dict(zip(df_total['Team'], labels))
                
        # Convert the team names to int values
        for col in ['Team', 'Opponent']:
            df_total[col] = df_total[col].apply(lambda x: enum[x])
            
        # Just in case any null values slipped in...
        df_total.fillna(0, inplace = True)
            
        # Here's where the actual Machine Learning comes in!
        df_model = fit_model(df_total)

        # Now, let's evaluate the results of the predictions (TODO: add more evaluation?)
        correct = 0
        
        away_teams = np.array(df_m.loc[df_m.Home == 0].Team)
        for team in away_teams:
            opp = df_m[df_m.Team == team].Opponent[0]

            # Get probabilities and difference between them
            # TODO: Use difference as team-dependent corrections to model behavior?
            team_prob = df_model[df_model.Team == enum[team]].Prediction[0]          
            opp_prob  = df_model[df_model.Team == enum[opp]].Prediction[0]
            prob_diff = abs(team_prob - opp_prob)
                    
            # Take team with higher probability as winner
            prediction = team if (team_prob > opp_prob) else opp
                        
            # Check and see how we did!
            winner = team if (df_m[df_m.Team == team].Victory[0] == 1) else opp  
            result = '✓' if (prediction == winner) else ''
            
            games[team] += 1
            games[opp]  += 1
                
            correct += int(prediction == winner)

            if (show_results):
                print '%3s %.4f - %3s %.4f' % (team, team_prob, opp, opp_prob), \
                '--> Prediction: %3s (%.4f) --> Winner: %3s %1s ' % (prediction, prob_diff, winner, result)
                    
        n_away = len(away_teams)
                    
        total_correct += correct
        total_games   += n_away
        
        print 'Week %2s - Correct: %2d / %2d (Week: %.3f - Season: %.3f)' \
        % (week, correct, n_away, 1.0 * correct / n_away, 1.0 * total_correct / total_games)
        
        if (show_results):
            print ''
        
    print '\nTotal Correct: %d / %d (%.3f)\n' \
    % (total_correct, total_games, 1.0 * total_correct / total_games)   
        
    return season, 1.0 * total_correct / total_games

In [8]:
pred = get_season_predictions(2016, xgb_pred, 'avg_all_season', show_results = True)

Predicting Season 2016 - avg_all_season



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  pred['Prediction']  = xgb_classifier.predict(dpred,  ntree_limit = xgb_classifier.best_iteration)


NYJ 0.4306 - BUF 0.5162 --> Prediction: BUF (0.0856) --> Winner: NYJ   
 SF 0.4922 - CAR 0.4424 --> Prediction:  SF (0.0499) --> Winner: CAR   
DAL 0.6395 - WSH 0.4111 --> Prediction: DAL (0.2284) --> Winner: DAL ✓ 
CIN 0.4130 - PIT 0.6280 --> Prediction: PIT (0.2150) --> Winner: PIT ✓ 
 NO 0.5512 - NYG 0.4668 --> Prediction:  NO (0.0844) --> Winner: NYG   
MIA 0.3323 -  NE 0.6992 --> Prediction:  NE (0.3669) --> Winner:  NE ✓ 
 KC 0.4218 - HOU 0.4489 --> Prediction: HOU (0.0272) --> Winner: HOU ✓ 
TEN 0.6113 - DET 0.6622 --> Prediction: DET (0.0508) --> Winner: TEN   
BAL 0.6452 - CLE 0.2970 --> Prediction: BAL (0.3483) --> Winner: BAL ✓ 
SEA 0.5388 -  LA 0.3398 --> Prediction: SEA (0.1990) --> Winner:  LA   
 TB 0.5810 - ARZ 0.2930 --> Prediction:  TB (0.2880) --> Winner: ARZ   
JAX 0.4829 -  SD 0.6532 --> Prediction:  SD (0.1703) --> Winner:  SD ✓ 
ATL 0.5302 - OAK 0.3857 --> Prediction: ATL (0.1445) --> Winner: ATL ✓ 
IND 0.3487 - DEN 0.5247 --> Prediction: DEN (0.1760) --> Winner:

CLE 0.3073 - BAL 0.6767 --> Prediction: BAL (0.3694) --> Winner: BAL ✓ 
HOU 0.5000 - JAX 0.4546 --> Prediction: HOU (0.0455) --> Winner: HOU ✓ 
DEN 0.5036 -  NO 0.4404 --> Prediction: DEN (0.0633) --> Winner: DEN ✓ 
 LA 0.6067 - NYJ 0.4405 --> Prediction:  LA (0.1662) --> Winner:  LA ✓ 
ATL 0.5187 - PHI 0.4728 --> Prediction: ATL (0.0459) --> Winner: PHI   
 KC 0.5913 - CAR 0.2943 --> Prediction:  KC (0.2970) --> Winner:  KC ✓ 
CHI 0.5770 -  TB 0.4314 --> Prediction: CHI (0.1457) --> Winner:  TB   
MIN 0.3545 - WSH 0.6631 --> Prediction: WSH (0.3086) --> Winner: WSH ✓ 
 GB 0.4008 - TEN 0.6837 --> Prediction: TEN (0.2830) --> Winner: TEN ✓ 
MIA 0.4299 -  SD 0.5452 --> Prediction:  SD (0.1153) --> Winner: MIA   
 SF 0.3233 - ARZ 0.6472 --> Prediction: ARZ (0.3239) --> Winner: ARZ ✓ 
DAL 0.6640 - PIT 0.2388 --> Prediction: DAL (0.4252) --> Winner: DAL ✓ 
SEA 0.2230 -  NE 0.7420 --> Prediction:  NE (0.5190) --> Winner: SEA   
CIN 0.4348 - NYG 0.6000 --> Prediction: NYG (0.1651) --> Winner:

## Results

Currently, the success of the model is not amazing.  Given that every game is effectively a coin-flip, we are comparing to a baseline of 50% accuracy.  This means we are not strongly outperforming the default worst case.  However, it also turns out NFL games are hard to predict!  Even the best experts / algorithms are still rarely breaking about 70% accuracy.  It's possible the data are just too noisy to find meaningful correlations (also known as 'Any Given Sunday').  Still, this result of ~65% accuracy is reasonably solid given the minimal amount of testing / tuning currently undergone.  The predictions also seem to be improved as the season progresses, so it's possible there are fluctuations with low statistics that could be corrected in this process.  Further exploration and other models will be interesting to explore in order to see how much the predictions can be bolstered.