# 2019 NCAA Mens Basketball Tournament

**Update**: I ended up coming in third place in the competition below. 

Notebook for my scratch work in setting up how I want to play the 2019 mens basketball tournament brackets.

Below are some links:

- [Awesome pdf with tons of stats sent to me by a friend at work](https://www.dropbox.com/s/rw8l2d6mv1sbe49/2019%20NCAA%20Tournament%20Binder%20-%20Reddit.pdf?dl=0)
- [Reddit article on the tournament](https://www.reddit.com/r/CollegeBasketball/comments/b35plp/eliminating_contenders_to_find_a_winner_for_the/)

## Game

Outside of making a bracket, my friends from school organize another NCAA tournament game (based on an assignment they had in a class). The rules of the game are:

- Submit at entry that contains the set of teams you want to buy.
- Each entry is allocated 100 points which must be used in its entirety.
- Every team has a cost assigned to them that is determined by their seed.
- The winning submission is the one with the most combined wins between all of it's teams.
- Every win counts the same no matter what round it comes in.

The costs associated with each seed are as follows:

- 1 = 25
- 2 = 21
- 3 = 18
- 4 = 15
- 5 = 12
- 6 = 10
- 7 = 8
- 8 = 6
- 9 = 5
- 10 = 4
- 11 = 3
- 12 = 2
- 13 = 1
- 14 = 1 
- 15 = 1
- 16 = 1

## Strategy

I want to get a few different sources of data and then make S-curve rankings based on each source. I define the S-curve ranking to be the 1-64 ranks of every team in the tournament (instead of four 1 seeds, four 2 seeds, etc.), highest ranked team has best chance of winning. This is how the NCAA selection committee makes brackets. Then I will arbitrarily (I didn't say this would be rigorous) decide which S-curve rankings I like and combine them into one average S-curve ranking for the teams. Then I will get a normalized S-curve difference for each team via
$$s_{diff} = \frac{s_a - s_e}{s_a}$$
where $s_{diff}$ is normalized S-curve difference, $s_e$ is expected S-curve rank (the aggregate rank I described above), and $s_a$ is actual S-curve rank based on NCAA bracket ranking. I will then *buy* (in the context of the game) the teams that have the highest $s_{diff}$, following the logic that these teams deserved a better seed, thus they are "cheap" for their seed (which determines their price). The logic for buying "cheap" teams is that I can afford more cheap teams, and these cheap teams should perform like more expensive teams (get more wins) given their expected S-curve ranking, thus hopefully they do get more wins and I do better in the game.

I haven't put much thought into this and could probably backtest this plan against old tournaments, etc. but I am doing this for fun and I like being data-driven (but not too data-driven). The thinking is that I don't follow college basketball closely enough to know about every team, but still this approach may fair better than random guess and also will hopefully be less over-fitted than if I throw tensorflow at the data. Hopefully my math above sounds and looks correct, it did to me. 

I am also going to use the outputs generated below in my bracket pool at work but I will be more emotional and pick teams I want to root for too.

## Data

The main source I am going to use are the tables in the pdf above (page 8 Advanced Ranking Metrics Systems), and the Vegas odds of each team winning the tournament from [SBNation](https://www.sbnation.com/2019/3/17/18270211/ncaa-tournament-odds-2019-national-title-futures-full-rankings-net-duke-houston) (I don't bet much so don't know the good bookies or places to get odds).

Other potential sources are [Fivethirtyeight](https://projects.fivethirtyeight.com/2019-march-madness-predictions/), sources listed in fivethirtyeights [How this works](https://fivethirtyeight.com/features/how-our-march-madness-predictions-work-2/) article, the actual websites of the different rankings I use (vide infra), more gambling sites, etc. No shortage of data. 

## Data cleaning and ingestion

In [1]:
import pandas as pd

For the NCAA tournament binder I didn't find the data they have in their PDF tables anywhere in their website (it exists on other sites but lazy) so I copy and pasted everything into google sheets and did some manual cleaning to get it into decent shape. Then I downloaded two csvs and stored them in the `data` folder that sits in this notebooks directory.

I also could have pulled rankings from specific sites instead of the from the PDF, some of those are linked in this [article](https://fivethirtyeight.com/methodology/how-our-march-madness-predictions-work-2/)

Key (from the PDF):

- BPI: ESPN BPI Rankings
- LRMC: Logistic Regression/Markov Chain (Bayesian) Rankings
- NET: NCAA NET Rankings
- KPI: Kevin Pauga - KPI Sports 
- Pom: Ken Pomeroy Rankings
- SAG: Jeff Sagarin Rankings
- SRS: Basketball-References Simple Rating Sys
- AVG: Average of rankings EXCLUDING maximum and minimum value (named binder-average)

In [2]:
df_binder = pd.read_csv('data/binder.csv')

In [3]:
df_binder.head()

Unnamed: 0,team,binder-average,bpi,kpi,lrmc,net,pom,sag,srs
0,Virginia,1.8,1,2,4,1,1,2,3
1,Duke,2.0,3,1,2,3,3,1,1
2,Gonzaga,2.6,2,11,1,2,2,5,2
3,Michigan St.,4.4,4,6,3,8,4,4,4
4,North Carolina,4.8,5,3,5,7,6,3,5


In [4]:
df_act_s = pd.read_csv('data/actual-s-curve.csv')

In [5]:
df_act_s.head()

Unnamed: 0,team,act
0,Duke,1
1,Virginia,2
2,North Carolina,3
3,Gonzaga,4
4,Tennessee,5


For the gambling data, I copied the table from [here](https://www.sbnation.com/2019/3/17/18270211/ncaa-tournament-odds-2019-national-title-futures-full-rankings-net-duke-houston) into Google sheets, made sure the names matched (for joins) and downloaded as a csv.

In [6]:
df_odds = pd.read_csv('data/odds.csv')

In [7]:
df_odds.head()

Unnamed: 0,team,odds
0,Virginia,8
1,Duke,2
2,Gonzaga,5
3,Michigan St.,14
4,North Carolina,6


## Implement strategy

In [8]:
df_all = pd.merge(df_binder, df_act_s, left_on='team', right_on='team')
df_all = pd.merge(df_all, df_odds, left_on='team',  right_on='team')
df_all.head()

Unnamed: 0,team,binder-average,bpi,kpi,lrmc,net,pom,sag,srs,act,odds
0,Virginia,1.8,1,2,4,1,1,2,3,2,8
1,Duke,2.0,3,1,2,3,3,1,1,1,2
2,Gonzaga,2.6,2,11,1,2,2,5,2,4,5
3,Michigan St.,4.4,4,6,3,8,4,4,4,6,14
4,North Carolina,4.8,5,3,5,7,6,3,5,3,6


In [9]:
df_all['exp'] = df_all['binder-average'].rank() 
df_all['s_diff'] = (df_all['act']-df_all['exp'])/df_all['act']
df_all.sort_values(by=['s_diff'],ascending=False).head(40)

Unnamed: 0,team,binder-average,bpi,kpi,lrmc,net,pom,sag,srs,act,odds,exp,s_diff
0,Virginia,1.8,1,2,4,1,1,2,3,2,8,1.0,0.5
3,Michigan St.,4.4,4,6,3,8,4,4,4,6,14,4.0,0.333333
15,Iowa St.,16.2,15,18,15,21,16,15,17,24,40,16.0,0.333333
10,Virginia Tech,12.0,11,25,14,11,11,12,12,16,80,11.0,0.3125
27,Florida,29.4,30,49,30,31,28,27,28,40,200,28.0,0.3
12,Auburn,13.4,13,19,10,18,14,11,11,18,60,13.0,0.277778
31,Saint Mary's,37.0,34,42,39,32,31,38,42,44,1000,32.0,0.272727
13,Wisconsin,14.6,14,23,13,17,12,16,13,19,100,14.0,0.263158
2,Gonzaga,2.6,2,11,1,2,2,5,2,4,5,3.0,0.25
35,Oregon,40.4,38,45,32,51,43,32,44,48,200,36.0,0.25


Now the same thing but with the gambling odds (I didn't end up using these because I think the futures odds I got from SBNation weren't as accurate or recent as on other sites I was checking - but still interesting).

In [10]:
df_all['exp'] = df_all['odds'].rank(ascending=True) 
df_all['s_diff'] = (df_all['act']-df_all['exp'])/df_all['act']
df_all.sort_values(by=['s_diff'],ascending=False).head(10)

Unnamed: 0,team,binder-average,bpi,kpi,lrmc,net,pom,sag,srs,act,odds,exp,s_diff
2,Gonzaga,2.6,2,11,1,2,2,5,2,4,5,2.0,0.5
15,Iowa St.,16.2,15,18,15,21,16,15,17,24,40,12.5,0.479167
35,Oregon,40.4,38,45,32,51,43,32,44,48,200,28.0,0.416667
42,N Mexico St.,51.4,58,56,41,40,49,54,57,49,300,33.0,0.326531
21,Villanova,22.6,21,16,22,26,26,18,30,21,50,14.5,0.309524
14,Florida St.,15.2,16,14,16,16,15,13,15,15,30,10.5,0.3
27,Florida,29.4,30,49,30,31,28,27,28,40,200,28.0,0.3
7,Kentucky,7.4,8,7,9,6,7,7,8,7,12,5.0,0.285714
45,Murray St.,56.0,49,74,52,44,52,56,71,46,300,33.0,0.282609
29,Syracuse,35.0,33,36,38,42,34,33,34,30,100,23.5,0.216667


So we see the most "underpriced" (or underseeded) teams are at the top. This gives me a baseline to make my selections for which teams I want to buy. Below I have a function to calculate the "cost" of different portfolios of teams, so I can choose what to buy and test different combinations.

This could be a cool application of a genetic algorithm (selecting which teams to buy, running populations of teams through multiple simulations). I'm sure there are lots of other optimization and modeling techniques you could use from this point on to test different combinations and their prices, but I am not worried about being that in depth.

In [11]:
import math

def calculate_seed(rank):
    return str(math.floor((rank-1)/4)+1)

def cost_given_seed(seed):
    cost_map = {
        "1"  : 25,
        "2"  : 21,
        "3"  : 18,
        "4"  : 15,
        "5"  : 12,
        "6"  : 10,
        "7"  : 8,
        "8"  : 6,
        "9"  : 5,
        "10" : 4,
        "11" : 3,
        "12" : 2,
        "13" : 1,
        "14" : 1,
        "15" : 1,
        "16" : 1
    }
    return cost_map[seed]


def calculate_cost(teams, df):
    cost = 0
    for team in teams:
        df_row = df.loc[df['team'] == team]
        rank = int(df_row['act'])
        seed = calculate_seed(rank)
        cost += cost_given_seed(seed)
    return cost

In [12]:
submission = [
    'Virginia',
    'Michigan St.',
    'Iowa St.',
    'Virginia Tech',
    'Florida',
    'Auburn',
    "Saint Mary's",
    'Buffalo'
]
calculate_cost(submission, df_act_s)

100