In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
rng = np.random.default_rng(2718)

import scrape_sbr
import fader
from win_loss_report import win_loss_report, win_loss_from_df

## DEMAR: Detecting Midrange Arbitrage on Roundball
an exploration by Casey Durfee <csdurfee@gmail.com>     
January 17, 2025

This is a part of a much larger thing I've been working on called "Don't Be A Sucker". It's a random walk through the math, psychology and philosophy of sports gambling. 

I'm not sure if this will make the final cut, but it's one of the more interesting things I've found, so I wanted to share it.

---------------

Sports betting has gotten really big in the US over the past few years, so it got me curious. I don't gamble and am a naturally skeptical person, so I had a suspicion that certain things weren't as balanced as most people assumed. 

Every potential bet is a math problem, whether the people betting know it or not.

The line is a prediction about the final point differential of the game. The difference between the prediction and reality is called the error. People are really betting whether the error of the estimated point differential will be positive or negative. The error should really be random -- as many underdogs as favorites should win.

I grabbed NBA odds and betting data for this season from [sportsbookreview.com](https://www.sportsbookreview.com/betting-odds/nba-basketball/). I was curious how betting on basketball operated as a market. Is it efficient and unbiased?


All numbers are as of Jan 6th, 2025 (except for the update at the end). The website is missing some data, approximately 10 days in total.

In [2]:
data = scrape_sbr.clean_data()

The first thing I notice is that underdogs are 252-225 against the spread this season.

In [3]:
data.fave_dog.value_counts()

fave_dog
DOG     252
FAVE    225
Name: count, dtype: int64

 It could just be coincidence, but it's enough of a difference that betting every underdog would've been slightly profitable.

In [4]:
win_loss_report(252, 225)

record:   252 - 225
full vig (-110) units: 4.5
reduced juice (-106) : 13.5
win pct: 52.83%, expected wins: 238.5 , excess: 13.5, profit %: 0.94
z test: 1.2362450755382013, std: 10.920164833920778 , p-value: 0.10818374012879906


The vig, or juice, is the extra amount you have to risk in order to place the bet and is how the sportsbook makes money. Standard vig is "-110" which means risking <span class="tex2jax_ignore">$110 to win $100</span>. This is kind of a silly way of writing the odds, but it's the American way.

<span class="tex2jax_ignore">The amount of vig can dramatically affect the profitability of the bettor. "units" is the number of bets won as profit, after the vig. So risking $106 to win $100 on every underdog would have netted 13.5 units * $100 = $1350 so far, but risking $110 to win $100 would have netted only $450. In case you had any doubts, [your Uncle Juice is not a good man](https://youtu.be/726Ujz_KOHE?si=Bv71Z0vBIMB17pHP&t=41).</span>

There's also a slight bias towards away teams winning, though that wouldn't have been profitable at standard vig.

In [5]:
data.winner_ats.value_counts()

winner_ats
AWAY    248
HOME    229
Name: count, dtype: int64

In [6]:
win_loss_report(248,229)

record:   248 - 229
full vig (-110) units: -3.9
reduced juice (-106) : 5.26
win pct: 51.99%, expected wins: 238.5 , excess: 9.5, profit %: -0.82
z test: 0.8699502383416972, std: 10.920164833920778 , p-value: 0.19216379949402418


It seems to me that a sportsbook would make more money if the side that gets more bets on it usually loses. Legalizing sports betting has undoubtedly increased the number of new bettors. It seems possible that naive gamblers would have a bias towards betting on favorites and home teams.

In a Dutch Book situation (discussed elsewhere), there is an equal amount of money on both sides of the line so the sportsbook doesn't care who wins because they make money either way. But if they could stack the deck in their favor, why wouldn't they? Sportsbooks know which of their customers are winning and losing, and what side they're playing on a particular bet.  That could be used as a signal for when not to move the lines at times when the money is imbalanced. What side is the "smart money" on?

The Sportsbookreview data I collected has the percent of money bet on each side. They don't indicate what sportsbook(s) the data comes from. I will be using the closing lines from the MGM Grand to test against. 

I decided to test how a strategy of always "fading the public" would work. In other words, always bet on the side that gets less money bet on it, under the premise that it's the side with more value on it.

The basic algorithm I came up with was:
1. if the AWAY team has < 50% of the money, take the AWAY team
2. if the AWAY team has > 50% of the money, take the HOME team

Both the simplicity and the cynicism of it speak to me.

In [7]:
fade_bets = fader.fade_the_public(lower_thresh=50,upper_thresh=50)

In [8]:
fade_bets.head()

Unnamed: 0.1,Unnamed: 0,away_names,away_lines,away_scores,away_percents,away_opens,home_names,home_lines,home_scores,home_percents,...,open_with_line,winner_ats,fave_dog,winner_ats_name,loser_ats_name,money_winner,money_loser,fade,fade_pick,fade_vs
0,0,New York,+6.5-115,109,40,+4.5-105,Boston,-6.5-105,132,60,...,-18.5,HOME,FAVE,Boston,New York,Boston,New York,AWAY,New York,Boston
1,1,Minnesota,-1.5+105,103,65,+1.5-115,L.A. Lakers,+1.5-130,110,35,...,-5.5,HOME,DOG,L.A. Lakers,Minnesota,Minnesota,L.A. Lakers,HOME,L.A. Lakers,Minnesota
2,0,Indiana,-5.5-110,115,76,-5.5-110,Detroit,+5.5-110,109,24,...,0.5,AWAY,FAVE,Indiana,Detroit,Indiana,Detroit,HOME,Detroit,Indiana
3,1,Milwaukee,-4.5-105,124,57,+3.5-105,Philadelphia,+4.5-115,109,43,...,18.5,AWAY,FAVE,Milwaukee,Philadelphia,Milwaukee,Philadelphia,HOME,Philadelphia,Milwaukee
4,2,Cleveland,-6.5-110,136,58,-3.5-110,Toronto,+6.5-110,106,42,...,26.5,AWAY,FAVE,Cleveland,Toronto,Cleveland,Toronto,HOME,Toronto,Cleveland


In [9]:
win_loss_from_df(fade_bets)

record:   254 - 214
full vig (-110) units: 18.6
reduced juice (-106) : 27.16
win pct: 54.27%, expected wins: 234.0 , excess: 20.0, profit %: 3.97
z test: 1.8490006540840969, std: 10.816653826391969 , p-value: 0.032228859135337906


winner_ats  AWAY  HOME
fade                  
AWAY         137   105
HOME         109   117


Hey, it worked. 54.3% isn't great, though it's not bad compared to handicappers that sell picks on the internet for &#36; 35 a pop.

Even at high volume, there is a lot of downside risk with such a low winning percentage. A random walk can take you to a lot of bad neighborhoods. The closer the winning percentage is to 50%, the dicier the whole things is.

Bets on the home team are going 117-109 (note you have to flip the numbers from the bottom row of the crosstab). 

That's enough to make money on reduced vig, but not full vig. Uncle Juice strikes again!

In [10]:
win_loss_report(117,109)

record:   117 - 109
full vig (-110) units: -2.9
reduced juice (-106) : 1.46
win pct: 51.77%, expected wins: 113.0 , excess: 4.0, profit %: -1.28
z test: 0.5321520841901914, std: 7.516648189186454 , p-value: 0.297310333164887


Just taking the AWAY teams yields a 56.6% winning percentage, which is about as well as the best professional handicappers can do. Of course, this is in backtesting, and predictions about the past are always easier than ones about the future.

In [11]:
win_loss_report(137,105)

record:   137 - 105
full vig (-110) units: 21.5
reduced juice (-106) : 25.7
win pct: 56.61%, expected wins: 121.0 , excess: 16.0, profit %: 8.88
z test: 2.05703790890632, std: 7.7781745930520225 , p-value: 0.01984128975085442


The rule for this algorithm is dead simple:
1. if the AWAY team has < 50% of the money, take the AWAY team
2. never take the HOME team.

Since bets on HOME lose more frequently than AWAY, I decided to raise the threshold for taking HOME games.

I played around with the threshold till I found the optimal value -- only take the HOME team if the AWAY team has over 60% of the money, otherwise don't bet the game. Similarly, I found a lower threshold of 49 to work a little better (only take the away team if they get 48% or less.)

I'm probably overfitting, which could hurt predictions in the future. But it slightly improved the winning percentage on both HOME and AWAY bets and increased wins by almost 7 units. 

In [12]:
threshold_60  = fader.fade_the_public(lower_thresh=49,upper_thresh=60)

win_loss_from_df(threshold_60)

record:   181 - 139
full vig (-110) units: 28.1
reduced juice (-106) : 33.66
win pct: 56.56%, expected wins: 160.0 , excess: 21.0, profit %: 8.78
z test: 2.347871376374779, std: 8.94427190999916 , p-value: 0.009440520078049408


winner_ats  AWAY  HOME
fade                  
AWAY         134   101
HOME          38    47


I'm curious which teams this algorithm is betting for and against.

Every bet has two sides -- you're always betting for one team and against another. And unless the bet is even money, it's never a bet on which team is going to win, it's a bet on whether the line is too high or too low.

So I've combined each team record when betting for and against, and am sorting by most number of units won. The teams at the top are the ones the algorithm "gets".

In [13]:
who_we_bet = fader.analyze_fade(threshold_60)

who_we_bet.sort_values('total_units', ascending=False)

Unnamed: 0,win,lose,wins_against,losses_against,win_pct,units,win_units_against,total_units
Sacramento,8,5,12,2,0.615385,2.5,9.8,12.3
Brooklyn,7,1,7,3,0.875,5.9,3.7,9.6
Phoenix,4,4,12,2,0.5,-0.4,9.8,9.4
Minnesota,3,3,11,2,0.5,-0.3,8.8,8.5
Detroit,10,4,5,2,0.714286,5.6,2.8,8.4
San Antonio,8,3,8,4,0.727273,4.7,3.6,8.3
Chicago,9,5,7,4,0.642857,3.5,2.6,6.1
Miami,8,5,5,2,0.615385,2.5,2.8,5.3
Milwaukee,4,2,8,5,0.666667,1.8,2.5,4.3
Denver,1,1,12,7,0.5,-0.1,4.3,4.2


### Which NBA teams get the most money bet on them?

Here are the top money-getters according to sportsbookreview data. I'm calculating that as the median % that a team gets when they're at home plus the median % they get on the road.

In [14]:
money_data = scrape_sbr.money_vs_ats()

money_data['money_percents'].sort_values(ascending=False)[:10]

Cleveland        118.0
Dallas           115.0
Denver           115.0
Memphis          114.0
Phoenix          113.0
Golden State     112.5
Oklahoma City    112.0
Minnesota        108.0
Houston          106.5
Indiana          106.5
Name: money_percents, dtype: float64



This is a pretty good list of who I think casual fans would assume are good teams to bet on even if they didn't really follow the league. Who was good last year? Who has the best record this year?

But we're not concerned with the best basketball team, we're concerned with performance against the spread.

These are the teams with the top winning percentages against the spread. Toronto and Brooklyn are definitely surprises to me, given they're trying to not win basketball games.

In [15]:

money_data.sort_values('ats_win_pct', ascending=False)[:10]

Unnamed: 0,winner,loser,ats_win_pct,money_percents
Cleveland,22,10,0.6875,118.0
Memphis,22,11,0.666667,114.0
Oklahoma City,22,11,0.666667,112.0
Toronto,20,12,0.625,98.0
Brooklyn,19,12,0.612903,106.0
Houston,20,13,0.606061,106.5
L.A. Lakers,18,13,0.580645,100.0
L.A. Clippers,18,13,0.580645,89.0
Detroit,17,14,0.548387,90.5
Charlotte,17,14,0.548387,81.0


There's some evidence that bettors are rational. Cleveland has been getting the highest amount of money and also has the highest winning percentage against the spread. So it seems lot of people are winning money on Cleveland.  Likewise for Memphis, Oklahoma City, and Houston.

On the other hand, Phoenix and Minnesota are top 10 for money %, but bottom 10 for wins against the spread. As an NBA fan, that tracks. Both teams were hyped as contenders at the beginning of the season but have not met expectations.

In [16]:
money_data.sort_values('ats_win_pct', ascending=True)[:10]

Unnamed: 0,winner,loser,ats_win_pct,money_percents
Phoenix,9,22,0.290323,113.0
Minnesota,10,21,0.322581,108.0
Philadelphia,11,19,0.366667,83.0
Atlanta,12,20,0.375,96.0
Washington,12,19,0.387097,101.5
Sacramento,13,20,0.393939,101.0
New Orleans,13,20,0.393939,76.0
Milwaukee,13,18,0.419355,104.0
Boston,14,18,0.4375,104.0
Miami,13,16,0.448276,80.0


There correlation between win % against the spread and money % bet is insignificant, but it is positive.

In [17]:
import scipy.stats as stats
stats.spearmanr(money_data.money_percents, money_data.ats_win_pct)

SignificanceResult(statistic=np.float64(0.1405658415296944), pvalue=np.float64(0.45876222093572816))

I noticed that the algorithm is doing the best for and against the "mid" teams. Looking at that data points out a couple of obvious truths: 
1) it's bad to bet against teams that usually win versus the spread
2) it's bad to bet in favor of teams that usually lose. 

This is pretty naive, because there's no good reason to believe that a team that has a good record against the spread now will end up with a good record at the end of the year.

So my new algorithm is the same logic as before, but eliminate any bets that are against the best 7 teams ATS, or for the 7 worst teams ATS as of the previous day.  A sliding window for record versus the spread might do a little better so it's only filtering out teams that have been good/bad recently, but this is simple. I picked top 7 because it worked a little better than top 5. That's definitely another arbitrary choice.

I think I'll call this algorithm [DEMAR](https://www.basketball-reference.com/players/d/derozde01.html) because it's shooting all middies. 

![derozan, monster of the midrange](img/demar.png)

In [18]:
newest_fade = fader.superfade(eliminate_top=7)

Skipping 2024-11-05, no data
Skipping 2024-11-28, no data
Skipping 2024-11-29, no data
Skipping 2024-11-30, no data
Skipping 2024-12-01, no data
Skipping 2024-12-02, no data
Skipping 2024-12-03, no data
Skipping 2024-12-04, no data
Skipping 2024-12-18, no data
Skipping 2024-12-24, no data


The results from the last iteration of DeMar are just silly:

In [19]:
win_loss_from_df(newest_fade)

record:   105 - 55
full vig (-110) units: 44.5
reduced juice (-106) : 46.7
win pct: 65.62%, expected wins: 80.0 , excess: 25.0, profit %: 27.81
z test: 3.952847075210474, std: 6.324555320336759 , p-value: 3.86133977526848e-05


winner_ats  AWAY  HOME
fade                  
AWAY          80    43
HOME          12    25


Looking at the record for/against each team, this managed to preserve almost all the wins while cutting down on losses to teams like Memphis and OKC.

In [20]:
fader.analyze_fade(newest_fade).sort_values('total_units', ascending=False)

Unnamed: 0,win,lose,wins_against,losses_against,win_pct,units,win_units_against,total_units
Sacramento,6,1,8,2,0.857143,4.9,5.8,10.7
Phoenix,0,0,10,2,,0.0,7.8,7.8
Detroit,7,2,3,1,0.777778,4.8,1.9,6.7
Minnesota,0,0,8,2,,0.0,5.8,5.8
Milwaukee,2,0,7,3,1.0,2.0,3.7,5.7
Brooklyn,5,0,0,0,1.0,5.0,0.0,5.0
Toronto,5,0,0,0,1.0,5.0,0.0,5.0
L.A. Lakers,5,2,3,1,0.714286,2.8,1.9,4.7
Miami,3,3,5,0,0.5,-0.3,5.0,4.7
Orlando,6,3,3,1,0.666667,2.7,1.9,4.6


It's making money on 25 teams and losing money on 5 teams.

## Forward testing

I started building the model on 1/6, and have avoided looking at the data since then. It's now 1/17.  

I suspect the hand optimized parameter choices I made -- thresholds of 49%/60% for the money percentages, throwing out top 7/bottom 7 teams -- could hurt the results moving forwards.

In [21]:
#1. fetch data from missing window
#new_scrapes = scrape_sbr.fetch_data_range(start='2025-01-06', end='2025-01-17')

#2. generate picks.
## this is a bit of a hack -- need to get data since the beginning of the year to have 
## wins/loss ATS counts work right
later_data = scrape_sbr.clean_data(start="2024-10-22", end="2025-01-17")
superfade = fader.superfade(eliminate_top=7, base_picks=later_data)

print("\n")
# only score the games after jan 6th.
win_loss_from_df(superfade[superfade.game_date > '2025-01-06'])

missing scores from 1 games
Skipping 2024-11-05, no data
Skipping 2024-11-28, no data
Skipping 2024-11-29, no data
Skipping 2024-11-30, no data
Skipping 2024-12-01, no data
Skipping 2024-12-02, no data
Skipping 2024-12-03, no data
Skipping 2024-12-04, no data
Skipping 2024-12-18, no data
Skipping 2024-12-24, no data
missing scores from 1 games
Skipping 2025-01-11, no data


record:   17 - 10
full vig (-110) units: 6.0
reduced juice (-106) : 6.4
win pct: 62.96%, expected wins: 13.5 , excess: 3.5, profit %: 22.22
z test: 1.3471506281091268, std: 2.598076211353316 , p-value: 0.08896586263412731


winner_ats  AWAY  HOME
fade                  
AWAY          12     8
HOME           2     5


Without the "superfade magic $"^{(TM)}$, it doesn't do nearly as well, but is still profitable.

In [22]:
standard_fade = fader.fade_the_public(games=later_data)

win_loss_from_df(standard_fade[standard_fade.game_date > '2025-01-06'])

record:   30 - 27
full vig (-110) units: 0.3
reduced juice (-106) : 1.38
win pct: 52.63%, expected wins: 28.5 , excess: 1.5, profit %: 0.53
z test: 0.39735970711951313, std: 3.774917217635375 , p-value: 0.3455511119224214


winner_ats  AWAY  HOME
fade                  
AWAY          23    20
HOME           7     7


I was genuinely expecting it to crash and burn on new data, and it didn't, not yet, at least. Failure is always an option, though.

## Conclusion
Did I mention that the name of this project is called "Don't Be A Sucker"? I haven't done anything (intentionally) dishonest or misleading here, other than choosing some arbitrary tuning parameters. 

The basic strategy of "if the AWAY team has < 50% of the money, take the AWAY team, otherwise don't bet" went 137-105, which is more than respectable on its own. So I think there's something to what I found.

But there are limits to how predictable the world is. Beating the line even 56% of the time is pretty unlikely, because humans are notoriously unreliable in both directions and there are more analytics than ever. Casual fans tend to concentrate on when a good team "chokes", but it seems to me just as many basketball games are decided because [a couple random guys went off](https://plaintextsports.com/nba/2025-01-15/atl-chi). Some nights it's Keaton Wallace's world, and the rest of us are just living in it. 

Should the money data tell you who is going to win a basketball game? Can analytics measure the size of the fight in the dog? Of course not. There are only probabilities. 

It can conceivably tell you when bets may be over and underpriced. How inefficient would the market have to be to allow a simple strategy like this to win 62% of the time? It's possible to convert between winning percentages and the number of points on the line with [a conversion chart](https://www.boydsbets.com/nba-spread-to-moneyline-conversion/).

It's not really plausible to me that a person or an automated system could find an advantage of four points or more on the line. "A couple" points seems far more likely.

I'm gonna let DEMAR cook for a while. My guess is that the very basic strategy will continue to be (slightly) profitable, but the more complex one will fall off.

I'm still looking for good historical data on betting percentages. The sportsbookreview site has older data, but it's missing for most games and includes unlikely data (such as betting percentages of 100% on one team). Sportsbooks like DraftKings [offer daily percentages](https://dknetwork.draftkings.com/draftkings-sportsbook-betting-splits/) but no historical info. But I'll keep digging.

Next up: the basics of sports gambling, playing the parlays with Nephew Doug, and why Frank P. Ramsey was the Len Bias of mathematics.