# Pythag Correlation Adjusted for Blowouts
I've often wondered whether MLB teams' Pythagorean win estimators could be made more accurate
by reduing the weight of extra runs scored in blowouts. Once the game is out of hand, the opposing
team is not likely throwing their best pitchers. They might even have a position player pitching. 
My hypothesis is that running up the score in those circumstances is less indicative of true talent than
consistently winning by a solid but non-blowout margin.

## What is a blowout?
That's a great question. Someone could probably do the math to determine at what run differential and what
point in the game a loss becomes a certainty and the losing team moves from trying to win, to trying to
finish out the game with the least negative impact on the pitching staff in the next game.

Someone could. But I didn't. In my view, if you need a grand slam or more to tie the game, you're usually
out of it. To simplify matters, I'm testing with run differentials of 4, 5, and 6. Less than that and it's 
still a close game. A save situation! More than that, and your sample is going to be too small to be meaningful.
Most teams aren't playing that many 7+ run differntial games. And if they are, they're probably either amazing
or terrible and therefore winning/losing a lot of 4 and 5 run games too.

## What I did
I pulled the entire 2024 season's results from Baseball Reference (https://www.baseball-reference.com/leagues/majors/2024-standings.shtml).
B-Ref wasn't being friendly about scraping with the python `requests` library (which is their right). So I 
copied the html right out of the browser's inspection tab and parsed it with `BeautifulSoup` to create the dataset.

See the full code below:

In [3]:
from bs4 import BeautifulSoup
import re
import pandas as pd


shortcode_to_team = {
    "ARI": "Arizona Diamondbacks",
    "ATL": "Atlanta Braves",
    "BAL": "Baltimore Orioles",
    "BOS": "Boston Red Sox",
    "CHC": "Chicago Cubs",
    "CHW": "Chicago White Sox",
    "CIN": "Cincinnati Reds",
    "CLE": "Cleveland Guardians",
    "COL": "Colorado Rockies",
    "DET": "Detroit Tigers",
    "HOU": "Houston Astros",
    "KCR": "Kansas City Royals",
    "LAA": "Los Angeles Angels",
    "LAD": "Los Angeles Dodgers",
    "MIA": "Miami Marlins",
    "MIL": "Milwaukee Brewers",
    "MIN": "Minnesota Twins",
    "NYM": "New York Mets",
    "NYY": "New York Yankees",
    "OAK": "Oakland Athletics",
    "PHI": "Philadelphia Phillies",
    "PIT": "Pittsburgh Pirates",
    "SDP": "San Diego Padres",
    "SEA": "Seattle Mariners",
    "SFG": "San Francisco Giants",
    "STL": "St. Louis Cardinals",
    "TBR": "Tampa Bay Rays",
    "TEX": "Texas Rangers",
    "TOR": "Toronto Blue Jays",
    "WSN": "Washington Nationals"
}

team_to_shortcode = {v: k for k, v in shortcode_to_team.items()}

def parse_team_from_anchor(anchor):
    return anchor['href'].split('/')[2]

# Formula per Wikipedia: https://en.wikipedia.org/wiki/Pythagorean_expectation
def pythag_wins(runs_scored, runs_allowed, games_played):
    return 1 / (1 + ((runs_allowed / runs_scored) ** 1.83)) * games_played

def limited_run_differential(home_score, away_score, max_diff):
    """
        Limit the run differential of the game to reduce the impact of blowouts.
        I've chosen to reduce the winning team's score and leaving the losing team's
        score the same.
    """
    actual_diff = abs(home_score - away_score)
    if actual_diff > max_diff:
        if home_score > away_score:
            home_score -= (actual_diff - max_diff)
        else:
            away_score -= (actual_diff - max_diff)

    return {
        "home_score": home_score,
        "away_score": away_score
    }

# I copied this straight from the source of https://www.baseball-reference.com/leagues/majors/2024-schedule.shtml
with open('data/2024_game_results.html') as file:
    soup = BeautifulSoup(file, features='html.parser')

games = soup.find_all('p', class_='game')

run_differentials = {}
skeleton = {
            'games': 0,
            'rs': 0,
            'ra': 0,
            'diff': 0,
            'rs_blowout4': 0,
            'ra_blowout4': 0,
            'ra_blowout5': 0,
            'rs_blowout5': 0,
            'ra_blowout6': 0,
            'rs_blowout6': 0
        }

i = 1
for g in games:
    game_soup = BeautifulSoup(str(g), features='html.parser')
    game_anchors = game_soup.find_all('a')
    home_team = parse_team_from_anchor(game_anchors[0])
    away_team = parse_team_from_anchor(game_anchors[1])
    scores = re.findall(r'\((\d+)\)', game_soup.text)
    home_score = int(scores[0])
    away_score = int(scores[1])
    home_differential = home_score - away_score

    if not home_team in run_differentials:
        run_differentials[home_team] = skeleton.copy()

    if not away_team in run_differentials:
        run_differentials[away_team] = skeleton.copy()

    run_differentials[home_team]['games'] += 1
    run_differentials[away_team]['games'] += 1
    run_differentials[home_team]['rs'] = run_differentials[home_team]['rs'] + home_score
    run_differentials[home_team]['ra'] = run_differentials[home_team]['ra'] + away_score
    run_differentials[away_team]['rs'] = run_differentials[away_team]['rs'] + away_score
    run_differentials[away_team]['ra'] = run_differentials[away_team]['ra'] + home_score

    blowout4 = limited_run_differential(home_score, away_score, 4)
    run_differentials[home_team]['rs_blowout4'] = run_differentials[home_team]['rs_blowout4'] + blowout4['home_score']
    run_differentials[home_team]['ra_blowout4'] = run_differentials[home_team]['ra_blowout4'] + blowout4['away_score']
    run_differentials[away_team]['rs_blowout4'] = run_differentials[away_team]['rs_blowout4'] + blowout4['away_score']
    run_differentials[away_team]['ra_blowout4'] = run_differentials[away_team]['ra_blowout4'] + blowout4['home_score']

    blowout5 = limited_run_differential(home_score, away_score, 5)
    run_differentials[home_team]['rs_blowout5'] = run_differentials[home_team]['rs_blowout5'] + blowout5['home_score']
    run_differentials[home_team]['ra_blowout5'] = run_differentials[home_team]['ra_blowout5'] + blowout5['away_score']
    run_differentials[away_team]['rs_blowout5'] = run_differentials[away_team]['rs_blowout5'] + blowout5['away_score']
    run_differentials[away_team]['ra_blowout5'] = run_differentials[away_team]['ra_blowout5'] + blowout5['home_score']

    blowout6 = limited_run_differential(home_score, away_score, 6)
    run_differentials[home_team]['rs_blowout6'] = run_differentials[home_team]['rs_blowout6'] + blowout6['home_score']
    run_differentials[home_team]['ra_blowout6'] = run_differentials[home_team]['ra_blowout6'] + blowout6['away_score']
    run_differentials[away_team]['rs_blowout6'] = run_differentials[away_team]['rs_blowout6'] + blowout6['away_score']
    run_differentials[away_team]['ra_blowout6'] = run_differentials[away_team]['ra_blowout6'] + blowout6['home_score']

for team_id, team_stats in run_differentials.items():
    run_differentials[team_id]['pythag'] = pythag_wins(team_stats['rs'], team_stats['ra'], team_stats['games'])
    run_differentials[team_id]['pythag4'] = pythag_wins(team_stats['rs_blowout4'], team_stats['ra_blowout4'], team_stats['games'])
    run_differentials[team_id]['pythag5'] = pythag_wins(team_stats['rs_blowout5'], team_stats['ra_blowout5'], team_stats['games'])
    run_differentials[team_id]['pythag6'] = pythag_wins(team_stats['rs_blowout6'], team_stats['ra_blowout6'], team_stats['games'])

# CITE: https://www.baseball-reference.com/leagues/majors/2024-standings.shtml
actual_team_results = pd.read_csv('data/2024_team_results.csv')

for row in actual_team_results.itertuples():
    # Don't process the last "Average" row
    if row.Tm == 'Average':
        continue
    team_shortcode = team_to_shortcode[row.Tm]
    rd = run_differentials[team_shortcode]
    rd['actual_wins'] = row.W

df = pd.DataFrame.from_dict(run_differentials, orient='index')
print(df)
print(df[['pythag', 'actual_wins']].corr())
print(df[['pythag4', 'actual_wins']].corr())
print(df[['pythag5', 'actual_wins']].corr())
print(df[['pythag6', 'actual_wins']].corr())

     games   rs   ra  diff  rs_blowout4  ra_blowout4  ra_blowout5  \
LAD    162  842  686     0          746          637          655   
SDP    162  760  669     0          682          617          638   
COL    162  682  929     0          644          810          850   
ARI    162  886  788     0          752          707          735   
LAA    162  635  797     0          599          704          738   
BAL    162  786  699     0          701          634          654   
DET    162  682  642     0          602          582          604   
CHW    162  507  813     0          481          703          740   
WSN    162  660  764     0          613          670          697   
CIN    162  699  694     0          631          625          647   
NYY    162  815  668     0          704          606          627   
HOU    161  740  649     0          666          599          622   
MIN    162  742  735     0          664          662          685   
KCR    162  735  644     0        

# Conclusion
Pythag is pretty darn good as it is!

There's a small bump when you limit blowouts to 4 runs, and an even smaller bump at 5. At 6 runs, the
"smarter" method falls below the results of just using Pythag alone.

This iteration is just using the 2024 season, so a larger sample size would be more useful. Maybe I'll 
copy 2021-2023 as well and see if that changes anything.