# An Analysis of the Cleveland Guardians – June 10th, 2024
### By: Eddie Dew [@da_mountain_dew](https://x.com/da_mountain_dew)
[Github](https://github.com/EddietheProgrammer) [Linkedin](https://www.linkedin.com/in/edde/)


**Disclaimer:** Shoutout to [pybaseball](https://github.com/jldbc/pybaseball) for scraping most of these webpages for me, making it easier to load data. Feel free to check them out.

# Introduction

The Cleveland Guardians boast an impressive 42-22 record through 64 games in this MLB season. In this article, we delve into the myriad reasons behind their success and explore a simple Monte Carlo simulation to project their potential final win tally based on their seasonal run productivity.

The primary aim of this article is to offer a comprehensive blend of in-depth code analysis and polished writing, catering to both programmers and fans alike. Extensive effort has been invested to achieve this balance. Enjoy the read!

In [1]:
########### Importing libraries ################################################################
########### If you are trying to run this cell and encounter an ModuleNotFoundError: ###########
########### make a new cell and run %pip install [package name that produced error] ############

import pybaseball
import pandas as pd
import numpy as np
import datetime
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
from matplotlib.offsetbox import OffsetImage, AnnotationBbox
from tqdm import tqdm
from sklearn import linear_model
import os
import warnings

warnings.filterwarnings('ignore')

In [2]:
def make_ids(df: pd.DataFrame, param: str) -> pd.DataFrame():
    """
    Custom made ids by sorting the team names then autogenerating them. Will come in handy 
    with mapping tables to the Image.
    """
    df = df.sort_values(by = param)
    df['id'] = range(1, len(df) + 1)
    return df


In [3]:
# Setting up the directory to load in the team logos. These are uploaded on the GitHub link below:
# https://github.com/UNC-Charlotte-Sports-Analytics-Club/Vizs/blob/main/mlb-logos/{}.png?raw=true
# {} - image id number

directory = './mlb_logos'

# Formatting ID number to the respective png file for mapping purposes
ids = {int(logo.split('.')[0]): logo for logo in os.listdir(directory) if logo.endswith('.png')}

# 1. Bullpen

The bullpen deserves significant praise for its exceptional performance this season. Despite a rocky start for the starting pitchers, the bullpen has maintained its dominance. Established names like Emmanuel Clase, Eli Morgan, and Nick Sandlin have been joined by newcomers like Hunter Gaddis and Cade Smith, who have seamlessly integrated into the bullpen rotation. Cleveland's consistent success can be attributed to their reliable pitching, with the team ranking 6th in most wins by any ball club since 2010.

Consistency is the hallmark of their bullpen, which has consistently ranked in the top 10 throughout the 2020s, and this season is no exception. Currently leading the league with the best bullpen ERA, lowest batting average against, and the third-most strikeouts, they excel from the 7th inning onward, boasting a remarkable 2.10 ERA, the best in baseball. Their ability to keep games close and facilitate comebacks makes them formidable opponents.

Advanced statistics further highlight their prowess, with the team boasting the league's best FIP of 2.68 and a league-best 2.57 BB/9. Their knack for getting batters out while minimizing walks underscores their pitching excellence. Credit is due to the coaching staff and front office for effectively utilizing pitchers to their strengths. ![image](./Guardians-vizs/bullpen.png)

While numerous players deserve recognition for their contributions, Emmanuel Clase stands out as the linchpin of the bullpen. The elite closer is poised for another outstanding season, currently ranking second in saves with 19. Sporting an impressive 0.57 ERA and 1.91 FIP, Clase has returned to peak form after a somewhat subdued performance last year. Notably, he has increased his outside zone swing percentage by 6% compared to last season, demonstrating improved pitch execution.
![image](./Guardians-vizs/clase.png)

A subtle adjustment to his cutter has yielded remarkable results, as evidenced by increased horizontal movement and decreased weak contact. With a Stuff+ score of 146, Clase's pitch arsenal is nearly **1.5** times more effective than the average pitcher, a testament to his exceptional abilities.

While Clase's performance stands out, it's important to acknowledge that the bullpen as a whole has been thriving this season. However, delving into Clase's exceptional numbers sheds light on the bullpen's overall success without overshadowing other noteworthy achievements of the team this season.


# 2. Timely Hitting

The Guardians have embraced the philosophy of small ball, prioritizing putting the ball in play and moving runners around over raw power. In today's game, characterized by many teams relying on sluggers who either strike out or hit home runs, there's a resurgence of power-speed players finding their place in the new era of hitting. Young hitters like Elly De La Cruz and Bobby Witt Jr. have made significant impacts with their talent for hitting and base running.

The Guardians have begun adopting this philosophy, reflected in their increased power numbers this year (albeit from a small sample size) compared to last year. Let's examine their hitting splits since 2021:

* 2021: .238 BA | .710 OPS | 109 SB 
* 2022: .254 BA | .699 OPS | 119 SB 
* 2023: .250 BA | .694 OPS | 151 SB 
* 2024: .240 BA | .718 OPS | 58 SB 

These stats illustrate the evolution of their hitting approach. In 2021, they were a power-hitting team with a higher strikeout rate (23%) and fewer stolen bases. In 2022, roster changes led to the formation of one of the youngest teams in baseball, emphasizing small ball and reducing strikeouts. Their playstyle has continued to evolve this season, resembling their 2021 form in terms of power hitting and on-base ability, while significantly enhancing their stolen base success (on pace for 147 stolen bases).

Using Bill James's [Power/Speed Number](https://www.baseball-reference.com/bullpen/Power/speed_number#:~:text=Bill%20James%20invented%20the%20Power,stat%20with%20little%20analytical%20value.), the Guardians rank 8th, indicating a balanced approach between speed and power. This balance aligns with their hitting philosophy, emphasizing moving runners. They have also maintained a consistently lower strikeout rate, in line with previous seasons. This year, they've been efficient at scoring runs, as depicted in the plot below.
![image](./Guardians-vizs/team_scoring.png)

Overall, the team's hitting performance has significantly improved compared to previous years. Notably, they excel with runners on base and in scoring position, leading all teams with a league-best .881 OPS and .372 wOBA in these situations. Being efficient with runners in scoring position is what generates World Series winning teams. From 2017 to 2022, teams that finished in the top 5 in wOBA with runners in scoring position won the pennant. The last time a team won a World Series without finishing top 10 in this metric was the San Francisco Giants in 2014. Timely hits matter. 

Notable Guardians hitters include Jose Ramirez, who has been on a tear after a slow start to the season, Steven Kwan, enjoying a bounce-back season with a .370 batting average, and David Fry, who has emerged out of nowhere with a .333 batting average. Josh Naylor has contributed power to the lineup with 16 home runs, complementing Ramirez's efforts. While I won't delve too deeply into their stats, their performances are worth further investigation.

While the Guardians excel in hitting, they also pose a threat on the basepaths, ranking 9th in baseball with 58 stolen bases. This is intriguing because they are not known for their speed; their average sprint speed is below average, which explains why they have been caught stealing 24 times. However, their aggressive approach on the basepaths demonstrates their commitment to maximizing scoring opportunities.

# 3. Catching

I took a look into the Guardians defense and so far, they are doing pretty well. Their Defensive Runs Saved is 30 which ranks 3rd in the majors which is a big plus. Their outfield has been the bright spot by keeping runners at bay with their arm. Their infield is a little below average and could be hindering their range factor, but their hitting compensates for this. That's being a little picky, but still something to mention. What I want to highlight is their defensive catching. Austin Hedges was their primary leader when he came over in 2020 and has instilled a positive culture in the dugout. His defense prowess has impacted young catcher Bo Naylor. While having some difficulties with the bat, Naylor has stepped up his defense behind the plate. He is getting a 50.5% called strike rate on pitches around the shadow zone (47.2% in 2023), **5th** best in the majors. Most of these calls come from an outside pitch to a right handed hitter. Here's an overall look at where the Guardian's catchers get strikes called in the shadow zone. ![image](./Guardians-vizs/catching.png)
Austin Hedges has continued to do his job in addition throwing 30% of runners out so far this season (Naylor is at 18% for those wondering). While the fielding and catching statistics may not wow you, they are important at keeping the Guardians competitive in games, especially stealing strikes to help with their top notch pitching.  

# Monte Carlo Simulation

I've given you some analysis and background for the Guardians. I now want to have a little fun to see where they rank, team strength included, in run production compared to other teams. I then will show a Monte Carlo simulation using those run production values to try to predict how many wins the Guardians may have by the end of the season.

### Modeling Run Production – Adjusted for Opponent

I was inspired by an article written by [@BudDavis](https://x.com/JBudDavis) on using [Ridge Regression] (https://blog.collegefootballdata.com/opponent-adjusted-stats-ridge-regression/) for adjusting team's stats to their opponent. He did it for College Football using Expected Points Added (EPA) as the target feature to adjust. I wanted to do something similar for baseball, but was unsure of what stat to pick. Since EPA is directly related to points contributed to a team, I figured why not do the same for baseball, so I decided to pick the stat I felt best with creating runs, [Weighted Runs Above Average](https://library.fangraphs.com/offense/wraa/) (wRAA). Essentially, this stat measures the number of runs a player contributes to their team compared to the average player. Keep in mind wRAA may be benefit teams with more plate appearances as they get more opportunities to score runs. 

That takes me into the data clensing for the model. I downloaded a csv from the Fangraph's [Guts](https://www.fangraphs.com/guts.aspx?type=cn) page to compute the weights for wOBA (weighted on base averge), which is feeded into wRAA. The derivitation is shown in the code below.

In [7]:
fg_batting = pybaseball.team_batting(2024) # Extract team data (specifically for this, the team abbreviations)

weights = pd.read_csv('/Users/eddie/Downloads/FanGraphs Leaderboard.csv').query('Season == 2024')
weights = weights.to_dict(orient='list')
weights = {key: value[0] for key, value in weights.items()}

teams = fg_batting['Team'].tolist()

weights

{'Season': 2024,
 'wOBA': 0.309,
 'wOBAScale': 1.267,
 'wBB': 0.693,
 'wHBP': 0.724,
 'w1B': 0.889,
 'w2B': 1.269,
 'w3B': 1.612,
 'wHR': 2.083,
 'runSB': 0.2,
 'runCS': -0.401,
 'R/PA': 0.116,
 'R/W': 9.592,
 'cFIP': 3.151}

In [8]:
# Function that derives wRAA for us using the weights.

def compute_wRAA(df, weights):
    def compute_woba(row):
        wBB = weights['wBB']
        wHBP = weights['wHBP']
        w1B = weights['w1B']
        w2B = weights['w2B']
        w3B = weights['w3B']
        wHR = weights['wHR']
        
        wOBA = (wBB * row['BB'] + wHBP * row['HBP'] + w1B * row['1B'] + 
                w2B * row['2B'] + w3B * row['3B'] + wHR * row['HR']) / (
                row['AB'] + row['BB'] - row['IBB'] + row['SF'] + row['HBP'])
        return wOBA
    
    df['wOBA'] = df.apply(compute_woba, axis=1)
    
    league_wOBA = weights['wOBA']
    wOBA_scale = weights['wOBAScale']
    
    df['wRAA'] = ((df['wOBA'] - league_wOBA) / wOBA_scale) * df['PA']
    
    return df

In [9]:
# Load the team schedules to obtain the team, opponent, and location data. This may take a bit.
hitting_schedule = pd.DataFrame()

for team in tqdm(teams, desc='Extracting Batting Data'):
    sub_df = pybaseball.team_game_logs(2024, team, "batting")
    sub_df.insert(2, 'team', team)
    sub_df['BB'] = sub_df['BB'] - sub_df['IBB']
    sub_df['1B'] = sub_df['H'] - sub_df['2B'] - sub_df['3B'] - sub_df['HR']
    hitting_schedule = pd.concat([hitting_schedule, sub_df], ignore_index = True)



Extracting Batting Data: 100%|██████████████████| 30/30 [02:54<00:00,  5.81s/it]


In [10]:
hitting_schedule = compute_wRAA(hitting_schedule, weights)

hitting_schedule = hitting_schedule[['team', 'Home', 'Opp', 'wRAA']]

hitting_schedule['hfa'] = np.where(hitting_schedule['Home'] == True, 1, -1)
# No neutral site games until June 20th and August 18th, so that won't matter for this.

hitting_schedule.drop('Home', axis = 1, inplace = True)

Ridge Regression, in simple terms, applies an L2 regularization by introducing a penalty term (alpha in this model's case) to the square of coefficients, which mitigates issues through "shrinkage," pushing these coefficients towards 0. This technique is particularly useful for computing opponent-adjusted stats compared to averaging methods because it addresses multicollinearity, which can result in higher variance in the results. While the averaging method is effective and achieves the goal of normalizing teams based on their opponent's strength, Ridge Regression offers a more reliable approach to the normalization process. For a deeper understanding of why and how Ridge Regression functions in this context, I recommend reading the article authored by [@BudDavis](https://x.com/JBudDavis), linked above.

In [11]:
df_dummies = pd.get_dummies(hitting_schedule[['team', 'Opp', 'hfa']])


# Tuning alpha parameter
ridge_tune = linear_model.RidgeCV(alphas = [25,50,75,100,125,150,175, 200], fit_intercept = True)
ridge_tune.fit(df_dummies, hitting_schedule['wRAA'])
alpha = ridge_tune.alpha_


ridge = linear_model.Ridge(alpha=alpha, fit_intercept = True)
ridge.fit(X = df_dummies, y = hitting_schedule['wRAA'])

In [12]:
df_results = pd.DataFrame({'coef_name' : df_dummies.columns.values, 'ridge_coef' : ridge.coef_})

# Add intercept back in to reg coef to get 'adjusted' value
df_results['ridge_reg_value'] = (df_results['ridge_coef']+ridge.intercept_)

print('Homefield Advantage: (alpha: '+str(alpha)+')')
print('{:.3f}'.format(df_results[df_results['coef_name'] == 'hfa']['ridge_coef'][0]))

Homefield Advantage: (alpha: 75)
0.057


In [13]:
df_team = pd.DataFrame({'team' : teams})

In [14]:
df_adj_hitting = (df_results[df_results['coef_name'].str.slice(0, len('team')) == 'team'].rename(columns = {"ridge_reg_value": 'wRAA'}).reset_index(drop = True))

df_adj_hitting['coef_name'] = df_adj_hitting['coef_name'].str.replace('team'+'_','')
df_adj_hitting = df_adj_hitting.drop(columns=['ridge_coef'])

In [15]:
df_adj_pitching = (df_results[df_results['coef_name'].str.slice(0, len('Opp')) == 'Opp'].rename(columns = {"ridge_reg_value": 'wRAA'}).reset_index(drop = True))

df_adj_pitching['coef_name'] = df_adj_pitching['coef_name'].str.replace('Opp'+'_','')
df_adj_pitching = df_adj_pitching.drop(columns=['ridge_coef'])

In [16]:
df_team['raw_hitting'] = df_team.join(hitting_schedule.groupby('team')['wRAA'].mean(), on='team').wRAA 
df_team['adj_hitting'] = df_team.join(df_adj_hitting.set_index('coef_name'), on='team').wRAA 
df_team['raw_pitching'] = df_team.join(hitting_schedule.groupby('Opp')['wRAA'].mean(), on='team').wRAA
df_team['adj_pitching'] = df_team.join(df_adj_pitching.set_index('coef_name'), on='team').wRAA


df_team = df_team.round(3)
df_team = make_ids(df_team, 'team')

# Results

Here's the full standings of where teams rank with their wRAA adjusted for opponent.

![image](./Guardians-vizs/adjusted.png)

If you want to interpret this from the Guardians perspective, their stats may be a little inflated due to a weaker strength of schedule. They are still above average in scoring and preventing runs.

### Monte Carlo Simulation

Now that the opponent-adjusted wRAA values have been derived, we can proceed with a Monte Carlo simulation. In simple terms, this involves running through numerous simulations on a computer to predict a certain outcome, in our case, wins.

The technique used to determine whether a team wins or loses involves computing a random number of runs scored compared to the average for both the Guardians and their opponent using a random (Gaussian) normalization. If the Guardians are projected to score more runs than their opponent in a given simulation, they will be awarded a win.

The simulation will run 10,000 times. If you prefer not to wait around for approximately 3 minutes, you can skip running the cells below and proceed to the results at the end.

In [18]:
# Obtain Cleveland's remaining schedule
cle = pybaseball.schedule_and_record(2024, "CLE")

schedule = cle.loc[(cle['W/L'].isna()), 'Opp'].tolist()

http://www.baseball-reference.com/teams/CLE/2024-schedule-scores.shtml


In [19]:
def simulate_game(g_hitting, g_pitching, o_hitting, o_pitching):
    g_score = np.random.normal(g_hitting - o_pitching)
    o_score = np.random.normal(o_hitting - g_pitching)
    return 1 if g_score > o_score else 0


def monte_carlo(schedule, n_simulations = 10000):
    wins = []
    is_win = []
    for _ in tqdm(range(n_simulations), desc='Running Simulation'):
        win_count = 0
        sub_win_list = []
        for opponent in schedule:
            opponent_data = df_team[df_team['team'] == opponent].iloc[0]
            guardians_data = df_team[df_team['team'] == 'CLE'].iloc[0]
            
            win = simulate_game(
                guardians_data['adj_hitting'], 
                guardians_data['adj_pitching'], 
                opponent_data['adj_hitting'], 
                opponent_data['adj_pitching']
            )
            win_count += win
            sub_win_list.append(win)
        wins.append(win_count)
        is_win.append(sub_win_list)
    return wins, is_win
            

In [20]:
simulated_wins, win_trend = monte_carlo(schedule)

Running Simulation: 100%|█████████████████| 10000/10000 [02:58<00:00, 56.14it/s]


# Results

![image](./Guardians-vizs/montecarlo.png)
The simulation predicts the Guardians to win approximately 94 games in total this season based on their remaining schedule. While this provides a useful estimate of their performance, it's important to note that actual outcomes may vary due to factors such as injuries, changes in team dynamics, and unforeseen events during the season. Therefore, while the simulation offers valuable insights, it should not be solely relied upon for betting purposes.

# Conclusion

I appreciate your time and attention in reading through this article. If you have any questions, comments, or concerns, please feel free to reach out to me via DM on Twitter (link above). I hope you found this article informative and enjoyable in some way. If you would like me to provide analysis for another team, please let me know.