### Abstract

In our project, we sought to predict whether umpires will call a pitch correctly or not. We used features of the pitch like its velocity and release position and features of the game situation like number of runners on base, pitch count, and inning to train a logistic regression model. We chose these features because we wanted to look for patterns in pitch calls that weren't related to where the pitch ended up over the plate. We fit a logistic regression model.... [add results here!]


[Find the source code here](https://github.com/ctkatz/ML_project)

### Introduction

The problem that we want to address in the project is improving fairness in baseball. We want to see if we can detect patterns in incorrect calls to highlight bias towards or against certain 
player demographics, game situations, or pitch types. This information could help teams understand when pitchers need to be more precise with their pitch locations and help umpires focus on calling pitches extra carefully during situations when they have been found to be less consistent. Statistics are an essential part of modern day baseball, and all MLB teams have extensive infrastructure in data analytics. Furthermore, because the MLB statistics are publicly available, there are many people using statistical and machine learning methods to make predictions about baseball pitches. Specifically, there has been a lot of work in using classification methods to predict pitch types. Glenn Sidle and Hien Tran, for example, used different multi-class classification methods to predict pitch types (@predictpitchtypes). They used post-processing techniques, like ranking the permuted variable delta error of each feature that was calculated in the construction of their random forest, to investigate the importance of each of the features they used. They found pitch count and batter handedness to be some of the most important features. Even though they were asking a different question than we are, it was still useful to see which features impact the pitch type the most, because these features could also impact the umpire calling the pitch. Or, if if pitch type impacts call correctness, these features would also be important. Additionally, they used categorical features like number of outs, inning, score, and time of day in their model, which helped us figure out how we might also want to use features like these in our model. In paper that we looked at, Jasmine Barbee was predicting the final pitch outcome for an at-bat (@predictfinaloutcome). That is, whether the last pitch of the at bat was a strike (strikeout), a ball (walk), put in play but an out, or a hit. Barbee did some exploration into features like pitch speed and she was able to predict with reasonable accuracy the outcome of the final pitch using the pitch count, the horizontal and vertical pitch locations, and the start and end speeds of the pitch. Her process highlighted how these features can be used in conjunction with each other to learn information about a pitch without knowing its position over the plate. When we decided which features to use for our model, we used insights we had gained from other peoples' work in classifying pitches. However, in our research we didn't find anyone who was trying to answer the specific question that we are asking. However, most papers mentioned similar motivations in their work that we have: making the game more fair and helping teams optimize their pitching capabilities to put them in a better position in a game.   


### Values Statement

This project would potentially be used by MLB teams and umpires. These teams are probably interested in understanding patterns in when pitches are likely to be called incorrectly, and umpires might be interesting in improving their call accuracy with this technology. [include something about potential results of looking for racial bias maybe]. This model could help improve baseball by potentially helping umpires learn when they are calling pitches incorrectly and improving accuracy. But, the ability for a model to find these patterns might eventually take away jobs from umpires as automatic ball strike systems (electronic systems that use the ball position to call pitches instead of umpires) could take over their jobs if they are found to be much more accurate, which this type of work could help suggest. Personally, all of our group members are big baseball fans. One of our favorite things about the game is how important statistics have become in baseball and how much analytical work is done to improve teams' performances. We were all really excited about the opportunity to work with baseball statistics. This technology could, in theory, make baseball more equitable. Even though this work isn't necessarily improving the world as a whole, a lot of people care about baseball and making the game more fair might improve fans' experiences. 

### Materials and Methods

#### Our Data
The data used in this project comes from Major League Baseball (MLB) pitch tracking data, which records information about pitches thrown during games. The data was collected from publicly available sources, specifically MLB’s official pitch tracking systems and data repositories. Each row in the dataset represents a single pitch thrown by a pitcher, including features such as pitch type, velocity, location, the batter's handedness (left or right), the umpire’s call (strike or ball), and whether the call was correct. The data also includes metadata about the game, the batter, and the pitcher. The dataset was downloaded from Kaggle. 

The primary focus of the our analysis was to determine bias which affects whether the umpire will make the correct call on a pitch, with additional breakdowns by batter handedness (left vs. right) and call tyep (strikes vs balls). The data was processed to handle missing values, standardize features, and generate additional variables like whether a pitch was on the edge of the strike zone. One potential limitation of the data is its inherent bias due to how pitch tracking systems are calibrated in different ballparks, which may affect accuracy. Additionally, the data does not account for environmental variables (such as lighting conditions or weather) that may influence umpire decisions. Since only pitches that were not hit are included in the analysis, the model may lack generalizability to situations where the batter makes contact.

#### Our Approach
The analysis focused on evaluating the accuracy of umpire calls in Major League Baseball (MLB) games using two logistic regression models: a standard Scikit-Learn Logistic Regression model and a Custom Logistic Regression with Newton's Method. The primary features used as predictors included key pitch characteristics such as type, velocity, and horizontal and vertical position at the plate, along with batter attributes like handedness (left or right), and game context including pitch count, inning, and game situation; importantly, in the first iterations of the models information surrounding where the pitch ended up around the plate was excluded and was later added back in for ptiches close to the edge of the strike zone. The target variable for prediction was whether the umpire’s call was correct, represented as a binary outcome (correct or incorrect).

Data preprocessing involved handling missing values by filling them with zeros, standardizing numerical features, and encoding categorical variables. To assess specific biases and performance differences, the dataset was further divided into subgroups, including Left-Handed vs. Right-Handed Batters to explore potential discrepancies based on handedness, and Balls vs. Strikes to evaluate consistency within the strike zone.

Both models were trained separately, with the Scikit-Learn model serving as a baseline comparison and the custom Newton-based model leveraging second-order optimization for potentially faster and more efficient parameter updates. Model evaluation was conducted using key metrics such as Accuracy, False Positive Rate (FPR), and False Negative Rate (FNR). Additionally, confusion matrices were generated for visual inspection of model performance across subgroups, and the predictions were benchmarked against actual umpire accuracy to identify significant deviations or biases in decision-making.

### Results
The Scikit-Learn model showed a higher overall accuracy compared to the Custom Newton model. However, the Custom Newton model exhibited more stable convergence during training and demonstrated competitive performance in edge cases near the strike zone. The Scikit-Learn model struggled with overfitting to specific pitch types, particularly balls however the Newton struggled to converge on the data. Perhaps with more computing to run many more iterations of our Newton descent it could have converged; the Newton did however demonstrated much better generalizability to both balls and strikes as well as left vs right handed batters. Confusion matrices were generated for left-handed and right-handed batters, revealing that both models performed better for right-handed batters, with fewer false positives and a clearer decision boundary compared to left batters.

The Scikit-Learn model demonstrated a slightly higher overall accuracy compared to the Custom Newton model. Specifically, the Scikit-Learn model maintained stronger stability across a broader range of pitch types and batter handedness. Its higher FPR and FNR scores were indicative of conservative decision-making, particularly in distinguishing borderline calls. Conversely, the Custom Newton model exhibited more stable convergence during training and demonstrated competitive performance in edge cases near the strike zone, where decision boundaries are more ambiguous. This suggests that Newton-based optimization may allow for finer adjustments in gradient descent, leading to better handling of difficult calls.

Confusion matrices were generated for left-handed and right-handed batters, revealing distinct performance differences. Both models showed improved accuracy for right-handed batters, with fewer false positives and a clearer decision boundary. This might suggest that the models are better calibrated for typical right-handed batting patterns or that there is an inherent bias in pitch tracking systems favoring right-handed batters. In contrast, left-handed batters saw slightly higher error rates, particularly in borderline pitch locations. This highlights the need for model adjustments to handle the variability introduced by left-handed batting stances.

Further analysis focused on balls and strikes separately, indicating that the models were generally more consistent for identifying balls correctly but faced challenges with borderline strike calls. For balls, the Scikit-Learn model had a lower False Positive Rate (FPR) compared to the Custom Newton model, reflecting more conservative predictions in non-strike situations. However, this also meant the Scikit-Learn model was more likely to miss strikes that were marginally on the edge. On the other hand, for strikes, the Custom Newton model exhibited slightly higher sensitivity, capturing more borderline calls as correct, though sometimes at the expense of increased FPR. This suggests that the Newton model is more aggressive in edge-zone classification, potentially catching calls that the Scikit-Learn model would miss.

### Concluding Discussion

### Group Contributions Statement

### Personal Reflection