![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Baseball

<img src=https://upload.wikimedia.org/wikipedia/commons/5/59/Baseball_diamond_marines.jpg width=800>
<p>
<a href='https://en.wikipedia.org/wiki/Baseball_field'>https://en.wikipedia.org/wiki/Baseball_field </a>
</p>

## Analyzing Pitch Data

Baseball is one of the most popular sports in North America, that in the past few decades has seen its popularity rise in countries across the world. As the sport has grown, so has the technology that it uses. Major League Baseball (MLB), the largest professional league in the world, has installed high-tech cameras and other tracking tools in all 30 of its stadiums to allow teams to analyze the game using more advanced methods than ever before. High-performance computers then run complicated programs to turn the video recordings from the cameras into usable data. Teams can use this data to improve their own players' performance, as well as to learn about what their opponents are likely to do.

MLB has made this data available for anyone who wants to use it, and we can go through some of it today to see if we can learn more about the game and how data can be used to gain an advantage on our opponents!

Email contact@callysto.ca if you experience issues.

### Import Libraries

Run the cell below to import the necessary Python libraries that we'll use in this notebook.

In [None]:
%pip install -q pyodide_http plotly
import pyodide_http
pyodide_http.patch_all()
import pandas as pd
import plotly.express as px
print('libraries imported')

## Downloading Data

For today's challenge, we'll be looking at data about the pitches thrown in baseball. Each play starts with a pitch, so there's potentially a lot we can learn from this data. In the 2022 season, over **600,000 pitches were thrown**, which is a lot of data. We don't need quite that much data, so instead we'll focus on just one of the divisions within MLB: the American League East division. This division consists of the Toronto Blue Jays, the New York Yankees, the Boston Red Sox, the Baltimore Orioles, and the Tampa Bay Rays. In the 2022 season, the AL East had the most wins of any division in all of MLB, despite almost half of each team's games being against other teams in the division! To narrow the data even more, we'll just look at data for the month of June 2022.

There are many sources of baseball data on the internet, such as [FanGraphs](https://www.fangraphs.com/), [Baseball-Reference](https://www.baseball-reference.com/), and many more. For this challenge, though all the sites have access to MLB's data, we've downloaded the data from a website called [Baseball Savant](https://baseballsavant.mlb.com/) and stored it in a CSV file so it's easy to access. We can load it here and look at five random observations:

In [None]:
pitch_data = pd.read_csv('data/ale_pitch_data_june.csv')
pitch_data.sample(5) # Because this process is random, not everyone will get the same results below

Each row represents a single pitch, with 92 columns of data (as we can see from the information below the dataframe above). Because there are more rows than can be displayed in the notebook, Jupyter hides many of the columns in the middle of the dataframe. To see all the columns, we can use the below code:

In [None]:
pitch_data.info()

### Looking at Data

<img src=https://upload.wikimedia.org/wikipedia/commons/2/25/Baseball_pitching_motion_2004.jpg width=800>
<p>
<a href='https://en.wikipedia.org/wiki/Pitch_(baseball)'>https://en.wikipedia.org/wiki/Pitch_(baseball) </a>
</p>

The above function tell us:
1. There are 18,264 pitches (rows) in the data
1. The name of each column, in order
1. How many rows in each column contain data (are 'non-null')
1. The type of data each column contains

Some of the column names provide a good description of the data they contain, but ideally we don't have to guess, and we can know exactly what the data represent. Thankfully, the [Baseball Savant](https://baseballsavant.mlb.com/csv-docs) website provides descriptions for each column. There's quite a few, so hold on tight:

0.   **pitch_type** - The type of pitch derived from Statcast.
1.   **game_date** - Date of the Game.
2.   **release_speed** - Pitch velocities from 2008-16 are via Pitch F/X, and adjusted to roughly out-of-hand release point. All velocities from 2017 and beyond are Statcast, which are reported out-of-hand.
3.   **release_pos_x** - Horizontal Release Position of the ball measured in feet from the catcher's perspective.
4.   **release_pos_z** - Vertical Release Position of the ball measured in feet from the catcher's perspective.
5.   **player_name** - Player's name tied to the event.
6.   **batter** - MLB Player Id tied to the play event.
7.   **pitcher** - MLB Player Id tied to the play event.
8.   **events** - Event of the resulting Plate Appearance.
9.   **description** - Description of the resulting pitch.
10.  **spin_dir** - * Deprecated field from the old tracking system.
11.  **spin_rate_deprecated** - * Deprecated field from the old tracking system. Replaced by release_spin
12.  **break_angle_deprecated** - * Deprecated field from the old tracking system.
13.  **break_length_deprecated** - * Deprecated field from the old tracking system.
14.  **zone** - Zone location of the ball when it crosses the plate from the catcher's perspective.
15.  **des** - Plate appearance description from game day.
16.  **game_type** - Type of Game. E = Exhibition, S = Spring Training, R = Regular Season, F = Wild Card, D = Divisional Series, L = League Championship Series, W = World Series
17.  **stand** - Side of the plate batter is standing.
18.  **p_throws** - Hand pitcher throws with.
19.  **home_team** - Abbreviation of home team.
20.  **away_team** - Abbreviation of away team.
21.  **type** - Short hand of pitch result. B = ball, S = strike, X = in play.
22.  **hit_location** - Position of first fielder to touch the ball.
23.  **bb_type** - Batted ball type, ground_ball, line_drive, fly_ball, popup.
24.  **balls** - Pre-pitch number of balls in count.
25.  **strikes** - Pre-pitch number of strikes in count.
26.  **game_year** - Year game took place.
27.  **pfx_x** - Horizontal movement in feet from the catcher's perspective.
28.  **pfx_z** - Vertical movement in feet from the catcher's perpsective.
29.  **plate_x** - Horizontal position of the ball when it crosses home plate from the catcher's perspective.
30.  **plate_z** - Vertical position of the ball when it crosses home plate from the catcher's perspective.
31.  **on_3b** - Pre-pitch MLB Player Id of Runner on 3B.
32.  **on_2b** - Pre-pitch MLB Player Id of Runner on 2B.
33.  **on_1b** - Pre-pitch MLB Player Id of Runner on 1B.
34.  **outs_when_up** - Pre-pitch number of outs.
35.  **inning** - Pre-pitch inning number.
36.  **inning_topbot** - Pre-pitch top or bottom of inning.
37.  **hc_x** - Hit coordinate X of batted ball.
38.  **hc_y** - Hit coordinate Y of batted ball.
39.  **tfs_deprecated** - * Deprecated field from old tracking system.
40.  **tfs_zulu_deprecated** - * Deprecated field from old tracking system.
41.  **fielder_2** - Pre-pitch MLB Player Id of Catcher.
42.  **umpire** - * Deprecated field from old tracking system.
43.  **sv_id** - Non-unique Id of play event per game.
44.  **vx0** - The velocity of the pitch, in feet per second, in x-dimension, determined at y=50 feet.
45.  **vy0** - The velocity of the pitch, in feet per second, in y-dimension, determined at y=50 feet.
46.  **vz0** - The velocity of the pitch, in feet per second, in z-dimension, determined at y=50 feet.
47.  **ax** - The acceleration of the pitch, in feet per second per second, in x-dimension, determined at y=50 feet.
48.  **ay** - The acceleration of the pitch, in feet per second per second, in y-dimension, determined at y=50 feet.
49.  **az** - The acceleration of the pitch, in feet per second per second, in z-dimension, determined at y=50 feet.
50.  **sz_top** - Top of the batter's strike zone set by the operator when the ball is halfway to the plate.
51.  **sz_bot** - Bottom of the batter's strike zone set by the operator when the ball is halfway to the plate.
52.  **hit_distance_sc** - Projected hit distance of the batted ball.
53.  **launch_speed** - Exit velocity of the batted ball as tracked by Statcast. For the limited subset of batted balls not tracked directly, estimates are included based on the process described here.
54.  **launch_angle** - Launch angle of the batted ball as tracked by Statcast. For the limited subset of batted balls not tracked directly, estimates are included based on the process described here.
55.  **effective_speed** - Derived speed based on the the extension of the pitcher's release.
56.  **release_spin_rate** - Spin rate of pitch tracked by Statcast.
57.  **release_extension** - Release extension of pitch in feet as tracked by Statcast.
58.  **game_pk** - Unique Id for Game.
59.  **pitcher.1** - MLB Player Id tied to the play event.
60.  **fielder_2.1** - MLB Player Id for catcher.
61.  **fielder_3** - MLB Player Id for 1B.
62.  **fielder_4** - MLB Player Id for 2B.
63.  **fielder_5** - MLB Player Id for 2B.
64.  **fielder_6** - MLB Player Id for SS.
65.  **fielder_7** - MLB Player Id for LF.
66.  **fielder_8** - MLB Player Id for CF.
67.  **fielder_9** - MLB Player Id for RF.
68.  **release_pos_y** - Release position of pitch measured in feet from the catcher's perspective.
69.  **estimated_ba_using_speedangle** - Estimated Batting Avg based on launch angle and exit velocity.
70.  **estimated_woba_using_speedangle** - Estimated wOBA based on launch angle and exit velocity.
71.  **woba_value** - wOBA value based on result of play.
72.  **woba_denom** - wOBA denominator based on result of play.
73.  **babip_value** - BABIP value based on result of play.
74.  **iso_value** - ISO value based on result of play.
75.  **launch_speed_angle** - Launch speed/angle zone based on launch angle and exit velocity: \
    1\.  Weak  
    2\.  Topped  
    3\.  Under  
    4\.  Flare/Burner  
    5\.  Solid Contact  
    6\.  Barrel  
76.  **at_bat_number** - Plate appearance number of the game.
77.  **pitch_number** - Total pitch number of the plate appearance.
78.  **pitch_name** - The name of the pitch derived from the Statcast Data.
79.  **home_score** - Pre-pitch home score
80.  **away_score** - Pre-pitch away score
81.  **bat_score** - Pre-pitch bat team score
82.  **fld_score** - Pre-pitch field team score
83.  **post_away_score** - Post-pitch home score
84.  **post_home_score** - Post-pitch away score
85.  **post_bat_score** - Post-pitch bat team score
86.  **post_fld_score** - Post-pitch field team score
87.  **if_fielding_alignment** - Infield fielding alignment at the time of the pitch.
88.  **of_fielding_alignment** - Outfield fielding alignment at the time of the pitch.
89.  **spin_axis** - The Spin Axis in the 2D X-Z plane in degrees from 0 to 360, such that 180 represents a pure backspin fastball and 0 degrees represents a pure topspin (12-6) curveball
90.  **delta_home_win_exp** - The change in Win Expectancy before the Plate Appearance and after the Plate Appearance
91.  **delta_run_exp** - The change in Run Expectancy before the Pitch and after the Pitch
    
That's a lot of columns of data! We probably won't use them all, but later on feel free to use whatever you want to get creative with your analysis.

### Preparing Data

Sure, those numbers are great, but they'd be a lot more meaningful if we could visualize them. To start, let's look at the speed at which each of the pitchers in the dataset throws. As a first step, let's see how many pitchers are in the dataset:

In [None]:
display(sorted(pitch_data['player_name'].unique().tolist()))
print(f'\nThere are {len(pitch_data["player_name"].unique())} pitchers appearing in this dataset') # Make list of unique player names, and print the length of that list

90 is a lot of pitchers, and it's tough to clearly visualize, so we'll first filter down the list to only include a smaller sample.

How do we determine which pitchers to include? Well, a good next step is to plot the data:

In [None]:
pitchers = pitch_data['player_name'].value_counts() # Counting instances of each pitcher's name in the dataset
px.bar(pitchers, labels={'value':'Pitches', 'index':'Pitcher name'}, 
       title='Number of pitches thrown by AL East pitchers in June 2022').update_layout(showlegend=False).update_xaxes(tickfont=dict(size=8))

There are a few "bumps" in the plot above where there's a drop-off in numbers that we could probably use as a cutoff point for determining how many pitchers we want to focus on. Our decision of where to cut the data is arbitrary, but perhaps using 400 pitches as a cutoff (between Corey Kluber and José Berríos) allows us to use a manageable number of pitchers without getting too overwhelming.

In [None]:
top_pitchers = pitchers[0:13] # 'Slice' the dataframe by taking only the rows starting at the 0th row, and up to (but not including) the 13th row, for 13 total names
top_pitchers

### Visualizing Data

Now we can plot the pitches each pitcher throws and the velocity associated with them.

In [None]:
px.scatter(pitch_data[pitch_data['player_name'].isin(top_pitchers.index)].sort_values('player_name'), # Filter original dataset to only include pitchers who's name is in our list of 13
           x='player_name', y='release_speed', # Choose the variables to plot
           labels={'release_speed':'Pitch velocity (mph)', 'player_name':'Pitcher name'}, # Rename the axes labels to be a little prettier
           title='Pitch velocity by pitcher', height=500) # Title the plot and make sure it's tall enough to be easily viewable

### Using Expert Knowledge

If you're a baseball fan, you likely know that pitchers will use different types of pitches to fool the batter (and if you're not a baseball fan, the next few steps will help you learn that knowledge). Though a pitch with more speed can definitely be difficult to hit, not every pitch relies on its velocity. We can further break down the same plot by pitch type:

In [None]:
px.scatter(pitch_data[pitch_data['player_name'].isin(top_pitchers.index)].sort_values('player_name'), 
           x='player_name', y='release_speed', color='pitch_name', opacity=0.8, # Now colour the pitches by type, and make the symbols slightly more transparent
           labels={'release_speed':'Pitch velocity (mph)', 'player_name':'Pitcher name', 'pitch_name':'Pitch type'},
           title='Pitch velocity by pitcher and pitch', height=500)

Fast pitches are certainly hard to hit, but as we see above there's clearly some kind of relationship between type of pitch and its speed. For pitches that don't rely on speed to fool the batter, they usually rely on *movement*. How a ball moves through the air, and how its movement is changed by the amount of *spin* on the ball is a fairly complicated topic, but for further learning you can check out [this YouTube video](https://www.youtube.com/watch?v=0lbQwFmwBNs) for a good explanation. We can build on this later, but for now we can consider *spin rate* to be a good measurement for pitch movement.

Ignoring individual pitchers for a second, we can plot the relationship between pitch velocity and spin rate, and see how that relates to pitch type:

In [None]:
px.scatter(pitch_data[pitch_data['player_name'].isin(top_pitchers.index)], 
           x='release_speed', y='release_spin_rate', color='pitch_name', # Compare velocity and spin rate
           labels={'release_speed':'Pitch velocity (mph)', 'release_spin_rate':'Spin rate (rpm)', 'pitch_name':'Pitch type'},
           title='Pitch velocity and movement, by pitch type', height=500)

Building on our knowledge that pitchers use either velocity **or** spin rate to fool the batter, let's replicate one of our earlier plots, but this time looking at spin rate:

In [None]:
px.scatter(pitch_data[pitch_data['player_name'].isin(top_pitchers.index)].sort_values('player_name'), 
           x='player_name', y='release_spin_rate', color='pitch_name', opacity=0.8,
           labels={'release_spin_rate':'Spin rate (rpm)', 'player_name':'Pitcher name', 'pitch_name':'Pitch type'},
           title='Pitch spin rate by pitcher and pitch', height=500)

When we looked at each pitcher's velocity, and coloured it by pitch type, there was a lot of overlap between different types of pitches. Knowing now that spin rate is another important factor in how effective certain pitches are, and plotting that instead of velocity, we can see that the difference between pitches is even clearer yet.

### Diving into the Data

In baseball jargon, pitches that rely on *speed* to fool the batter are generally known as **fastballs**, whereas those that rely on *movement* (through increased spin rate) are collectively referred to as **breaking balls**. You can find a good breakdown of the pitch types [here](https://www.rookieroad.com/baseball/pitch-types/), as well as the below image that shows the movement of the most common pitches:

<img src=https://appliedvisionbaseball.com/wp-content/uploads/2019/10/nipponfoodie-31-1024x576.png width=800>
<p>
<a href='https://appliedvisionbaseball.com/how-to-identify-pitch-types-spin-speed-location/'>https://appliedvisionbaseball.com/how-to-identify-pitch-types-spin-speed-location/ </a>
</p>

Let's see exactly which pitch types exist in this dataset (going back to all 90 pitchers):

In [None]:
pitch_types = pitch_data['pitch_name'].unique() # List unique values in 'pitch_name' variable
pitch_types

We can then create two lists: one that contains all the fastballs, and one that contains the breaking balls. We do this by using the `[#]` notation to refer to a particular entry's position in a list, starting (as always in Python) with 0:

In [None]:
fastballs = [pitch_types[0], pitch_types[2], pitch_types[3]]
breaking_balls = [pitch_types[1], pitch_types[4], pitch_types[6], pitch_types[7]]
display(fastballs)
display(breaking_balls)

A keen eye will notice that fastballs and breaking balls don't account for all of our pitches. There are two remaining pitches: 'Changeup', and 'nan'.

A changeup is a pitch that is neither a fastball nor a breaking ball. The goal of a changeup is to *appear* as a fastball when it's leaving the pitcher's hand, but to have a significantly lower velocity, causing the batter to swing much earlier and miss the ball.

<div class="alert alert-block alert-success">
<b>Note:</b>
On the other hand, a <b>nan</b> is not a pitch at all. Often in programming languages, you'll encounter a data type that represents non-existent data. This is different than zero; imagine considering temperature. If you have a dataset that contains temperatures, if you have a value of <b>0</b>, you would assume that you've measured a real temperature right at freezing. However, a value of <b>nan</b> indicates that a value <i>should</i> exist, but for whatever reason doesn't. Depending on the language and/or the source of the data, this could be written as <b>nan, NaN, NA, or null.</b>
</div>

### Cleaning Data

One of the most time-consuming parts of data science is 'cleaning' your data. Cleaning entails looking at your data and ensuring it represents what you think it does, and ensuring that any anomalies are identified and can be either explained or removed. Knowing that we have entries in our data where there isn't a particular data point for the pitch name in the dataset, we can look at just those rows and it can help guide us as to what we do with the data.

We'll filter our original dataset to look for just those rows that have a `pitch_name` of **nan**, which the pandas library treats as 'null':

In [None]:
pitch_data[pitch_data['pitch_name'].isnull()] # Return all rows where the value for 'pitch_name' is equal to nan/NaN/null

In our dataframe of 18,264 entries, 16 of them contain that null value for `pitch_name`. Though the `pitch_name` column isn't shown above, the remaining columns give us some insight into how valuable the data is, to help us decide whether to keep or remove the data.

Taking a look at the `description` column, every single one of these pitches resulted in a valid play, so in this instance it may be worth keeping the data in the original dataset. Though we can't be 100% sure, what likely happened is that whatever system MLB uses to classify pitches strugggled to classify these ones, but they were still valid pitches.

### Filtering Data

Now that we are satisfied with the types of pitches existing in the dataset, we can compare the two groups of pitches and compare it across pitchers:

In [None]:
# Change the value here to one of `fastballs` or `breaking_balls` to switch the types of pitches shown
grouping = breaking_balls


px.scatter(pitch_data[(pitch_data['player_name'].isin(top_pitchers.index)) & (pitch_data['pitch_name'].isin(grouping))], # Filter first by the top 13 pitchers, then by the pitches contained in the 'grouping' variable
           x='release_speed', y='release_spin_rate', color='pitch_name',
           labels={'release_speed':'Pitch velocity (mph)', 'release_spin_rate':'Spin rate (rpm)', 'pitch_name':'Pitch type'},
           title='Pitch velocity and movement, by pitch type (grouped)', height=500)

Or, if we want to look at the same relationship for all the pitches from a single pitcher, we can do so using the code below:

In [None]:
# Change the pitcher name here to look at any individual pitcher. Use the list generated at 'Preparing the Data' to ensure proper spelling and formatting
pitcher = 'Manoah, Alek'


px.scatter(pitch_data[pitch_data['player_name']==pitcher], # Find rows only where the 'player_name' matches the value assigned to 'pitcher'
           x='release_speed', y='release_spin_rate', color='pitch_name',
           labels={'release_speed':'Pitch velocity (mph)', 'release_spin_rate':'Spin rate (rpm)', 'pitch_name':'Pitch type'},
           title=f'Pitch velocity and movement, by pitch type ({pitcher})', height=500)

## Next Steps

This has been a quick intro into baseball data, and how we can visualize relationships between variables to learn more about how they interact with each other. We've only looked at a small selection of the available data, and there are probably some questions you might have that we didn't even attempt to answer. We've walked through a few aspects you can use to help guide your analysis, but we've barely scratched the surface of what we can do.

For further analysis of this data, and to learn the skills to allow you to answer any other questions you can come up with, continue on to the [next notebook](baseball-challenge.ipynb).

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)