![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Baseball - Analyzing pitch data

**Submitted by: A, B, C, D**

<img src=https://upload.wikimedia.org/wikipedia/commons/5/59/Baseball_diamond_marines.jpg width=800>
<p>
<a href='https://upload.wikimedia.org/wikipedia/commons/5/59/Baseball_diamond_marines.jpg'>https://upload.wikimedia.org/wikipedia/commons/5/59/Baseball_diamond_marines.jpg
</p>

Baseball is popular sport in North America, that in the past few decades has seen its popularity rise in countries across the world. As the sport has grown, so has the technology that it uses. Major League Baseball (MLB), the largest professional league in the world, has installed high-tech cameras and other tracking tools in all 30 of its stadiums to allow teams to analyze the game using more advanced methods than ever before. High-performance computers then run complicated programs to turn the video recordings from the cameras into usable data. Teams can use this data to improve their own players' performance, as well as to learn about what their opponents are likely to do.

MLB has made this data available for anyone who wants to use it, and we can go through some of it today to see if we can learn more about the game to see if we can gain an advantage on our opponents!

## Getting ready

Before we get into the questions we want to ask, we need to do a little bit of work behind the scenes. All the cells in this notebook can be run without being modified, and you don't need to be too concerned if you don't understand the exact function of every line of code. If there's anything you're curious about, please ask one of your mentors to help you learn more!

#### 1. Install/Import libraries

Run the cell below to download and install required Python libraries. It may take few minutes to complete the execution of the cell.

In [None]:
!pip install pandas plotly

Now that we have installed the libraries we need to do run the notebook, we have to activate them:

In [1]:
# Load libraries
import pandas as pd
import plotly.express as px

#### 2. Import data and create a dataframe

For today's challenge, we'll be looking at data about the pitches thrown in baseball. Each play starts with a pitch, so there's potentially a lot we can learn from this data. In the 2022 season, over **600,000 pitches were thrown**, which is a lot of data. We don't need quite that much, so instead we'll focus on just one of the divisions within MLB: the American League East division. This division consists of the Toronto Blue Jays, the New York Yankees, the Boston Red Sox, the Baltimore Orioles, and the Tampa Bay Rays. In the 2022 season, the AL East had the most wins of any division in all of MLB, despite almost half of each team's games being against other teams in the division! To narrow the data even more, we'll just look at data for the month of June 2022.

There are many sources of baseball data on the internet, such as [FanGraphs](https://www.fangraphs.com/), [Baseball-Reference](https://www.baseball-reference.com/), and many more. For this challenge, though all the sites have access to MLB's data, we've downloaded the data from a website called [Baseball Savant](https://baseballsavant.mlb.com/) and stored it in a CSV file so it's easy to access. We can load it here and look at five random observations:

In [8]:
pitch_data = pd.read_csv('data/ale_pitch_data_june.csv')
pitch_data.sample(5) # Because this process is random, not everyone will get the same results below

Unnamed: 0,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,description,...,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment,spin_axis,delta_home_win_exp,delta_run_exp
2078,SL,2022-06-16,85.3,-2.42,6.0,"Gausman, Kevin",669720,592332,,called_strike,...,0,0,0,0,0,Standard,Standard,216.0,0.0,-0.074
7605,FF,2022-06-29,92.1,-1.68,6.19,"Taillon, Jameson",643393,592791,,ball,...,0,0,0,0,0,Infield shift,Standard,208.0,0.0,0.033
14468,FF,2022-06-04,93.5,2.31,6.1,"Akin, Keegan",680911,669211,,foul,...,5,4,5,4,5,Standard,Standard,140.0,0.0,-0.031
4013,FF,2022-06-29,93.7,-1.42,6.69,"Pivetta, Nick",666971,601713,,ball,...,0,0,0,0,0,Standard,Standard,199.0,0.0,0.059
13265,SL,2022-06-14,80.5,-2.37,5.01,"Tate, Dillon",665489,622253,,called_strike,...,6,6,4,4,6,Standard,Standard,65.0,0.0,-0.017


Each row represents a single pitch, with 92 columns of data (as we can see from the information below the dataframe above). Because there are more rows than space in the notebook, Jupyter hides many of the columns in the middle of the dataframe. To see all the columns, we can use the below code:

In [7]:
pitch_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18264 entries, 0 to 18263
Data columns (total 92 columns):
 #   Column                           Non-Null Count  Dtype         
---  ------                           --------------  -----         
 0   pitch_type                       18248 non-null  object        
 1   game_date                        18264 non-null  datetime64[ns]
 2   release_speed                    18224 non-null  float64       
 3   release_pos_x                    18224 non-null  float64       
 4   release_pos_z                    18224 non-null  float64       
 5   player_name                      18264 non-null  object        
 6   batter                           18264 non-null  int64         
 7   pitcher                          18264 non-null  int64         
 8   events                           4726 non-null   object        
 9   description                      18264 non-null  object        
 10  spin_dir                         0 non-null      float64  

This function tell us:
1. There are 18,264 pitches (rows) in the data
1. The name of each column, in order
1. How many rows in each column contain data (are 'non-null')
1. The type of data each column contains


[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)