![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Callysto’s Weekly Data Visualization

## Baseball Pitching Statistics

### Recommended Grade levels: 6-9
<br>

### Instructions
#### “Run” the cells to see the graphs
Click “Cell” and select “Run All”.<br> This will import the data and run all the code, so you can see this week's data visualization. Scroll to the top after you’ve run the cells.<br> 

![instructions](https://github.com/callysto/data-viz-of-the-week/blob/main/images/instructions.png?raw=true)

**You don’t need to do any coding to view the visualizations**.
The plots generated in this notebook are interactive. You can hover over and click on elements to see more information. 

Email contact@callysto.ca if you experience issues.

### About this Notebook

Callysto's Weekly Data Visualization is a learning resource that aims to develop data literacy skills. We provide Grades 5-12 teachers and students with a data visualization, like a graph, to interpret. This companion resource walks learners through how the data visualization is created and interpreted by a data scientist. 

The steps of the data analysis process are listed below and applied to each weekly topic.

1. Question - What are we trying to answer? 
2. Gather - Find the data source(s) you will need. 
3. Organize - Arrange the data, so that you can easily explore it. 
4. Explore - Examine the data to look for evidence to answer the question. This includes creating visualizations. 
5. Interpret - Describe what's happening in the data visualization. 
6. Communicate - Explain how the evidence answers the question. 

# Question

Have you ever wondered about baseball pitching?

Baseball is a very popular sport, played by children and adults across Central and North America (and elsewhere). In professional baseball, extensive data is collected on games and players, to help players, coaches, and spectators understand who is playing well and what strategies work well to propel a team to the championship.

Baseball data is available from many sources. Let's see how we can use this data to explore some aspects of the game of baseball.

### Goal

The pitcher is a key player in baseball games, in charge of throwing the ball from the pitcher's mound across to home plate, where he or she attempts to strike out the batter. The pitcher can adjust the way the ball is throwing, using various speeds and delivery methods to try to fool the batter into swinging and missing the ball (a strike) or not swinging at a good ball (a called strike). 

Over the years, certain pitches have gained popularity for their success in games. you may have heard of some of them: fast balls, curve balls, even change ups and more.

Here is a diagram from the website https://https://rocklandpeakperformance.com/ showing more types of pitches:
<img src="images/Pitch-Shapes.png" width= 400>

Our goal is to use major league baseball data to learn if there are numerical differences that we can observe between different types of pitches.

We will use scatter plots to characterize these types of pitchs by things like speed and spin on the ball. We will use pie charts (or bars charts) to look at the success rate for earning strikes, as determined by pitch type. 

This will give us a firm understanding of how these types of pitches differ. 



# Gather

### Code:
The code below will import the Python programming libraries we need to gather and organize the data to answer our question.

In [None]:
## import libraries
import pandas as pd
import plotly.express as px

### Data:

There are many great sources for baseball data. Some of our favourites are https://www.mlb.com/stats/, https://www.baseball-reference.com/, and https://www.fangraphs.com/.

To keep things simple, we downloaded two sets of data from the github repo https://github.com/palewire/baseball-notebooks as Comma Separated Values (csv) files containing complete pitched records of the major league pitchers Jon Lester (born in Tacoma, Washington) and Yu Darvish (born in Japan). We can then load them directly into this Jupyter notebook.

### Import the data

In [None]:
## import data, for the two pitchers Jon Lester and Yu Darvish
pitcher_jl = pd.read_csv("./data/lester-pitches.csv", low_memory=False)
pitcher_yd = pd.read_csv("./data/darvish-pitches.csv", low_memory=False)


### Comment on the data

Once the data is loaded, it is a good idea to check that the information has been properly stored. A quick check is to see the shpae of each dataframe. The shape command will tell us how many rows and columns are there.

In [None]:
pitcher_jl.shape, pitcher_yd.shape

We see the first data frame has 36,288 rows while the second has 14,448. This means the first has more pitches recorded, while the second has fewer pitchers.

We can also get information about the names of columns, using the info command.

In [None]:
pitcher_jl.info()

From the above, we see there is a lot of information about each pitch. We will be interested in the pitch_type, release_speed, and release_spin rate.

# Organize

The code below will arrange the data cleanly so that we can do analysis on it. This is a quality control step for our data and involves examining the data to detect anything odd with the data (e.g. structure, missing values), fixing the oddities, and checking if the fixes worked. 

The first thing we do is remove any rows where the pitch type has not been enterred (these are the "null" entries). We do this by selecting the rows that are "not null" and updating the data frame appropriately


In [None]:
# data cleaning
pitcher_jl = pitcher_jl[pitcher_jl.pitch_type.notnull()]
pitcher_yd = pitcher_yd[pitcher_yd.pitch_type.notnull()]


The next thing we do is check to see what types of pitches are included in the data files. We can list the unique entries in the pitch type as follows:

In [None]:
pitcher_jl.pitch_type.unique(), pitcher_yd.pitch_type.unique(),

We notice that the first pitcher, Jon Lester, has only fie types of pitches, while the second pitcher, Yu Darvish, has many more. 

Let's take these labels and write them out in full, and map them into the data record.


In [None]:
labels = {
    'CH': 'Change Up',
    'CU': "Curve",
    'FC': 'Cutter',
    'FF': "Four seamer",
    'FT': "Two seamer",
    'FS': "Fast sinker",
    'IN': "Intentional Ball",
    'PO': "Pitch out",
    'SI': "Sinker",
    'SL': "Slider",
    'EP': "Eephus"
}

pitcher_jl['pitch_name'] = pitcher_jl.pitch_type.map(labels)

pitcher_yd['pitch_name'] = pitcher_yd.pitch_type.map(labels)



### Comment on the data

We notice now that the data frames are smaller, as we have removed some null data sets. We can check this by looking again at the shapes.

In [None]:
pitcher_jl.shape, pitcher_yd.shape

# Explore (1)

The code below will be used to help us look for evidence to answer our question, on what makes pitches different. This can involve looking at data in table format, applying math and statistics, and creating different types of visualizations to represent our data.

Two important pieces of data are the speed at which a ball is thrown (a fast ball is often harder to hit) and the rate of spin put on the ball (a quickly rotating ball will cause an aerodynamic force on the ball that makes it move in a curved path that is hard to hit). 

We can plot these two values in a scatter plot, and group by the type of pitch. This may give us an easy way to characterize these pitches.

In [None]:
# data exploration
px.scatter(data_frame=pitcher_jl,x='release_speed',y='release_spin_rate',color='pitch_name',
        title='Jon Lester, Pitch speed and spin',
        labels={'release_speed':"Speed (MPH)", "release_spin_rate":"Spin rate (RPM)", "pitch_name":"Pitch name"})


In [None]:
# data exploration
px.scatter(data_frame=pitcher_yd,x='release_speed',y='release_spin_rate',color='pitch_name',
        title='Yu Darvish, Pitch speed and spin',
        labels={'release_speed':"Speed (MPH)", "release_spin_rate":"Spin rate (RPM)", "pitch_name":"Pitch name"})


# Interpret

In both scatter plots, we see that the data points cluster nicely according to the color indicating pitch type. For both Lester and Darvish, the Four Seamer and Two Seamer fast balls are their faster pitches at 90 to 95 miles per hour. Their curve balls are much slower, betweem 70 and 80 miles per hour.

Lester has some odd pitches with a slow spin, less than 1500 rotations per minute. Are these just bad throws? Usually a pitcher wants a fast rotation on the ball, to help make it curve or drop as it travels through the air.

Darvish has some very slow pitches, at just 60 miles per hour. By scrolling over the data points, we see these represent "intentional Balls" which is a strategy used by pitchers to send a pitch that is far outside the strike zone, so the batter cannot hit it even though it is slow and easy. 

We also see Darvish has a few slow pitches (65 MPH) labeled as "Eephus" which means a "nothing pitch", from the Hebrew work "efes" meaning "nothing."

# Explore (2)

Since we have the data, let's explore a bit more. The data we have covers several years of pitching. Can we see if the style of the pitcher changes as the years go by?

Let's try plotting pitching speed as a function of time. We again can use a scatter plot, with coordinates of game data and release spped. 

In [None]:
px.scatter(data_frame=pitcher_jl,x='game_date',y='release_speed',color='pitch_name',
        title='Jon Lester, Pitch speed over time',
        labels={'game_date':"Date",'release_speed':"Speed (MPH)", "pitch_name":"Pitch name"})


In [None]:
px.scatter(data_frame=pitcher_yd,x='game_date',y='release_speed',color='pitch_name',
        title='Yu Darvish, Pitch speed over time',
        labels={'game_date':"Date",'release_speed':"Speed (MPH)", "pitch_name":"Pitch name"})


# Interpret

We notice that Jon Lester's fastest pitches (Four seamer fast ball) slowed down over the years. We can see this as the top of the data in each year tends to go down. Rolling over the data points, we see the top speed pitch in 2009 was 99.1 MPH, while in the last year of 2019, the speed was 91.7 MPH.

For Yu Darvish, the top speed stayed fairly constant, except for his last year in 2019. Interestingly, his curve ball seems to have speeded up over time.

# Explore (3) Strike zones

A pitcher must attempt to throw the ball into the strike zone of the batter, which is a rectangular region above home place extending from the batter's knees to the med=point of the batter's back. Here is a diagram of the strike zone, from the website https://www.umpirebible.com/
<img src="images/StrikeZone.png" width= 400>

We can make a scatter plot of the position of the ball as it passes across home plate. We do this by selecting the plate_x and plate_z coordinates from the data frame. Here, the x position is the horizontal position along the plate, and z position is the height (in feet). 

In [None]:
px.scatter(data_frame=pitcher_jl,x='plate_x',y='plate_z',color='pitch_name',
        title='Jon Lester, Position at home plate (x,z), Called Balls',
       labels={'plate_x':"X position", "plate_z":"Z position", "pitch_name":"Pitch name"})

We notice that some of the pitches have a negative z value -- this means they hit the dirt before going over the plate!

Some players will hit just about anything, even balls in the dirt. Here is a video of Vladimir Guerrero hitting a one bounce dirt ball, and more.

In [None]:
from IPython.display import IFrame
IFrame('https://www.youtube.com/embed/DoAdY3uP9c0', width=500, height=500)

## More exploring

It is interesting to plot only the pitched that were balls. These are the pitches that were judged by the umpire to not be in the strike zone. We plot this by selecting pitches_type to equal "B" for Balls. 

In [None]:
px.scatter(data_frame=pitcher_jl[pitcher_jl.type=='B'],x='plate_x',y='plate_z',color='pitch_name',
        title='Jon Lester, Position at home plate (x,z), Called Balls',
       labels={'plate_x':"X position", "plate_z":"Z position", "pitch_name":"Pitch name"})

# Interpret

By selecting just one pitch type, we see a few interesting things. In the balls-only display, the curve balls tend to fall low and to the left, while the sinkers tend to be low and to the right. The four seamer fastball, on the other hand, seems to fall randonly all around the strike zone. 

# Communicate

Below are some writing prompts to help you reflect on the new information that is presented from the data. When we look at the evidence, think about what you perceive about the information. Is this perception based on what the evidence shows? If others were to view it, what perceptions might they have?

- I used to think ____________________but now I know____________________. 
- I wish I knew more about ____________________. 
- This visualization reminds me of ____________________. 
- I really like ____________________.

## Add your comments here

✏️

## Saving your work

You can download this notebook, with your comments, by using the **File** menu item in the toolbar above. You might like to downloads this in .html format so that the graphics remain active and can be viewed in your web browser.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)