![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Baseball Analytics

Welcome to another Jupyter notebook on baseball analytics. This notebook is a free resource and is part of the Callysto project, which brings data science skills to grades 5 to 12 classrooms. 

In this notebook, we’ll start by looking at some baseball statistical data, specifically on batters and what pitches they are good at hitting.

In real major league baseball, computing statistic are key to understanding how players are valued by their team. The money Moneyball, with Brad Pitt and Jonah Hill is all about baseball analytics. 

Visualizations are coded using Python, a computer programming language. Python contains words from English and is used by data scientists. Programming languages are how people communicate with computers. Our graphics are done in Plotly, which makes it easy to create line chart, scatter plots and even heat maps. This is all great for understanding the baseball statistics 

# Strike Zone Heatmaps

It can be of strategic interest to know what areas of the strike zone a hitter is most likely to make good contact with the ball in. Hitters want to wait for pitches in those zones, and pitchers want to avoid them. We'll explore two visualizations that can quickly show that: barreled balls, and *all* balls hit into play.

This visualization is a construction of a heatmap of a player's ['barreled balls'](https://www.mlb.com/glossary/statcast/barrel). A barreled ball has a specific definition, but in simple terms refers to a hit where the batter has put considerable power and launch angle into the ball, increasing the chances for them to reach base safely:

<div>
<!-- <img src = 'img/r8ilsaxwpjhtbszghayu.jpg' width = 600/> -->
<img src = 'https://img.mlbstatic.com/mlb-images/image/private/mlb/r8ilsaxwpjhtbszghayu' width=750 />

</div>

<p>
    <font size = 2>
    https://img.mlbstatic.com/mlb-images/image/private/mlb/r8ilsaxwpjhtbszghayu
    </font>
</p>

## Barreled ball heatmaps.

In a barreled ball heatmap, greater density equates to a higher rate of barreled balls and thus higher likelihood of a base hit (or even HR). Hitters want to focus on swinging at balls in the denser regions of their heatmaps, and pitchers want to avoid those regions.

## Coding

To start our coding, we import two useful libraries: **Plotly for graphing,** and **pybaseball** for accessing Major League Baseball data. 

To learn more about **pybaseball** you can access its documentation on Github: https://github.com/jldbc/pybaseball/tree/master/docs

Run the following code cells to complete this notebook.

In [None]:
!pip install pybaseball

In [None]:
from pybaseball import playerid_lookup, statcast_batter
import plotly.graph_objects as go

## Finding a player

We use the **lookup** command to find the ID of any player, which we will need to access the player database. 

Here we look up the 2021 Americal League Most Valuable Player, Shohei Ohtani. 

In [None]:
# Load data for 2021 AL MVP Shohei Ohtani
playerID = playerid_lookup('Ohtani', 'Shohei')
playerID

We see his player ID is **660271.** We use this number in the following command to look up his data for the baseball season of 2021. 

In [None]:
data = statcast_batter('2021-04-01', '2021-10-03', player_id = 660271)
data

## Examining the data

We see there is a lot of information here, contained in over 2500 rows and almost 100 columns. 

Let's first look at the names of all the columns.

In [None]:
data.columns

## Important columns

There is a lot of information in this data frame. Each row represents an individual pitch during some game, thrown to this particular player. Each column has information about the pitch: the type of pitch, the date of the game, the speed at which the pitch was thrown, and so on. 

In this notebook, we are interested in the **description** of the pitch, the position of the call when it reaches the player (**plate_x** and **plate_z** which are the horizontal and vertical coordinates of the position), the **launch_angle** and the **launch_speed.**

The strike zone is also important: it is the rectangular box near the player where the pitcher is supposed to throw the ball. The strike zone is always 17 inches wide, so we easily compute the left and right borders for the strike zone. The top and bottom borders depend on the player (it is based on their physical size) which we can gather from the data base.

In [None]:
sz_left = -17/2/12        # 17 inches, halved, in feet
sz_right = 17/2/12
sz_top = data['sz_top'].mean()  # average of the strike zone top
sz_bot = data['sz_bot'].mean()  # average of the strike zone bottom

## Plotting the Pitches

Let's use this data to plot the positions of the 2500 pitched balls for this player. We use Plotly to create the chart, and draw the strike zone in the chart for reference.

In [None]:
fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=data['plate_x'],
        y=data['plate_z'],
        mode='markers',
        name='Ball position'
    ))
fig.add_trace(
    go.Line(
        x=[sz_left,sz_right,sz_right,sz_left,sz_left],
        y=[sz_bot,sz_bot,sz_top,sz_top,sz_bot],
        name='Strike zone'
    ))
fig.update_layout(
    title = "Shohei Ohtani, 2021 - Placement of Pitched Ball"
)
fig.update_xaxes(
    title="Horizontal position (feet)"
)
fig.update_yaxes(
    title="Vertical position (feet)"
)

fig.show()

## Observations

There are a lot of pitches that land outside the strike zone. There are called "balls" in baseball, and the player is not obliged to swing at these balls. 

The strike zone looks unusual in the above chart because the scales of the axes are not the same. We will fix this in the following charts, using the **scaleratio** command in Plotly. 

## Types of pitches and where they go

We can use the description of the pitches to narrow the dataset to something more relevant to the player's performance. 

First, let's examine the entries in the **description** column and find what types of pitches occur.

In [None]:
data['description'].unique()

## Hit into play

The pitches marked **hit_into_play** are the ones where the batter actually hit the ball and it went on to a play where the runners are running around the bases. This is a good thing -- the batter made a good hit. 

Let's plot just those pitches. We use a selector in the data frame to grab just those pitches with the description **hit_int_play.**

In [None]:
fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=data[data['description'] =='hit_into_play']['plate_x'],
        y=data[data['description'] =='hit_into_play']['plate_z'],
        mode='markers',
        name='Ball position'
    ))
fig.add_trace(
    go.Line(
        x=[sz_left,sz_right,sz_right,sz_left,sz_left],
        y=[sz_bot,sz_bot,sz_top,sz_top,sz_bot],
        name='Strike zone'
    ))
fig.update_layout(
    title = "Shohei Ohtani, 2021 - Ball Position: Hit Into Play",
    height = 600
)
fig.update_xaxes(
    ##range=(-4,4),
    constrain = "domain",
    title="Horizontal position (feet)"
)
fig.update_yaxes(
    scaleanchor = "x",
    scaleratio = 1,
    title="Vertical position (feet)"
)

fig.show()

## Observations

In the chart above, we notice most of the pitches that did get hit into play actually were very close to the strike zone. Which is expected: these are the pitches the player has a good chance to hit well. 

## Hit by pitch

Sometimes, the pitcher messes up in his throw and the ball actually hits the batter. Whether on purpose or by accident, this is not recorded! In any case, it is interesting to see a chart of where the hit-by-pitch balls were thrown. 

Again, we use a selector to only show the data whose description says **hit_by_pitch.**

In [None]:
fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=data[data['description'] =='hit_by_pitch']['plate_x'],
        y=data[data['description'] =='hit_by_pitch']['plate_z'],
        mode='markers',
        name='Ball position'
    ))
fig.add_trace(
    go.Line(
        x=[sz_left,sz_right,sz_right,sz_left,sz_left],
        y=[sz_bot,sz_bot,sz_top,sz_top,sz_bot],
        name='Strike zone'
    ))
fig.update_layout(
    title = "Shohei Ohtani, 2021 - Ball Position: Hit By Pitch",
    height = 600
)
fig.update_xaxes(
    range=(-4,4),
    constrain = "domain",
    title="Horizontal position (feet)"
)
fig.update_yaxes(
    scaleanchor = "x",
    scaleratio = 1,
    title="Vertical position (feet)"
)

fig.show()

## Barreled balls

Finally, we want to take a look those pitches that the batter hit really, really well. These are called "barreled balls" and they typically occur in a spot that the batter usually is very successful at hitting. 

There is a formula to determine what hits are considered as **barreled balls.** It depends on the angle and speed at which the ball is launched off the bat as the batter hits it. We found the formula in the code for a library called **baseballr** which you can look at here:
https://github.com/BillPetti/baseballr/blob/master/R/sch_code_barrel.R

The code is writting in the R language and looks like this:
```
code_barrel <- function(df) {
  df$barrel <- with(df, ifelse(launch_angle <= 50 & launch_speed >= 97 & launch_speed * 1.5 - 
                                 launch_angle >= 117 & launch_speed + launch_angle >= 123, 1, 0))
```

We translate this to Python as follows, creating a new column in our data frame called **barreled.**

In [None]:
data['barreled']=(data['launch_angle'] <= 50)&(data['launch_speed'] >= 97)& \
    (data['launch_speed']*1.5 - data['launch_angle'] >= 117) & (data['launch_speed'] + data['launch_angle'] >= 123)

## Plotting, heat map

Let's plot the barreled balls and their positions. We include a heat map, to focus attention on where most of these balls arrived. 

In [None]:
fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=data[data['barreled']]['plate_x'],
        y=data[data['barreled']]['plate_z'],
        mode='markers',
        name='Ball position'
    ))
fig.add_trace(
    go.Line(
        x=[sz_left,sz_right,sz_right,sz_left,sz_left],
        y=[sz_bot,sz_bot,sz_top,sz_top,sz_bot],
        name='Strike zone'
    ))
fig.add_trace(
    go.Histogram2d(
        x=data[data['barreled']]['plate_x'],
        y=data[data['barreled']]['plate_z'],
        colorscale='YlGnBu',
        zmax=15,
        nbinsx=10,
        nbinsy=10,
        zauto=False,
))
fig.update_layout(
    title = "Shohei Ohtani, 2021 - Ball Position: Barreled Ball",
    height = 600,showlegend=False
)
fig.update_xaxes(
    range=(-4,4),
    constrain = "domain",
    title="Horizontal position (feet)"
)
fig.update_yaxes(
    scaleanchor = "x",
    scaleratio = 1,
    title="Vertical position (feet)"
)

fig.show()

## Observations

The heat map shows the barreled balls are concentrated just about the center of the strike zone. For this player, this is the ideal place to hit the ball. 

## Going further

Try exploring the data for other player and other seasons. Who are your favourite hitters? What can you learn about them from this data exercise?

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)