![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Baseball - Challenges

Now that you've gone through the introduction notebook and learned how to navigate Jupyter Notebooks, Python, and some useful libraries in pandas and Plotly, we can get a bit more creative with our questions. This notebook will expand on what you've learned and allow you to modify the code as you need. Don't be afraid to refer back to the previous notebook if you have any questions.

### Prep work

In [None]:
# Import/install libraries
import pandas as pd
import plotly.express as px
try:
    import pybaseball as pbb
except:
    !pip install pybaseball --user
    import pybaseball as pbb

In [None]:
# Import data
pitch_data = pd.read_csv('data/ale_pitch_data_june.csv')
pitch_data.head()


In our baseball data here, we have 92 columns of data describing each pitch. In the previous notebook, we looked at only a few of those columns. Now we'll introduce more data, more techniques, and then let you get creative answering some more questions.

## Infield alignment

As data has became so prevalent in the sport of baseball, many aspects of the game have noticably changed. One that's been quite visible when watching the game in the past few years is the [infield shift](https://en.wikipedia.org/wiki/Infield_shift). The gist of the strategy is to position the infielders in areas of the diamond that will increase their chances of making an out. This is done for each batter, and starts with looking at the batter's hitting tendencies over a certain timeframe (i.e. career, season, month) and placing the infielders where the ball is most likely to get hit. MLB uses specific criteria to define each type of shift, but details can be found [here](https://www.mlb.com/glossary/statcast/shifts).

We can start by looking at how frequently infield shifting occurs in this dataset:

In [None]:
if_shift_perc = pitch_data['if_fielding_alignment'].value_counts(normalize=True) # Look at each option in `if_fielding_alignment` and calculate its percentage of the whole

# In pie chart form
px.pie(values=if_shift_perc, names=if_shift_perc.index, title='Percentage of infield shifting in the AL East in June 2022')

Starting in the 2023 season, MLB and the MLB Player's Association have agreed to add new rules that effectively ban infield shifting. As we can see from the pie chart above, shifting was only implemented a little less than half the time a batter came to the plate. Is it an effective strategy, or is the change in rules going to have a minimal impact on the game?

We can investigate that by looking at how many outs were made with shifting, versus those that were made without shifting. First, we have to look at plays only made by infielders. Thankfully, the data has a field that allows us to do that: `hit_location`. Values in this field are numbers that represent the *position* of the first defensive player to touch the ball, according to a numbering convention in baseball:

<img src=https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Baseball_positions.svg/1920px-Baseball_positions.svg.png width=400>
<p>
<a href='https://en.wikipedia.org/wiki/Baseball_positions'>https://en.wikipedia.org/wiki/Baseball_positions </a>
</p>

Therefore, we'll filter our data to only look at rows where the hit was to either the first baseman (3), second baseman (4), shortstop (6), or third baseman (5):

In [None]:
if_hits = pitch_data[pitch_data['hit_location'].isin([3, 4, 6, 5])] # Hits to the infielders only
if_hits

The purpose of shifting the infielders is to stop the ball before it leaves the infield, and make an out. With the exception of some rare circumstances where hitters can get a hit *without* the ball leaving the infield, infield shifting generally fails when the ball is hit *through* the infield and is picked up by an outfielder.

To account for that, we'll also look at plays where the ball was first touched up by an outfielder. However, we need to make sure we only look at balls that *could* have been fielded by an infielder, assuming they were perfectly positioned. That means we need to look only at **ground balls** when counting outfield outs ([see here for types of batted balls](https://en.wikipedia.org/wiki/Batted_ball#Characterization)):

In [None]:
of_hits = pitch_data[(pitch_data['hit_location'].isin([7, 8, 9])) & (pitch_data['bb_type']=='ground_ball')] # Filter by both hits to the outfielders, as well as ground balls
of_hits

Similarly, some balls fielded by infielders are less affected by where they've been positioned. Specifically, a 'pop-up' is a type of hit where the fielder has time to position themselves underneath the ball. Whether an infielder is shifted or not, they would have ample time to catch a pop-up, so we can remove those hits from our subsetted data.

This is also an opportunity to compare the **equality** operators in Python. The operator `==` is a comparison where you're checking if one variable is **equal to** another. Conversely, the operator `!=` is the opposite, only returning matches where the two variables being compared are **not equal**.

Below we're returning all results *except* those that equal 'popup':

In [None]:
if_hits = if_hits[if_hits['bb_type']!='popup']  # Note how we're also overwriting the dataframe by using the same name
if_hits

In [None]:
if_hits_perc = if_hits['if_fielding_alignment'].value_counts(normalize=True) # Calculate percentage of total values for each shifting strategy for infield hits
of_hits_perc = of_hits['if_fielding_alignment'].value_counts(normalize=True) # Do the same as above for outfield hits
all_hits_perc = pd.concat([if_hits_perc, of_hits_perc], axis=1, keys=['Infield', 'Outfield']) # Combine the two dataframes
all_hits_perc

Now let's compare them side by side:

In [None]:
px.bar(all_hits_perc, y=all_hits_perc.columns, x=all_hits_perc.index, barmode='group', 
       title='Comparison of hits to different fielders by infield positioning', 
       labels={'value':'Percentage of hits fielded',
               'index':'Infield positioning',
               'variable':'Fielded by'})

The above plot may at first look confusing, but we can draw some conclusions from it. First off, teams shift much *less* frequently than they don't; over half the time balls are hit to either infielders or outfielders, it's during a standard infield positioning.

Secondly, and what we're trying to demonstrate, **when an infield shift is in place, a higher percentage of balls are hit to the infielders (versus ourfielders) than when it isn't**.

### Follow-Up Questions

1. We looked at infield alignment, but there's also a column for outfield alignment (`of_fielding_alignment`). Can you draw similar conclusions on that data?
1. Instead of lumping many types of hits together (`bb_type`), what can you find by separating out the different hits?
1. `home_team` and `away_team` contain information on the teams in each game. Do some teams shift more than others?

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)