![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Baseball Challenge

Submitted by: 

From school: 

## Introduction

Now that you've gone through the introduction notebook and learned how to navigate Jupyter Notebooks, Python, and some useful libraries in pandas and Plotly, we can get a bit more creative with our questions. This notebook will expand on what you've learned and allow you to modify the code as you need. Don't be afraid to refer back to the previous notebook if you have any questions or need to remember what type of data the columns hold.

## Prep work

In [None]:
%pip install -q pyodide_http plotly pybaseball nbformat
import pyodide_http
pyodide_http.patch_all()
import pybaseball as pbb
import pandas as pd
import plotly.express as px
pitch_data = pd.read_csv('data/ale_pitch_data_june.csv')
pitch_data.head()

## Grouping

Returning to our original dataset, let's do some statistics on the data we have.

Below is a pandas function that will allow us to group by a column, before calculating mean pitch speed.

In [None]:
pitcher_grp_mean = pitch_data.groupby(by='player_name')['release_speed'].mean() # Returning only the `release_speed` column
pitcher_grp_mean

We can repeat this looking at `max` as well:

In [None]:
pitcher_grp_max = pitch_data.groupby(by='player_name')['release_speed'].max()
pitcher_grp_max

What if we're curious how many pitchers can throw sinkers over 95 mph? We can filter the data by implementing and combining conditional statements:

In [None]:
pitcher_95 = pitch_data[(pitch_data['release_speed']>95) & (pitch_data['pitch_name']=='Sinker')]
pitcher_95

### Challenges - 1:

See if you can use the methods here, and what you've learned in the previous notebook, to tackle these challenges ([hint](https://www.geeksforgeeks.org/pandas-groupby-one-column-and-get-mean-min-and-max-values/)):
1. Which pitcher throws, on average, the fastest?
1. Which pitcher threw the hardest pitch in the dataset?
1. What is the highest average velocity for each pitch?  

We can also return multiple columns when grouping:

In [None]:
pitch_data.groupby(by=['player_name', 'pitch_name'])[['release_speed', 'release_spin_rate']].mean()

## Batting

<img src=https://upload.wikimedia.org/wikipedia/commons/thumb/e/ee/Mookie_Betts_hitting_the_ball_%2836478781664%29.jpg/2560px-Mookie_Betts_hitting_the_ball_%2836478781664%29.jpg width=400>
<p>
<a href='https://en.wikipedia.org/wiki/Baseball'>https://en.wikipedia.org/wiki/Baseball </a>
</p>

Up until now we've only looked at data that focuses on the pitches, which is (unsurprisingly) mostly related to the pitcher. Though we can also access data explicitly on the hitters (see the end of this notebook for details on how), there's some hitter data available in what we already have. But first, we have to do some data cleaning.

In our original dataset, the column `batter` contains a number that uniquely corresponds to each batter. That's helpful in keeping them apart, but not very helpful in identifying *who* each batter is. For that, we use the below code from the `pybaseball` library.

First, we're going to take the entire `batter` column, pass it to the `playerid_reverse_lookup` function, and extract just the names that are returned:

In [None]:
batter_names = pbb.playerid_reverse_lookup(pitch_data['batter'])[['name_last', 'name_first']]
batter_names

For consistency, we can take these two columns, merge them into one, and format the names so they match the style of the pitcher names ('Lastname, Firstname').

Let's capitalize the names in each column individually:

In [None]:
batter_names['name_last'] = batter_names['name_last'].str.title()
batter_names['name_first'] = batter_names['name_first'].str.title()
batter_names

Now we can join ('con**cat**enate') the two names with a comma (and space), and create a new, single column:

In [None]:
batter_names_comb = batter_names['name_last'].str.cat(batter_names['name_first'], sep=', ')
batter_names_comb

Because the function to retrieve player names from IDs ignores duplicates (but retains order), we need to do the same with our IDs:

In [None]:
ids = pitch_data['batter'].drop_duplicates().to_list()
ids[:5] # Only showing the first five entries

Then we can combine them to create a mapping function:

In [None]:
mapper = {ids[i]: batter_names_comb[i] for i in range(len(ids))} # Create a dictionary with key:value pairs of IDs and player names

Finally, use this mapping dictionary to overwrite the IDs for batter with the names. The `map` function will look for values in the **key** of each dictionary entry, and replace it with its corresponding **value**:

In [None]:
pitch_data['batter'] = pitch_data['batter'].map(mapper)
pitch_data

### Challenges - 2:

With the addition of batter names, see if you can tackle the next set of challenges:

1. Which batter has the highest average launch angle?
1. Only one batter has hit *two* balls over 425 feet. Who is it? (hint: use `hit_distance_sc`)
1. Plot a scatterplot with launch angle and hit distance. Approximately what launch angle (or range of angles) leads to the longest hits?  
    1. What can you say about negative launch angles?

## Visualizing

<img src=https://perceptionaction.com/wp-content/uploads/2016/03/BattingFig-basic-2.jpg width=400>
<p>
<a href='https://perceptionaction.com/battinghand/'>https://perceptionaction.com/battinghand/ </a>
</p>


One of the most unique (and polarizing) aspects of the game of baseball is the **strike zone**. [Under the current rules of baseball](https://www.mlb.com/glossary/rules/strike-zone), the pitch must cross the plate within a rectangle which has an upper boundary defined by the midpoints between a batter's shoulders and top of their pants, with a lower boundary defined by a point just below the kneecap. The width of the strike zone is the same as the width of home plate, or 17 inches.

We can plot the pitches as they cross the plate below, and colour the points differently based on the handedness of the pitcher:

In [None]:
fig1 = px.scatter(data_frame=pitch_data,x='plate_x', y='plate_z', color='p_throws', opacity=0.4,
                  title='Pitched ball location (all balls)',
                  labels={'plate_x':"Horizontal position (feet)", 
                          "plate_z":"Height (feet)", 
                          'p_throws':'Pitcher throws'},
                          height=800, width=800)
fig1.add_shape(type='rect', x0=-0.708, x1=0.708, y0=pitch_data['sz_bot'].mean(), y1=pitch_data['sz_top'].mean())
fig1.update_yaxes(scaleanchor='x', scaleratio=1)

We can use the same techniques as previously to filter our dataframes to nail down the specific information we want to look at. Use the code cell and resulting visualization below to answer the next set of questions:

### Challenges - 3:

1. For a left-handed pitcher and a right-handed batter, what part of the strike zone does a slider most often cross the plate in?  
    1. The dataset has a variable `zone` that breaks down the strike zone [into numbered areas](https://baseballsavant.mlb.com/site-core/images/attack-zone.png). Can you use this to find the result *without* visualization?
1. How many pitches over 100mph are strikes (either called or whiffs)?  
    1. How many are in the strike zone?
1. (Difficult) In which direction outside the strike zone are Toronto Blue Jays batters most likely to either barrel the ball or make solid contact?  
    - Hint: the home team always bats in the bottom of the inning

In [None]:
# Use this variable to filter the dataframe
df = pitch_data

# Use this variable to show a colour difference in the column values
color='type'

fig1 = px.scatter(data_frame=df, x='plate_x', y='plate_z', color=color, opacity=0.3, # change the opacity (between 0 and 1) if the points overlap too much
                  title='Pitched ball location',
                  labels={'plate_x':"Horizontal position (feet)", 
                          "plate_z":"Height (feet)", 
                          'p_throws':'Pitcher throws',
                          'stand':'Batter handedness',
                          'type':'Result',
                          'launch_speed':'Launch speed',
                          'launch_speed_angle':'Launch angle code'},
                          height=800, width=800)
fig1.add_shape(type='rect', x0=-0.708, x1=0.708, y0=pitch_data['sz_bot'].mean(), y1=pitch_data['sz_top'].mean())
fig1.update_yaxes(scaleanchor='x', scaleratio=1)

## Exploring

For most of this notebook, we've used data that we supplied ourselves as downloadable from the Callysto GitHub page. If you want to do your own analysis of baseball data, one of the best repositories is *[pybaseball](https://github.com/jldbc/pybaseball/tree/master/docs)*. If you're not used to GitHub, the webpage linked there can be difficult to explore, so here are a few functions you might want to use:

- [playerid_lookup(last, first=None, fuzzy=False)](https://github.com/jldbc/pybaseball/blob/master/docs/playerid_lookup.md): Look up player ID using the player name by passing their last name, first name (optional), or return inexact matches (by setting `fuzzy=True`)
- [batting_stats(start_season, end_season=None, league='all', qual=1, ind=1)](https://github.com/jldbc/pybaseball/blob/master/docs/batting_stats.md): Season-level batting stats for all players, allowing you to set a minimum number of plate appearances (`qual`)
- [pitching_stats(start_season, end_season=None, league='all', qual=1, ind=1)](https://github.com/jldbc/pybaseball/blob/master/docs/pitching_stats.md): Season-level pitching stats (same arguments as above)
- [statcast(start_dt=[yesterday's date], end_dt=None, team=None, verbose=True, parallel=True)](https://github.com/jldbc/pybaseball/blob/master/docs/statcast.md): Lookup Statcast data for any date or range of dates. This is the function we used to initially pull the data for these notebooks




We've already imported the library into this notebook as *pbb* (i.e. `pbb.playerid_lookup`), so feel free to make new cells and play around with the data and your newfound skills!

Have fun!

## Hackathon Reflections

Write about some or all of the following questions:

- What is something you learned through this process?
- How well did your group work together? Why do you think that is?
- What were some of the hardest parts?
- What are you proud of? What would you like to show others?
- Are you curious about anything else related to this? Did anything surprise you?
- How can you apply your learning to future activities?

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)