# MSAS Python Tutorial Five

**In this tutorial, we will go over two more important methods when using Pandas for data analysis**

First, take a minute to remind yourself what a method is

**Pandas Merge**

The merge method is an essential component of the pandas library because it allows the user to combine to relevant datasets together very easily

What are some reasons that combining CSV files would be useful?

One example would be that you receive new columns that correspond to the rows in a dataset you already have:

* We received new epa data from statsbomb to help make more insightful analysis

In [None]:
import pandas as pd
import numpy as np

In [None]:
epa = pd.read_csv('Michigan 2022 EPA Data.csv')
pbp = pd.read_csv('StatsBomb Michigan 2022 Plays.csv')

In [None]:
pbp = pbp.merge(epa, on='play_uuid')

In [None]:
pbp.columns

What's going on here?

* Both dataframes have the same play IDs, so we can use the "on" parameter to tell Python that it should append each entry of the epa dataframe to its corresponding row in the pbp dataframe
* The merge method also has the "how" parameter, which can be used to tell Python which rows to include in the merged dataframe

**The "how" parameter can be set to inner, outer, right, or left join. Try to think about what each would do**

* The parameter is inner by default, which will only transfer keys in each dataframe that **intersect**
* Conversely, an outer join will use all the keys from both dataframes to merge, regardless of whether they intersect
* The right join will only join on keys from the right dataframe, leaving out all those only on in the left
* The left join does the same as the right, but opposite

In [None]:
wrs = pd.DataFrame([{'Name':'Justin Jefferson', 'Receiving Yards': 1726},
                   {'Name':'Christian McCaffery','Receiving Yards': 864},
                   {'Name':'Alvin Kamara','Receiving Yards':689}])

rbs = pd.DataFrame([{'Name':'Christian McCaffery','Rushing Yards':1028},
                   {'Name':'Derrick Henry','Rushing Yards':1563},
                   {'Name':'Alvin Kamara','Rushing Yards':945}])

In [None]:
#wrs.set_index('Name', inplace = True)
#rbs.set_index('Name', inplace = True)
rbs.head()

In [None]:
combined = pd.merge(wrs, rbs, how='right', right_index = True, left_index = True)

In [None]:
combined.head()

**Pandas Groupby**

* The groupby method allows the user to subset the data into smaller groups to easily perform desired computations
* Returns a tuple for each group. The first entry in the tuple is the name of the group and the second is a reference to the dataframe you are apply the groupby method

Let's say I want to determine the efficiency of Michigan's offense against different defensive personnel, and I want to know how many plays they faced each personnel package:

In [None]:
pbp = pbp[pbp['offense_team_name'] == 'Michigan Wolverines']
pbp = pbp[pbp['play_scrimmage_epa'].notna()]
pbp = pbp[(pbp['play_pass_yards_air'].notna()) | (pbp['play_yards_run'].notna())]

First we subset our data to only look at plays when Michigan is on offense and not punting

Now we can use a for loop to iterate through the groups we create with the groupby method:

In [None]:
form_all = {}

for group, frame in pbp.groupby('play_defensive_personnel'):
    avg = round(np.average(frame['play_scrimmage_epa']),4)
    num_plays = len(frame)
    form_all[group] = [avg, num_plays]

In [None]:
form_all

* We created an empty dictionary to store the average epa and number of plays against each defensive personnel
* We use the round and numpy average functions to create a variable with the average epa against the defensive personnel
* We use the len function to determine the number of rows that are returned in the dataframe in each group, which represents the number of plays

Now, see if you can further subset the data to investigate all rushing plays against different personnel groups and all passing plays

**Thanks for coming!**