# Python Pandas III: Aggregates

Still starring our favorite globetrotter. 

We'll start by setting up a merge between the same two previous datasets we were using in the previous lesson

In [None]:
# import pandas
import pandas

# configure pandas
pandas.options.display.max_rows = None
pandas.options.display.max_columns = None

# load our first data set and give it a quick head check
celtics_roster = pandas.read_csv('boston_celtics_2023_2024.csv')
celtics_roster.head(4)

In [None]:
# Go ahead and load our next dataset and give it a quick head check
celtics_totals = pandas.read_csv('boston_celtics_2023_2024_totals.csv')
celtics_totals.head(4)

In [None]:
# Now let's do the merge and do our last head check
celtics = pandas.merge(celtics_roster, celtics_totals, on='player', how='outer')
celtics.head(5)

### Let's look at Groups and Pivots. 

We're going to group our players by position group and on-off percentile. 

Before we do this, we're going to create a new column (Remember how to do that??) We want to aggregate our forwards and guards. 

In [None]:
celtics['position_group'] = "Unknown"
celtics.loc[(celtics['position'] == 'C'), 'position_group'] = 'center'
celtics.loc[(celtics['position'] == 'PF') | (celtics['position'] == 'SF'), 'position_group'] = 'forward'
celtics.loc[(celtics['position'] == 'PG') | (celtics['position'] == 'SG'), 'position_group'] = 'guard'
celtics[['player','position_group']]

Now let's put together the group

In [None]:
celtics.groupby(['position_group','on-off']).player.count().reset_index()

Neat! This is one way we can look at who the starters on the team are, or what playing time looks like. 

Let's create a pivot_table from the group and look at it in a different way. (This creates something easier to analyze and look at to human eyes, because it de-duplicates the index)

In [None]:
celtics.groupby(['position_group','on-off']).player.count().reset_index().pivot(columns='position_group',index='on-off', values='player')

### Column Stats

Points are everything, right?? So let's get some interesting stats. 

In [None]:
print(f'Max Points: {celtics.PTS.max()}')
print(f'Min Points: {celtics.PTS.min()}')
print(f'Mean Points: {celtics.PTS.mean()}')
print(f'Median Points: {celtics.PTS.median()}')
print(f'Count Points(Number of Points Scorers): {celtics.PTS.count()}')
print(f'Standard Deviation of Points: {celtics.PTS.std()}')



Here let's look at getting unique values (or a set) and counts. 

In [None]:
# This returns all positions (duplicates!) 
celtics.position

# this returns a python array of the unique values. It's more or less a set. 
celtics.position.unique()

In [None]:
# this gets the count of unique values
celtics.position.nunique()

### Aggregates and Groups

Remember groups??? We can create aggregates per group. 

In [None]:
# Top point scorers at each position
celtics.groupby('position').PTS.max().reset_index()

---

This is useful, but how do I get the player associated w/ that max score? 

First let's look at what we get just by viewing the groupby('position')

In [None]:
# basic groupby command for position
celtics.groupby('position')



That wasn't too useful, right? It just told us that we have a DataFrameGroupBy object. 

So, what happens if we try to perform a selection? (Any Guesses??) youre_fired = celtics.groupby('position')

In [None]:
# Let's select PTS...
celtics.groupby('position')['PTS']

Remember Python Pandas 1?

That's right, when you perform a single selection on a DataFrame, you get a Series object. So, if you perform a single selection on a DataFrameGroupBy... you get a SeriesGroupBy. 

So let's use idxmax() to get the index of the max value and then try to use that index to get the information we want. 

In [None]:
# First we'll get the index and print it out to check that we've actually got something other than an object.
index = celtics.groupby('position')['PTS'].idxmax()
index

In [None]:
# Rad! So we've got the index of the max scorer for each position. Now let's use it. 
celtics.loc[index]



Ok. That's helpful, but we don't need all of that information right? We just need the player's name and the points. 

In [None]:
celtics.loc[index][['player','PTS']]

Now that we've put the pieces together, let's try to put it together using a python lambda and use groupby in order to fix the row axis so that we 
don't have the indexes anymore, but rather the positions. 

In [None]:
get_max_score = lambda group: group.loc[group.PTS.idxmax()]

celtics.groupby('position').apply(get_max_score)[['player', 'PTS']]

Let's do this again w/ Total Rebounds. Unfortunately "TRB" is a vague abbreviation that many users might not recognize, so we'll want to rename the column. Let's do that. 

In [None]:
get_max_trb = lambda group: group.loc[group.TRB.idxmax()]

celtics_trb_by_pos = celtics.groupby('position').apply(get_max_trb)[['player','TRB']]
celtics_trb_by_pos

In [None]:
# Now let's rename the columns. 
celtics_trb_by_pos = celtics_trb_by_pos.rename(columns={"TRB": "Rebounds"})
celtics_trb_by_pos

---

Getting the high scorers is helpful, but what if we want to determine the percentiles so we can select the players above/below those percentiles? 

Before we can do this... we're going to need to solve for some of the NaNs

In [None]:
celtics[['PTS', 'player', 'position']]

Remember that silly (TW) suffix in the player name? This caused some issues when we merged the two data sets. It would have gone much smoother if we had fixed the names to match up (or created an id of some kind.) 

We're going to solve this w/ a simple hack. We're going to fill in the NaN values in PTS, because they'll prevent us from calculating percentiles. 

```
Rule of Thumb: Non Numeric Values Break Numeric Calculations.
```
1. The easiest way to spot non-numerics is running DataFrame.info(), and look for fields that we expect to see a numeric data type that shows object. This means that there is likely a string somewhere.
2. The second easiest way is to hunt down "NaN"s. NaN shows up in numeric fields. 

In [None]:
# Setting the values (I'm just taking the values from the player w/o the TW and copying it to their duplicate entry.  
celtics.at[15,'PTS'] = celtics.iloc[19]['PTS']
celtics.at[16, 'PTS'] = celtics.iloc[20]['PTS']
celtics.loc[[15, 16, 19, 20]][['player','PTS']]

The output should show you the correct outcomes!

NOTE: There is a method called set_value() that is SO Much faster than .at(), but it is going to be deprecated by pandas. use at() or iat(). 

Now that we've done a little hacky-wacky wrangling, let me show you an easy way. 

In [None]:
# import numpy. We need that to set NaN and get to percentiles later
import numpy as np


# Setting the values Back
celtics.at[15,'PTS'] = np.NaN
celtics.at[16, 'PTS'] = np.NaN
celtics.loc[[15, 16, 19, 20]][['player','PTS']]

In [None]:
# Let's calculate the top scorers... and show the bad values. 
top_scorers = celtics.groupby('position').PTS.apply(lambda x: np.percentile(x, 75)).reset_index()
top_scorers

In [None]:
# Lame! So let's fix it. [replace numpy.percentile w/ numpy.nanpercentile]
top_scorers = celtics.groupby('position').PTS.apply(lambda x: np.nanpercentile(x, 75)).reset_index()
top_scorers

In [None]:
# now let's calculate the low scorers
low_scorers = celtics.groupby('position').PTS.apply(lambda x: np.percentile(x, 25)).reset_index()
low_scorers

In [None]:
# what about the middle scorers? The 50th percentile...

In [None]:
mid = celtics.groupby('position').PTS.median().reset_index()
mid

In [None]:
# Let's prove it by 
mid_quantile = celtics.groupby('position').PTS.apply(lambda x: np.nanpercentile(x, 50)).reset_index()
mid_quantile

In [None]:
# Here is another short cut to get all of the information we just calculated separately. 
# .. I added a second field (TRB) to demonstrate how easy it is to get there...
celtics.groupby(by='position').describe()[['PTS', 'TRB']]

### DataFrameGroupBy.quantile() vs. numpy.percentile() !!!

It's generally recommended to use numpy for performance reasons, however quantile can be easier to use

#### quantile's interpolation method
This is how quantile calculates the quantile... (in other word's it's not very accurate...) 
- linear: i + (j - i) * (x-i)/(j-i), where (x-i)/(j-i) is the fractional part of the index surrounded by i > j.  **<-- Default**
- lower: i.
- higher: j.
- nearest: i or j whichever is nearest.
- midpoint: (i + j) / 2.

#### numpy percentile. 

https://numpy.org/doc/stable/reference/generated/numpy.percentile.html

Similar, but more involved: 

This parameter specifies the method to use for estimating the percentile. There are many different methods, some unique to NumPy. See the notes for explanation. The options sorted by their R type as summarized in the H&F paper [1] are:

- ‘inverted_cdf’
- ‘averaged_inverted_cdf’
- ‘closest_observation’
- ‘interpolated_inverted_cdf’
- ‘hazen’
- ‘weibull’
- ‘linear’ **<--default**
- ‘median_unbiased’
- ‘normal_unbiased’



In [None]:
# Here is an example of using Quantile()..Same results. 
celtics.groupby('position').PTS.quantile(q=.25) 

### Other helpful aggregations w/ Groups. 

A comprehensive list is here https://pandas.pydata.org/docs/reference/groupby.html

In [None]:
# this groups the first few players at each position based on their index. (not terribly useful unless you filter it..) 
celtics.groupby('position').head(2)

In [None]:
# first entry
celtics.groupby('position').first()

In [None]:
# last entry
celtics.groupby('position').last()