# What can we do with pandas? Part Deux

We'll be working with our favorite folks on the court, the Boston Celtics. 

Let's set up our data.

In [4]:
import pandas

celtics = pandas.read_csv('boston_celtics_2023_2024.csv')
celtics.head(10)

Unnamed: 0,number,player,position,height,weight,birth_date,country_code,experience,college
0,11,Payton Pritchard,PG,6-1,195,January 28 1998,us,3,Oregon
1,30,Sam Hauser,SF,6-8,215,December 8 1997,us,2,Marquette Virginia
2,0,Jayson Tatum,PF,6-8,210,March 3 1998,us,6,Duke
3,9,Derrick White,SG,6-4,190,July 2 1994,us,6,Colorado-Colorado Springs Colorado
4,7,Jaylen Brown,SF,6-6,223,October 24 1996,us,7,California
5,4,Jrue Holiday,PG,6-4,205,June 12 1990,us,14,UCLA
6,42,Al Horford,C,6-9,240,June 3 1986,do,16,Florida
7,40,Luke Kornet,C,7-2,250,July 15 1995,us,6,Vanderbilt
8,8,Kristaps Porziņģis,C,7-2,240,August 2 1995,lv,7,
9,12,Oshae Brissett,SF,6-7,210,June 20 1998,ca,4,Syracuse


We're going to go through each of the fields, but we're going to start small. 

Let's look at country_code. This is in lower case. It's more common to see this as an upper case field, so let's change that. 

In [6]:
### Simple String Operations

import string 

celtics.country_code = celtics.country_code.apply(str.upper)
celtics.head(10)

Unnamed: 0,number,player,position,height,weight,birth_date,country_code,experience,college
0,11,Payton Pritchard,PG,6-1,195,January 28 1998,US,3,Oregon
1,30,Sam Hauser,SF,6-8,215,December 8 1997,US,2,Marquette Virginia
2,0,Jayson Tatum,PF,6-8,210,March 3 1998,US,6,Duke
3,9,Derrick White,SG,6-4,190,July 2 1994,US,6,Colorado-Colorado Springs Colorado
4,7,Jaylen Brown,SF,6-6,223,October 24 1996,US,7,California
5,4,Jrue Holiday,PG,6-4,205,June 12 1990,US,14,UCLA
6,42,Al Horford,C,6-9,240,June 3 1986,DO,16,Florida
7,40,Luke Kornet,C,7-2,250,July 15 1995,US,6,Vanderbilt
8,8,Kristaps Porziņģis,C,7-2,240,August 2 1995,LV,7,
9,12,Oshae Brissett,SF,6-7,210,June 20 1998,CA,4,Syracuse


Much better. Let's also inspect the data frame. Pay close attention to the Non-Null Count. 
(Do you remember how to find that NaN value from the last lesson? Who didn't go to college?!?!)

In [8]:
celtics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17 entries, 0 to 16
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   number        17 non-null     int64 
 1   player        17 non-null     object
 2   position      17 non-null     object
 3   height        17 non-null     object
 4   weight        17 non-null     int64 
 5   birth_date    17 non-null     object
 6   country_code  17 non-null     object
 7   experience    17 non-null     object
 8   college       16 non-null     object
dtypes: int64(2), object(7)
memory usage: 1.3+ KB


Let's look at JD Petersen for a second...(Remember how to select a single Series?) 

In [10]:
celtics[celtics.player.str.contains("Drew")]

Unnamed: 0,number,player,position,height,weight,birth_date,country_code,experience,college
16,13,Drew Peterson (TW),PF,6-9,205,November 9 1999,US,R,Rice University USC


Drew has 2 universities listed. Let's change the column name to colleges. 

In [12]:
celtics.rename(columns={"college": "colleges"})

Unnamed: 0,number,player,position,height,weight,birth_date,country_code,experience,colleges
0,11,Payton Pritchard,PG,6-1,195,January 28 1998,US,3,Oregon
1,30,Sam Hauser,SF,6-8,215,December 8 1997,US,2,Marquette Virginia
2,0,Jayson Tatum,PF,6-8,210,March 3 1998,US,6,Duke
3,9,Derrick White,SG,6-4,190,July 2 1994,US,6,Colorado-Colorado Springs Colorado
4,7,Jaylen Brown,SF,6-6,223,October 24 1996,US,7,California
5,4,Jrue Holiday,PG,6-4,205,June 12 1990,US,14,UCLA
6,42,Al Horford,C,6-9,240,June 3 1986,DO,16,Florida
7,40,Luke Kornet,C,7-2,250,July 15 1995,US,6,Vanderbilt
8,8,Kristaps Porziņģis,C,7-2,240,August 2 1995,LV,7,
9,12,Oshae Brissett,SF,6-7,210,June 20 1998,CA,4,Syracuse


```
Note: we can change all of the column names as well:

celtics.columns = [<list of new column names>]

This syntax requires you to specify all of the column names.The previous example above using DataFrame.rename() allows
selecting renaming and is more common.
```

There are a number of reasons for doing this. Sometimes source data has column names w/ spaces or that start w/ numbers or other limitations that prevent us from using dot notation. We might rename columns to fix this, because dot notation tends to be less error prone than bracket notation. 

Another reason is presentation. After all of your data wrangling is finished, you might want to output new column names that reflect some polish. This is the most common use of DataFrame.columns, because it's more likely that all of the columns will be renamed to fit presentation requirements. 

### Adding Columns by merging data frames

We could add a single column, but we will be doing that a bit later, and I want to add some data to make this more involved, so let's add a second. Before we can do that, let's create the data frame and figure out a good field to merge them.

In [15]:
totals = pandas.read_csv('boston_celtics_2023_2024_totals.csv')
totals.head(10)

Unnamed: 0,player,age,G,GS,MP,FG,FGA,FG%,3P,3PA,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,Jayson Tatum,25,74,74,2645,672,1426,0.471,229,609,...,0.833,67,534,601,364,75,43,188,145,1987
1,Derrick White,29,73,73,2381,387,839,0.461,196,495,...,0.901,51,259,310,377,74,87,112,152,1107
2,Jaylen Brown,27,70,70,2343,627,1256,0.499,145,410,...,0.703,84,303,387,249,83,37,166,185,1610
3,Jrue Holiday,33,69,69,2263,331,689,0.48,138,322,...,0.833,84,289,373,333,61,53,124,108,860
4,Payton Pritchard,26,82,5,1825,297,635,0.468,147,382,...,0.821,70,195,265,281,39,6,61,106,787
5,Sam Hauser,26,79,13,1741,249,558,0.446,197,465,...,0.895,45,231,276,82,40,25,32,99,712
6,Al Horford,37,65,33,1740,214,419,0.511,108,258,...,0.867,82,331,413,168,38,62,48,93,562
7,Kristaps Porziņģis,28,57,57,1690,388,752,0.516,110,293,...,0.858,97,312,409,115,42,111,89,156,1145
8,Luke Kornet,28,63,7,983,142,203,0.7,1,1,...,0.907,118,143,261,67,23,61,21,77,334
9,Oshae Brissett,25,55,1,630,68,153,0.444,15,55,...,0.602,61,99,160,44,19,8,20,54,201


Yikes. Not much to play with. We only have 2 options. 
1. We can use the player name, which is usually going to be unique on a sports team (at least within a given season).
2. We could calculate the age of the player from the first data frame based on their birthdate and match it to their age in this data frame. The downside is that we don't really know how it's been calculated (what if they had a birthday mid-season? Is this current? etc.)

For our purposes, players is the best way to do this right now. 

In [17]:
# Using an "outer" method, which is similar to a SQL full outer join so we get all of the data. 
celtics = pandas.merge(celtics, totals, on='player', how='outer')
celtics.head(10)

Unnamed: 0,number,player,position,height,weight,birth_date,country_code,experience,college,age,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,42.0,Al Horford,C,6-9,240.0,June 3 1986,DO,16,Florida,37.0,...,0.867,82.0,331.0,413.0,168.0,38.0,62.0,48.0,93.0,562.0
1,,Dalano Banton,,,,,,,,24.0,...,0.8,12.0,23.0,35.0,19.0,5.0,3.0,10.0,20.0,56.0
2,9.0,Derrick White,SG,6-4,190.0,July 2 1994,US,6,Colorado-Colorado Springs Colorado,29.0,...,0.901,51.0,259.0,310.0,377.0,74.0,87.0,112.0,152.0,1107.0
3,,Drew Peterson,,,,,,,,24.0,...,,0.0,1.0,1.0,1.0,2.0,0.0,1.0,1.0,11.0
4,13.0,Drew Peterson (TW),PF,6-9,205.0,November 9 1999,US,R,Rice University USC,,...,,,,,,,,,,
5,,JD Davison,,,,,,,,21.0,...,0.75,2.0,8.0,10.0,10.0,1.0,1.0,2.0,4.0,16.0
6,20.0,JD Davison (TW),SG,6-1,195.0,October 3 2002,US,1,Alabama,,...,,,,,,,,,,
7,44.0,Jaden Springer,PG,6-4,204.0,September 25 2002,US,2,Tennessee,21.0,...,0.875,8.0,12.0,20.0,9.0,11.0,4.0,8.0,17.0,35.0
8,7.0,Jaylen Brown,SF,6-6,223.0,October 24 1996,US,7,California,27.0,...,0.703,84.0,303.0,387.0,249.0,83.0,37.0,166.0,185.0,1610.0
9,0.0,Jayson Tatum,PF,6-8,210.0,March 3 1998,US,6,Duke,25.0,...,0.833,67.0,534.0,601.0,364.0,75.0,43.0,188.0,145.0,1987.0


Whoa, nellie! That certainly added a ton of data, and I imagine you can start to pick out some problems. 
Let's look at the DataFrame's structure and the data types. 

Pay attention to the non-null values. This is one of the common results of merging data frames or tables

In [19]:
celtics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 35 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   number        17 non-null     float64
 1   player        21 non-null     object 
 2   position      17 non-null     object 
 3   height        17 non-null     object 
 4   weight        17 non-null     float64
 5   birth_date    17 non-null     object 
 6   country_code  17 non-null     object 
 7   experience    17 non-null     object 
 8   college       16 non-null     object 
 9   age           19 non-null     float64
 10  G             19 non-null     float64
 11  GS            19 non-null     float64
 12  MP            19 non-null     float64
 13  FG            19 non-null     float64
 14  FGA           19 non-null     float64
 15  FG%           19 non-null     float64
 16  3P            19 non-null     float64
 17  3PA           19 non-null     float64
 18  3P%           18 non-null     fl

### Brief Note On Display options. 
Hmm... we can't see all of our columns. 

Let's change that...

In [21]:
pandas.options.display.max_columns = None
pandas.options.display.max_rows = None

celtics.head(10)

Unnamed: 0,number,player,position,height,weight,birth_date,country_code,experience,college,age,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,42.0,Al Horford,C,6-9,240.0,June 3 1986,DO,16,Florida,37.0,65.0,33.0,1740.0,214.0,419.0,0.511,108.0,258.0,0.419,106.0,161.0,0.658,0.64,26.0,30.0,0.867,82.0,331.0,413.0,168.0,38.0,62.0,48.0,93.0,562.0
1,,Dalano Banton,,,,,,,,24.0,24.0,1.0,171.0,19.0,51.0,0.373,2.0,16.0,0.125,17.0,35.0,0.486,0.392,16.0,20.0,0.8,12.0,23.0,35.0,19.0,5.0,3.0,10.0,20.0,56.0
2,9.0,Derrick White,SG,6-4,190.0,July 2 1994,US,6,Colorado-Colorado Springs Colorado,29.0,73.0,73.0,2381.0,387.0,839.0,0.461,196.0,495.0,0.396,191.0,344.0,0.555,0.578,137.0,152.0,0.901,51.0,259.0,310.0,377.0,74.0,87.0,112.0,152.0,1107.0
3,,Drew Peterson,,,,,,,,24.0,3.0,0.0,23.0,4.0,6.0,0.667,3.0,5.0,0.6,1.0,1.0,1.0,0.917,0.0,0.0,,0.0,1.0,1.0,1.0,2.0,0.0,1.0,1.0,11.0
4,13.0,Drew Peterson (TW),PF,6-9,205.0,November 9 1999,US,R,Rice University USC,,,,,,,,,,,,,,,,,,,,,,,,,,
5,,JD Davison,,,,,,,,21.0,8.0,0.0,39.0,5.0,12.0,0.417,3.0,7.0,0.429,2.0,5.0,0.4,0.542,3.0,4.0,0.75,2.0,8.0,10.0,10.0,1.0,1.0,2.0,4.0,16.0
6,20.0,JD Davison (TW),SG,6-1,195.0,October 3 2002,US,1,Alabama,,,,,,,,,,,,,,,,,,,,,,,,,,
7,44.0,Jaden Springer,PG,6-4,204.0,September 25 2002,US,2,Tennessee,21.0,17.0,1.0,130.0,13.0,30.0,0.433,2.0,11.0,0.182,11.0,19.0,0.579,0.467,7.0,8.0,0.875,8.0,12.0,20.0,9.0,11.0,4.0,8.0,17.0,35.0
8,7.0,Jaylen Brown,SF,6-6,223.0,October 24 1996,US,7,California,27.0,70.0,70.0,2343.0,627.0,1256.0,0.499,145.0,410.0,0.354,482.0,846.0,0.57,0.557,211.0,300.0,0.703,84.0,303.0,387.0,249.0,83.0,37.0,166.0,185.0,1610.0
9,0.0,Jayson Tatum,PF,6-8,210.0,March 3 1998,US,6,Duke,25.0,74.0,74.0,2645.0,672.0,1426.0,0.471,229.0,609.0,0.376,443.0,817.0,0.542,0.552,414.0,497.0,0.833,67.0,534.0,601.0,364.0,75.0,43.0,188.0,145.0,1987.0


### Creating a new column from derived data. 

As you can see, we have a considerable number of pre-calculated fields. (Thanks to the wonderful folks at [basketball-reference.com](https://basketball-reference.com).) 

However, they left a few for us to play with. We only have total points, but points-per-game is a common statistic. Let's start by looking at the columns. 

In [23]:
celtics[['player', 'G', 'PTS']]

Unnamed: 0,player,G,PTS
0,Al Horford,65.0,562.0
1,Dalano Banton,24.0,56.0
2,Derrick White,73.0,1107.0
3,Drew Peterson,3.0,11.0
4,Drew Peterson (TW),,
5,JD Davison,8.0,16.0
6,JD Davison (TW),,
7,Jaden Springer,17.0,35.0
8,Jaylen Brown,70.0,1610.0
9,Jayson Tatum,74.0,1987.0


We're going to create a new column called 'PPG' for Points Per Game. We're going to compute the data based on The information we just queried

Afterwards, let's sort the results in descending order based on the highest point scorers per-game

In [25]:
celtics['PPG'] = celtics.PTS / celtics.G
celtics[['player', 'G', 'PTS', 'PPG']].sort_values(by=['PPG'], ascending=False)

Unnamed: 0,player,G,PTS,PPG
9,Jayson Tatum,74.0,1987.0,26.851351
8,Jaylen Brown,70.0,1610.0,23.0
12,Kristaps Porziņģis,57.0,1145.0,20.087719
2,Derrick White,73.0,1107.0,15.164384
11,Jrue Holiday,69.0,860.0,12.463768
17,Payton Pritchard,82.0,787.0,9.597561
18,Sam Hauser,79.0,712.0,9.012658
0,Al Horford,65.0,562.0,8.646154
15,Neemias Queta,28.0,154.0,5.5
14,Luke Kornet,63.0,334.0,5.301587


### Lambdas (Python) to create columns

That player column is bugging me. I'd like to split that up

In [27]:
# Create some lambdas to do work for us
get_first_name = lambda word: word.split(' ')[0]
get_last_name = lambda word: word.split(' ')[1]

# This is a little bit more challenging. You have to use the negative index to avoid a 'list index out of range'
get_suffix = lambda suffix: suffix.split(' ')[-1] if len(suffix.split(' ')) > 2 else 'null'

# Create the new columns
celtics['first_name'] = celtics.player.apply(get_first_name)
celtics['last_name'] = celtics.player.apply(get_last_name)
celtics['suffix'] = celtics.player.apply(get_suffix)

celtics[['player','first_name','last_name','suffix']]

Unnamed: 0,player,first_name,last_name,suffix
0,Al Horford,Al,Horford,
1,Dalano Banton,Dalano,Banton,
2,Derrick White,Derrick,White,
3,Drew Peterson,Drew,Peterson,
4,Drew Peterson (TW),Drew,Peterson,(TW)
5,JD Davison,JD,Davison,
6,JD Davison (TW),JD,Davison,(TW)
7,Jaden Springer,Jaden,Springer,
8,Jaylen Brown,Jaylen,Brown,
9,Jayson Tatum,Jayson,Tatum,


### Working with Rows. 

So we can similarly work with rows in a similar manner, where we perform calculations on a row in order to create a new column or value. 

One of the most common sources of confusion in basketball statistics is FG (Field Goals or shots), FT (Free Throws), 3P (Three Pointers) and PTS.

1. a Free Throw is worth a single point. It is not a Field Goal
2. Field Goals are 2 or 3 PT shots
3. 2 Point shots aren't reflected as a separate statistic, you have to pull them out... so let's do that.


We're going to need 3 new columns: 
1. 2P - this will be the 2 pointers made, which = FG - 3P
2. 2PA - this is 2 pointers attempted, which is FGA - 3PA
3. 2P% - this is 2-Pt percentage, which is FG / FGA

In [29]:
# Create our python lambdas for 2P and 2PA (NOTE. We can't use dot notation w/ column names that start
# with a number)
get_2P = lambda row: row.FG - row['3P']
get_2PA = lambda row: row.FGA - row['3PA']

# create the new columns (make sure to set the axis to ROWS, axis=1!!!)
celtics['2P'] = celtics.apply(get_2P, axis=1)
celtics['2PA'] = celtics.apply(get_2PA, axis=1)

celtics[['last_name', '2P','2PA']]

Unnamed: 0,last_name,2P,2PA
0,Horford,106.0,161.0
1,Banton,17.0,35.0
2,White,191.0,344.0
3,Peterson,1.0,1.0
4,Peterson,,
5,Davison,2.0,5.0
6,Davison,,
7,Springer,11.0,19.0
8,Brown,482.0,846.0
9,Tatum,443.0,817.0


Cool. That's done, so let's perform the calculation to get the percentage. 

In [31]:
get_2PPct = lambda row: row['2P'] / row['2PA']
celtics['2P%'] = celtics.apply(get_2PPct, axis=1)
celtics[['last_name', '2P','2PA', '2P%']]

Unnamed: 0,last_name,2P,2PA,2P%
0,Horford,106.0,161.0,0.658385
1,Banton,17.0,35.0,0.485714
2,White,191.0,344.0,0.555233
3,Peterson,1.0,1.0,1.0
4,Peterson,,,
5,Davison,2.0,5.0,0.4
6,Davison,,,
7,Springer,11.0,19.0,0.578947
8,Brown,482.0,846.0,0.56974
9,Tatum,443.0,817.0,0.542228


Let's output the entire DataFrame to finish up

In [33]:
celtics

Unnamed: 0,number,player,position,height,weight,birth_date,country_code,experience,college,age,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,PPG,first_name,last_name,suffix
0,42.0,Al Horford,C,6-9,240.0,June 3 1986,DO,16,Florida,37.0,65.0,33.0,1740.0,214.0,419.0,0.511,108.0,258.0,0.419,106.0,161.0,0.658385,0.64,26.0,30.0,0.867,82.0,331.0,413.0,168.0,38.0,62.0,48.0,93.0,562.0,8.646154,Al,Horford,
1,,Dalano Banton,,,,,,,,24.0,24.0,1.0,171.0,19.0,51.0,0.373,2.0,16.0,0.125,17.0,35.0,0.485714,0.392,16.0,20.0,0.8,12.0,23.0,35.0,19.0,5.0,3.0,10.0,20.0,56.0,2.333333,Dalano,Banton,
2,9.0,Derrick White,SG,6-4,190.0,July 2 1994,US,6,Colorado-Colorado Springs Colorado,29.0,73.0,73.0,2381.0,387.0,839.0,0.461,196.0,495.0,0.396,191.0,344.0,0.555233,0.578,137.0,152.0,0.901,51.0,259.0,310.0,377.0,74.0,87.0,112.0,152.0,1107.0,15.164384,Derrick,White,
3,,Drew Peterson,,,,,,,,24.0,3.0,0.0,23.0,4.0,6.0,0.667,3.0,5.0,0.6,1.0,1.0,1.0,0.917,0.0,0.0,,0.0,1.0,1.0,1.0,2.0,0.0,1.0,1.0,11.0,3.666667,Drew,Peterson,
4,13.0,Drew Peterson (TW),PF,6-9,205.0,November 9 1999,US,R,Rice University USC,,,,,,,,,,,,,,,,,,,,,,,,,,,,Drew,Peterson,(TW)
5,,JD Davison,,,,,,,,21.0,8.0,0.0,39.0,5.0,12.0,0.417,3.0,7.0,0.429,2.0,5.0,0.4,0.542,3.0,4.0,0.75,2.0,8.0,10.0,10.0,1.0,1.0,2.0,4.0,16.0,2.0,JD,Davison,
6,20.0,JD Davison (TW),SG,6-1,195.0,October 3 2002,US,1,Alabama,,,,,,,,,,,,,,,,,,,,,,,,,,,,JD,Davison,(TW)
7,44.0,Jaden Springer,PG,6-4,204.0,September 25 2002,US,2,Tennessee,21.0,17.0,1.0,130.0,13.0,30.0,0.433,2.0,11.0,0.182,11.0,19.0,0.578947,0.467,7.0,8.0,0.875,8.0,12.0,20.0,9.0,11.0,4.0,8.0,17.0,35.0,2.058824,Jaden,Springer,
8,7.0,Jaylen Brown,SF,6-6,223.0,October 24 1996,US,7,California,27.0,70.0,70.0,2343.0,627.0,1256.0,0.499,145.0,410.0,0.354,482.0,846.0,0.56974,0.557,211.0,300.0,0.703,84.0,303.0,387.0,249.0,83.0,37.0,166.0,185.0,1610.0,23.0,Jaylen,Brown,
9,0.0,Jayson Tatum,PF,6-8,210.0,March 3 1998,US,6,Duke,25.0,74.0,74.0,2645.0,672.0,1426.0,0.471,229.0,609.0,0.376,443.0,817.0,0.542228,0.552,414.0,497.0,0.833,67.0,534.0,601.0,364.0,75.0,43.0,188.0,145.0,1987.0,26.851351,Jayson,Tatum,
