# What can we do with pandas? Part Deux

We'll be working with our favorite folks on the court, the Boston Celtics. 

Let's set up our data.

In [None]:
import pandas

celtics = pandas.read_csv('boston_celtics_2023_2024.csv')
celtics.head(10)

We're going to go through each of the fields, but we're going to start small. 

Let's look at country_code. This is in lower case. It's more common to see this as an upper case field, so let's change that. 

In [None]:
### Simple String Operations

import string 

celtics.country_code = celtics.country_code.apply(str.upper)
celtics.head(10)

Much better. Let's also inspect the data frame. Pay close attention to the Non-Null Count. 
(Do you remember how to find that NaN value from the last lesson? Who didn't go to college?!?!)

In [None]:
celtics.info()

Let's look at JD Petersen for a second...(Remember how to select a single Series?) 

In [None]:
celtics[celtics.player.str.contains("Drew")]

Drew has 2 universities listed. Let's change the column name to colleges. 

In [None]:
celtics.rename(columns={"college": "colleges"})

```
Note: we can change all of the column names as well:

celtics.columns = [<list of new column names>]

This syntax requires you to specify all of the column names.The previous example above using DataFrame.rename() allows
selecting renaming and is more common.
```

There are a number of reasons for doing this. Sometimes source data has column names w/ spaces or that start w/ numbers or other limitations that prevent us from using dot notation. We might rename columns to fix this, because dot notation tends to be less error prone than bracket notation. 

Another reason is presentation. After all of your data wrangling is finished, you might want to output new column names that reflect some polish. This is the most common use of DataFrame.columns, because it's more likely that all of the columns will be renamed to fit presentation requirements. 

### Adding Columns by merging data frames

We could add a single column, but we will be doing that a bit later, and I want to add some data to make this more involved, so let's add a second. Before we can do that, let's create the data frame and figure out a good field to merge them.

In [None]:
totals = pandas.read_csv('boston_celtics_2023_2024_totals.csv')
totals.head(10)

Yikes. Not much to play with. We only have 2 options. 
1. We can use the player name, which is usually going to be unique on a sports team (at least within a given season).
2. We could calculate the age of the player from the first data frame based on their birthdate and match it to their age in this data frame. The downside is that we don't really know how it's been calculated (what if they had a birthday mid-season? Is this current? etc.)

For our purposes, players is the best way to do this right now. 

In [None]:
# Using an "outer" method, which is similar to a SQL full outer join so we get all of the data. 
celtics = pandas.merge(celtics, totals, on='player', how='outer')
celtics.head(10)

Whoa, nellie! That certainly added a ton of data, and I imagine you can start to pick out some problems. 
Let's look at the DataFrame's structure and the data types. 

Pay attention to the non-null values. This is one of the common results of merging data frames or tables

In [None]:
celtics.info()

### Brief Note On Display options. 
Hmm... we can't see all of our columns. 

Let's change that...

In [None]:
pandas.options.display.max_columns = None
pandas.options.display.max_rows = None

celtics.head(10)

### Creating a new column from derived data. 

As you can see, we have a considerable number of pre-calculated fields. (Thanks to the wonderful folks at ![basketball-reference.com](https://basketball-reference.com).) 

However, they left a few for us to play with. We only have total points, but points-per-game is a common statistic. Let's start by looking at the columns. 

In [None]:
celtics[['player', 'G', 'PTS']]

We're going to create a new column called 'PPG' for Points Per Game. We're going to compute the data based on The information we just queried

Afterwards, let's sort the results in descending order based on the highest point scorers per-game

In [None]:
celtics['PPG'] = celtics.PTS / celtics.G
celtics[['player', 'G', 'PTS', 'PPG']].sort_values(by=['PPG'], ascending=False)

### Lambdas (Python) to create columns

That player column is bugging me. I'd like to split that up

In [None]:
# Create some lambdas to do work for us
get_first_name = lambda word: word.split(' ')[0]
get_last_name = lambda word: word.split(' ')[1]

# This is a little bit more challenging. You have to use the negative index to avoid a 'list index out of range'
get_suffix = lambda suffix: suffix.split(' ')[-1] if len(suffix.split(' ')) > 2 else 'null'

# Create the new columns
celtics['first_name'] = celtics.player.apply(get_first_name)
celtics['last_name'] = celtics.player.apply(get_last_name)
celtics['suffix'] = celtics.player.apply(get_suffix)

celtics[['player','first_name','last_name','suffix']]

### Working with Rows. 

So we can similarly work with rows in a similar manner, where we perform calculations on a row in order to create a new column or value. 

One of the most common sources of confusion in basketball statistics is FG (Field Goals or shots), FT (Free Throws), 3P (Three Pointers) and PTS.

1. a Free Throw is worth a single point. It is not a Field Goal
2. Field Goals are 2 or 3 PT shots
3. 2 Point shots aren't reflected as a separate statistic, you have to pull them out... so let's do that.


We're going to need 3 new columns: 
1. 2P - this will be the 2 pointers made, which = FG - 3P
2. 2PA - this is 2 pointers attempted, which is FGA - 3PA
3. 2P% - this is 2-Pt percentage, which is FG / FGA

In [None]:
# Create our python lambdas for 2P and 2PA (NOTE. We can't use dot notation w/ column names that start
# with a number)
get_2P = lambda row: row.FG - row['3P']
get_2PA = lambda row: row.FGA - row['3PA']

# create the new columns (make sure to set the axis to ROWS, axis=1!!!)
celtics['2P'] = celtics.apply(get_2P, axis=1)
celtics['2PA'] = celtics.apply(get_2PA, axis=1)

celtics[['last_name', '2P','2PA']]

Cool. That's done, so let's perform the calculation to get the percentage. 

In [None]:
get_2PPct = lambda row: row['2P'] / row['2PA']
celtics['2P%'] = celtics.apply(get_2PPct, axis=1)
celtics[['last_name', '2P','2PA', '2P%']]

Let's output the entire DataFrame to finish up

In [None]:
celtics