# What can we do with Pandas?

In this example, we'll be working w/ data from the 2023-24 Boston Celtics

In [2]:
# Before we analyze anything, we need to import pandas
import pandas as pd

### Loading data from a csv file

We can load data into Pandas from a csv (comma-separated variable) file. This data represents the Celtics roster.

In [4]:
celtics = pd.read_csv('boston_celtics_2023_2024.csv')

### Selecting Data (Previewing)

Let's examine the first 10 rows of our data.

In [6]:
celtics.head(10)

Unnamed: 0,number,player,position,height,weight,birth_date,country_code,experience,college
0,11,Payton Pritchard,PG,6-1,195,January 28 1998,us,3,Oregon
1,30,Sam Hauser,SF,6-8,215,December 8 1997,us,2,Marquette Virginia
2,0,Jayson Tatum,PF,6-8,210,March 3 1998,us,6,Duke
3,9,Derrick White,SG,6-4,190,July 2 1994,us,6,Colorado-Colorado Springs Colorado
4,7,Jaylen Brown,SF,6-6,223,October 24 1996,us,7,California
5,4,Jrue Holiday,PG,6-4,205,June 12 1990,us,14,UCLA
6,42,Al Horford,C,6-9,240,June 3 1986,do,16,Florida
7,40,Luke Kornet,C,7-2,250,July 15 1995,us,6,Vanderbilt
8,8,Kristaps Porziņģis,C,7-2,240,August 2 1995,lv,7,
9,12,Oshae Brissett,SF,6-7,210,June 20 1998,ca,4,Syracuse


### Inspecting the structure of the data frame.

Let's see what the data looks like.

In [8]:
celtics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17 entries, 0 to 16
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   number        17 non-null     int64 
 1   player        17 non-null     object
 2   position      17 non-null     object
 3   height        17 non-null     object
 4   weight        17 non-null     int64 
 5   birth_date    17 non-null     object
 6   country_code  17 non-null     object
 7   experience    17 non-null     object
 8   college       16 non-null     object
dtypes: int64(2), object(7)
memory usage: 1.3+ KB


### Selecting Data by column

What colleges did the team go to? 

In [10]:
celtics.college

0                                 Oregon
1                     Marquette Virginia
2                                   Duke
3     Colorado-Colorado Springs Colorado
4                             California
5                                   UCLA
6                                Florida
7                             Vanderbilt
8                                    NaN
9                               Syracuse
10                                Kansas
11                 Utah State University
12                        Michigan State
13                             Tennessee
14                              Arkansas
15                               Alabama
16                   Rice University USC
Name: college, dtype: object


Let's inspect the data types

In [12]:
print(type(celtics))
print(type(celtics.college))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


### Selecting Multiple Columns

Well, that isn't useful. Let's add the player names too. 

In [14]:
celtics[['player','college']]

Unnamed: 0,player,college
0,Payton Pritchard,Oregon
1,Sam Hauser,Marquette Virginia
2,Jayson Tatum,Duke
3,Derrick White,Colorado-Colorado Springs Colorado
4,Jaylen Brown,California
5,Jrue Holiday,UCLA
6,Al Horford,Florida
7,Luke Kornet,Vanderbilt
8,Kristaps Porziņģis,
9,Oshae Brissett,Syracuse


Let's check the data type again. (HINT: It's different when selecting multiple columns!) 

In [16]:
type(celtics[['player','college']])

pandas.core.frame.DataFrame

### Selecting Rows

iloc is a way to select rows based on integer location. Let's select Jaylen Brown.

In [18]:
celtics.iloc[4]

number                        7
player             Jaylen Brown
position                     SF
height                      6-6
weight                      223
birth_date      October 24 1996
country_code                 us
experience                    7
college              California
Name: 4, dtype: object

We can also select them using python's **slice** notation. The second number is *non-inclusive*

In [20]:
celtics.iloc[2:7]

Unnamed: 0,number,player,position,height,weight,birth_date,country_code,experience,college
2,0,Jayson Tatum,PF,6-8,210,March 3 1998,us,6,Duke
3,9,Derrick White,SG,6-4,190,July 2 1994,us,6,Colorado-Colorado Springs Colorado
4,7,Jaylen Brown,SF,6-6,223,October 24 1996,us,7,California
5,4,Jrue Holiday,PG,6-4,205,June 12 1990,us,14,UCLA
6,42,Al Horford,C,6-9,240,June 3 1986,do,16,Florida


### Selecting Rows by Logic

Who's our fives? (Centers)

In [22]:
celtics[celtics.position == 'C']

Unnamed: 0,number,player,position,height,weight,birth_date,country_code,experience,college
6,42,Al Horford,C,6-9,240,June 3 1986,do,16,Florida
7,40,Luke Kornet,C,7-2,250,July 15 1995,us,6,Vanderbilt
8,8,Kristaps Porziņģis,C,7-2,240,August 2 1995,lv,7,
11,88,Neemias Queta,C,7-0,245,July 13 1999,pt,2,Utah State University


Who has a birthday coming up? 

In [24]:
celtics[celtics.birth_date.str.contains('June')]

Unnamed: 0,number,player,position,height,weight,birth_date,country_code,experience,college
5,4,Jrue Holiday,PG,6-4,205,June 12 1990,us,14,UCLA
6,42,Al Horford,C,6-9,240,June 3 1986,do,16,Florida
9,12,Oshae Brissett,SF,6-7,210,June 20 1998,ca,4,Syracuse
10,50,Svi Mykhailiuk,SF,6-7,205,June 10 1997,ua,5,Kansas


Who plays guard?

In [26]:
celtics[(celtics.position == 'PG') | (celtics.position =='SG')]

Unnamed: 0,number,player,position,height,weight,birth_date,country_code,experience,college
0,11,Payton Pritchard,PG,6-1,195,January 28 1998,us,3,Oregon
3,9,Derrick White,SG,6-4,190,July 2 1994,us,6,Colorado-Colorado Springs Colorado
5,4,Jrue Holiday,PG,6-4,205,June 12 1990,us,14,UCLA
13,44,Jaden Springer,PG,6-4,204,September 25 2002,us,2,Tennessee
15,20,JD Davison (TW),SG,6-1,195,October 3 2002,us,1,Alabama


Which players weren't born in the US, but attended college? 

In [28]:
celtics[(celtics.country_code != 'us') & (celtics.college)]

Unnamed: 0,number,player,position,height,weight,birth_date,country_code,experience,college
6,42,Al Horford,C,6-9,240,June 3 1986,do,16,Florida
9,12,Oshae Brissett,SF,6-7,210,June 20 1998,ca,4,Syracuse
10,50,Svi Mykhailiuk,SF,6-7,205,June 10 1997,ua,5,Kansas
11,88,Neemias Queta,C,7-0,245,July 13 1999,pt,2,Utah State University


[HINT: The query above is looking for a defined cell, let's see what happens we find an undefined cell)

In [30]:
# This is a crappy way to hunt down NaNs. 
celtics[celtics.isnull().any(axis=1)]

Unnamed: 0,number,player,position,height,weight,birth_date,country_code,experience,college
8,8,Kristaps Porziņģis,C,7-2,240,August 2 1995,lv,7,


Who went to college in California? 

In [32]:
celtics[celtics.college.isin(['California','UCLA','USC'])]

Unnamed: 0,number,player,position,height,weight,birth_date,country_code,experience,college
4,7,Jaylen Brown,SF,6-6,223,October 24 1996,us,7,California
5,4,Jrue Holiday,PG,6-4,205,June 12 1990,us,14,UCLA


### Setting Indices

(This is using pandas' loc, not iloc)

Let's set the starting lineup!  

In [34]:
starting_lineup = celtics.loc[[2,3,4,5,8]]
starting_lineup

Unnamed: 0,number,player,position,height,weight,birth_date,country_code,experience,college
2,0,Jayson Tatum,PF,6-8,210,March 3 1998,us,6,Duke
3,9,Derrick White,SG,6-4,190,July 2 1994,us,6,Colorado-Colorado Springs Colorado
4,7,Jaylen Brown,SF,6-6,223,October 24 1996,us,7,California
5,4,Jrue Holiday,PG,6-4,205,June 12 1990,us,14,UCLA
8,8,Kristaps Porziņģis,C,7-2,240,August 2 1995,lv,7,


We're going to use this starting a lineup a lot. Wouldn't it be nice if we could update the indexes? 

In [36]:
new_starting_lineup = starting_lineup.reset_index()
new_starting_lineup

Unnamed: 0,index,number,player,position,height,weight,birth_date,country_code,experience,college
0,2,0,Jayson Tatum,PF,6-8,210,March 3 1998,us,6,Duke
1,3,9,Derrick White,SG,6-4,190,July 2 1994,us,6,Colorado-Colorado Springs Colorado
2,4,7,Jaylen Brown,SF,6-6,223,October 24 1996,us,7,California
3,5,4,Jrue Holiday,PG,6-4,205,June 12 1990,us,14,UCLA
4,8,8,Kristaps Porziņģis,C,7-2,240,August 2 1995,lv,7,


Hmm. So that's cool, but now I have a wasted data frame. 'starting_lineup' didn't update...

In [38]:
starting_lineup

Unnamed: 0,number,player,position,height,weight,birth_date,country_code,experience,college
2,0,Jayson Tatum,PF,6-8,210,March 3 1998,us,6,Duke
3,9,Derrick White,SG,6-4,190,July 2 1994,us,6,Colorado-Colorado Springs Colorado
4,7,Jaylen Brown,SF,6-6,223,October 24 1996,us,7,California
5,4,Jrue Holiday,PG,6-4,205,June 12 1990,us,14,UCLA
8,8,Kristaps Porziņģis,C,7-2,240,August 2 1995,lv,7,


Let's fix that. 

In [40]:
starting_lineup.reset_index(inplace=True)
starting_lineup

Unnamed: 0,index,number,player,position,height,weight,birth_date,country_code,experience,college
0,2,0,Jayson Tatum,PF,6-8,210,March 3 1998,us,6,Duke
1,3,9,Derrick White,SG,6-4,190,July 2 1994,us,6,Colorado-Colorado Springs Colorado
2,4,7,Jaylen Brown,SF,6-6,223,October 24 1996,us,7,California
3,5,4,Jrue Holiday,PG,6-4,205,June 12 1990,us,14,UCLA
4,8,8,Kristaps Porziņģis,C,7-2,240,August 2 1995,lv,7,


That's much better. I don't really need that second index column. Get outta here. 

In [42]:
starting_lineup.reset_index(drop=True, inplace=True)
starting_lineup

Unnamed: 0,index,number,player,position,height,weight,birth_date,country_code,experience,college
0,2,0,Jayson Tatum,PF,6-8,210,March 3 1998,us,6,Duke
1,3,9,Derrick White,SG,6-4,190,July 2 1994,us,6,Colorado-Colorado Springs Colorado
2,4,7,Jaylen Brown,SF,6-6,223,October 24 1996,us,7,California
3,5,4,Jrue Holiday,PG,6-4,205,June 12 1990,us,14,UCLA
4,8,8,Kristaps Porziņģis,C,7-2,240,August 2 1995,lv,7,


HA! Tricked you. That doesn't work. Those columns are still there. It's a bit of an either or. 

This makes **inplace=True** obsolete when you don't need the index column, because the solution is to reassign to the same variable...

```NOTE: It isn't unusual for this to have unexpected behavior. If you plan to do something like this, I recommend doing it right away.``` 

In [73]:
starting_lineup = celtics.loc[[2,3,4,5,8]]
starting_lineup = starting_lineup.reset_index(drop=True)
starting_lineup

Unnamed: 0,number,player,position,height,weight,birth_date,country_code,experience,college
0,0,Jayson Tatum,PF,6-8,210,March 3 1998,us,6,Duke
1,9,Derrick White,SG,6-4,190,July 2 1994,us,6,Colorado-Colorado Springs Colorado
2,7,Jaylen Brown,SF,6-6,223,October 24 1996,us,7,California
3,4,Jrue Holiday,PG,6-4,205,June 12 1990,us,14,UCLA
4,8,Kristaps Porziņģis,C,7-2,240,August 2 1995,lv,7,
