#Getting Started with pandas 

The first step in any data science project is to get our data in the format for us to use. 

Lets bring in the NBA data from last time. 

Note this dataset is from a gitlab link. We could bring in the data by uploading it from our local machine or straight from Google Drive. Many packages also have datasets we can import to test our. We will go over how to use these approaches. 

In [1]:
import numpy as np
import pandas as pd
#Bring in the data 
pd.read_csv('https://gitlab.com/CEADS/DrKerby/python/raw/master/NBA_season1718_salary.csv')

Unnamed: 0.1,Unnamed: 0,Player,Tm,season17_18
0,1,Stephen Curry,GSW,34682550.0
1,2,LeBron James,CLE,33285709.0
2,3,Paul Millsap,DEN,31269231.0
3,4,Gordon Hayward,BOS,29727900.0
4,5,Blake Griffin,DET,29512900.0
...,...,...,...,...
568,569,Quinn Cook,NOP,25000.0
569,570,Chris Johnson,HOU,25000.0
570,571,Beno Udrih,DET,25000.0
571,572,Joel Bolomboy,MIL,22248.0


In [2]:
#Since Pandas by default adds an index column, we can drop that first column from the set. 
salaries = pd.read_csv('https://gitlab.com/CEADS/DrKerby/python/raw/master/NBA_season1718_salary.csv').drop("Unnamed: 0", axis=1)

Note that with .drop() if you want to drop a column you need to specify the axis to equal 1. 

# Exploring Data with pandas 

The most common methods to start data exploration in pandas are .head(), .describe(), .info() and .shape

In [4]:
#The .head() method gives us the first 5 entries in our DataFrame, we can also specify the number of entries we want to display 
salaries.head(10)

Unnamed: 0,Player,Tm,season17_18
0,Stephen Curry,GSW,34682550.0
1,LeBron James,CLE,33285709.0
2,Paul Millsap,DEN,31269231.0
3,Gordon Hayward,BOS,29727900.0
4,Blake Griffin,DET,29512900.0
5,Kyle Lowry,TOR,28703704.0
6,Russell Westbrook,OKC,28530608.0
7,Mike Conley,MEM,28530608.0
8,James Harden,HOU,28299399.0
9,DeMar DeRozan,TOR,27739975.0


In [5]:
#.tail() works the same way, but with the bottom of the dataset.
salaries.tail()

Unnamed: 0,Player,Tm,season17_18
568,Quinn Cook,NOP,25000.0
569,Chris Johnson,HOU,25000.0
570,Beno Udrih,DET,25000.0
571,Joel Bolomboy,MIL,22248.0
572,Jarell Eddie,CHI,17224.0


In [6]:
#The .describe() method will give us the summary stats of our data 
salaries.describe() 

Unnamed: 0,season17_18
count,573.0
mean,5858946.0
std,7162373.0
min,17224.0
25%,1312611.0
50%,2386864.0
75%,7936509.0
max,34682550.0


In [7]:
#.info() is another useful method 
salaries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 573 entries, 0 to 572
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Player       573 non-null    object 
 1   Tm           573 non-null    object 
 2   season17_18  573 non-null    float64
dtypes: float64(1), object(2)
memory usage: 13.6+ KB


In [10]:
#.shape will tell us the dimensions of the DataFrame 
salaries.shape

(573, 3)

In [11]:
#We can specify the columns we want to apply the methods to by using a list 
print(salaries[["Player", "Tm"]].head())

           Player   Tm
0   Stephen Curry  GSW
1    LeBron James  CLE
2    Paul Millsap  DEN
3  Gordon Hayward  BOS
4   Blake Griffin  DET


#Sorting and Grouping with pandas 

In [13]:
salaries.sort_values('Tm').tail(10)

Unnamed: 0,Player,Tm,season17_18
140,Markieff Morris,WAS,8000000.0
458,Martell Webster,WAS,830000.0
48,John Wall,WAS,18063850.0
19,Bradley Beal,WAS,23775506.0
193,Jason Smith,WAS,5225000.0
16,Otto Porter,WAS,24773250.0
497,Ramon Sessions,WAS,263124.0
508,Carrick Felix,WAS,140902.0
330,Tim Frazier,WAS,2000000.0
93,Marcin Gortat,WAS,12782609.0


In [14]:
salaries.sort_values('Tm', ascending=False).tail(10)

Unnamed: 0,Player,Tm,season17_18
166,Marco Belinelli,ATL,6306060.0
278,Richard Jefferson,ATL,2500000.0
202,Mike Muscala,ATL,5000000.0
483,Tyler Cavanaugh,ATL,679919.0
334,John Collins,ATL,1936920.0
363,DeAndre' Bembry,ATL,1567200.0
111,Jamal Crawford,ATL,10942762.0
95,Miles Plumlee,ATL,12500000.0
351,Mike Dunleavy,ATL,1662500.0
180,Ersan Ilyasova,ATL,6000000.0


In [16]:
salaries.sort_values("Tm", inplace=True)#This will make sort permanent 
salaries 

Unnamed: 0,Player,Tm,season17_18
334,John Collins,ATL,1936920.0
435,Nicolas Brussino,ATL,1312611.0
434,Okaro White,ATL,1312611.0
433,Sheldon Mac,ATL,1312611.0
363,DeAndre' Bembry,ATL,1567200.0
...,...,...,...
16,Otto Porter,WAS,24773250.0
497,Ramon Sessions,WAS,263124.0
508,Carrick Felix,WAS,140902.0
330,Tim Frazier,WAS,2000000.0


In [17]:
#Let's reset the index values 
salaries.reset_index(drop=True, inplace=True  )
salaries.head()

Unnamed: 0,Player,Tm,season17_18
0,John Collins,ATL,1936920.0
1,Nicolas Brussino,ATL,1312611.0
2,Okaro White,ATL,1312611.0
3,Sheldon Mac,ATL,1312611.0
4,DeAndre' Bembry,ATL,1567200.0


# Boolean Mask

Let's find out how many players from the Utah Jazz are in this set. 

In [18]:
#Using the == operator we can change the contents of the Tm column to Booleans 
salaries["Tm"] == 'UTA'

0      False
1      False
2      False
3      False
4      False
       ...  
568    False
569    False
570    False
571    False
572    False
Name: Tm, Length: 573, dtype: bool

This isn't the most helpful nor is it good practice, instead we should create a new column for our mask. 

In [20]:
mask = salaries["Tm"] == 'UTA'
salaries[mask]

Unnamed: 0,Player,Tm,season17_18
535,Raul Neto,UTA,1471382.0
536,Jae Crowder,UTA,6796117.0
537,Donovan Mitchell,UTA,2621280.0
538,Ekpe Udoh,UTA,3200000.0
539,Rudy Gobert,UTA,21974719.0
540,Joel Bolomboy,UTA,1312611.0
541,Derrick Favors,UTA,12000000.0
542,Ricky Rubio,UTA,14275000.0
543,Joe Ingles,UTA,14136364.0
544,Eric Griffin,UTA,50000.0


In [None]:
#Now use a Boolean Mask to see which players made more than 20 million 

In [21]:
mask = salaries['season17_18'] > 20000000
salaries[mask]

Unnamed: 0,Player,Tm,season17_18
36,Al Horford,BOS,27734405.0
37,Gordon Hayward,BOS,29727900.0
96,Nicolas Batum,CHO,22434783.0
98,Dwight Howard,CHO,23500000.0
109,LeBron James,CLE,33285709.0
112,Kevin Love,CLE,22642350.0
115,Harrison Barnes,DAL,23112004.0
148,Paul Millsap,DEN,31269231.0
163,Blake Griffin,DET,29512900.0
165,Andre Drummond,DET,23775506.0


In [22]:
#How about a specific player 
salaries[salaries["Player"] == "Stephen Curry"]

Unnamed: 0,Player,Tm,season17_18
175,Stephen Curry,GSW,34682550.0


In [23]:
salaries[salaries["Player"].str.contains("Curr")]

Unnamed: 0,Player,Tm,season17_18
131,Seth Curry,DAL,3028410.0
175,Stephen Curry,GSW,34682550.0


In [26]:
#If I want to change the name of a column, we the .column method. This will change the header based on the order of the entries we put in 
salaries.columns = ["Player", "Team", "Salary_17_18"]
salaries.head()

Unnamed: 0,Player,Team,Salary_17_18
0,John Collins,ATL,1936920.0
1,Nicolas Brussino,ATL,1312611.0
2,Okaro White,ATL,1312611.0
3,Sheldon Mac,ATL,1312611.0
4,DeAndre' Bembry,ATL,1567200.0


In [None]:
#We can change specific columns using a Python Dictionary 
#Change a column 
salaries.rename(columns={"Team" : "Tm"}, inplance=True)
s