<a href="https://colab.research.google.com/github/dnmalan/advanced-data-journalism-23/blob/main/Day_1_Introduction_to_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Day 1: Introduction to Pandas

In this lesson, we will explore how to manipulate and analyze data using Pandas. Pandas excels in data cleaning, preparation, exploration, and transformation, making it an indispensable library for data professionals.

We'll use a dataset of baseball player salaries, which includes columns for Name, Team, League, Position, and Salary. You will learn how to perform essential operations on DataFrames, such as importing data, displaying data, sorting, filtering, running functions, and using group by.

### Helpful links

- [10 minutes to Pandas](https://pandas.pydata.org/docs/user_guide/10min.html)
- [Pandas tutorials, quizzes and exercises on W3 schools](https://www.w3schools.com/python/pandas/default.asp)

##1. Importing data from a CSV and creating a dataframe

First we need to import our dataset. You can import from various places, including your computer, cloud storage, or github (via a url). We will import a csv file that is stored in github at this url:

https://raw.githubusercontent.com/dnmalan/advanced-data-journalism-23/main/data/baseball_players.csv

Pandas uses objects called **dataframes**,  or a two-dimensional tabular data structure that organizes data into rows and columns.


In [1]:
# Import the libraries needed into your notebook
import pandas as pd


Unnamed: 0,Name,Team,League,Position,Salary
0,Ildemaro Vargas,Arizona Diamondbacks,NL,SS,555000
1,Richie Martin,Baltimore Orioles,AL,SS,555000
2,Drew Jackson,Baltimore Orioles,AL,SS,555000
3,Eric Stamets,Cleveland Indians,AL,SS,555000
4,Michael Kopech,Chicago White Sox,AL,SP,555000


In [None]:
# Create a dataframe called "df", and import a csv file into that dataframe
df = pd.read_csv('https://raw.githubusercontent.com/dnmalan/advanced-data-journalism-23/main/data/baseball_players.csv')

# Display the first few rows of the DataFrame
df.head()

Pandas generally follows this format to run functions:

- **dataframe.function()**

This is what's called a Python **method**, or a function that is associated with an object (the dataframe) and can perform actions on it.

##2. Exploring the dataset
Before doing any analysis, we'll need to explore and get to know the dataset. Let's inspect the first and last few rows, as well as get some initial statistics about our dataset.

In [3]:
# Display the first 5 rows
df.head()

Unnamed: 0,Name,Team,League,Position,Salary
0,Ildemaro Vargas,Arizona Diamondbacks,NL,SS,555000
1,Richie Martin,Baltimore Orioles,AL,SS,555000
2,Drew Jackson,Baltimore Orioles,AL,SS,555000
3,Eric Stamets,Cleveland Indians,AL,SS,555000
4,Michael Kopech,Chicago White Sox,AL,SP,555000


In [4]:
# Display the last 5 rows
df.tail()

Unnamed: 0,Name,Team,League,Position,Salary
872,Clayton Kershaw,Los Angeles Dodgers,NL,SP,31000000
873,Zack Greinke,Arizona Diamondbacks,NL,SP,32421884
874,Mike Trout,Los Angeles Angels,AL,CF,34083333
875,Stephen Strasburg,Washington Nationals,NL,SP,36428571
876,Max Scherzer,Washington Nationals,NL,SP,42142857



You can also look at any number of rows by adding a number within the parentheses.

In [5]:
#display first 10 rows
df.head(10)

Unnamed: 0,Name,Team,League,Position,Salary
0,Ildemaro Vargas,Arizona Diamondbacks,NL,SS,555000
1,Richie Martin,Baltimore Orioles,AL,SS,555000
2,Drew Jackson,Baltimore Orioles,AL,SS,555000
3,Eric Stamets,Cleveland Indians,AL,SS,555000
4,Michael Kopech,Chicago White Sox,AL,SP,555000
5,Caleb Frare,Chicago White Sox,AL,RP,555000
6,Ryan Cordell,Chicago White Sox,AL,OF,555000
7,Reed Garrett,Detroit Tigers,AL,RP,555000
8,Gregory Soto,Detroit Tigers,AL,RP,555000
9,Dustin Peterson,Detroit Tigers,AL,OF,555000


If you want to see just one column, use this syntax:

- **df['column']**

In [6]:
# Display only the player name column
df['Name']

0        Ildemaro Vargas
1          Richie Martin
2           Drew Jackson
3           Eric Stamets
4         Michael Kopech
             ...        
872      Clayton Kershaw
873         Zack Greinke
874           Mike Trout
875    Stephen Strasburg
876         Max Scherzer
Name: Name, Length: 877, dtype: object

You can also print the entire dataset to see the beginning and end.

In [7]:
#print out the entire dataframe
print(df)

                  Name                  Team League Position    Salary
0      Ildemaro Vargas  Arizona Diamondbacks     NL       SS    555000
1        Richie Martin     Baltimore Orioles     AL       SS    555000
2         Drew Jackson     Baltimore Orioles     AL       SS    555000
3         Eric Stamets     Cleveland Indians     AL       SS    555000
4       Michael Kopech     Chicago White Sox     AL       SP    555000
..                 ...                   ...    ...      ...       ...
872    Clayton Kershaw   Los Angeles Dodgers     NL       SP  31000000
873       Zack Greinke  Arizona Diamondbacks     NL       SP  32421884
874         Mike Trout    Los Angeles Angels     AL       CF  34083333
875  Stephen Strasburg  Washington Nationals     NL       SP  36428571
876       Max Scherzer  Washington Nationals     NL       SP  42142857

[877 rows x 5 columns]


You can also get some basic summaries very quickly with the shape and describe functions.

In [12]:
# Find how many rows and columns in the dataset
df.shape

(877, 5)

In [13]:
# Summary statistics for numeric columns
df.describe()

Unnamed: 0,Salary
count,877.0
mean,4509878.0
std,6334236.0
min,555000.0
25%,567500.0
50%,1400000.0
75%,6000000.0
max,42142860.0


In [16]:
#add this formatting snippet to turn off scientific notation

df.describe().apply(lambda s: s.apply('{0:.2f}'.format))

Unnamed: 0,Salary
count,877.0
mean,4509877.57
std,6334235.74
min,555000.0
25%,567500.0
50%,1400000.0
75%,6000000.0
max,42142857.0


##3. Analyzing the dataset

Now that we've got a feel for what's in the datsaet, let's do some analysis, including sorting, filtering and functions.

### Sorting

Let's find who makes the highest salary (and lowest salary) by sorting this dataset. We will use the **sort_values** method and incorporating the parameters **by** (to tell Pandas which column to sort by) and **ascending** (to tell Pandas what order to sort by).

In [19]:
#sort by salaries in descending order (largest salary at the top)
df.sort_values(by='Salary', ascending=False)


Unnamed: 0,Name,Team,League,Position,Salary
876,Max Scherzer,Washington Nationals,NL,SP,42142857
875,Stephen Strasburg,Washington Nationals,NL,SP,36428571
874,Mike Trout,Los Angeles Angels,AL,CF,34083333
873,Zack Greinke,Arizona Diamondbacks,NL,SP,32421884
872,Clayton Kershaw,Los Angeles Dodgers,NL,SP,31000000
...,...,...,...,...,...
27,Chris Paddack,San Diego Padres,NL,SP,555000
26,Ben Heller,New York Yankees,AL,RP,555000
25,Troy Tulowitzki,New York Yankees,AL,SS,555000
24,Tim Peterson,New York Mets,NL,RP,555000


In [20]:
#sort by salaries in ascending order (smallest salary at the top)
df.sort_values(by='Salary', ascending=True)

Unnamed: 0,Name,Team,League,Position,Salary
0,Ildemaro Vargas,Arizona Diamondbacks,NL,SS,555000
23,Pete Alonso,New York Mets,NL,1B,555000
24,Tim Peterson,New York Mets,NL,RP,555000
25,Troy Tulowitzki,New York Yankees,AL,SS,555000
27,Chris Paddack,San Diego Padres,NL,SP,555000
...,...,...,...,...,...
872,Clayton Kershaw,Los Angeles Dodgers,NL,SP,31000000
873,Zack Greinke,Arizona Diamondbacks,NL,SP,32421884
874,Mike Trout,Los Angeles Angels,AL,CF,34083333
875,Stephen Strasburg,Washington Nationals,NL,SP,36428571


###Filtering

Filtering is the way to find a subset of the dataframe that matches a specific criteria or set of criteria. Think of it as searching your data, and the filter will return rows that match. You can search within text or numerical columns, and you can also search within just one or in multiple columns at once.

To filter in Pandas, you'll need several pieces:

- Column: The column you want to search in.
- Comparison operator: equal to (==), not equal to (!=), greater than (>), less than (<), greater than or equal to (>=), and less than or equal to (<=)
- Criteria: The information you want to search by.

The filter generally follows this pattern:

- **df[df['column'] == 'value']**

####Filtering by numbers

Do not put quotes around the numbers you're filtering by.

In [22]:
#Find players who make the minimum wage

df[df['Salary'] == 555000]

Unnamed: 0,Name,Team,League,Position,Salary
0,Ildemaro Vargas,Arizona Diamondbacks,NL,SS,555000
1,Richie Martin,Baltimore Orioles,AL,SS,555000
2,Drew Jackson,Baltimore Orioles,AL,SS,555000
3,Eric Stamets,Cleveland Indians,AL,SS,555000
4,Michael Kopech,Chicago White Sox,AL,SP,555000
5,Caleb Frare,Chicago White Sox,AL,RP,555000
6,Ryan Cordell,Chicago White Sox,AL,OF,555000
7,Reed Garrett,Detroit Tigers,AL,RP,555000
8,Gregory Soto,Detroit Tigers,AL,RP,555000
9,Dustin Peterson,Detroit Tigers,AL,OF,555000


In [23]:
#Find players who make more than $10 million
df[df['Salary'] > 10000000]

Unnamed: 0,Name,Team,League,Position,Salary
754,Jose Quintana,Chicago Cubs,NL,SP,10500000
755,Brandon Morrow,Chicago Cubs,NL,RP,10500000
756,Carlos Carrasco,Cleveland Indians,AL,SP,10500000
757,Francisco Lindor,Cleveland Indians,AL,SS,10550000
758,Starling Marte,Pittsburgh Pirates,NL,CF,10666666
...,...,...,...,...,...
872,Clayton Kershaw,Los Angeles Dodgers,NL,SP,31000000
873,Zack Greinke,Arizona Diamondbacks,NL,SP,32421884
874,Mike Trout,Los Angeles Angels,AL,CF,34083333
875,Stephen Strasburg,Washington Nationals,NL,SP,36428571


####Filtering by text

Put quotes around the text value you're filtering by.

In [24]:
#find all the Kansas City Royals players
df[df['Team'] == 'Kansas City Royals']


Unnamed: 0,Name,Team,League,Position,Salary
10,Kyle Zimmer,Kansas City Royals,AL,RP,555000
11,Chris Ellis,Kansas City Royals,AL,RP,555000
12,Frank Schwindel,Kansas City Royals,AL,1B,555000
43,Trevor Oaks,Kansas City Royals,AL,RP,555350
71,Cam Gallagher,Kansas City Royals,AL,C,557125
87,Ryan O'Hearn,Kansas City Royals,AL,1B,557650
159,Jorge Lopez,Kansas City Royals,AL,SP,562250
214,Hunter Dozier,Kansas City Royals,AL,3B,567225
225,Eric Skoglund,Kansas City Royals,AL,SP,568175
264,Tim Hill,Kansas City Royals,AL,RP,573175


####Filtering by multiple columns

Now let's put these together and search for anyone who makes the minimum wage on the Kansas City Royals.

To filter by multiple columns, we need to add a logical operator (AND, OR, or NOT) in between the criteria.

- AND: &
- OR: |
- NOT: ~


In [27]:
#Filter by Kansas City Royals and salary of more than 1 million

df[(df['Salary'] > 1000000) & (df['Team'] == 'Kansas City Royals')]

Unnamed: 0,Name,Team,League,Position,Salary
410,Whit Merrifield,Kansas City Royals,AL,2B,1187500
415,Lucas Duda,Kansas City Royals,AL,1B,1250000
501,Brad Boxberger,Kansas City Royals,AL,RP,2200000
518,Martin Maldonado,Kansas City Royals,AL,C,2500000
534,Jake Diekman,Kansas City Royals,AL,RP,2750000
548,Chris Owings,Kansas City Royals,AL,SS,3000000
561,Wily Peralta,Kansas City Royals,AL,RP,3250000
617,Jorge Soler,Kansas City Royals,AL,RF,4666666
640,Billy Hamilton,Kansas City Royals,AL,CF,5250000
764,Salvador Perez,Kansas City Royals,AL,C,11200000


####Creating a new dataframe from results

Often, we'll want to save the results of a function into a new dataframe so we can do some more analysis on it. To do this, simply create a new dataframe name and assign it to the function.

In [28]:
#new dataframe called "royals" and assign it to the filtered original dataframe
royals = df[df['Team'] == 'Kansas City Royals']


In [29]:
royals.head()

Unnamed: 0,Name,Team,League,Position,Salary
10,Kyle Zimmer,Kansas City Royals,AL,RP,555000
11,Chris Ellis,Kansas City Royals,AL,RP,555000
12,Frank Schwindel,Kansas City Royals,AL,1B,555000
43,Trevor Oaks,Kansas City Royals,AL,RP,555350
71,Cam Gallagher,Kansas City Royals,AL,C,557125


In [30]:
royals.shape

(31, 5)

####On your own:

Find all of the players who make more than $20 million.

In [None]:
#Filter by salary of more than 20000000

