This repo contains material for an introductory workshop to the python package pandas.
- Data Scientist with the Data Science Institute
- I like to be interrupted with questions! Please jump right in.
Mothership: https://pandas.pydata.org/
- conda install pandas
- python3 -m pip install --upgrade pandas
To confirm installation:
- open python and type
import pandas as pd
- Python for Data Analysis by McKinney - get from library
- R for Data Science by Wickham & Grolemund - free online
- Get pandas working on your machine
- Get comfortable with pandas (know what it's all about)
- Learn how to look up help
- Read data into pandas
- Manipulate data with pandas
- Plot/Aggregate/Summarize data with pandas (... and beyond)
- Designed by Wes McKinney - launched in 2008
- version 0.24.1
Using pandas the main tool is the data frame. We use it to hold data and perform operations on the data frame to prepare the data for our end goal. Depending on where you want to go some end functions are also built into the data frame (eg make a histogram). However ...
Often pandas data frames are confusing to new users. Almost always that is not the fault of the new user.
- "Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure." official documentation
The trick is pandas data frames have a lot of capability and a lot of features. But when learning you don't need that. So today we'll take a different approach and focus on the basics.
- Working definition of Data Frame:
An allocation of computer memory that holds data. The format works like a spreadsheet.
- What a data fame "looks like"
Our first step is to read data into a pandas data frame. It may be coming from many diferent sources (Eg: a file in storage, an object in memory, the internet, etc.). Here we will look at the example of reading in from a file (.csv).
- read from csv files
-
docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
don't worry if this looks intimidating, we'll break it down
-
- Let's write a program
# load pandas package import pandas as pd
# read data into data frame df = pd.read_csv("andre.csv")
Your data in a data frame is stored in memory. But as a human we like to see a snippet on the screen. Let's write a program to do just that.
# load pandas package
df.head()
# look at end of file
df.tail()
# look at random sample
n = 5
df.sample(n)
- Notice how the whole frame doesn't display. If you want to see the column names you can do this:
list(df)
In this world we call sorting the data set on a column "arranging". To write aprogram to do that we use the function 'sort_values'.
df.sort_values(by='G', ascending=False)
Sometimes we want to remove some of the data. We call that filtering. Let's try writing code to remove the years before 1976 from this data frame.
df = df[ df.Year > 1976 ]
Code Breakdown:
- df is a data frame object
- the " [ ... ] " let's us reference or "index" the data frame
- df.Year > 1976 is a boolean that is evaluated for every row
-
Once you have worked the data set you sometimes want to select a subset of the columns going forward. Again we use the reference tool and provide a list of columns to keep.
df = df[ ['H','G'] ]
-
To create columns we can even get a little fancier and stick calculations right into those new columns. mutate (aka create new columns). Let's make up batting average.
df['myBA']=df.H/df.AB df[['myBA','BA']]
To create the new column we just acted like it existed and pandas created it for us on the fly.
-
To quickly summarize the common statistics of the dataset use describe
df.describe()
-
There are also a bunch of pandas functions to do some stats and things
- df.mean()
- << google practice time >>
-
pandas also facilitates group operations. To explain let's look at a picture. -- show picture on your desktop from the pandas book because it is not open source.
- let's group Andre's data frame by Team he played for then take the mean
dfg = df.groupby(df.Tm)
dfg.mean()['H']
Let's make a histogram from our sample dataset. Since we are using a data frame this feature is already built in
# make a histogram df.hist("H",bins=100)
- Write some code
- Ask a friend to review it
- Beginning: Load a csv file into a data frame. Clean it up for statistical analysis.
- Intermediate: Load an excel spreadsheet into a data frame.
- Expert: Go nuts. Read the McKinney book and try out some cool stuff.