pandas-intro

This repo contains material for an introductory workshop to the python package pandas.

Who am I?

Data Scientist with the Data Science Institute
I like to be interrupted with questions! Please jump right in.

Welcome to the UVA Library

Getting Pandas

Mothership: https://pandas.pydata.org/

conda install pandas
python3 -m pip install --upgrade pandas

To confirm installation:

open python and type
```
import pandas as pd
```

Other Resources

Python for Data Analysis by McKinney - get from library
R for Data Science by Wickham & Grolemund - free online

Goals for Today

Get pandas working on your machine
Get comfortable with pandas (know what it's all about)
Learn how to look up help

Outline

Read data into pandas
Manipulate data with pandas
Plot/Aggregate/Summarize data with pandas (... and beyond)

A brief history

Designed by Wes McKinney - launched in 2008
- https://www.blockchain.com/btc/address/1CUztXcgPYfL1AXuv8FD8XDyPXTc2jcheg
- https://etherscan.io/address/0x5BC648c302d6aF9D921DE31d0DB2411D26686A4a
version 0.24.1

The Data Frame

Using pandas the main tool is the data frame. We use it to hold data and perform operations on the data frame to prepare the data for our end goal. Depending on where you want to go some end functions are also built into the data frame (eg make a histogram). However ...

Often pandas data frames are confusing to new users. Almost always that is not the fault of the new user.

Quick survey: Who knows R?

"Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure." official documentation

The trick is pandas data frames have a lot of capability and a lot of features. But when learning you don't need that. So today we'll take a different approach and focus on the basics.

Working definition of Data Frame:

 An allocation of computer memory that holds data. The format works like a spreadsheet.

What a data fame "looks like"

Reading in Data

Our first step is to read data into a pandas data frame. It may be coming from many diferent sources (Eg: a file in storage, an object in memory, the internet, etc.). Here we will look at the example of reading in from a file (.csv).

read from csv files
- docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
  
  don't worry if this looks intimidating, we'll break it down

Let's write a program

# load pandas package
import pandas as pd

# read data into data frame
df =  pd.read_csv("andre.csv")

Manipulating Data

Taking a look at the data

Your data in a data frame is stored in memory. But as a human we like to see a snippet on the screen. Let's write a program to do just that.

# load pandas package
df.head()

# look at end of file
df.tail()

# look at random sample
n = 5
df.sample(n)

Notice how the whole frame doesn't display. If you want to see the column names you can do this:

list(df)

Sort

In this world we call sorting the data set on a column "arranging". To write aprogram to do that we use the function 'sort_values'.

documentation

df.sort_values(by='G', ascending=False)

Remove rows (with condition)

Sometimes we want to remove some of the data. We call that filtering. Let's try writing code to remove the years before 1976 from this data frame.

df = df[ df.Year > 1976 ]

Code Breakdown:

df is a data frame object
the " [ ... ] " let's us reference or "index" the data frame
df.Year > 1976 is a boolean that is evaluated for every row

Create or Remove Columns

Once you have worked the data set you sometimes want to select a subset of the columns going forward. Again we use the reference tool and provide a list of columns to keep.
```
  df = df[ ['H','G'] ]
  
```
To create columns we can even get a little fancier and stick calculations right into those new columns. mutate (aka create new columns). Let's make up batting average.
```
  df['myBA']=df.H/df.AB
  df[['myBA','BA']]
  
```
To create the new column we just acted like it existed and pandas created it for us on the fly.

Aggregation / summarization

To quickly summarize the common statistics of the dataset use describe
```
 df.describe() 
```
There are also a bunch of pandas functions to do some stats and things
- df.mean()
- << google practice time >>
pandas also facilitates group operations. To explain let's look at a picture. -- show picture on your desktop from the pandas book because it is not open source.
- let's group Andre's data frame by Team he played for then take the mean

dfg = df.groupby(df.Tm)

dfg.mean()['H']

Making plots

Let's make a histogram from our sample dataset. Since we are using a data frame this feature is already built in

# make a histogram
df.hist("H",bins=100)

Ways to Practice

Write some code
Ask a friend to review it

Beginning: Load a csv file into a data frame. Clean it up for statistical analysis.
Intermediate: Load an excel spreadsheet into a data frame.
Expert: Go nuts. Read the McKinney book and try out some cool stuff.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
LICENSE		LICENSE
README.md		README.md
andre.csv		andre.csv
andre.png		andre.png
bryce.csv		bryce.csv
histo.png		histo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

andre.csv

andre.csv

andre.png

andre.png

bryce.csv

bryce.csv

histo.png

histo.png

Repository files navigation

pandas-intro

Who am I?

Welcome to the UVA Library

Getting Pandas

Other Resources

Goals for Today

Outline

A brief history

The Data Frame

Often pandas data frames are confusing to new users. Almost always that is not the fault of the new user.

Quick survey: Who knows R?

Reading in Data

Manipulating Data

Taking a look at the data

Sort

Remove rows (with condition)

Create or Remove Columns

Aggregation / summarization

Making plots

Ways to Practice

About

Releases

Packages

License

alonzi/pandas-intro

Folders and files

Latest commit

History

Repository files navigation

pandas-intro

Who am I?

Welcome to the UVA Library

Getting Pandas

Other Resources

Goals for Today

Outline

A brief history

The Data Frame

Often pandas data frames are confusing to new users. Almost always that is not the fault of the new user.

Quick survey: Who knows R?

Reading in Data

Manipulating Data

Taking a look at the data

Sort

Remove rows (with condition)

Create or Remove Columns

Aggregation / summarization

Making plots

Ways to Practice

About

Resources

License

Stars

Watchers

Forks