Skip to content

alonzi/pandas-intro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pandas-intro

This repo contains material for an introductory workshop to the python package pandas.

Who am I?

Welcome to the UVA Library

Getting Pandas

Mothership: https://pandas.pydata.org/

  • conda install pandas
  • python3 -m pip install --upgrade pandas

To confirm installation:

  • open python and type
    import pandas as pd

Other Resources

Goals for Today

  1. Get pandas working on your machine
  2. Get comfortable with pandas (know what it's all about)
  3. Learn how to look up help

Outline

  1. Read data into pandas
  2. Manipulate data with pandas
  3. Plot/Aggregate/Summarize data with pandas (... and beyond)

A brief history

The Data Frame

Using pandas the main tool is the data frame. We use it to hold data and perform operations on the data frame to prepare the data for our end goal. Depending on where you want to go some end functions are also built into the data frame (eg make a histogram). However ...

Often pandas data frames are confusing to new users. Almost always that is not the fault of the new user.

Quick survey: Who knows R?

  • "Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure." official documentation

The trick is pandas data frames have a lot of capability and a lot of features. But when learning you don't need that. So today we'll take a different approach and focus on the basics.

  • Working definition of Data Frame:
     An allocation of computer memory that holds data. The format works like a spreadsheet.
  • What a data fame "looks like" visualized

Reading in Data

Our first step is to read data into a pandas data frame. It may be coming from many diferent sources (Eg: a file in storage, an object in memory, the internet, etc.). Here we will look at the example of reading in from a file (.csv).

Manipulating Data

Taking a look at the data

Your data in a data frame is stored in memory. But as a human we like to see a snippet on the screen. Let's write a program to do just that.

# load pandas package
df.head()

# look at end of file
df.tail()

# look at random sample
n = 5
df.sample(n)
  • Notice how the whole frame doesn't display. If you want to see the column names you can do this:
list(df)

Sort

In this world we call sorting the data set on a column "arranging". To write aprogram to do that we use the function 'sort_values'.

df.sort_values(by='G', ascending=False)

Remove rows (with condition)

Sometimes we want to remove some of the data. We call that filtering. Let's try writing code to remove the years before 1976 from this data frame.

df = df[ df.Year > 1976 ]

Code Breakdown:

  • df is a data frame object
  • the " [ ... ] " let's us reference or "index" the data frame
  • df.Year > 1976 is a boolean that is evaluated for every row

Create or Remove Columns

  • Once you have worked the data set you sometimes want to select a subset of the columns going forward. Again we use the reference tool and provide a list of columns to keep.

      df = df[ ['H','G'] ]
      
  • To create columns we can even get a little fancier and stick calculations right into those new columns. mutate (aka create new columns). Let's make up batting average.

      df['myBA']=df.H/df.AB
      df[['myBA','BA']]
      

    To create the new column we just acted like it existed and pandas created it for us on the fly.

Aggregation / summarization

  • To quickly summarize the common statistics of the dataset use describe

     df.describe() 
  • There are also a bunch of pandas functions to do some stats and things

    • df.mean()
    • << google practice time >>
  • pandas also facilitates group operations. To explain let's look at a picture. -- show picture on your desktop from the pandas book because it is not open source.

    • let's group Andre's data frame by Team he played for then take the mean
dfg = df.groupby(df.Tm)
dfg.mean()['H']

Making plots

Let's make a histogram from our sample dataset. Since we are using a data frame this feature is already built in

# make a histogram
df.hist("H",bins=100)

Ways to Practice

  1. Write some code
  2. Ask a friend to review it
  • Beginning: Load a csv file into a data frame. Clean it up for statistical analysis.
  • Intermediate: Load an excel spreadsheet into a data frame.
  • Expert: Go nuts. Read the McKinney book and try out some cool stuff.

About

This repo contains material for an introductory workshop to the python package pandas.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published