# A Brief Introduction to Pandas

## Module Overview

[`pandas`](https://pandas.pydata.org) is an essential package for anyone doing data analysis in Python.  It provides fundamental data structures ([Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) and [DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)) that make working with data much easier and more intuitive.  Pandas helps you to focus on the data analysis problems you're trying to solve, rather than the coding mechanics of how to solve them.

The `pandas` package is deep and there's lots to learn (more than we can cover in this workshop!), but some of the main functionality it provides are:

* functions for reading data into Python
* data structures for storing, transforming and manipulating data
* functions for subsetting data tables
* functions for reshaping data
* functions for performing grouped data operations
* functions for making basic plots (using [`matplotlib`](https://matplotlib.org) under the covers)

In this module of the workshop, we'll cover 4 main aspects of `pandas` which represent some of the most frequently encountered data analysis tasks:

1. Introduction to working with tabular data (`DataFrames`) using `pandas`
2. How to subset `DataFrames` to focus in on the data you want
3. The basics of reshaping and combining data
4. Basic data visualization using `pandas`

## In this section, we'll cover

* What is tabular data and why is it important for doing data analysis?
* How `pandas` stores tabular data using `Series` and `DataFrames`
* Basic operations on `Series` and `DataFrames`
* How to read and write data using `pandas`

### What is Tabular Data and Why is it Important?

Tabular data is simply a rectangular, 2-dimension table of data.  You've likely seen or worked with tabular data in many different contexts:

* Excel spreadsheets
* Data you've collected or entered yourself (e.g. in a notebook, or spreadsheet)
* In textbooks, presentations, journal articles
* As column separate values (.csv) files
* As a database table

Tabular data is one of the most widely used ways of organizing data.  While not all data can be represented as a 2-dimensional table, lots (most?) data you'll typically encounter can be, or at least can be transformed into this structure.

In `pandas` tabular data is stored in a format (a data structure) known as a `DataFrame`.  A `DataFrame` looks just like typical data table (image from pandas.pydata.org):

![](https://pandas.pydata.org/docs/_images/01_table_dataframe1.svg)

As the image shows, a `DataFrame` is a rectangular table of data with rows and columns that intersect to give individual cells that store data values.  The rows and columns provide important structure: data along a given row or a given column are (typically) related in some way.

Often times, columns represent *variables* or specific attributes that are measured, while rows represent *observations* or the things that the measurements are being made on.  Consider the following data table:

![]()

In this example, a hypothetical teacher is collecting information about their students: name, major, graduation year, and test scores.  Each of these individual types of data are represented as a **column** of the data table, and each **row** represents an individual student.  Imagine if you were to randomly scatter the data values across the cells of the table -- you'd still have the exact same data, but in a format that's completely useless to work with.  A well structured data table is an essential part of the data analysis process.

### DataFrame Basics

A `DataFrame` is a specific data structure provided by the `pandas` package that's used to store tabular data.  One way to create a `DataFrame` is to make one yourself (in code):

In [1]:
# This is how you import the pandas package for
# use in your own scripts; pd is the community accepted
# alias for the pandas package -- you should use it as well
import pandas as pd

# We'll create a DataFrame from a Dictionary
dataDict = {"Name": ["John", "Mary", "Sue"], \
            "Age": [23, 26, 19], \
            "Major": ["Chemistry", "Economics", "Mathematics"]}

myDF = pd.DataFrame(dataDict)

myDF

Unnamed: 0,Name,Age,Major
0,John,23,Chemistry
1,Mary,26,Economics
2,Sue,19,Mathematics


More typically though, you'll want to read data in from another source, for example and Excel spreadsheet or a delimited text file (e.g. a .csv file).  `pandas` provides function for reading these types of files, and lots more.

In [2]:
# Read a .csv file with information about penguins using read_csv
# https://github.com/allisonhorst/palmerpenguins

# you need to supply the path to the csv file
# this path might be a path to a file on your computer
# or can even be a web URL (that links directly to a raw .csv file)
penguins = pd.read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv")

# in a Jupyter notebook, just type the name of the DataFrame variable
# as the last command in the cell to see a nice display of the table
penguins

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007
6,Adelie,Torgersen,38.9,17.8,181.0,3625.0,female,2007
7,Adelie,Torgersen,39.2,19.6,195.0,4675.0,male,2007
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,,2007
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,,2007


Once you have a DataFrame, there are several basic operations you can perform on it to better understand what's there:

In [3]:
# Check that penguins is in fact a data frame
type(penguins)

pandas.core.frame.DataFrame

In [4]:
# Show the first 5 rows (how could you show the first 10?)
penguins.head(5)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


In [5]:
# Show the last 5 rows
penguins.tail(5)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
339,Chinstrap,Dream,55.8,19.8,207.0,4000.0,male,2009
340,Chinstrap,Dream,43.5,18.1,202.0,3400.0,female,2009
341,Chinstrap,Dream,49.6,18.2,193.0,3775.0,male,2009
342,Chinstrap,Dream,50.8,19.0,210.0,4100.0,male,2009
343,Chinstrap,Dream,50.2,18.7,198.0,3775.0,female,2009


In [6]:
# Get the number of rows and columns (you can index this to get one or the other)
penguins.shape

(344, 8)

In [22]:
# Get the column names
penguins.columns

Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex', 'year'],
      dtype='object')

In [23]:
# Get the column names as a list
list(penguins.columns)

['species',
 'island',
 'bill_length_mm',
 'bill_depth_mm',
 'flipper_length_mm',
 'body_mass_g',
 'sex',
 'year']

In [7]:
# What are the data types of the columns
# Important -- these are pandas data types (not python or numpy types)
penguins.dtypes

species               object
island                object
bill_length_mm       float64
bill_depth_mm        float64
flipper_length_mm    float64
body_mass_g          float64
sex                   object
year                   int64
dtype: object

In [8]:
# General info about the DataFrame
penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 8 columns):
species              344 non-null object
island               344 non-null object
bill_length_mm       342 non-null float64
bill_depth_mm        342 non-null float64
flipper_length_mm    342 non-null float64
body_mass_g          342 non-null float64
sex                  333 non-null object
year                 344 non-null int64
dtypes: float64(4), int64(1), object(3)
memory usage: 21.6+ KB


One important concept is that a `DataFrame` is made up of individual `Series` objects.  Whereas a `DataFrame` is a 2-dimensional data structure, a `Series` is a 1-dimensional data structure.  Each column in a `DataFrame` is a `Series`. `pandas` provides lots of different functions for working with both `DataFrames` and `Series`.

In [9]:
# You can get any column of data by indexing with its column name
species = penguins["species"]

species

0         Adelie
1         Adelie
2         Adelie
3         Adelie
4         Adelie
5         Adelie
6         Adelie
7         Adelie
8         Adelie
9         Adelie
10        Adelie
11        Adelie
12        Adelie
13        Adelie
14        Adelie
15        Adelie
16        Adelie
17        Adelie
18        Adelie
19        Adelie
20        Adelie
21        Adelie
22        Adelie
23        Adelie
24        Adelie
25        Adelie
26        Adelie
27        Adelie
28        Adelie
29        Adelie
         ...    
314    Chinstrap
315    Chinstrap
316    Chinstrap
317    Chinstrap
318    Chinstrap
319    Chinstrap
320    Chinstrap
321    Chinstrap
322    Chinstrap
323    Chinstrap
324    Chinstrap
325    Chinstrap
326    Chinstrap
327    Chinstrap
328    Chinstrap
329    Chinstrap
330    Chinstrap
331    Chinstrap
332    Chinstrap
333    Chinstrap
334    Chinstrap
335    Chinstrap
336    Chinstrap
337    Chinstrap
338    Chinstrap
339    Chinstrap
340    Chinstrap
341    Chinstr

In [10]:
# This should be a series
type(species)

pandas.core.series.Series

You can operate with Series using a variety of functions:

In [11]:
penguins["body_mass_g"].max()

6300.0

In [12]:
penguins["body_mass_g"].max()

6300.0

In [13]:
penguins["body_mass_g"].mean()

4201.754385964912

In [14]:
penguins["body_mass_g"].describe()

count     342.000000
mean     4201.754386
std       801.954536
min      2700.000000
25%      3550.000000
50%      4050.000000
75%      4750.000000
max      6300.000000
Name: body_mass_g, dtype: float64

## Main Points

* Tabular data is data arranged in a 2-dimensional rectangular table
* Lots of data sets can be naturally represented in a tabular format
* A tabular data set has rows and columns that intersect at cells that hold the underlying data
* The `pandas` python package provides a `DataFrame` data structure for storing tabular data
* You can create `DataFrames` explicitly using code, or more likely, by reading data into Python using one of `pandas` `read_*` functions.
* A `pandas` `DataFrame` is represented as a collection of 1-dimensional `Series`, one for each column in the table.
* One of the easiest ways to access and operate on the columns of a `DataFrame` using the column names