# Pandas cheat sheet

This notebook has some common data manipulations you might do while working in the popular Python data analysis library [`pandas`](https://pandas.pydata.org/). It assumes you're already are set up to analyze data in pandas using Python 3.

(If you're _not_ set up, [here's IRE's guide](https://docs.google.com/document/d/1cYmpfZEZ8r-09Q6Go917cKVcQk_d0P61gm0q8DAdIdg/edit#) to setting up Python. [Hit me up](mailto:cody@ire.org) if you get stuck.)

### Topics
- [Importing pandas](#Importing-pandas)
- [Creating a dataframe from a CSV](#Creating-a-dataframe-from-a-CSV)
- [Checking out the data](#Checking-out-the-data)
- [Selecting columns of data](#Selecting-columns-of-data)
- [Getting unique values in a column](#Getting-unique-values-in-a-column)
- [Running basic summary stats](#Running-basic-summary-stats)
- [Sorting your data](#Sorting-your-data)
- [Filtering rows of data](#Filtering-rows-of-data)
- [Filtering text columns with string methods](#Filtering-text-columns-with-string-methods)
- [Filtering against multiple values](#Filtering-against-multiple-values)
- [Exclusion filtering](#Exclusion-filtering)
- [Adding a calculated column](#Adding-a-calculated-column)
- [Filtering for nulls](#Filtering-for-nulls)
- [Grouping and aggregating data](#Grouping-and-aggregating-data)
- [Pivot tables](#Pivot-tables)
- [Applying a function across rows](#Applying-a-function-across-rows)
- [Joining data](#Joining-data)

### Importing pandas

Before we can use pandas, we need to import it. The most common way to do this is:

In [None]:
import pandas as pd

### Creating a dataframe from a CSV

To begin with, let's import a CSV of Major League Baseball player salaries on opening day. The file, which is in the same directory as this notebook, is called `mlb.csv`.

Pandas has a `read_csv()` method that we can use to get this data into a [dataframe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) (it has methods to read other file types, too). At minimum, you need to tell this method where the file lives:

In [None]:
mlb = pd.read_csv('mlb.csv')

### Checking out the data

When you first load up your data, you'll want to get a sense of what's in there. A pandas dataframe has several useful things to help you get a quick read of your data:

- `.head()`: Shows you the first 5 records in the data frame (optionally, if you want to see a different number of records, you can pass in a number)
- `.tail()`: Same as `head()`, but it pull records from the end of the dataframe
- `.sample(n)` will give you a sample of *n* rows of the data -- just pass in a number
- `.info()` will give you a count of non-null values in each column -- useful for seeing if any columns have null values
- `.describe()` will compute summary stats for numeric columns
- `.columns` will list the column names
- `.dtypes` will list the data types of each column
- `.shape` will give you a pair of numbers: _(number of rows, number of columns)_

In [None]:
mlb.head()

In [None]:
mlb.tail()

In [None]:
mlb.sample(5)

In [None]:
mlb.info()

In [None]:
mlb.describe()

In [None]:
mlb.columns

In [None]:
mlb.dtypes

In [None]:
mlb.shape

To get the number of records in a dataframe, you can access the first item in the `shape` pair, or you can just use the Python function `len()`:

In [None]:
len(mlb)

### Selecting columns of data

If you need to select just one column of data, you can use "dot notation" (`mlb.SALARY`) as long as your column name doesn't have spaces and it isn't the name of a dataframe method (e.g., `product`). Otherwise, you can use "bracket notation" (`mlb['SALARY']`).

Selecting one column will return a [`Series`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html).

If you want to select multiple columns of data, use bracket notation and pass in a _list_ of columns that you want to select. In Python, a list is a collection of items enclosed in square brackets, separated by commas: `['SALARY', 'NAME']`.

Selecting multiple columns will return a [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).

In [None]:
# select one column of data
teams = mlb.TEAM

# bracket notation would do the same thing -- note the quotes around the column name
# teams = mlb['TEAM']

teams.head()

In [None]:
type(teams)

In [None]:
# select multiple columns of data
salaries_and_names = mlb[['SALARY', 'NAME']]

In [None]:
salaries_and_names.head()

In [None]:
type(salaries_and_names)

### Getting unique values in a column

As you evaluate your data, you'll often want to get a list of unique values in a column (for cleaning, filtering, grouping, etc.).

To do this, you can use the Series method `unique()`. If you wanted to get a list of baseball positions, you could do:

In [None]:
mlb.POS.unique()

If useful, you could also sort the results alphabetically with the Python [`sorted()`](https://docs.python.org/3/library/functions.html#sorted) function:

In [None]:
sorted(mlb.POS.unique())

Sometimes you just need the _number_ of unique values in a column. To do this, you can use the pandas method `nunique()`:

In [None]:
mlb.POS.nunique()

(You can also run `nunique()` on an entire dataframe:)

In [None]:
mlb.nunique()

If you want to count up the number of times a value appears in a column of data -- the equivalent of doing a pivot table in Excel and aggregating by count -- you can use the Series method [`value_counts()`](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.Series.value_counts.html).

To get a list of MLB teams and the number of times each one appears in our salary data -- in other words, the roster count for each team -- we could do:

In [None]:
mlb.TEAM.value_counts()

### Running basic summary stats

Some of this already surfaced with `describe()`, but in some cases you'll want to compute these stats manually:
- `sum()`
- `mean()`
- `median()`
- `max()`
- `min()`

You can run these on a Series (e.g., a column of data), or on an entire DataFrame.

In [None]:
mlb.SALARY.sum()

In [None]:
mlb.SALARY.mean()

In [None]:
mlb.SALARY.median()

In [None]:
mlb.SALARY.max()

In [None]:
mlb.SALARY.min()

In [None]:
# entire dataframe
mlb.mean()

### Sorting your data

You can use the [`sort_values()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) method to sort a dataframe by one or more columns. The default is to sort the values ascending; if you want your results sorted descending, specify `ascending=False`.

Let's sort our dataframe by `SALARY` descending:

In [None]:
mlb.sort_values('SALARY', ascending=False).head()

To sort by multiple columns, pass a list of columns to the `sort_values()` method -- the sorting will happen in the order you specify in the list. You'll also need to pass a list to the `ascending` keyword argument, otherwise both will sort ascending.

Let's sort our dataframe first by `TEAM` ascending, then by `SALARY` descending:

In [None]:
mlb.sort_values(['TEAM', 'SALARY'], ascending=[True, False]).head()

### Filtering rows of data

To filter your data by some criteria, you'd pass your filtering condition(s) to a dataframe using bracket notation.

You can use Python's [comparison operators](https://docs.python.org/3/reference/expressions.html#comparisons) in your filters, which include:
- `>` greater than
- `<` less than
- `>=` greater than or equal to
- `<=` less than or equal to
- `==` equal to
- `!=` not equal to

Example: You want to filter your data to keep records where the `TEAM` value is 'ARI':

In [None]:
diamondbacks = mlb[mlb.TEAM == 'ARI']

In [None]:
diamondbacks.head()

We could filter to get all records where the `TEAM` value is _not_ 'ARI':

In [None]:
non_diamondbacks = mlb[mlb.TEAM != 'ARI']

In [None]:
non_diamondbacks.head()

We could filter our data to just grab the players that make at least $1 million:

In [None]:
million_a_year  = mlb[mlb.SALARY >= 1000000]

In [None]:
million_a_year.head()

### Filtering against multiple values

You can use the `isin()` method to test a value against multiple matches -- just hand it a _list_ of values to check against.

Example: Let's say we wanted to filter to get just players in Texas (in other words, just the Texas Rangers and the Houston Astros):

In [None]:
tx = mlb[mlb.TEAM.isin(['TEX', 'HOU'])]

In [None]:
tx.head()

### Exclusion filtering

Sometimes it's easier to specify what records you _don't_ want returned. To flip the meaning of a filter condition, prepend a tilde `~`.

For instance, if we wanted to get all players who are _not_ from Texas, we'd use the same filter condition we just used to get the TX players but add a tilde at the beginning:

In [None]:
not_tx = mlb[~mlb.TEAM.isin(['TEX', 'HOU'])]

In [None]:
not_tx.head()

### Filtering text columns with string methods

You can access the text values in a column with `.str`, and you can use any of Python's native string functions to manipulate them.

For our purposes, though, the pandas [`str.contains()`](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.Series.str.contains.html) method is useful for filtering data by matching text patterns.

If we wanted to get every player with 'John' in their name, we could do something like this:

In [None]:
johns = mlb[mlb.NAME.str.contains('John', case=False)]

In [None]:
johns.head()

Note the `case=False` keyword argument -- we're telling pandas to match case-insensitive. And if the pattern you're trying to match is more complex, the method is set up to support [regular expressions](https://docs.python.org/3/howto/regex.html) by default.

### Multiple filters

Sometimes you have multiple filters to apply to your data. Lots of the time, it makes sense to break the filters out into separate statements.

For instance, if you wanted to get all Texas players who make at least $1 million, I might do this:

In [None]:
tx = mlb[mlb.TEAM.isin(['TEX', 'HOU'])]

# note that I'm filtering the dataframe I just created,  not the original `mlb` dataframe
tx_million_a_year = tx[tx.SALARY >= 1000000]

In [None]:
tx_million_a_year.head()

But sometimes you want to chain your filters together into one statement. Use `|` for "or" and `&` for "and" rather than Python's built-in `or` and `and` statements, and use grouping parentheses around each statement.

The same filter in one statement:

In [None]:
tx_million_a_year = mlb[(mlb.TEAM.isin(['TEX', 'HOU'])) & (mlb.SALARY > 1000000)]

In [None]:
tx_million_a_year.head()

Do what works for you and makes sense in context, but I find the first version a little easier to read.

### Adding a calculated column

To add a new column to a dataframe, use bracket notation to supply the name of the new column (in quotes, or apostrophes, as long as they match), then set it equal to a value -- maybe a calculation derived from other data in your dataframe.

For example, let's create a new column, `contract_total`, that multiplies the annual salary by the number of contract years:

In [None]:
mlb['contract_total'] = mlb['SALARY'] * mlb['YEARS']

In [None]:
mlb.head()

### Filtering for nulls

You can use the `isnull()` method to get records that are null, or `notnull()` to get records that aren't. The most common use I've seen for these methods is during filtering to see how many records you're missing (and, therefore, how that affects your analysis).

The MLB data is complete, so to demonstrate this, let's load up a new data set: A cut of the [National Inventory of Dams](https://ire.org/nicar/database-library/databases/national-inventory-of-dams/) database, courtesy of the NICAR data library. (We'll need to specify the `encoding` on this CSV because it's not UTF-8.)

In [None]:
dams = pd.read_csv('dams.csv',
                   encoding='latin-1')

In [None]:
dams.head()

Maybe we're interested in looking at the year the dam was completed (the `Year_Comp`) column. Running `.info()` on the dataframe shows that we're missing some values:

In [None]:
dams.info()

We can filter for `isnull()` to take a closer look:

In [None]:
no_year_comp = dams[dams.Year_Comp.isnull()]

In [None]:
no_year_comp.head()

How many are we missing? That will help us determine whether the analysis would be valid:

In [None]:
# calculate the percentage of records with no Year_Comp value
# (part / whole) * 100

(len(no_year_comp) / len(dams)) * 100

So this piece of our analysis would exclude one-third of our records -- something you'd need to explain to your audience, if indeed your reporting showed that the results of your analysis would still be meaningful.

To get records where the `Year_Comp` is not null, we'd use `notnull()`:

In [None]:
has_year_comp = dams[dams.Year_Comp.notnull()]

In [None]:
has_year_comp.head()

What years remain? Let's use `value_counts()` to find out:

In [None]:
has_year_comp.Year_Comp.value_counts()

(To sort by year, not count, we could tack on a `sort_index()`:

In [None]:
has_year_comp.Year_Comp.value_counts().sort_index()

### Grouping and aggregating data

You can use the `groupby()` method to group and aggregate data in pandas, similar to what you'd get by running a pivot table in Excel or a `GROUP BY` query in SQL. We'll also provide the aggregate function to use.

Let's group our baseball salary data by team to see which teams have the biggest payrolls -- in other words, we want to use `sum()` as our aggregate function:

In [None]:
grouped_mlb = mlb.groupby('TEAM').sum()

In [None]:
grouped_mlb.head()

If you don't specify what columns you want, it will run `sum()` on every numeric column. Typically I select just the grouping column and the column I'm running the aggregation on:

In [None]:
grouped_mlb = mlb[['TEAM', 'SALARY']].groupby('TEAM').sum()

In [None]:
grouped_mlb.head()

... and we can sort descending, with `head()` to get the top payrolls:

In [None]:
grouped_mlb.sort_values('SALARY', ascending=False).head(10)

You can use different aggregate functions, too. Let's say we wanted to get the top median salaries by team:

In [None]:
mlb[['TEAM', 'SALARY']].groupby('TEAM').median().sort_values('SALARY', ascending=False).head(10)

You can group by multiple columns by passing a list. Here, we'll select our columns of interest and group by `TEAM`, then by `POS`, using `sum()` as our aggregate function:

In [None]:
mlb[['TEAM', 'POS', 'SALARY']].groupby(['TEAM', 'POS']).sum()

### Pivot tables

Sometimes you need a full-blown pivot table, and [pandas has a function to make one](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html).

For this example, we'll look at some foreign trade data -- specifically, eel product imports from 2010 to mid-2017:

In [None]:
eels = pd.read_csv('eels.csv')

In [None]:
eels.head()

Let's run a pivot table where the grouping column is `country`, the values are the sum of `kilos`, and the columns are the year:

In [None]:
pivoted_sums = pd.pivot_table(eels,
                              index='country',
                              columns='year',
                              values='kilos',
                              aggfunc=sum)

In [None]:
pivoted_sums.head()

Let's sort by the `2017` value. While we're at it, let's fill in null values (`NaN`) with zeroes using the [`fillna()`](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.fillna.html) method.

In [None]:
pivoted_sums.sort_values(2017, ascending=False).fillna(0)

### Applying a function across rows

Often, you'll want to calculate a value for every column but it won't be that simple, and you'll write a separate function that accepts one row of data, does some calculations and returns a value. We'll use the [`apply()`](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.apply.html) method to accomplish this.

For this example, we're going to load up a CSV of gators killed by hunters in Florida:

In [None]:
gators = pd.read_csv('gators.csv')

In [None]:
gators.head()

We want to find the longest gator in our data, of course, but there's a problem: right now, the caracass size value is being stored as text: `{} ft. {} in.`. The pattern is predicatable, though, and we can use some Python to turn those values into constant numbers -- inches -- that we can then sort on. Here's our function:

In [None]:
def get_inches(row):
    '''Accepts a row from our dataframe, calculates carcass length in inches and returns that value'''

    # get the value in the 'Carcass Size' column
    carcass_size = row['Carcass Size']
    
    # split the text on 'ft.'
    # the result is a list
    size_split = carcass_size.split('ft.')
    
    # strip whitespace from the first item ([0]) in the resulting list -- the feet --
    # and coerce it to an integer with the Python `int()` function
    feet = int(size_split[0].strip())
    
    # in the second item ([1]) in the resulting list -- the inches -- replace 'in.' with nothing,
    # strip whitespace and coerce to an integer
    inches = int(size_split[1].replace('in.', '').strip())
    
    # add the feet times 12 plus the inches and return that value
    return inches + (feet * 12)

Now we're going to create a new column, `length_in` and use the [`apply()`](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.apply.html) method to apply our function to every row. The `axis=1` keyword argument means that we're applying our function row-wise, not column-wise.

In [None]:
gators['length_in'] = gators.apply(get_inches, axis=1)

In [None]:
gators.sort_values('length_in', ascending=False).head()

### Joining data

You can use [`merge()`](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.merge.html) to join data in pandas.

In this simple example, we're going to take a CSV of country population data in which each country is represented by an [ISO 3166-1 numeric country code](https://en.wikipedia.org/wiki/ISO_3166-1_numeric) and join it to a CSV that's basically a lookup table with the ISO codes and the names of the countries to which they refer.

Some of the country codes have leading zeroes, so we're going to use the `dtype` keyword when we import each CSV to specify that the `'code'` column in each dataset should be treated as a string (text), not a number.

In [None]:
pop_csv = pd.read_csv('country-population.csv', dtype={'code': str})

In [None]:
pop_csv.head()

In [None]:
code_csv = pd.read_csv('country-codes.csv', dtype={'code': str})

In [None]:
code_csv.head()

Now we'll use `merge()` to join them.

The `on` keyword argument tells the method what column to join on. If the names of the columns were different, you'd use `left_on` and `right_on`, with the "left" dataframe being the first one you hand to the `merge()` function.

The `how` keyword argument tells the method what type of join to use -- the default is `'inner'`.

In [None]:
joined_data = pd.merge(pop_csv,
                       code_csv,
                       on='code',
                       how='left')

In [None]:
joined_data.head()