# Introduction to Pandas

*Author: [Douglas Strodtman](http://linkedin.com/in/dstrodtman/)*

In this notebook, we'll cover the basic functionality of Pandas, which is one of the most widely used Python packages for data exploration, cleaning, and manipulation. The official docs have a [10 Minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) lesson; we'll take a bit longer and hopefully generalize concepts to have a deeper understanding.

If you have robust knowledge working with R, SQL, SAS, or Stata, I recommend that you read through [the corresponding comparison notes provided by the maintainers of Pandas](http://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/).

## A Note on the Data

The city of Los Angeles has adopted one of the most progressive approaches to open data in the nation under the leadership of Mayor Garcetti.

The [Los Angeles Open Data portal](https://data.lacity.org/) is an amazing resource for accessing city-specific data. It has a number of built-in tools to let you explore data in the browser prior to downloading, and will let you export data in many popular formats.

Many online tutorials will have you work with canonical datasets that are well-documented. While these are fine, I encourage you to always work with real data when possible, and to apply your skills to questions that you'd like to answer.

Today, we'll be working with [City Budget and Expenditures](https://controllerdata.lacity.org/Budget/City-Budget-and-Expenditures/uyzw-yi8n) data. The version that has been included with this lesson was updated on March 8, 2019.

The data is presented as a line item budget, showing aggregates for accounts by department and fund. We can consider each row in our data to be a line item.

## Lesson Overview

1. Module import
1. Data import
1. Preview Data
    - head
    - tail
    - sample
1. DataFrames and Series
    - What are they?
    - Data selection
        - One column
        - Multiple columns
        - One row
        - Multiple rows
        - Singular values
1. Metadata exploration
    - Data size
    - Data types
    - Columns
    - Index
1. Methods and Attributes
1. Categorical data
    - Unique values
    - Value counts
    - String operations
1. Numeric data
    - Summary statistics
    - Basic math operations
1. Sorting


## Module Import

`pandas` is always imported as `pd`. It's good to import all modules that you might need in a notebook in the first code block.

Pandas is in active open source development. While the maintainers do a good job trying to make sure that newer version are backwards compatible, occassionally there will be breaking changes.

Checking the `__version__` of an imported module will let us know if we're all in the same development environment. The newest version is 0.24.2 (at the time of this writing, April 2019). It is unlikely that you will see significant changes in performance as long as your version is > 0.20.0.

## Data Import

Pandas provides support for many different data file types. Read the docs [here](https://pandas.pydata.org/pandas-docs/stable/reference/io.html) if you need help loading a file.

Provided with this lesson are a number of data files. We'll begin our exploration by loading the file `2018_budget.csv`.

When you read data into Pandas, it will be stored into a DataFrame (we'll explore what this means shortly).

In order to explore and manipulate our DataFrame, we'll assign it to a variable.

### A Note on Naming Conventions

Python has many naming conventions, and some of these vary slightly based on area of practice.

Generally speaking, it's best practice to use [snake case](https://en.wikipedia.org/wiki/Snake_case) for all variable names, and to provide your variables with informative human-readable names. Initial capital letters are usually reserved for `Class`es and `Type`s and written in [camel case](https://en.wikipedia.org/wiki/Camel_case). Variable names in all caps are usually reserved for global variables.

| casing | use case | example |
| --- | --- | --- |
| `snake_case` | most variables and functions | `some_variable`, `some_function` |
| `CamelCase` | classes | `DataFrame`, `Series` |
| `UPPERCASE` | global variables | `USER_CREDENTIALS`, `IP_ADDRESS` |

Until you become a more advanced Python user, you will probably want to exclusively use `snake_case`, as you are unlikely to be defining classes or global variables.

### A Specific Note on `df`

As you are searching for solutions to Pandas related issues and going through online tutorials, you will find a proliferation of the variable name `df`.

`df` is shorthand for DataFrame, and you can readily adapt code you find online to your use case by changing this variable name to make your user-defined variable name.

> "Should I use `df` to name _my_ DataFrame?" - Literally Everyone

_**Maybe**_.

My rule of thumb is that if I know I'm only going to be loading in a single DataFrame, I'll stick with `df`. If I'm going to have 2 of more DataFrames in a single notebook, then I always give each a descriptive, human-readable name.

**Note**: Please don't ever just append numbers to differentiate DataFrames. This creates a huge mental burden on you as a programmer and your audience to keep track of what data is in each DataFrame.

### Load Data as `df` with `read_csv`

Here, we'll use the standard `df` to read our CSV with default options.

Pandas is great about having generally good default options that do what you probably want, but if your data ever loads in incorrectly, you can probably set an option or two to fix this by [consulting the docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv).

## Preview Data

Pandas is meant for computational evaluation and manipulation of your data. This means that you'll never be manually poring over every value in your data. Our data is going to be big, and visual inspection will be a waste of our time.

Similarly, because our data is big, we don't want to ever print out all rows. Instead, we'll check that our manipulations are working correctly by printing out small previews.

### `.head`

By default, this returns the top 5 rows. Pass a scalar value as an argument to get that number of rows returned.

### `.tail`

The same as `.head` but for the end of our DataFrame.

### `.sample`

Provides a random sample of the rows in our data.

### Note on Data Previews

While we'll use `head` extensively throughout this demo, it's generally good practice to validate your operations by sampling from throughout your data, especially as you're learning (as you're especially prone to make mistakes that might not be apparent in just the first 5 rows).

In particular, manual inspection of the `tail` of your data will often quickly reveal if there was an error while parsing/loading your data, as generally these errors will cause misalignment that propagates throughout your data and will be easily visible in your last few rows.

## `DataFrames` and `Series`

The two main data structures you'll need to understand when working with Pandas are `DataFrames` and `Series`.

In brief, Pandas provides tabular (table-based) access to data, porting much of the functionality of Excel and SQL into the Python environment. As such, we'll be talking a lot about rows, columns, and cells. 

At the same time, it is implemented to be optimized for vectorized operations, meaning that we can also think of a DataFrame as a matrix, with our Series representing vectors, and our cells being scalars. **If you don't speak linear algebra, don't worry.** If you just use Pandas built-in functions, you should see the performance benefits without ever thinking about what a vector is.

Here's a table to help you organize these concepts:

| Pandas structure | Common name | Linalg complement | Dimensions |
| --- | --- | --- | --- |
| DataFrame | table, sheet | matrix | 2-d (rows by columns) |
| Series | column, row | vector, array | 1-d (rows and columns have length only) |
| values | cell | scalar | 0-d (these are single values) |


If you're inclined, you can get a deep understanding of these by [reading the docs](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html).

For our purposes, we'll make a few simple (but important) distinctions.

### `Series`

We can think of Series as the rows and columns of a DataFrame. When we access a series, the returned preview won't be pretty printed.

#### Columns

We can access Series in a number of ways. Most commonly, we'll use a single square bracket notation.

**NOTE**: We'll use `head` on all of our column operations to only show the first 5 rows.

As long as our column names are only letters, numbers, and underscores, we can also use a dot notation to access Series.

We can also use `.loc` notation to access a Series.

The `:` here indicates all rows. 

We can also select a specific range of rows.

We can also provide a list of specific rows we want to include.

#### Rows

To return an entire row, just use `.loc` with the index label for that row.

Similar to how we specified rows for our column selections, we can specify columns for our row selections.

We can use the same `:` notation to select a range of our columns (note that the columns returned are dependent upon the order of columns in your DataFrame).

#### Quick Note on `.iloc`

Pandas also provides an `.iloc` selection method. This uses the numeric index of rows and columns to handle selection. While this can be useful in some applications, most times in Pandas you will have informative column and index labels that you would rather select on.

When our index is a serial integer range (as here), `.loc` and `.iloc` have similar behavior when selecting rows. `.iloc` will not allow you to pass column names, however.

As we proceed forward and insert, sort, and remove rows and columns, it will become apparent why using specific labels makes for clearer, more robust code for selection and manipulation.

### `DataFrames`

DataFrames are tables, essentially, or a collection of rows and columns. When we access a DataFrame, the returned preview will be pretty printed.

**NOTE**: We'll use `head` on all of our DataFrame operations to only show the first 5 rows.

To show an entire DataFrame, just type in the name and evaluate the code.

We can use double square bracket notation to select a subset of columns.

We could have also selected a single column in this way, still returning a DataFrame.

**NOTE**: While here we look at a single column, we still have a DataFrame rather than a Series. Certain Pandas operations require Series and will error out if we try to pass a DataFrame instead.

We can nest brackets within our `.loc` notation to return a DataFrame instead of a Series.

This is especially useful when we're interested in looking at multiple columns for a range of rows.

Or when we're interested in selecting specific rows with multiple columns.

If we nest brackets on just our rows, we'll again get a DataFrame rather than a Series.

This can be used to select one or many rows, with one or many columns.

#### A Note on Transpose

DataFrames contain a transpose attribute, `.T`. This doesn't change the underlying data, but just changes the orientation of the returned preview, switching the location or rows with columns.

We can apply this to any of our above DataFrame access methods it we prefer to our data in this transposed configuration.

### Values

Generally, we'll be approaching our data in aggregate or filtering by some condition. Sometimes you will want to access an individual value.

My preferred method is to again use `.loc`.

There is a more specific method of access, `.at`, but this _only_ allows you to access a single value, and so is less generalizable than `.loc`.

You may also see the follow format in example code online. We won't go into the details of it now, but essentially we can think of this as selecting a single column AFTER selecting a single row.

We can also apply this in the opposite order, selecting a single row AFTER selecting a single column.

#### Pro-tip

Use standard `.loc` notation regardless of whether you are trying to access a DataFrame, Series, or value. It is the most explicit and robust method for selection, and will make your code easy to read and understand.

## Meta-Data Exploration

Now that we understand some of the basics of navigating our data in Pandas, we'll learn some best practices for learning about _what_ our data is.

### Shape

One of the most important things to always check is the `.shape` of our data. This will return the size of our data as a tuple of `(rows, columns)`.

Here we have 3653 rows and 17 columns.

### Data Types

We can see the data type of each column with the `.dtypes` attribute.

### Info

Using the `.info` method will give us more detail. This includes:

- A description of our index (here a `RangeIndex` with 3653 entries)
- The total number of columns
- The name, data type, and count of non-null values for each column (this will be truncated in DataFrames with MANY columns).
- The data types present in our DataFrame and the count of columns of each type
- The memory usage of our current DataFrame

Looking at the `.info` of your data is always a great place to start your data exploration.

### Index

Both our rows and columns have an associated index. For our rows, this is the `.index`.

Here we have a serial integer range as our index, starting at 0, stopping at 3653 (exclusive), and incrementing by 1 at each row.

### Columns

We access our column index with `.columns`.

These are the names of each of the columns in our DataFrame. We can generally treat this object as a list (although sometimes we'll want to explicitly cast it as such for some operations).

To force an index to a list, we can use `.tolist` or just call `list` on it.

## Methods and Attributes

By now, you're probably wondering:

> "Why do some of these things have `()`, some `[]`, and some no punctuation after the words?"

Great question!

If you've tried to explore Pandas independently, you may have given up in frustration, unable to troubleshoot the errors that you were getting trying to implement the solutions you found on StackOverflow. 

**You're not alone.**

It will take time to familiarize yourself with the syntax for interacting with the Pandas application programming interface (API). In my experience, the biggest difficulty is learning that Pandas is _always trying to do the right thing_. Or rather, it's trying to do the thing it _thinks_ you might want to do.

Pandas was developed to bring the ease of data exploration and manipulation found in Excel, SQL, and R DataFrames into the Python environment. As such, there is a _ton_ of desired functionality that has been implemented by the open source community. Hopefully the following rules can help you begin to organize Pandas syntax in your mind. [(The official Pandas docs are always a good place to inform yourself)](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html).

### 1. When Accessing a Static Attribute, No Punctuation Is Needed

A DataFrame is a `Class` object. You may or may not have worked with classes before; that's okay! An important thing to understand about a class is that it can store already computed static values (or attributes) that don't actually transform or operate on the data in anyway at time of access.

Here are some examples we've seen so far:

| Syntax | Attribute accessed |
| --- | --- |
| `.shape` | A tuple of the shape of the data, rows by columns |
| `.index` | The index of the data, as an Index object |
| `.columns` | The columns of the data, as an Index object |
| `.T` | The transpose of the data, with rows as columns and columns as rows |
| `.some_specific_column_name` e.g., `.fund_name` | A column series |

In each of these example, the returned attribute is just some aspect of the defined data that you're accessing. This will make a bit more sense after we consider...

### 2. When Modifying Data or Performing Calculations, Use `()`

Classes also have associated methods. There are the same as functions, but can be thought of to generally have pre-defined arguments that act upon the data in the class. Thus far, most of the methods that we've invoked have _modified a preview of our data_.

Here are some examples we've seen so far:

| Syntax | Method applied |
| --- | --- |
| `.head()` | Limit returned view to first 5 rows |
| `.sample()` | Return a random sample row from our data |
| `.tolist()` | Return data as a list object |
| `.info()` | Return a summary of indices, columns, dtypes, nulls, and size |

Some other methods we'll see shortly extend common operations you may have encountered in SQL or numpy:

| Syntax | Method applied |
| --- | --- |
| `.groupby()` | Group by one or more columns |
| `.count()` | Provide an aggregate count of rows |
| `.mean()` | Calculate the mean of one or more columns |
| `.join()` | Join a DataFrame with another DataFrame on some condition |

Pandas methods should always be used, when possible, as they're designed to be highly optimized for calculations, sorting, and aggregation. We'll learn many of these today, but there are far too many to cover in any one lesson (I still find new, useful methods regularly as complete projects).

**But what about those pesky square brackets?**

### 3. When Filtering or Accessing Data, Use `[]`

This can be a conceptual hurdle for some, but understanding square bracket notation really boils down to two main points:

1. DataFrames are built on top of `numpy` arrays and extend much of the `numpy` notation.
2. DataFrames can be thought of as a dictionary of `Series` and use dictionary indexing notation.

Of course, neither of these concepts is helpful if you aren't familiar with keying into dictionaries and indexing in numpy.

The most important things to keep in mind:

1. Data is accessed by row(s) and then column(s)
    - `df.loc[row]`
    - `df.loc[row, column]`
    - `df.loc[[row1, row2], [column1, column2]]`
2. If no rows are passed, Pandas assumes you are trying to index into your columns
    - `df[column]`
    - `df[[column1, column2]]`
    - `df.some_column_name` (not bracket notation, but the same concept)
3. `:` can be used to select a range
    - `df.loc[:, column]`
    - `df.loc[0:5, column]`
    - `df.loc[0:5, column1:column5]`
4. Double brackets always return a DataFrame.
    - `df[[column]]`
    - `df[[column1, column2]]`
    - `df.loc[[row1, row2], [column1, column2]]`

**This is a lot of information, and you're not expected to remember all of this right now**. One of the nice things about coding in Jupyter is that many of our error messages are informative, and we can often quickly add the punctuation needed to revise our code and get it to run. **The only way to learn is to try and fail**.


## Categorical Data

As we've already seen, Pandas allows you to add a number of different kinds of data into the same DataFrame. It has a suite of methods specifically built out for dealing with categorical data.

By default, Pandas will load any column that contains letters as an `object` type.

You can use the `select_dtypes` method to return a view of your DataFrame with only the specified type present.

I like to save the the columns object out so that I can use this to easily operate on all my categorical columns.

Not that when we use a list of column names, the list itself contains a set of square brackets, so we can pass it back to our DataFrame to return a DataFrame.

### Summary Statistics

There are limited summary statistics to perform on categorical values. Namely we look at total unique values, total count for each value, and the distribution of our total counts.

The method `.nunique` will gives us the number of unique values in each column.

We can then call `.unique` on a specific column to get a set of the values contained.

Note that by default, `.nunique` will ignore null values, while `.unique` will return these.

More often, we will use `.value_counts` to return a count of each unique value in a column.

By default, these will be return in descending order.

Note that `.value_counts` also ignores missing values. We can override this behavior (in many Pandas methods) by setting `dropna=False`.

The high number of missing values suggests that this is an optional field.

### String Operations

Pandas has great functionality for applying string methods to text data.

Let's look at our categorical data again.

I don't love that the department names are in all caps. Changing this is as simple as accessing the Series and then invoking `.str.capitalize`. You'll find easy application of your favorite string methods here, alongside a number of Pandas-specific string methods.

This will return a copy of the Series with the changes applied, but nothing will have changed in the original data.

You can chain methods easily in Pandas. Because the returned object is a Series, we can use `.value_counts` to see unique values and counts with this new formatting.

Let's go ahead and save a similar formatting change for the `account_group_name` field.

Just do this by assigning right back into the Series in the DataFrame.

## Numeric Data

There's even greater functionality for numeric data. A great place to start is to look at summary statistics.

### Summary Statistics

Pandas `.describe` method will automatically return most of the desired summary stats for all numeric fields.

I find it's often easier to interpret these values once they're transposed.

Note that two of our numeric fields actually represent categorical data, `budget_fiscal_year` and `department`.

There are a number of ways to deal with this. I prefer to just go ahead and cast them as `object` type. This will omit them from our numeric operations (and also allow us to easily group them with our other categorical columns).

Use `.astype` to change each column to an object and overwrite the original column.

Now when we look at our `.describe`, these values won't appear.

If you're not familiar, the `e+06` is scientific notation, and just means that the decimal should be moved 6 places to the right.

Here's a table for what each of our calculated values represents:

| statistic | meaning |
| --- | --- |
| count | Number of non-null elements |
| mean | Average of non-null elements |
| std | Standard deviation |
| min | The smallest value in the column |
| 25% | Value greater than 25% of data (lower quartile, Q1) |
| 50% | The middle value (50th percentile, Q2) |
| 75% | Value greater than 75% of data (upper quartile, Q3) |
| max | The largest value in the column |

We note that many of our columns are full of mostly zeros (5 of our columns have a median of 0, and 4 have the value 0 at the 75 percentile).

All of our columns have min values of 0 or less.

The only column with missing values is `total_expenditures`. Because we're looking at 2018 data, it's possible that these values just haven't been calculated and finalized yet, but we should keep this in mind moving forward as we look at other years.

#### Select Numeric Columns

To be explicit moving forward, let's get a quick list of numeric columns, we'll again use `.select_dtypes` with the keyword `exclude='object'`. We'll just directly access our `.columns` from here.

Here's a convenient table of the description provided alongside the data for each of these fields:

| Column name | Description |
| --- | --- |
| adopted_budget_amount | Original budget amount adopted by Mayor and Council |
| total_expenditures | Total Budget Fiscal Year amount expended from account to date |
| budget_change_amount | Amendment to the adopted budget amount |
| budget_transfer_in_amount | Increase in appropriation to account by transfer in of funds |
| budget_transfer_out_amount | Decrease in appropriation to account by transfer out of funds |
| total_budget | Appropriation account amount net of changes and transfers to/from the original budgeted amount |
| encumbrance_amount | Obligation or commitment to pay for a good or service |
| pre_encumbrance_amount | Anticipated obligation or commitment to pay for a good or service |
| budget_uncommitted_amount | Total unused appropriation after expenditures and encumbrances |

### Aggregate Statistics

We can also calculate these same statistics on each column, or the entire DataFrame. We'll run through these operations quickly as a demonstration.

#### `.count`

`.count` will work on both numeric and categorical values. Note that nulls are ignored.

#### `.mean`

`.mean` will only evaluated numeric values, ignoring nulls.

By default, means are calculated for each column. However, we can changed this to be calculated over rows by passing an `axis=1` keyword argument.

In this investigation, these numbers aren't extremely informative. As a reminder, let's look at our numeric columns again.

So the row-wise means we are returning are just being calculated on these various numbers. Again, this is for demonstration, and not really informative.

#### `.std`

Standard deviation will only be calculated on numeric columns.

#### Quantiles

The `.quantile` method will allow you to specify any value between 0 and 1.

We also have the built-in `.median` function for the 50 percentile.

#### `.min` and `.max`

These will work on both numeric and categorical columns, returning minimum and maximum values for each column. For categorical columns, these are alphabetically sorted.

Note that all of these methods can be applied on a Series as well.

### Basic Math Operations

Pandas is set up to do vectorized math operations by default.

We'll go through a few of these here.

#### Addition

#### Subtraction

#### Multiplication

#### Division
We can also do scalar operations:

## Sorting

Sorting in Pandas is accomplished easily by using the `sort_values` method. By default, all columns in the DataFrame will be returned with rows sorted in ascending order by the specified column.

Here, let's sort on `total_budget`.

When we sort on a text column, this will be sorted alphabetically.

Let's trying this with department name.

To reverse the sort order, we pass the argument `ascending=False`.

Since this returns a DataFrame, we could choose to preview only the column that we're interested in, even though we're sorting on a different column.

Here, let's return the `fund_name` for 5 rows with the higheset `adopted_budget_amount`.

Of course, it might be more useful to also know what the `adopted_budget_amount` is here. Let's return just these two columns, again sorted to show the 5 largest amounts.

## Lesson Summary

After today's lesson you should feel comfortable with the following basics of Pandas:

1. Module import
1. Data import
1. Previewing Data
1. Differences between DataFrames and Series
1. Selecting rows and columns
1. Basic metadata exploration
1. Differences between methods and attributes
1. Basic methods for categorical data
1. Basic methods for numeric data
1. Sorting

Please proceed to complete the included lab to test these skills.