# Introduction to Pandas Lab

*Author: [Douglas Strodtman](http://linkedin.com/in/dstrodtman/)*

## Lab Overview

This lab builds upon the previous lesson. Some of the skills that you'll be demonstrating include:

1. Module import
1. Data import
1. Previewing Data
1. Differences between DataFrames and Series
1. Selecting rows and columns
1. Basic metadata exploration
1. Differences between methods and attributes
1. Basic methods for categorical data
1. Basic methods for numeric data
1. Sorting

Our primary objective in this lab will be to explore the 2017 budget and expenditures data and confirm whether or not the fields align with the expected values based on the data dictionary (available [here](https://controllerdata.lacity.org/Budget/City-Budget-and-Expenditures/uyzw-yi8n) under "Columns in this Dataset").

## Module Import

Use the below cell to import pandas.

(Other modules you might choose to import for data exploration and cleaning include [re](https://docs.python.org/3/library/re.html) and [numpy](https://www.numpy.org/), but these aren't essential to complete this lab.)

## Data Import

We'll be working with just the 2017 data here. You can run command line operations using the `!` in Jupyter. The following cell will list the names of all files in our `data` directory:

In [None]:
!ls ../data

As this will be the main DataFrame in this notebook, it should be safe to use the variable name `df`.

## Preview Data

Let's start by looking at the first 3 rows of our data to confirm it loaded correctly.

It's always a good idea to also check the last 3 rows in case there were any parsing errors.

The data looks as we expect it. It appears that the data was saved out sorted alphabetically by department name.

## Meta-Data Exploration

Let's get an idea of the contents of the 2017 data by exploring some of the meta-data. First, we should get an idea of the number of rows and columns.

It's also a good idea to check that our data loaded in the expected format by looking at the data types of each column. If you recall, there's a method that provides the names of each column, the number of non-null values, and the type.

It's always a good idea to record observations along the way. Is there anything unexpected in the above data?

Both `total_expenditures` and `account_group_name` have a high number of nulls. The 2 nulls in `account_name` are confounding, especially as there are no nulls in `account`.

#### Optional 
Because we're only working with the 2017 data, let's go ahead and drop the `budget_fiscal_year` column. We can also drop the `department` column, as this numeric is redundant with the `department_name`.

The docs for `.drop` are [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html). 

**Note**: Leaving these columns in won't negatively impact your ability to complete this lab, but will just retain data you don't really need.

## Categorical Data

Save out a list of all the columns that contain strings.

Now use that list to return the number of unique values in each column.

Return the names of the 10 funds that have the most line items in 2017, along with the count.

Let's save out a list of the name of the funds that appear more than 100 times in our 2017 data. (In this case, you should be able to just use the `.index[:6]` to select the first 6 rows from the previous output).

To simplify the rest of our operations, let's limit ourselves to only these 6 funds.

We can use `.isin` to create a boolean list that we can pass back to `.loc` to only return those rows that are true. This is often called **masking**.

Create a mask by completing the following code:

In [None]:
mask = df['fund_name'].isin(#your_most_common_funds_list_here)

We'll save this as a new DataFrame called `top_funds`.

**Note**: We should use `.copy()` here at the end of our assignment to make sure that we've saved out a _new_ DataFrame. While discussions of mutability in Pandas are beyond the scope of this lesson, if you've ever run into warnings like the following:

```
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead```

you can often avoid these by making sure that you're saving out new DataFrame objects with the `.copy()` method.

In [None]:
top_funds = df.loc[mask, :].copy()

And then look at the shape:

## Numeric Data

Let's look at the summary statistics for the numeric columns of our newly created DataFrame.

Do you notice anything particularly noteworthy at this time? Do these values seem reasonable? **There aren't necessarily correct answers to these questions, but you should always record any observations you make as you explore your data.**

Many of our fields have a large number of 0 values, but `pre_encumbrance_amount` is actually all zeroes for this portion of our data.

#### Select Numeric Columns

To be explicit moving forward, let's get a quick list of numeric columns, we'll again use `.select_dtypes` with the keyword `exclude='object'`. We'll just directly access our `.columns` from here.

Here's a convenient table of the description provided alongside the data for each of these fields:

| Column name | Description |
| --- | --- |
| adopted_budget_amount | Original budget amount adopted by Mayor and Council |
| total_expenditures | Total Budget Fiscal Year amount expended from account to date |
| budget_change_amount | Amendment to the adopted budget amount |
| budget_transfer_in_amount | Increase in appropriation to account by transfer in of funds |
| budget_transfer_out_amount | Decrease in appropriation to account by transfer out of funds |
| total_budget | Appropriation account amount net of changes and transfers to/from the original budgeted amount |
| encumbrance_amount | Obligation or commitment to pay for a good or service |
| pre_encumbrance_amount | Anticipated obligation or commitment to pay for a good or service |
| budget_uncommitted_amount | Total unused appropriation after expenditures and encumbrances |

Let's see if we can identify and seeming discrepancies in the budget.

#### Total Budget

The total budget should be our approved budget plus the changes and transfers in minus the transfers out.

Create a new column called `calc_total_budget` with the results of this math.

You can use the `==` to check equality between two Series. If we enclose this in parenthesis, we can chain pandas methods. Calling `mean` here will give us the percentage of items that have the same reported total budget as the one we calculated.

This operation suggests that almost 3% of our line items are incorrectly balanced. Let's look into this further.

First, we'll set up a new mask (here called `wrong`) to find those rows where these two values aren't identical.

Because the result of math operations between Series will be a new Series (with the same index), we can pass this mask to the difference between these two columns.

Look at how _small_ those numbers are. This is one of the imperfect aspects of doing math with computers. **These numbers are all actually zero**; these tiny numbers just represent precision errors the somehow propagated or compounded during our operations.

#### Budget Uncommitted Amount

Now that we've confirmed that our `total_budget` is correct, let's see if we can also confirm our `budget_uncommited_amount`.

Is this case, it looks like the calculation will be:

`total_budget` - `total_expenditures` - `encumbrance_amount`

Let's save this as `calc_uncommitted`.

You may notice that a number of these values are `NaN`s. Because we had many missing values for `total_expenditures`, these rows cannot be correctly calculated. For now, we'll ignore these. Note that Pandas will (by default) ignore nulls when calculating any aggregate statistics.

Given the floating point error that we saw in our last series of calculations, let's check for mismatches differently. We'll build up this check one step at a time.

First, let's get the difference of our calculation and the provided value for the `budget_uncommitted_amount`.

From this preview, we'll see both `NaN`s and some very small non-zero values. 

Let's get rid of the sign of these non-zero values by enclosing our previous command in parentheses and calling `.abs()` to return the absolute value.

Because we've eliminated the sign of these differences, we can now use a simple predicate to check if they are meaningful; in this case, a true difference would be greater than 1 cent.

We can do this by just adding `> .01` to the end of our previous call.

This returns a boolean Series. We can use this to mask our DataFrame, but we can also simple enclose it in parentheses and call `.sum()` to get the total number of non-zero differences.

We see that we have two differences of more than a cent.

While we could have saved our mask out to an intermediate variable, here we'll show passing it directly back to our DataFrame to manually review the contents of these two rows.

While one of these reported values is only off by a couple hundred dollars, the other is almost \$50k off. We'll keep this in mind as we complete our final calculation.

#### Expected Expenditures

The fact that so many of our line items have not reported their expenditures more than a year later is concerning. While we cannot know these values for sure, assuming that the rest of our reported values are correct we can calculated the expected amounts.

Rework our most recent calculation to create our new column `expected_expenditures`.

Let's check if any of these are less than 0 to make sure we don't have any negative amounts.

And then we'll also check if any of these are less than 1 (again, this is in part to avoid reporting on floating point errors).

We see a large number of entries here. Let's compare this to the number of nulls in our provided expenditures column.

We see that these counts are almost identical.

Do you feel it's safe to conclude that these accounts had now expenditures in 2017? What further investigations might you be able to conduct on the data to bolster your claims or conclusions?

## Conclusion

Solution code has been provided for all the cells above. We will continue our investigations of this data during the following 2 lessons and labs.