# Introduction to Pandas Lab

*Author: [Douglas Strodtman](http://linkedin.com/in/dstrodtman/)*

## Lab Overview

This lab builds upon the previous lesson. Some of the skills that you'll be demonstrating include:

1. Module import
1. Data import
1. Previewing Data
1. Differences between DataFrames and Series
1. Selecting rows and columns
1. Basic metadata exploration
1. Differences between methods and attributes
1. Basic methods for categorical data
1. Basic methods for numeric data
1. Sorting

Our primary objective in this lab will be to explore the 2017 budget and expenditures data and confirm whether or not the fields align with the expected values based on the data dictionary (available [here](https://controllerdata.lacity.org/Budget/City-Budget-and-Expenditures/uyzw-yi8n) under "Columns in this Dataset").

## Module Import

Use the below cell to import pandas.

(Other modules you might choose to import for data exploration and cleaning include [re](https://docs.python.org/3/library/re.html) and [numpy](https://www.numpy.org/), but these aren't essential to complete this lab.)

In [1]:
import pandas as pd

## Data Import

We'll be working with just the 2017 data here. You can run command line operations using the `!` in Jupyter. The following cell will list the names of all files in our `data` directory:

In [2]:
!ls ../data

2017_budget.csv                  City_Budget_and_Expenditures.csv
2018_budget.csv


As this will be the main DataFrame in this notebook, it should be safe to use the variable name `df`.

In [3]:
df = pd.read_csv('../data/2017_budget.csv')

## Preview Data

Let's start by looking at the first 3 rows of our data to confirm it loaded correctly.

In [4]:
df.head(3)

Unnamed: 0,budget_fiscal_year,department_name,fund_name,account_name,adopted_budget_amount,total_expenditures,budget_change_amount,budget_transfer_in_amount,budget_transfer_out_amount,total_budget,encumbrance_amount,pre_encumbrance_amount,budget_uncommitted_amount,account_group_name,fund,account,department
0,2017,AGING,GENERAL FUND (GENERAL BUDGET),CONTRACTUAL SERVICES,1811882.0,1248695.94,93339.0,0.0,536334.0,1368887.0,0.0,0.0,120191.06,EXPENSES,100,003040,2
1,2017,AGING,HEALTH INS COUNS ADV (HICAP),HICAP 9 MONTH,0.0,204159.0,229429.0,0.0,0.0,229429.0,0.0,0.0,25270.0,,47Y,02ND01,2
2,2017,AGING,AREA PLAN FOR THE AGING TIT 7,SOCIAL SERVICES FOR SENIORS,0.0,54285.0,71102.0,0.0,0.0,71102.0,16817.0,0.0,0.0,,395,02NQ01,2


It's always a good idea to also check the last 3 rows in case there were any parsing errors.

In [5]:
df.tail(3)

Unnamed: 0,budget_fiscal_year,department_name,fund_name,account_name,adopted_budget_amount,total_expenditures,budget_change_amount,budget_transfer_in_amount,budget_transfer_out_amount,total_budget,encumbrance_amount,pre_encumbrance_amount,budget_uncommitted_amount,account_group_name,fund,account,department
3590,2017,ZOO,ZOO ENTERPRISE TRUST FUND,GENERAL SERVICES,0.0,3343.67,3343.67,0.0,0.0,3343.67,0.0,0.0,0.0,,40E,87N140,87
3591,2017,ZOO,GENERAL FUND (GENERAL BUDGET),CONTRACTUAL SERVICES,424400.0,476001.59,146077.66,0.0,0.0,570477.66,5146.63,0.0,89329.44,EXPENSES,100,003040,87
3592,2017,ZOO,GENERAL FUND (GENERAL BUDGET),FIELD EQUIPMENT EXPENSE,20000.0,19417.73,0.0,0.0,0.0,20000.0,0.0,0.0,582.27,EXPENSES,100,003090,87


The data looks as we expect it. It appears that the data was saved out sorted alphabetically by department name.

## Meta-Data Exploration

Let's get an idea of the contents of the 2017 data by exploring some of the meta-data. First, we should get an idea of the number of rows and columns.

In [6]:
df.shape

(3593, 17)

It's also a good idea to check that our data loaded in the expected format by looking at the data types of each column. If you recall, there's a method that provides the names of each column, the number of non-null values, and the type.

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3593 entries, 0 to 3592
Data columns (total 17 columns):
budget_fiscal_year            3593 non-null int64
department_name               3593 non-null object
fund_name                     3593 non-null object
account_name                  3591 non-null object
adopted_budget_amount         3593 non-null float64
total_expenditures            2568 non-null float64
budget_change_amount          3593 non-null float64
budget_transfer_in_amount     3593 non-null float64
budget_transfer_out_amount    3593 non-null float64
total_budget                  3593 non-null float64
encumbrance_amount            3593 non-null float64
pre_encumbrance_amount        3593 non-null float64
budget_uncommitted_amount     3593 non-null float64
account_group_name            1083 non-null object
fund                          3593 non-null object
account                       3593 non-null object
department                    3593 non-null int64
dtypes: float64(9),

It's always a good idea to record observations along the way. Is there anything unexpected in the above data?

Both `total_expenditures` and `account_group_name` have a high number of nulls. The 2 nulls in `account_name` are confounding, especially as there are no nulls in `account`.

#### Optional 
Because we're only working with the 2017 data, let's go ahead and drop the `budget_fiscal_year` column. We can also drop the `department` column, as this numeric is redundant with the `department_name`.

The docs for `.drop` are [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html). 

**Note**: Leaving these columns in won't negatively impact your ability to complete this lab, but will just retain data you don't really need.

In [8]:
df.drop(['budget_fiscal_year', 'department'], 1, inplace=True)

## Categorical Data

Save out a list of all the columns that contain strings.

In [9]:
cat_cols = df.select_dtypes('object').columns

Now use that list to return the number of unique values in each column.

In [10]:
df[cat_cols].nunique()

department_name         52
fund_name              474
account_name          2075
account_group_name       4
fund                   473
account               2352
dtype: int64

Return the names of the 10 funds that have the most line items in 2017, along with the count.

In [11]:
df['fund_name'].value_counts().head(10)

GENERAL FUND (GENERAL BUDGET)     695
RECREATION & PARKS GRANT          218
ARTS DEVELOPMENT FEE TRUST FND    134
SEWER CAPITAL FUND                128
RECREATION AND PARKS              125
DEPT OF NEIGHBORHOOD EMPOWERE     108
COMMUNITY DEVELOPMENT TRUST        73
PROPOSITION K MAINTENANCE FUND     52
PROPOSITION A LOCAL TRANSIT        44
WASTEWATER SYS REV BD CONS/10A     42
Name: fund_name, dtype: int64

Let's save out a list of the name of the funds that appear more than 100 times in our 2017 data. (In this case, you should be able to just use the `.index[:6]` to select the first 6 rows from the previous output).

In [12]:
common_funds = list(df['fund_name'].value_counts().index[:6])

In [13]:
common_funds

['GENERAL FUND (GENERAL BUDGET)',
 'RECREATION & PARKS GRANT',
 'ARTS DEVELOPMENT FEE TRUST FND',
 'SEWER CAPITAL FUND',
 'RECREATION AND PARKS',
 'DEPT OF NEIGHBORHOOD EMPOWERE']

To simplify the rest of our operations, let's limit ourselves to only these 6 funds.

We can use `.isin` to create a boolean list that we can pass back to `.loc` to only return those rows that are true. This is often called **masking**.

Create a mask by completing the following code:

In [14]:
# mask = df['fund_name'].isin(#your_most_common_funds_list_here)
mask = df['fund_name'].isin(common_funds)

We'll save this as a new DataFrame called `top_funds`.

**Note**: We should use `.copy()` here at the end of our assignment to make sure that we've saved out a _new_ DataFrame. While discussions of mutability in Pandas are beyond the scope of this lesson, if you've ever run into warnings like the following:

```
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead```

you can often avoid these by making sure that you're saving out new DataFrame objects with the `.copy()` method.

In [15]:
top_funds = df.loc[mask, :].copy()

And then look at the shape:

In [16]:
top_funds.shape

(1408, 15)

## Numeric Data

Let's look at the summary statistics for the numeric columns of our newly created DataFrame.

In [17]:
top_funds.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
adopted_budget_amount,1408.0,5000590.0,45037060.0,0.0,0.0,4970.0,500000.0,1095629000.0
total_expenditures,906.0,7881091.0,55859470.0,0.0,42000.0,211396.3,1530037.0,1085729000.0
budget_change_amount,1408.0,321714.0,2363442.0,-6154501.0,0.0,479.825,45258.34,61161460.0
budget_transfer_in_amount,1408.0,321773.6,3190065.0,0.0,0.0,0.0,0.0,85940630.0
budget_transfer_out_amount,1408.0,321773.6,3025259.0,0.0,0.0,0.0,0.0,72990420.0
total_budget,1408.0,5322304.0,45159160.0,0.0,6500.0,67471.505,617742.3,1085729000.0
encumbrance_amount,1408.0,28743.11,203996.5,0.0,0.0,0.0,0.0,4559853.0
pre_encumbrance_amount,1408.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
budget_uncommitted_amount,1408.0,222314.1,953267.4,0.0,0.0,2210.0,50430.61,12477140.0


Do you notice anything particularly noteworthy at this time? Do these values seem reasonable? **There aren't necessarily correct answers to these questions, but you should always record any observations you make as you explore your data.**

Many of our fields have a large number of 0 values, but `pre_encumbrance_amount` is actually all zeroes for this portion of our data.

#### Select Numeric Columns

To be explicit moving forward, let's get a quick list of numeric columns, we'll again use `.select_dtypes` with the keyword `exclude='object'`. We'll just directly access our `.columns` from here.

In [18]:
num_cols = top_funds.select_dtypes(exclude='object').columns
num_cols

Index(['adopted_budget_amount', 'total_expenditures', 'budget_change_amount',
       'budget_transfer_in_amount', 'budget_transfer_out_amount',
       'total_budget', 'encumbrance_amount', 'pre_encumbrance_amount',
       'budget_uncommitted_amount'],
      dtype='object')

Here's a convenient table of the description provided alongside the data for each of these fields:

| Column name | Description |
| --- | --- |
| adopted_budget_amount | Original budget amount adopted by Mayor and Council |
| total_expenditures | Total Budget Fiscal Year amount expended from account to date |
| budget_change_amount | Amendment to the adopted budget amount |
| budget_transfer_in_amount | Increase in appropriation to account by transfer in of funds |
| budget_transfer_out_amount | Decrease in appropriation to account by transfer out of funds |
| total_budget | Appropriation account amount net of changes and transfers to/from the original budgeted amount |
| encumbrance_amount | Obligation or commitment to pay for a good or service |
| pre_encumbrance_amount | Anticipated obligation or commitment to pay for a good or service |
| budget_uncommitted_amount | Total unused appropriation after expenditures and encumbrances |

Let's see if we can identify and seeming discrepancies in the budget.

#### Total Budget

The total budget should be our approved budget plus the changes and transfers in minus the transfers out.

Create a new column called `calc_total_budget` with the results of this math.

In [19]:
top_funds['calc_total_budget'] = top_funds['adopted_budget_amount'] + top_funds['budget_change_amount'] - top_funds['budget_transfer_out_amount'] + top_funds['budget_transfer_in_amount']

You can use the `==` to check equality between two Series. If we enclose this in parenthesis, we can chain pandas methods. Calling `mean` here will give us the percentage of items that have the same reported total budget as the one we calculated.

In [20]:
(top_funds['total_budget'] == top_funds['calc_total_budget']).mean()

0.9737215909090909

This operation suggests that almost 3% of our line items are incorrectly balanced. Let's look into this further.

First, we'll set up a new mask (here called `wrong`) to find those rows where these two values aren't identical.

In [21]:
wrong = top_funds['total_budget'] != top_funds['calc_total_budget']

Because the result of math operations between Series will be a new Series (with the same index), we can pass this mask to the difference between these two columns.

In [22]:
(top_funds['total_budget'] - top_funds['calc_total_budget'])[wrong]

75     -5.820766e-11
261     5.820766e-11
449     1.862645e-09
557    -1.862645e-09
574    -7.275958e-12
963     7.275958e-12
1037    7.450581e-09
1041   -3.725290e-09
1047    1.862645e-09
1056   -1.164153e-10
1399    2.328306e-10
1458   -2.910383e-11
1499    3.725290e-09
1996    2.328306e-10
2192    8.731149e-11
2357   -1.164153e-10
2451   -1.455192e-11
2452   -2.910383e-11
2458    1.018634e-10
2464   -5.820766e-11
2469   -5.820766e-11
2475   -1.455192e-11
2476    1.455192e-11
2484   -1.818989e-12
2490   -2.910383e-11
2496    2.910383e-11
2500   -2.728484e-12
2514   -5.820766e-11
2550   -4.656613e-10
2557   -1.490116e-08
2565   -3.129344e-09
2592   -3.725290e-09
2759    9.313226e-10
2768    1.862645e-09
2774    1.455192e-11
2830    1.862645e-09
3267    1.164153e-10
dtype: float64

Look at how _small_ those numbers are. This is one of the imperfect aspects of doing math with computers. **These numbers are all actually zero**; these tiny numbers just represent precision errors the somehow propagated or compounded during our operations.

#### Budget Uncommitted Amount

Now that we've confirmed that our `total_budget` is correct, let's see if we can also confirm our `budget_uncommited_amount`.

Is this case, it looks like the calculation will be:

`total_budget` - `total_expenditures` - `encumbrance_amount`

Let's save this as `calc_uncommitted`.

In [23]:
top_funds['calc_uncommitted'] = top_funds['total_budget'] - top_funds['total_expenditures'] - top_funds['encumbrance_amount']

You may notice that a number of these values are `NaN`s. Because we had many missing values for `total_expenditures`, these rows cannot be correctly calculated. For now, we'll ignore these. Note that Pandas will (by default) ignore nulls when calculating any aggregate statistics.

Given the floating point error that we saw in our last series of calculations, let's check for mismatches differently. We'll build up this check one step at a time.

First, let's get the difference of our calculation and the provided value for the `budget_uncommitted_amount`.

In [24]:
top_funds['budget_uncommitted_amount'] - top_funds['calc_uncommitted']

0      -5.820766e-11
7       0.000000e+00
10      0.000000e+00
14      0.000000e+00
26     -5.820766e-11
31      4.547474e-13
32      0.000000e+00
33      0.000000e+00
54      9.094947e-12
56      0.000000e+00
57      1.629815e-09
58     -1.409717e-11
61      0.000000e+00
62      0.000000e+00
63      4.547474e-13
64     -5.798029e-12
66     -2.728484e-11
69     -3.637979e-12
74     -1.454836e-12
75      4.423839e-11
84               NaN
85     -1.164153e-10
92      4.656613e-10
100     0.000000e+00
104     0.000000e+00
107    -5.587935e-09
115     7.275958e-12
116    -1.818989e-12
126     1.164153e-10
144     1.164153e-09
            ...     
3267   -4.365575e-11
3289   -7.275958e-12
3302   -1.164153e-10
3314    4.656613e-09
3329   -7.275958e-12
3333    0.000000e+00
3360    0.000000e+00
3361    0.000000e+00
3373   -1.091394e-11
3395    0.000000e+00
3402    0.000000e+00
3410    9.313226e-10
3430    0.000000e+00
3443    9.458745e-11
3454    0.000000e+00
3459   -2.328306e-09
3570   -4.729

From this preview, we'll see both `NaN`s and some very small non-zero values. 

Let's get rid of the sign of these non-zero values by enclosing our previous command in parentheses and calling `.abs()` to return the absolute value.

In [25]:
(top_funds['budget_uncommitted_amount'] - top_funds['calc_uncommitted']).abs()

0       5.820766e-11
7       0.000000e+00
10      0.000000e+00
14      0.000000e+00
26      5.820766e-11
31      4.547474e-13
32      0.000000e+00
33      0.000000e+00
54      9.094947e-12
56      0.000000e+00
57      1.629815e-09
58      1.409717e-11
61      0.000000e+00
62      0.000000e+00
63      4.547474e-13
64      5.798029e-12
66      2.728484e-11
69      3.637979e-12
74      1.454836e-12
75      4.423839e-11
84               NaN
85      1.164153e-10
92      4.656613e-10
100     0.000000e+00
104     0.000000e+00
107     5.587935e-09
115     7.275958e-12
116     1.818989e-12
126     1.164153e-10
144     1.164153e-09
            ...     
3267    4.365575e-11
3289    7.275958e-12
3302    1.164153e-10
3314    4.656613e-09
3329    7.275958e-12
3333    0.000000e+00
3360    0.000000e+00
3361    0.000000e+00
3373    1.091394e-11
3395    0.000000e+00
3402    0.000000e+00
3410    9.313226e-10
3430    0.000000e+00
3443    9.458745e-11
3454    0.000000e+00
3459    2.328306e-09
3570    4.729

Because we've eliminated the sign of these differences, we can now use a simple predicate to check if they are meaningful; in this case, a true difference would be greater than 1 cent.

We can do this by just adding `> .01` to the end of our previous call.

In [26]:
(top_funds['budget_uncommitted_amount'] - top_funds['calc_uncommitted']).abs() > .01

0       False
7       False
10      False
14      False
26      False
31      False
32      False
33      False
54      False
56      False
57      False
58      False
61      False
62      False
63      False
64      False
66      False
69      False
74      False
75      False
84      False
85      False
92      False
100     False
104     False
107     False
115     False
116     False
126     False
144     False
        ...  
3267    False
3289    False
3302    False
3314    False
3329    False
3333    False
3360    False
3361    False
3373    False
3395    False
3402    False
3410    False
3430    False
3443    False
3454    False
3459    False
3570    False
3572    False
3574    False
3575    False
3578    False
3582    False
3583    False
3585    False
3586    False
3587    False
3588    False
3589    False
3591    False
3592    False
Length: 1408, dtype: bool

This returns a boolean Series. We can use this to mask our DataFrame, but we can also simple enclose it in parentheses and call `.sum()` to get the total number of non-zero differences.

In [27]:
((top_funds['budget_uncommitted_amount'] - top_funds['calc_uncommitted']).abs() > .01).sum()

2

We see that we have two differences of more than a cent.

While we could have saved our mask out to an intermediate variable, here we'll show passing it directly back to our DataFrame to manually review the contents of these two rows.

In [28]:
top_funds[(top_funds['budget_uncommitted_amount'] - top_funds['calc_uncommitted']).abs() > .01]

Unnamed: 0,department_name,fund_name,account_name,adopted_budget_amount,total_expenditures,budget_change_amount,budget_transfer_in_amount,budget_transfer_out_amount,total_budget,encumbrance_amount,pre_encumbrance_amount,budget_uncommitted_amount,account_group_name,fund,account,calc_total_budget,calc_uncommitted
498,CITY PLANNING,GENERAL FUND (GENERAL BUDGET),CONTRACTUAL SERVICES,8439371.0,6392197.17,795868.0,150000.0,70000.0,9315239.0,2272266.92,0.0,603917.76,EXPENSES,100,003040,9315239.0,650774.91
2983,RECREATION AND PARKS,RECREATION & PARKS GRANT,SLAUSON RECREATION CENTER,0.0,514707.63,1500000.0,0.0,0.0,1500000.0,164702.67,0.0,820331.49,,205,88NMAT,1500000.0,820589.7


While one of these reported values is only off by a couple hundred dollars, the other is almost \$50k off. We'll keep this in mind as we complete our final calculation.

#### Expected Expenditures

The fact that so many of our line items have not reported their expenditures more than a year later is concerning. While we cannot know these values for sure, assuming that the rest of our reported values are correct we can calculated the expected amounts.

Rework our most recent calculation to create our new column `expected_expenditures`.

In [29]:
top_funds['expected_expenditures'] = top_funds['total_budget'] - top_funds['encumbrance_amount'] - top_funds['budget_uncommitted_amount']

Let's check if any of these are less than 0 to make sure we don't have any negative amounts.

In [30]:
(top_funds['expected_expenditures'] < 0).sum()

0

And then we'll also check if any of these are less than 1 (again, this is in part to avoid reporting on floating point errors).

In [31]:
(top_funds['expected_expenditures'] < 1).sum()

507

We see a large number of entries here. Let's compare this to the number of nulls in our provided expenditures column.

In [32]:
top_funds['total_expenditures'].isna().sum()

502

We see that these counts are almost identical.

Do you feel it's safe to conclude that these accounts had now expenditures in 2017? What further investigations might you be able to conduct on the data to bolster your claims or conclusions?

## Conclusion

Solution code has been provided for all the cells above. We will continue our investigations of this data during the following 2 lessons and labs.