# Data Cleaning & EDA
*Author: [Douglas Strodtman](http://linkedin.com/in/dstrodtman/)*



Data cleaning and exploratory data analysis often go hand in hand. 
- Without examining our data, it's difficult to know whether or not there are errors in it. 
- Without cleaning our data, our aggregate statistics may be skewed by errant data.
The interplay of these processes is often very cyclical. For a data science workflow, these steps are essential to help us understand the nature of our data and ensure that we haven't injected or propogated unnecessary noise to our modeling algorithm. Oftentimes we will find ourselves circling back to data cleaning and EDA after modeling when we are dissatisfied with results.

**No matter your goals working with data, becoming proficient with cleaning and EDA is amongst the most important skills you can learn.**

## Skills Covered
1. Module import
1. Data import
1. Previewing Data
1. Renaming Columns
1. Masking
1. Reindexing
1. Summary Statistics
1. `groupby` and Aggregation
1. Pivot Tables
1. Using `.diff`
1. Missing data
    - Finding missing values
    - Imputing missing data

## Key Objectives

Our walkthrough will focus on data from the years 2017 and 2018. By the end of this lab, you'll be able to answer the following questions:

- Which department had the most line item entries each year?
- Which department had the highest total expenditures each year?
- Which fund had the highest budget allocation each year?
- What percentage of money from the general fund was allocated to different departments each year?
- Which departments saw the largest budget increase and decrease from 2017 to 2018?

## Module Import
Start off by importing pandas.

In [3]:
import pandas as pd

## Data Import
Load the full data. Use a relative path so that your code will be robust.

To see all the data files that were included with this lesson, run the following cell:

In [1]:
!ls ../data

2018_budget.csv                  City_Budget_and_Expenditures.csv


Import this to the variable `all_data`.

In [4]:
all_data = pd.read_csv('../data/City_Budget_and_Expenditures.csv')

## Preview Data
Look at the first 5 rows of your data to see how it loaded.

In [6]:
all_data.head()

Unnamed: 0,BUDGET FISCAL YEAR,DEPARTMENT NAME,FUND NAME,ACCOUNT NAME,ADOPTED BUDGET AMOUNT,TOTAL EXPENDITURES,BUDGET CHANGE AMOUNT,BUDGET TRANSFER IN AMOUNT,BUDGET TRANSFER OUT AMOUNT,TOTAL BUDGET,ENCUMBRANCE AMOUNT,PRE-ENCUMBRANCE AMOUNT,BUDGET UNCOMMITTED AMOUNT,ACCOUNT GROUP NAME,FUND,ACCOUNT,DEPARTMENT
0,2019,AGING,GENERAL FUND (GENERAL BUDGET),CONTRACTUAL SERVICES,2185782.0,750988.85,2000.0,0.0,413400.0,1774382.0,522198.0,0.0,467277.15,EXPENSES,100,003040,2
1,2019,AGING,HEALTH INS COUNS ADV (HICAP),FINANCIAL ALIGNMENT - NEW,0.0,,66184.0,0.0,0.0,66184.0,66184.0,0.0,0.0,,47Y,02RDD3,2
2,2019,AGING,OTHER PROGRAMS FOR THE AGING,ENROLLEE WAGES,0.0,1073105.46,1601346.0,0.0,0.0,1601346.0,0.0,0.0,528240.54,,410,021021,2
3,2019,AGING,SENIOR CITYRIDE PROGRAM FUND,CITYRIDE PROGRAM,0.0,1709925.0,3708000.0,0.0,0.0,3708000.0,1961240.0,0.0,0.0,,599,02R220,2
4,2019,AGING,GENERAL FUND (GENERAL BUDGET),OVERTIME GENERAL,3900.0,319.28,0.0,0.0,0.0,3900.0,0.0,0.0,3580.72,SALARIES AND BENEFITS,100,001090,2


While our default options appear to have successfully loaded the data, we have column names that are all caps and contain spaces. Let's fix this before moving forward.

## Renaming Columns

As long as our column names are only letters, numbers, and underscores, we can also use a dot notation to access Series. In addition, this format will work accross almost all parts of your data workflow, and is especially friendly to SQL.

Let's start by looking at all of our columns.

In [7]:
all_data.columns

Index(['BUDGET FISCAL YEAR', 'DEPARTMENT NAME', 'FUND NAME', 'ACCOUNT NAME',
       'ADOPTED BUDGET AMOUNT', 'TOTAL EXPENDITURES', 'BUDGET CHANGE AMOUNT',
       'BUDGET TRANSFER IN AMOUNT', 'BUDGET TRANSFER OUT AMOUNT',
       'TOTAL BUDGET', 'ENCUMBRANCE AMOUNT', 'PRE-ENCUMBRANCE AMOUNT',
       'BUDGET UNCOMMITTED AMOUNT', 'ACCOUNT GROUP NAME', 'FUND', 'ACCOUNT',
       'DEPARTMENT'],
      dtype='object')

We're aiming for `snake_case` here, which means we'll want only lowercase letters and underscores.

Let's start by just saving our lowercase strings to a new variable, `columns_clean`.

In [9]:
columns_lower = all_data.columns.str.lower()

As a next step, let's just replace the hyphens, overwriting our variable.

In [11]:
columns_lower = columns_lower.str.replace('-', '_')

Finally, we can replace our spaces with underscores as well.

In [13]:
columns_lower = columns_lower.str.replace(' ', '_')

Because we've maintained the order of our columns, we can safely overwrite the original columns in our DataFrame.

In [14]:
all_data.columns = columns_lower

Preview the first 3 rows to see that this worked.

In [16]:
all_data.head(3)

Unnamed: 0,budget_fiscal_year,department_name,fund_name,account_name,adopted_budget_amount,total_expenditures,budget_change_amount,budget_transfer_in_amount,budget_transfer_out_amount,total_budget,encumbrance_amount,pre_encumbrance_amount,budget_uncommitted_amount,account_group_name,fund,account,department
0,2019,AGING,GENERAL FUND (GENERAL BUDGET),CONTRACTUAL SERVICES,2185782.0,750988.85,2000.0,0.0,413400.0,1774382.0,522198.0,0.0,467277.15,EXPENSES,100,003040,2
1,2019,AGING,HEALTH INS COUNS ADV (HICAP),FINANCIAL ALIGNMENT - NEW,0.0,,66184.0,0.0,0.0,66184.0,66184.0,0.0,0.0,,47Y,02RDD3,2
2,2019,AGING,OTHER PROGRAMS FOR THE AGING,ENROLLEE WAGES,0.0,1073105.46,1601346.0,0.0,0.0,1601346.0,0.0,0.0,528240.54,,410,021021,2


## Masking

We're only interested in data from 2017 and 2018. Let's set up a unique mask for each of these years.

To do this, we'll just do a check for equality on our `budget_fiscal_year`.

In [19]:
mask_2017 = all_data.budget_fiscal_year == 2017
mask_2018 = all_data.budget_fiscal_year == 2018

We can now put these masks back into our DataFrame to look at only those rows for each year. Let's do this for each year and check the `shape` attribute so we can see how many rows we're selecting.

In [20]:
all_data[mask_2017].shape

(3593, 17)

In [21]:
all_data[mask_2018].shape

(3653, 17)

In [22]:
all_data[mask_2017].shape[0] + all_data[mask_2018].shape[0]

7246

We can also use the bitwise `or` operator `|` to select all those rows where either of these conditions are true. The number of rows here should equal 7246.

In [23]:
all_data[mask_2017 | mask_2018].shape

(7246, 17)

Because we know that this is the data we wish to work with for the remainder of our exploration, let's save this out to a new DataFrame `df`.

In [24]:
df = all_data[mask_2017 | mask_2018]

And let's take a sample of 10 rows to do a quick check that we haven't included any data from other years.

In [26]:
df.sample(10)

Unnamed: 0,budget_fiscal_year,department_name,fund_name,account_name,adopted_budget_amount,total_expenditures,budget_change_amount,budget_transfer_in_amount,budget_transfer_out_amount,total_budget,encumbrance_amount,pre_encumbrance_amount,budget_uncommitted_amount,account_group_name,fund,account,department
6436,2018,TRANSPORTATION,WEST LA TRANSP IMPROV & MITIGA,REIMBURSEMENT OF GENERAL FUND COSTS,210413.0,159865.3,0.0,0.0,0.0,210413.0,0.0,0.0,50547.7,,681,94P299,94
5694,2018,NON-DEPARTMENTAL - GENERAL CITY PURPOSES,GENERAL FUND (GENERAL BUDGET),COMMUNITY SERVICES DISTRICT 14,94533.0,321873.14,989688.21,0.0,5000.0,1079221.21,62094.83,0.0,695253.24,SPECIAL,100,000714,56
7233,2017,CITY PLANNING,CITY PLANNING SYSTEM DEVE FUND,GENERAL SERVICES DEPT,0.0,6720.55,6720.55,0.0,0.0,6720.55,0.0,0.0,0.0,,588,68N140,68
7586,2017,ECONOMIC AND WORKFORCE DEVELOPMENT DEPARTMENT,WORKFORCE INVEST ACT TRS,PERSONNEL,0.0,2600.0,2600.0,0.0,0.0,2600.0,0.0,0.0,0.0,,44A,22N166,22
6612,2018,WATER AND POWER,PWR SYST RB 2001 SERI C BD SER,INTEREST EXPENSE,0.0,17353.0,0.0,0.0,0.0,0.0,0.0,0.0,-17353.0,,J78,988210,98
9028,2017,NON-DEPARTMENTAL - CAPITAL FINANCE ADMINISTRATION,GENERAL FUND (GENERAL BUDGET),MICLA 2009-A (CAPITAL EQUIPMENT),7329813.0,7314945.13,0.0,0.0,0.0,7329813.0,0.0,0.0,14867.87,SPECIAL,100,000326,53
5045,2018,NON-DEPARTMENTAL - APPROPRIATIONS TO SPECIAL P...,"WSSRB CONSTRUCTION FUND, SERIES 2017-A (GREEN ...",HWRP SERVICE WATER FACILITY IMPROVEMENTS,440000.0,,-440000.0,0.0,0.0,0.0,0.0,0.0,0.0,SPECIAL,75N,50PDG6,50
8837,2017,NON-DEPARTMENTAL - APPROPRIATIONS TO SPECIAL P...,"WSSRB CONSTRUCTION FUND, SERIES 2017-A (GREEN ...",LAG WW CONTROL SYSTEM REPL,0.0,347830.04,347830.04,0.0,0.0,347830.04,0.0,0.0,0.0,,75N,50NE53,50
5075,2018,NON-DEPARTMENTAL - APPROPRIATIONS TO SPECIAL P...,ENGINEERING SPECIAL SERVICE FD,PW-STREET SERVICES,0.0,60870.0,60870.0,0.0,0.0,60870.0,0.0,0.0,0.0,,682,50P186,50
8294,2017,NEIGHBORHOOD EMPOWERMENT,DEPT OF NEIGHBORHOOD EMPOWERE,BOYLE HEIGHTS NC,0.0,42000.0,42000.0,0.0,0.0,42000.0,0.0,0.0,0.0,,44B,471038,47


## Reset Index

You'll note in the preview above that our indices are quite high. This index is not especially informative (it was generated by Pandas automatically upon import).

Personally, when my index doesn't correspond to a primary key, I prefer to work with a serial index starting at 0.

This method is also helpful for returning columns that you've used in a `groupby` statement back into your main DataFrame (more on this later).

Make sure to set the argument `drop=True` if you want to discard your old index (here, we desire this functionality).

In addition, once you've checked that your code is working, you should set `inplace=True` to persist these changes in your `df`.

In [31]:
df.reset_index(drop=True, inplace=True)

## Summary Stats

We've already looked at the shape of our data, but let's check out our `info` to see the types and make note of any missing values.

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7246 entries, 0 to 7245
Data columns (total 17 columns):
budget_fiscal_year            7246 non-null int64
department_name               7246 non-null object
fund_name                     7246 non-null object
account_name                  7242 non-null object
adopted_budget_amount         7246 non-null float64
total_expenditures            5106 non-null float64
budget_change_amount          7246 non-null float64
budget_transfer_in_amount     7246 non-null float64
budget_transfer_out_amount    7246 non-null float64
total_budget                  7246 non-null float64
encumbrance_amount            7246 non-null float64
pre_encumbrance_amount        7246 non-null float64
budget_uncommitted_amount     7246 non-null float64
account_group_name            2458 non-null object
fund                          7246 non-null object
account                       7246 non-null object
department                    7246 non-null int64
dtypes: float64(9),

And we can look at our overall numeric summary statistics. (Don't forget to transpose to make these easy to read).

In [33]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
budget_fiscal_year,7246.0,2017.504,0.5000174,2017.0,2017.0,2018.0,2018.0,2018.0
adopted_budget_amount,7246.0,3807644.0,37207150.0,0.0,0.0,0.0,261208.0,1114645000.0
total_expenditures,5106.0,11899910.0,134873400.0,0.0,36292.92,200215.785,1478339.0,5256445000.0
budget_change_amount,7246.0,1085642.0,26430660.0,-60641220.0,0.0,2454.855,107702.2,1449055000.0
budget_transfer_in_amount,7246.0,119959.8,1696095.0,0.0,0.0,0.0,0.0,85940630.0
budget_transfer_out_amount,7246.0,119959.8,1888885.0,0.0,0.0,0.0,0.0,81720300.0
total_budget,7246.0,4893286.0,45468000.0,0.0,5756.0,98919.32,716220.5,1449055000.0
encumbrance_amount,7246.0,74174.92,621173.8,0.0,0.0,0.0,0.0,18554310.0
pre_encumbrance_amount,7246.0,4686.861,83939.23,0.0,0.0,0.0,0.0,4069569.0
budget_uncommitted_amount,7246.0,-3571651.0,105073100.0,-5256445000.0,0.0,0.0,52133.89,428909900.0


Is there anything of value you note here? Do these numbers provide insight into any of the questions we originally sought to answers?

## `groupby` and Aggregation

We're not actually interested in aggregate statistics calculated over the entire column. Rather, we want to identify groups.

When using `groupby`, you'll need to also apply an aggregation method. Some useful aggregation methods include:

| method | function |
| --- | --- |
| `.count` | Returns the count of total rows that have been grouped together. |
| `.sum` | Returns the sum of all the rows in each group. |
| `.mean` | Returns the average of all the rows in each group. |

Let's start by just grouping by our `budget_fiscal_year` and calculating the mean. Transpose the result for easier interpretation.

In [37]:
df.groupby('budget_fiscal_year').mean().T

budget_fiscal_year,2017,2018
adopted_budget_amount,3726947.0,3887015.0
total_expenditures,11667700.0,12134860.0
budget_change_amount,1095589.0,1075858.0
budget_transfer_in_amount,126353.2,113671.4
budget_transfer_out_amount,126353.2,113671.4
total_budget,4822536.0,4962873.0
encumbrance_amount,50326.87,97631.27
pre_encumbrance_amount,3210.761,6138.717
budget_uncommitted_amount,-3570947.0,-3572344.0
department,51.34929,48.45497


We can also use `value_counts` and `describe` with `groupby`, but I'd recommend you limit these to a single column.

Let's use `describe` on our `total_budget` grouped by `budget_fiscal_year`.

In [38]:
df.groupby('budget_fiscal_year')['total_budget'].describe().T

budget_fiscal_year,2017,2018
count,3593.0,3653.0
mean,4822536.0,4962873.0
std,45101110.0,45832060.0
min,0.0,0.0
25%,5250.0,6000.0
50%,88936.16,102055.1
75%,614408.8,800000.0
max,1447680000.0,1449055000.0


All let's look at the `value_counts` of our `department_name` when grouped by year.

In [42]:
df.groupby('budget_fiscal_year')['department_name'].value_counts()

budget_fiscal_year  department_name                                             
2017                NON-DEPARTMENTAL - APPROPRIATIONS TO SPECIAL PURPOSE FUND       634
                    HOUSING AND COMMUNITY INVESTMENT DEPARTMENT                     279
                    RECREATION AND PARKS                                            269
                    TRANSPORTATION                                                  225
                    CULTURAL AFFAIRS                                                200
                    CITY ADMINISTRATIVE OFFICER                                     156
                    ECONOMIC AND WORKFORCE DEVELOPMENT DEPARTMENT                   148
                    CITY CLERK                                                      128
                    RECREATION AND PARKS - SPECIAL ACCOUNTS                         124
                    NEIGHBORHOOD EMPOWERMENT                                        118
                    MAYOR              

It's difficult to garner any insights from this preview.

Instead, let's `groupby` both `budget_fiscal_year` and `department_name` and get the `count`.

In [58]:
df.pivot_table(values='fund_name', index=['department_name'], columns=['budget_fiscal_year'], aggfunc='count')

budget_fiscal_year,2017,2018
department_name,Unnamed: 1_level_1,Unnamed: 2_level_1
AGING,37.0,41.0
AIRPORTS,17.0,16.0
ANIMAL SERVICES,29.0,30.0
BUILDING AND SAFETY,45.0,59.0
CANNABIS REGULATION,,9.0
CITY ADMINISTRATIVE OFFICER,156.0,156.0
CITY ATTORNEY,51.0,56.0
CITY CLERK,128.0,212.0
CITY EMPLOYEES RETIREMENT SYSTEM,12.0,12.0
CITY ETHICS COMMISSION,23.0,10.0


In [52]:
df.groupby(['department_name', 'budget_fiscal_year'])[['fund_name']].count().pivot('department_name', 'budget_fiscal_year')

KeyError: 'department_name'

In [None]:
df.

In [11]:
df.loc[10:15, 'account_name']

10                                TRAVEL
11                                 AGING
12          CONGREGATE MEALS FOR SENIORS
13      OTO NSIP CONGREGATE MEALS III C1
14           FINANCIAL ALIGNMENT PROGRAM
15    ELDER ABUSE PREV PROGR FOR SENIORS
Name: account_name, dtype: object

We can also provide a list of specific rows we want to include.

In [12]:
df.loc[[10,12,27,45,500], 'account_name']

10                             TRAVEL
12       CONGREGATE MEALS FOR SENIORS
27                      HICAP 3 MONTH
45                   SALARIES GENERAL
500    WESTWOOD  NEIGHBORHOOD COUNCIL
Name: account_name, dtype: object

#### Rows

To return an entire row, just use `.loc` with the index label for that row.

In [13]:
df.loc[0]

budget_fiscal_year                                     2018
department_name                                       AGING
fund_name                     GENERAL FUND (GENERAL BUDGET)
account_name                           CONTRACTUAL SERVICES
adopted_budget_amount                           2.22238e+06
total_expenditures                              1.60816e+06
budget_change_amount                                   9000
budget_transfer_in_amount                                 0
budget_transfer_out_amount                           453500
total_budget                                    1.77788e+06
encumbrance_amount                                    93331
pre_encumbrance_amount                                    0
budget_uncommitted_amount                             76394
account_group_name                                 EXPENSES
fund                                                    100
account                                              003040
department                              

Similar to how we specified rows for our column selections, we can specify columns for our row selections.

In [14]:
df.loc[0, ['fund_name', 'account_name', 'total_budget']]

fund_name       GENERAL FUND (GENERAL BUDGET)
account_name             CONTRACTUAL SERVICES
total_budget                      1.77788e+06
Name: 0, dtype: object

We can use the same `:` notation to select a range of our columns (note that the columns returned are dependent upon the order of columns in your DataFrame).

In [15]:
df.loc[0, 'fund_name':'total_budget']

fund_name                     GENERAL FUND (GENERAL BUDGET)
account_name                           CONTRACTUAL SERVICES
adopted_budget_amount                           2.22238e+06
total_expenditures                              1.60816e+06
budget_change_amount                                   9000
budget_transfer_in_amount                                 0
budget_transfer_out_amount                           453500
total_budget                                    1.77788e+06
Name: 0, dtype: object

#### Quick Note on `.iloc`

Pandas also provides an `.iloc` selection method. This uses the numeric index of rows and columns to handle selection. While this can be useful in some applications, most times in Pandas you will have informative column and index labels that you would rather select on.

When our index is a serial integer range (as here), `.loc` and `.iloc` have similar behavior when selecting rows. `.iloc` will not allow you to pass column names, however.

In [16]:
df.iloc[0]

budget_fiscal_year                                     2018
department_name                                       AGING
fund_name                     GENERAL FUND (GENERAL BUDGET)
account_name                           CONTRACTUAL SERVICES
adopted_budget_amount                           2.22238e+06
total_expenditures                              1.60816e+06
budget_change_amount                                   9000
budget_transfer_in_amount                                 0
budget_transfer_out_amount                           453500
total_budget                                    1.77788e+06
encumbrance_amount                                    93331
pre_encumbrance_amount                                    0
budget_uncommitted_amount                             76394
account_group_name                                 EXPENSES
fund                                                    100
account                                              003040
department                              

In [17]:
df.iloc[0:5, 5]

0    1608157.04
1      87876.00
2     292338.00
3    2419162.00
4      15943.36
Name: total_expenditures, dtype: float64

In [18]:
df.iloc[0, 2:7]

fund_name                GENERAL FUND (GENERAL BUDGET)
account_name                      CONTRACTUAL SERVICES
adopted_budget_amount                      2.22238e+06
total_expenditures                         1.60816e+06
budget_change_amount                              9000
Name: 0, dtype: object

As we proceed forward and insert, sort, and remove rows and columns, it will become apparent why using specific labels makes for clearer, more robust code for selection and manipulation.

### `DataFrames`

DataFrames are tables, essentially, or a collection of rows and columns. When we access a DataFrame, the returned preview will be pretty printed.

**NOTE**: We'll use `head` on all of our DataFrame operations to only show the first 5 rows.

To show an entire DataFrame, just type in the name and evaluate the code.

In [19]:
df.head()

Unnamed: 0,budget_fiscal_year,department_name,fund_name,account_name,adopted_budget_amount,total_expenditures,budget_change_amount,budget_transfer_in_amount,budget_transfer_out_amount,total_budget,encumbrance_amount,pre_encumbrance_amount,budget_uncommitted_amount,account_group_name,fund,account,department
0,2018,AGING,GENERAL FUND (GENERAL BUDGET),CONTRACTUAL SERVICES,2222382.0,1608157.04,9000.0,0.0,453500.0,1777882.0,93331.0,0.0,76393.96,EXPENSES,100,003040,2
1,2018,AGING,TITLE VII OLDER AMERICANS ACT,OMBUDSMAN VII A PROGRAM,0.0,87876.0,87876.0,0.0,0.0,87876.0,0.0,0.0,0.0,,564,02PB01,2
2,2018,AGING,SENIOR HUMAN SERVICES PROGRAM,EVIDENCE BASED PROGRAMS,0.0,292338.0,303447.0,0.0,0.0,303447.0,11109.0,0.0,0.0,,42J,02R340,2
3,2018,AGING,AREA PLAN FOR THE AGING TIT 7,HOME DELIVERED MEALS FOR SENIORS,0.0,2419162.0,2543845.0,0.0,0.0,2543845.0,36122.0,0.0,88561.0,,395,02PQ04,2
4,2018,AGING,GENERAL FUND (GENERAL BUDGET),OVERTIME GENERAL,3900.0,15943.36,0.0,13300.0,0.0,17200.0,0.0,0.0,1256.64,SALARIES AND BENEFITS,100,001090,2


We can use double square bracket notation to select a subset of columns.

In [20]:
df[['department_name', 'fund_name', 'account_name']].head()

Unnamed: 0,department_name,fund_name,account_name
0,AGING,GENERAL FUND (GENERAL BUDGET),CONTRACTUAL SERVICES
1,AGING,TITLE VII OLDER AMERICANS ACT,OMBUDSMAN VII A PROGRAM
2,AGING,SENIOR HUMAN SERVICES PROGRAM,EVIDENCE BASED PROGRAMS
3,AGING,AREA PLAN FOR THE AGING TIT 7,HOME DELIVERED MEALS FOR SENIORS
4,AGING,GENERAL FUND (GENERAL BUDGET),OVERTIME GENERAL


We could have also selected a single column in this way, still returning a DataFrame.

**NOTE**: While here we look at a single column, we still have a DataFrame rather than a Series. Certain Pandas operations require Series and will error out if we try to pass a DataFrame instead.

In [21]:
df[['fund_name']].head()

Unnamed: 0,fund_name
0,GENERAL FUND (GENERAL BUDGET)
1,TITLE VII OLDER AMERICANS ACT
2,SENIOR HUMAN SERVICES PROGRAM
3,AREA PLAN FOR THE AGING TIT 7
4,GENERAL FUND (GENERAL BUDGET)


We can nest brackets within our `.loc` notation to return a DataFrame instead of a Series.

In [22]:
df.loc[:, ['department_name','total_expenditures']].head()

Unnamed: 0,department_name,total_expenditures
0,AGING,1608157.04
1,AGING,87876.0
2,AGING,292338.0
3,AGING,2419162.0
4,AGING,15943.36


This is especially useful when we're interested in looking at multiple columns for a range of rows.

In [23]:
df.loc[0:10, ['department_name','total_expenditures']]

Unnamed: 0,department_name,total_expenditures
0,AGING,1608157.04
1,AGING,87876.0
2,AGING,292338.0
3,AGING,2419162.0
4,AGING,15943.36
5,AGING,256193.0
6,AGING,218106.0
7,AGING,2740986.0
8,AGING,4393789.0
9,AGING,1444876.4


Or when we're interested in selecting specific rows with multiple columns.

In [24]:
df.loc[[0,10,20,30,40], ['fund_name', 'adopted_budget_amount']]

Unnamed: 0,fund_name,adopted_budget_amount
0,GENERAL FUND (GENERAL BUDGET),2222382.0
10,GENERAL FUND (GENERAL BUDGET),8650.0
20,AREA PLAN FOR THE AGING TIT 7,0.0
30,AREA PLAN FOR THE AGING TIT 7,0.0
40,GENERAL FUND (GENERAL BUDGET),222431.0


If we nest brackets on just our rows, we'll again get a DataFrame rather than a Series.

In [25]:
df.loc[[0]]

Unnamed: 0,budget_fiscal_year,department_name,fund_name,account_name,adopted_budget_amount,total_expenditures,budget_change_amount,budget_transfer_in_amount,budget_transfer_out_amount,total_budget,encumbrance_amount,pre_encumbrance_amount,budget_uncommitted_amount,account_group_name,fund,account,department
0,2018,AGING,GENERAL FUND (GENERAL BUDGET),CONTRACTUAL SERVICES,2222382.0,1608157.04,9000.0,0.0,453500.0,1777882.0,93331.0,0.0,76393.96,EXPENSES,100,3040,2


This can be used to select one or many rows, with one or many columns.

In [26]:
df.loc[[50,51,52,53,54,55]]

Unnamed: 0,budget_fiscal_year,department_name,fund_name,account_name,adopted_budget_amount,total_expenditures,budget_change_amount,budget_transfer_in_amount,budget_transfer_out_amount,total_budget,encumbrance_amount,pre_encumbrance_amount,budget_uncommitted_amount,account_group_name,fund,account,department
50,2018,AIRPORTS,AIRPORT REVENUE FUND-ONTARIO,SALARIES GENERAL,0.0,5909209.0,0.0,0.0,0.0,0.0,0.0,0.0,-5909209.0,,723,41010,4
51,2018,AIRPORTS,AIRPORT REVENUE,HIRING HALL-OVERTIME,0.0,88645.08,0.0,0.0,0.0,0.0,0.0,0.0,-88645.08,,700,41190,4
52,2018,AIRPORTS,AIRPORT REVENUE,OTHER EXPENDITURES,0.0,1943900000.0,0.0,0.0,0.0,0.0,0.0,0.0,-1943900000.0,,700,41000,4
53,2018,AIRPORTS,DEA FEDERAL FORFEIT PROP-LAWA/,OTHER EXPENDITURES,0.0,182140.8,0.0,0.0,0.0,0.0,0.0,0.0,-182140.8,,74L,41000,4
54,2018,AIRPORTS,AIRPORT REVENUE FUND-ONTARIO,OTHER EXPENDITURES,0.0,8205629.0,0.0,0.0,0.0,0.0,0.0,0.0,-8205629.0,,723,41000,4
55,2018,AIRPORTS,PASSENGER FACILITY CHARGE -LAX,OTHER EXPENDITURES,0.0,273795500.0,0.0,0.0,0.0,0.0,0.0,0.0,-273795500.0,,71R,41000,4


#### A Note on Transpose

DataFrames contain a transpose attribute, `.T`. This doesn't change the underlying data, but just changes the orientation of the returned preview, switching the location or rows with columns.

In [27]:
df.T.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,3643,3644,3645,3646,3647,3648,3649,3650,3651,3652
budget_fiscal_year,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,...,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018
department_name,AGING,AGING,AGING,AGING,AGING,AGING,AGING,AGING,AGING,AGING,...,ZOO,ZOO,ZOO,ZOO,ZOO,ZOO,ZOO,ZOO,ZOO,ZOO
fund_name,GENERAL FUND (GENERAL BUDGET),TITLE VII OLDER AMERICANS ACT,SENIOR HUMAN SERVICES PROGRAM,AREA PLAN FOR THE AGING TIT 7,GENERAL FUND (GENERAL BUDGET),OMBUDSMAN INITIATIVE PROGRAM F,AREA PLAN FOR THE AGING TIT 7,AREA PLAN FOR THE AGING TIT 7,AREA PLAN FOR THE AGING TIT 7,OTHER PROGRAMS FOR THE AGING,...,GENERAL FUND (GENERAL BUDGET),GENERAL FUND (GENERAL BUDGET),ZOO ENTERPRISE TRUST FUND,GENERAL FUND (GENERAL BUDGET),GENERAL FUND (GENERAL BUDGET),ZOO ENTERPRISE TRUST FUND,ZOO ENTERPRISE TRUST FUND,ZOO ENTERPRISE TRUST FUND,GENERAL FUND (GENERAL BUDGET),GENERAL FUND (GENERAL BUDGET)
account_name,CONTRACTUAL SERVICES,OMBUDSMAN VII A PROGRAM,EVIDENCE BASED PROGRAMS,HOME DELIVERED MEALS FOR SENIORS,OVERTIME GENERAL,STATE HEALTH FACILITIES CITATION PENALTIES,PREVENTIVE HEALTH III D,HOME DELIVERED MEALS III C2,CONGREGATE MEALS III C1,ENROLLEE WAGES,...,SALARIES GENERAL,FEED AND GRAIN,PUBLIC WORKS - ENGINEERING,PRINTING AND BINDING,OPERATING SUPPLIES,ZOO,ZOO PROGRAMS & OPERATIONS,ZOO WASTEWATER FACILITY,OVERTIME GENERAL,VETERINARY SUPPLIES & EXPENSE
adopted_budget_amount,2.22238e+06,0,0,0,3900,0,0,0,0,0,...,1.60662e+07,914648,0,70000,130000,2.20124e+07,0,0,135164,400000


We can apply this to any of our above DataFrame access methods it we prefer to our data in this transposed configuration.

In [28]:
df.loc[[0,10,20,30,40], ['fund_name', 'adopted_budget_amount']].T

Unnamed: 0,0,10,20,30,40
fund_name,GENERAL FUND (GENERAL BUDGET),GENERAL FUND (GENERAL BUDGET),AREA PLAN FOR THE AGING TIT 7,AREA PLAN FOR THE AGING TIT 7,GENERAL FUND (GENERAL BUDGET)
adopted_budget_amount,2.22238e+06,8650,0,0,222431


### Values

Generally, we'll be approaching our data in aggregate or filtering by some condition. Sometimes you will want to access an individual value.

My preferred method is to again use `.loc`.

In [29]:
df.loc[0, 'fund_name']

'GENERAL FUND (GENERAL BUDGET)'

There is a more specific method of access, `.at`, but this _only_ allows you to access a single value, and so is less generalizable than `.loc`.

In [30]:
df.at[0, 'fund_name']

'GENERAL FUND (GENERAL BUDGET)'

You may also see the follow format in example code online. We won't go into the details of it now, but essentially we can think of this as selecting a single column AFTER selecting a single row.

In [31]:
df.loc[0]['fund_name']

'GENERAL FUND (GENERAL BUDGET)'

We can also apply this in the opposite order, selecting a single row AFTER selecting a single column.

In [32]:
df['fund_name'].loc[0]

'GENERAL FUND (GENERAL BUDGET)'

#### Pro-tip

Use standard `.loc` notation regardless of whether you are trying to access a DataFrame, Series, or value. It is the most explicit and robust method for selection, and will make your code easy to read and understand.

## Meta-Data Exploration

Now that we understand some of the basics of navigating our data in Pandas, we'll learn some best practices for learning about _what_ our data is.

### Shape

One of the most important things to always check is the `.shape` of our data. This will return the size of our data as a tuple of `(rows, columns)`.

In [33]:
df.shape

(3653, 17)

Here we have 3653 rows and 17 columns.

### Data Types

We can see the data type of each column with the `.dtypes` attribute.

In [34]:
df.dtypes

budget_fiscal_year              int64
department_name                object
fund_name                      object
account_name                   object
adopted_budget_amount         float64
total_expenditures            float64
budget_change_amount          float64
budget_transfer_in_amount     float64
budget_transfer_out_amount    float64
total_budget                  float64
encumbrance_amount            float64
pre_encumbrance_amount        float64
budget_uncommitted_amount     float64
account_group_name             object
fund                           object
account                        object
department                      int64
dtype: object

### Info

Using the `.info` method will give us more detail. This includes:

- A description of our index (here a `RangeIndex` with 3653 entries)
- The total number of columns
- The name, data type, and count of non-null values for each column (this will be truncated in DataFrames with MANY columns).
- The data types present in our DataFrame and the count of columns of each type
- The memory usage of our current DataFrame

Looking at the `.info` of your data is always a great place to start your data exploration.

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3653 entries, 0 to 3652
Data columns (total 17 columns):
budget_fiscal_year            3653 non-null int64
department_name               3653 non-null object
fund_name                     3653 non-null object
account_name                  3651 non-null object
adopted_budget_amount         3653 non-null float64
total_expenditures            2538 non-null float64
budget_change_amount          3653 non-null float64
budget_transfer_in_amount     3653 non-null float64
budget_transfer_out_amount    3653 non-null float64
total_budget                  3653 non-null float64
encumbrance_amount            3653 non-null float64
pre_encumbrance_amount        3653 non-null float64
budget_uncommitted_amount     3653 non-null float64
account_group_name            1375 non-null object
fund                          3653 non-null object
account                       3653 non-null object
department                    3653 non-null int64
dtypes: float64(9),

### Index

Both our rows and columns have an associated index. For our rows, this is the `.index`.

In [36]:
df.index

RangeIndex(start=0, stop=3653, step=1)

Here we have a serial integer range as our index, starting at 0, stopping at 3653 (exclusive), and incrementing by 1 at each row.

### Columns

We access our column index with `.columns`.

In [37]:
df.columns

Index(['budget_fiscal_year', 'department_name', 'fund_name', 'account_name',
       'adopted_budget_amount', 'total_expenditures', 'budget_change_amount',
       'budget_transfer_in_amount', 'budget_transfer_out_amount',
       'total_budget', 'encumbrance_amount', 'pre_encumbrance_amount',
       'budget_uncommitted_amount', 'account_group_name', 'fund', 'account',
       'department'],
      dtype='object')

These are the names of each of the columns in our DataFrame. We can generally treat this object as a list (although sometimes we'll want to explicitly cast it as such for some operations).

To force an index to a list, we can use `.tolist` or just call `list` on it.

In [38]:
df.columns.tolist()

['budget_fiscal_year',
 'department_name',
 'fund_name',
 'account_name',
 'adopted_budget_amount',
 'total_expenditures',
 'budget_change_amount',
 'budget_transfer_in_amount',
 'budget_transfer_out_amount',
 'total_budget',
 'encumbrance_amount',
 'pre_encumbrance_amount',
 'budget_uncommitted_amount',
 'account_group_name',
 'fund',
 'account',
 'department']

In [39]:
list(df.columns)

['budget_fiscal_year',
 'department_name',
 'fund_name',
 'account_name',
 'adopted_budget_amount',
 'total_expenditures',
 'budget_change_amount',
 'budget_transfer_in_amount',
 'budget_transfer_out_amount',
 'total_budget',
 'encumbrance_amount',
 'pre_encumbrance_amount',
 'budget_uncommitted_amount',
 'account_group_name',
 'fund',
 'account',
 'department']

## Methods and Attributes

By now, you're probably wondering:

> "Why do some of these things have `()`, some `[]`, and some no punctuation after the words?"

Great question!

If you've tried to explore Pandas independently, you may have given up in frustration, unable to troubleshoot the errors that you were getting trying to implement the solutions you found on StackOverflow. 

**You're not alone.**

It will take time to familiarize yourself with the syntax for interacting with the Pandas application programming interface (API). In my experience, the biggest difficulty is learning that Pandas is _always trying to do the right thing_. Or rather, it's trying to do the thing it _thinks_ you might want to do.

Pandas was developed to bring the ease of data exploration and manipulation found in Excel, SQL, and R DataFrames into the Python environment. As such, there is a _ton_ of desired functionality that has been implemented by the open source community. Hopefully the following rules can help you begin to organize Pandas syntax in your mind. [(The official Pandas docs are always a good place to inform yourself)](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html).

### 1. When Accessing a Static Attribute, No Punctuation Is Needed

A DataFrame is a `Class` object. You may or may not have worked with classes before; that's okay! An important thing to understand about a class is that it can store already computed static values (or attributes) that don't actually transform or operate on the data in anyway at time of access.

Here are some examples we've seen so far:

| Syntax | Attribute accessed |
| --- | --- |
| `.shape` | A tuple of the shape of the data, rows by columns |
| `.index` | The index of the data, as an Index object |
| `.columns` | The columns of the data, as an Index object |
| `.T` | The transpose of the data, with rows as columns and columns as rows |
| `.some_specific_column_name` e.g., `.fund_name` | A column series |

In each of these example, the returned attribute is just some aspect of the defined data that you're accessing. This will make a bit more sense after we consider...

### 2. When Modifying Data or Performing Calculations, Use `()`

Classes also have associated methods. There are the same as functions, but can be thought of to generally have pre-defined arguments that act upon the data in the class. Thus far, most of the methods that we've invoked have _modified a preview of our data_.

Here are some examples we've seen so far:

| Syntax | Method applied |
| --- | --- |
| `.head()` | Limit returned view to first 5 rows |
| `.sample()` | Return a random sample row from our data |
| `.tolist()` | Return data as a list object |
| `.info()` | Return a summary of indices, columns, dtypes, nulls, and size |

Some other methods we'll see shortly extend common operations you may have encountered in SQL or numpy:

| Syntax | Method applied |
| --- | --- |
| `.groupby()` | Group by one or more columns |
| `.count()` | Provide an aggregate count of rows |
| `.mean()` | Calculate the mean of one or more columns |
| `.join()` | Join a DataFrame with another DataFrame on some condition |

Pandas methods should always be used, when possible, as they're designed to be highly optimized for calculations, sorting, and aggregation. We'll learn many of these today, but there are far too many to cover in any one lesson (I still find new, useful methods regularly as complete projects).

**But what about those pesky square brackets?**

### 3. When Filtering or Accessing Data, Use `[]`

This can be a conceptual hurdle for some, but understanding square bracket notation really boils down to two main points:

1. DataFrames are built on top of `numpy` arrays and extend much of the `numpy` notation.
2. DataFrames can be thought of as a dictionary of `Series` and use dictionary indexing notation.

Of course, neither of these concepts is helpful if you aren't familiar with keying into dictionaries and indexing in numpy.

The most important things to keep in mind:

1. Data is accessed by row(s) and then column(s)
    - `df.loc[row]`
    - `df.loc[row, column]`
    - `df.loc[[row1, row2], [column1, column2]]`
2. If no rows are passed, Pandas assumes you are trying to index into your columns
    - `df[column]`
    - `df[[column1, column2]]`
    - `df.some_column_name` (not bracket notation, but the same concept)
3. `:` can be used to select a range
    - `df.loc[:, column]`
    - `df.loc[0:5, column]`
    - `df.loc[0:5, column1:column5]`
4. Double brackets always return a DataFrame.
    - `df[[column]]`
    - `df[[column1, column2]]`
    - `df.loc[[row1, row2], [column1, column2]]`

**This is a lot of information, and you're not expected to remember all of this right now**. One of the nice things about coding in Jupyter is that many of our error messages are informative, and we can often quickly add the punctuation needed to revise our code and get it to run. **The only way to learn is to try and fail**.


In [40]:
df.loc[0:5, 'department_name':'account_name']

Unnamed: 0,department_name,fund_name,account_name
0,AGING,GENERAL FUND (GENERAL BUDGET),CONTRACTUAL SERVICES
1,AGING,TITLE VII OLDER AMERICANS ACT,OMBUDSMAN VII A PROGRAM
2,AGING,SENIOR HUMAN SERVICES PROGRAM,EVIDENCE BASED PROGRAMS
3,AGING,AREA PLAN FOR THE AGING TIT 7,HOME DELIVERED MEALS FOR SENIORS
4,AGING,GENERAL FUND (GENERAL BUDGET),OVERTIME GENERAL
5,AGING,OMBUDSMAN INITIATIVE PROGRAM F,STATE HEALTH FACILITIES CITATION PENALTIES


## Categorical Data

As we've already seen, Pandas allows you to add a number of different kinds of data into the same DataFrame. It has a suite of methods specifically built out for dealing with categorical data.

By default, Pandas will load any column that contains letters as an `object` type.

You can use the `select_dtypes` method to return a view of your DataFrame with only the specified type present.

In [41]:
df.select_dtypes('object').head()

Unnamed: 0,department_name,fund_name,account_name,account_group_name,fund,account
0,AGING,GENERAL FUND (GENERAL BUDGET),CONTRACTUAL SERVICES,EXPENSES,100,003040
1,AGING,TITLE VII OLDER AMERICANS ACT,OMBUDSMAN VII A PROGRAM,,564,02PB01
2,AGING,SENIOR HUMAN SERVICES PROGRAM,EVIDENCE BASED PROGRAMS,,42J,02R340
3,AGING,AREA PLAN FOR THE AGING TIT 7,HOME DELIVERED MEALS FOR SENIORS,,395,02PQ04
4,AGING,GENERAL FUND (GENERAL BUDGET),OVERTIME GENERAL,SALARIES AND BENEFITS,100,001090


I like to save the the columns object out so that I can use this to easily operate on all my categorical columns.

Not that when we use a list of column names, the list itself contains a set of square brackets, so we can pass it back to our DataFrame to return a DataFrame.

In [42]:
cat_cols = df.select_dtypes('object').columns
df[cat_cols].head()

Unnamed: 0,department_name,fund_name,account_name,account_group_name,fund,account
0,AGING,GENERAL FUND (GENERAL BUDGET),CONTRACTUAL SERVICES,EXPENSES,100,003040
1,AGING,TITLE VII OLDER AMERICANS ACT,OMBUDSMAN VII A PROGRAM,,564,02PB01
2,AGING,SENIOR HUMAN SERVICES PROGRAM,EVIDENCE BASED PROGRAMS,,42J,02R340
3,AGING,AREA PLAN FOR THE AGING TIT 7,HOME DELIVERED MEALS FOR SENIORS,,395,02PQ04
4,AGING,GENERAL FUND (GENERAL BUDGET),OVERTIME GENERAL,SALARIES AND BENEFITS,100,001090


### Summary Statistics

There are limited summary statistics to perform on categorical values. Namely we look at total unique values, total count for each value, and the distribution of our total counts.

The method `.nunique` will gives us the number of unique values in each column.

In [43]:
df[cat_cols].nunique()

department_name         54
fund_name              481
account_name          2057
account_group_name       4
fund                   481
account               2347
dtype: int64

We can then call `.unique` on a specific column to get a set of the values contained.

In [44]:
df['account_group_name'].unique()

array(['EXPENSES', nan, 'SALARIES AND BENEFITS', 'SPECIAL', 'EQUIPMENT'],
      dtype=object)

Note that by default, `.nunique` will ignore null values, while `.unique` will return these.

More often, we will use `.value_counts` to return a count of each unique value in a column.

In [45]:
df['department_name'].value_counts()

NON-DEPARTMENTAL - APPROPRIATIONS TO SPECIAL PURPOSE FUND       680
HOUSING AND COMMUNITY INVESTMENT DEPARTMENT                     288
TRANSPORTATION                                                  267
CULTURAL AFFAIRS                                                229
ECONOMIC AND WORKFORCE DEVELOPMENT DEPARTMENT                   215
CITY CLERK                                                      212
CITY ADMINISTRATIVE OFFICER                                     156
RECREATION AND PARKS - SPECIAL ACCOUNTS                         124
WATER AND POWER                                                 120
NON-DEPARTMENTAL - GENERAL CITY PURPOSES                         99
RECREATION AND PARKS                                             92
MAYOR                                                            88
POLICE                                                           82
NON-DEPARTMENTAL - GENERAL                                       75
NON-DEPARTMENTAL - CAPITAL IMPROVEMENT EXPENSE P

By default, these will be return in descending order.

Note that `.value_counts` also ignores missing values. We can override this behavior (in many Pandas methods) by setting `dropna=False`.

In [46]:
df['account_group_name'].value_counts(dropna=False)

NaN                      2278
SPECIAL                   931
EXPENSES                  287
SALARIES AND BENEFITS     143
EQUIPMENT                  14
Name: account_group_name, dtype: int64

The high number of missing values suggests that this is an optional field.

### String Operations

Pandas has great functionality for applying string methods to text data.

Let's look at our categorical data again.

In [47]:
df[cat_cols].head()

Unnamed: 0,department_name,fund_name,account_name,account_group_name,fund,account
0,AGING,GENERAL FUND (GENERAL BUDGET),CONTRACTUAL SERVICES,EXPENSES,100,003040
1,AGING,TITLE VII OLDER AMERICANS ACT,OMBUDSMAN VII A PROGRAM,,564,02PB01
2,AGING,SENIOR HUMAN SERVICES PROGRAM,EVIDENCE BASED PROGRAMS,,42J,02R340
3,AGING,AREA PLAN FOR THE AGING TIT 7,HOME DELIVERED MEALS FOR SENIORS,,395,02PQ04
4,AGING,GENERAL FUND (GENERAL BUDGET),OVERTIME GENERAL,SALARIES AND BENEFITS,100,001090


I don't love that the department names are in all caps. Changing this is as simple as accessing the Series and then invoking `.str.capitalize`. You'll find easy application of your favorite string methods here, alongside a number of Pandas-specific string methods.

This will return a copy of the Series with the changes applied, but nothing will have changed in the original data.

You can chain methods easily in Pandas. Because the returned object is a Series, we can use `.value_counts` to see unique values and counts with this new formatting.

In [48]:
df['department_name'].str.capitalize().value_counts()

Non-departmental - appropriations to special purpose fund       680
Housing and community investment department                     288
Transportation                                                  267
Cultural affairs                                                229
Economic and workforce development department                   215
City clerk                                                      212
City administrative officer                                     156
Recreation and parks - special accounts                         124
Water and power                                                 120
Non-departmental - general city purposes                         99
Recreation and parks                                             92
Mayor                                                            88
Police                                                           82
Non-departmental - general                                       75
Non-departmental - capital improvement expense p

Let's go ahead and save a similar formatting change for the `account_group_name` field.

Just do this by assigning right back into the Series in the DataFrame.

In [49]:
df['account_group_name'] = df['account_group_name'].str.capitalize()

## Numeric Data

There's even greater functionality for numeric data. A great place to start is to look at summary statistics.

### Summary Statistics

Pandas `.describe` method will automatically return most of the desired summary stats for all numeric fields.

In [50]:
df.describe()

Unnamed: 0,budget_fiscal_year,adopted_budget_amount,total_expenditures,budget_change_amount,budget_transfer_in_amount,budget_transfer_out_amount,total_budget,encumbrance_amount,pre_encumbrance_amount,budget_uncommitted_amount,department
count,3653.0,3653.0,2538.0,3653.0,3653.0,3653.0,3653.0,3653.0,3653.0,3653.0,3653.0
mean,2018.0,3887015.0,12134860.0,1075858.0,113671.4,113671.4,4962873.0,97631.27,6138.717,-3572344.0,48.454969
std,0.0,37813520.0,138386400.0,26271730.0,1327327.0,1878178.0,45832060.0,703960.9,100076.6,107288700.0,26.521788
min,2018.0,0.0,0.0,-57180380.0,0.0,0.0,0.0,0.0,0.0,-5256445000.0,2.0
25%,2018.0,0.0,34238.84,0.0,0.0,0.0,6000.0,0.0,0.0,0.0,30.0
50%,2018.0,0.0,200096.2,1065.0,0.0,0.0,102055.1,0.0,0.0,435.54,50.0
75%,2018.0,300000.0,1564424.0,105643.0,0.0,0.0,800000.0,0.0,0.0,80000.0,62.0
max,2018.0,1114645000.0,5256445000.0,1449055000.0,35907000.0,81720300.0,1449055000.0,16026540.0,4069569.0,428909900.0,98.0


I find it's often easier to interpret these values once they're transposed.

In [51]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
budget_fiscal_year,3653.0,2018.0,0.0,2018.0,2018.0,2018.0,2018.0,2018.0
adopted_budget_amount,3653.0,3887015.0,37813520.0,0.0,0.0,0.0,300000.0,1114645000.0
total_expenditures,2538.0,12134860.0,138386400.0,0.0,34238.84,200096.24,1564423.615,5256445000.0
budget_change_amount,3653.0,1075858.0,26271730.0,-57180380.0,0.0,1065.0,105643.0,1449055000.0
budget_transfer_in_amount,3653.0,113671.4,1327327.0,0.0,0.0,0.0,0.0,35907000.0
budget_transfer_out_amount,3653.0,113671.4,1878178.0,0.0,0.0,0.0,0.0,81720300.0
total_budget,3653.0,4962873.0,45832060.0,0.0,6000.0,102055.13,800000.0,1449055000.0
encumbrance_amount,3653.0,97631.27,703960.9,0.0,0.0,0.0,0.0,16026540.0
pre_encumbrance_amount,3653.0,6138.717,100076.6,0.0,0.0,0.0,0.0,4069569.0
budget_uncommitted_amount,3653.0,-3572344.0,107288700.0,-5256445000.0,0.0,435.54,80000.0,428909900.0


Note that two of our numeric fields actually represent categorical data, `budget_fiscal_year` and `department`.

There are a number of ways to deal with this. I prefer to just go ahead and cast them as `object` type. This will omit them from our numeric operations (and also allow us to easily group them with our other categorical columns).

Use `.astype` to change each column to an object and overwrite the original column.

In [52]:
df['department'] = df['department'].astype('object')

In [53]:
df['budget_fiscal_year'] = df['budget_fiscal_year'].astype('object')

Now when we look at our `.describe`, these values won't appear.

In [54]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
adopted_budget_amount,3653.0,3887015.0,37813520.0,0.0,0.0,0.0,300000.0,1114645000.0
total_expenditures,2538.0,12134860.0,138386400.0,0.0,34238.84,200096.24,1564423.615,5256445000.0
budget_change_amount,3653.0,1075858.0,26271730.0,-57180380.0,0.0,1065.0,105643.0,1449055000.0
budget_transfer_in_amount,3653.0,113671.4,1327327.0,0.0,0.0,0.0,0.0,35907000.0
budget_transfer_out_amount,3653.0,113671.4,1878178.0,0.0,0.0,0.0,0.0,81720300.0
total_budget,3653.0,4962873.0,45832060.0,0.0,6000.0,102055.13,800000.0,1449055000.0
encumbrance_amount,3653.0,97631.27,703960.9,0.0,0.0,0.0,0.0,16026540.0
pre_encumbrance_amount,3653.0,6138.717,100076.6,0.0,0.0,0.0,0.0,4069569.0
budget_uncommitted_amount,3653.0,-3572344.0,107288700.0,-5256445000.0,0.0,435.54,80000.0,428909900.0


If you're not familiar, the `e+06` is scientific notation, and just means that the decimal should be moved 6 places to the right.

Here's a table for what each of our calculated values represents:

| statistic | meaning |
| --- | --- |
| count | Number of non-null elements |
| mean | Average of non-null elements |
| std | Standard deviation |
| min | The smallest value in the column |
| 25% | Value greater than 25% of data (lower quartile, Q1) |
| 50% | The middle value (50th percentile, Q2) |
| 75% | Value greater than 75% of data (upper quartile, Q3) |
| max | The largest value in the column |

We note that many of our columns are full of mostly zeros (5 of our columns have a median of 0, and 4 have the value 0 at the 75 percentile).

All of our columns have min values of 0 or less.

The only column with missing values is `total_expenditures`. Because we're looking at 2018 data, it's possible that these values just haven't been calculated and finalized yet, but we should keep this in mind moving forward as we look at other years.

#### Select Numeric Columns

To be explicit moving forward, let's get a quick list of numeric columns, we'll again use `.select_dtypes` with the keyword `exclude='object'`. We'll just directly access our `.columns` from here.

In [55]:
num_cols = df.select_dtypes(exclude='object').columns
num_cols

Index(['adopted_budget_amount', 'total_expenditures', 'budget_change_amount',
       'budget_transfer_in_amount', 'budget_transfer_out_amount',
       'total_budget', 'encumbrance_amount', 'pre_encumbrance_amount',
       'budget_uncommitted_amount'],
      dtype='object')

Here's a convenient table of the description provided alongside the data for each of these fields:

| Column name | Description |
| --- | --- |
| adopted_budget_amount | Original budget amount adopted by Mayor and Council |
| total_expenditures | Total Budget Fiscal Year amount expended from account to date |
| budget_change_amount | Amendment to the adopted budget amount |
| budget_transfer_in_amount | Increase in appropriation to account by transfer in of funds |
| budget_transfer_out_amount | Decrease in appropriation to account by transfer out of funds |
| total_budget | Appropriation account amount net of changes and transfers to/from the original budgeted amount |
| encumbrance_amount | Obligation or commitment to pay for a good or service |
| pre_encumbrance_amount | Anticipated obligation or commitment to pay for a good or service |
| budget_uncommitted_amount | Total unused appropriation after expenditures and encumbrances |

### Aggregate Statistics

We can also calculate these same statistics on each column, or the entire DataFrame. We'll run through these operations quickly as a demonstration.

#### `.count`

`.count` will work on both numeric and categorical values. Note that nulls are ignored.

In [56]:
df.count()

budget_fiscal_year            3653
department_name               3653
fund_name                     3653
account_name                  3651
adopted_budget_amount         3653
total_expenditures            2538
budget_change_amount          3653
budget_transfer_in_amount     3653
budget_transfer_out_amount    3653
total_budget                  3653
encumbrance_amount            3653
pre_encumbrance_amount        3653
budget_uncommitted_amount     3653
account_group_name            1375
fund                          3653
account                       3653
department                    3653
dtype: int64

#### `.mean`

`.mean` will only evaluated numeric values, ignoring nulls.

In [57]:
df[num_cols].mean()

adopted_budget_amount         3.887015e+06
total_expenditures            1.213486e+07
budget_change_amount          1.075858e+06
budget_transfer_in_amount     1.136714e+05
budget_transfer_out_amount    1.136714e+05
total_budget                  4.962873e+06
encumbrance_amount            9.763127e+04
pre_encumbrance_amount        6.138717e+03
budget_uncommitted_amount    -3.572344e+06
dtype: float64

By default, means are calculated for each column. However, we can changed this to be calculated over rows by passing an `axis=1` keyword argument.

In [58]:
df[num_cols].mean(axis=1).head()

0    693405.111111
1     29292.000000
2    101149.000000
3    847948.333333
4      5733.333333
dtype: float64

In this investigation, these numbers aren't extremely informative. As a reminder, let's look at our numeric columns again.

In [59]:
num_cols

Index(['adopted_budget_amount', 'total_expenditures', 'budget_change_amount',
       'budget_transfer_in_amount', 'budget_transfer_out_amount',
       'total_budget', 'encumbrance_amount', 'pre_encumbrance_amount',
       'budget_uncommitted_amount'],
      dtype='object')

So the row-wise means we are returning are just being calculated on these various numbers. Again, this is for demonstration, and not really informative.

#### `.std`

Standard deviation will only be calculated on numeric columns.

In [60]:
df[num_cols].std()

adopted_budget_amount         3.781352e+07
total_expenditures            1.383864e+08
budget_change_amount          2.627173e+07
budget_transfer_in_amount     1.327327e+06
budget_transfer_out_amount    1.878178e+06
total_budget                  4.583206e+07
encumbrance_amount            7.039609e+05
pre_encumbrance_amount        1.000766e+05
budget_uncommitted_amount     1.072887e+08
dtype: float64

#### Quantiles

The `.quantile` method will allow you to specify any value between 0 and 1.

In [61]:
df[num_cols].quantile(.3)

adopted_budget_amount             0.0
total_expenditures            42000.0
budget_change_amount              0.0
budget_transfer_in_amount         0.0
budget_transfer_out_amount        0.0
total_budget                  13409.6
encumbrance_amount                0.0
pre_encumbrance_amount            0.0
budget_uncommitted_amount         0.0
Name: 0.3, dtype: float64

We also have the built-in `.median` function for the 50 percentile.

In [62]:
df[num_cols].median()

adopted_budget_amount              0.00
total_expenditures            200096.24
budget_change_amount            1065.00
budget_transfer_in_amount          0.00
budget_transfer_out_amount         0.00
total_budget                  102055.13
encumbrance_amount                 0.00
pre_encumbrance_amount             0.00
budget_uncommitted_amount        435.54
dtype: float64

#### `.min` and `.max`

These will work on both numeric and categorical columns, returning minimum and maximum values for each column. For categorical columns, these are alphabetically sorted.

In [63]:
df.min()

budget_fiscal_year                                                  2018
department_name                                                    AGING
fund_name                     100 RESILIENT CITIES INITIATIVE GRANT FUND
adopted_budget_amount                                                  0
total_expenditures                                                     0
budget_change_amount                                        -5.71804e+07
budget_transfer_in_amount                                              0
budget_transfer_out_amount                                             0
total_budget                                                           0
encumbrance_amount                                                     0
pre_encumbrance_amount                                                 0
budget_uncommitted_amount                                   -5.25645e+09
fund                                                                 100
account                                            

In [64]:
df.max()

budget_fiscal_year                                 2018
department_name                                     ZOO
fund_name                     ZOO ENTERPRISE TRUST FUND
adopted_budget_amount                       1.11464e+09
total_expenditures                          5.25645e+09
budget_change_amount                        1.44906e+09
budget_transfer_in_amount                    3.5907e+07
budget_transfer_out_amount                  8.17203e+07
total_budget                                1.44906e+09
encumbrance_amount                          1.60265e+07
pre_encumbrance_amount                      4.06957e+06
budget_uncommitted_amount                    4.2891e+08
fund                                                W88
account                                          988210
department                                           98
dtype: object

Note that all of these methods can be applied on a Series as well.

In [65]:
df.total_budget.max()

1449055000.0

# END OF LESSON 1

I think? This already seems like a ton of content for just an hour. **BUT** we haven't gotten to masking or math operations. A lot of the above is just me being overly verbose so that attendees have things to refer back to later.

This is definitely more bottom-up than top-down. Maybe there's an additional notebook example that we run through just before this that provides some top-down high level analytics with all the code provided? This could be useful to show people the power of Pandas, and then zoom back and explain the basics? And then the lab could be focused on duplicating some of the things completed in the top-down demo?

A few cells above I realize don't really have any comments directly above them, but these will make sense in the nature of the lecture, and I think they read well here in the solutions notebook.

### Basic Math Operations

Pandas is set up to do vectorized math operations by default.

I'm no financial expert, but let's see how balanced this budget is.

Based on our provided data dictionary, we can see that the `total_budget` should result from the following:

`adopted_budget_amount` + `budget_change_amount` + `budget_transfer_in_amount` - `budget_transfer_out_amount`

Let's break this down into a few intermediate operations.

First, we'll add the `budget_change_amount` to our `adopted_budget_amount`. We'll save this as new column, `changed_budget`.

In [193]:
df['budget_change_amount'] + df['adopted_budget_amount']

0        2231382.00
1          87876.00
2         303447.00
3        2543845.00
4           3900.00
5         256193.00
6         220546.00
7        2758622.00
8        4428880.00
9        1450864.00
10          8450.00
11        291679.00
12        544000.00
13          6652.00
14        164614.00
15          2571.00
16       2096835.00
17         10218.00
18        527631.00
19         32049.00
20       2538774.00
21         17788.00
22         31473.00
23        402384.00
24         15720.00
25        129399.00
26        537683.00
27         52522.00
28       3713657.00
29         57079.00
           ...     
3623           0.00
3624           0.00
3625           0.00
3626           0.00
3627           0.00
3628           0.00
3629      821999.00
3630     1281682.00
3631      226339.00
3632      399794.00
3633      500000.00
3634       40000.00
3635       20158.71
3636       12500.00
3637       60000.00
3638      150000.00
3639     1598896.00
3640       21311.00
3641        5001.00


| Column name | Description |
| --- | --- |
| adopted_budget_amount | Original budget amount adopted by Mayor and Council |
| total_expenditures | Total Budget Fiscal Year amount expended from account to date |
| budget_change_amount | Amendment to the adopted budget amount |
| budget_transfer_in_amount | Increase in appropriation to account by transfer in of funds |
| budget_transfer_out_amount | Decrease in appropriation to account by transfer out of funds |
| total_budget | Appropriation account amount net of changes and transfers to/from the original budgeted amount |
| encumbrance_amount | Obligation or commitment to pay for a good or service |
| pre_encumbrance_amount | Anticipated obligation or commitment to pay for a good or service |
| budget_uncommitted_amount | Total unused appropriation after expenditures and encumbrances |

In [192]:
((df['adopted_budget_amount'] 
  + df['budget_change_amount']
  - df['budget_transfer_out_amount'] 
  + df['budget_transfer_in_amount']
 ) == df['total_budget']).mean()

0.989871338625787

In [187]:
df_no_nulls = df.dropna()

In [188]:
((df_no_nulls['total_expenditures'] + df_no_nulls['encumbrance_amount'] + df_no_nulls['budget_uncommitted_amount'] + df_no_nulls['pre_encumbrance_amount']) == df_no_nulls['total_budget']).mean()

0.9175257731958762

In [189]:
((df_no_nulls['adopted_budget_amount'] - df_no_nulls['total_budget'] - df_no_nulls['budget_transfer_out_amount'] + df_no_nulls['budget_transfer_in_amount']).abs() == df_no_nulls.budget_change_amount).mean()


0.7824742268041237

In [173]:
df.budget_change_amount

0          9000.00
1         87876.00
2        303447.00
3       2543845.00
4             0.00
5        256193.00
6        220546.00
7       2758622.00
8       4428880.00
9       1450864.00
10         -200.00
11      -171434.00
12       544000.00
13         6652.00
14       164614.00
15         2571.00
16      -264700.00
17         1093.00
18       527631.00
19        32049.00
20      2538774.00
21        11987.00
22        31473.00
23       402384.00
24        15720.00
25       129399.00
26       537683.00
27        52522.00
28      -142554.00
29        57079.00
           ...    
3623          0.00
3624          0.00
3625          0.00
3626          0.00
3627          0.00
3628          0.00
3629          0.00
3630          0.00
3631      -9521.00
3632    -591606.00
3633     500000.00
3634          0.00
3635      20158.71
3636      12500.00
3637          0.00
3638          0.00
3639          0.00
3640      21311.00
3641          0.00
3642          0.00
3643     -34463.71
3644        

In [174]:
df.head()

Unnamed: 0,budget_fiscal_year,department_name,fund_name,account_name,adopted_budget_amount,total_expenditures,budget_change_amount,budget_transfer_in_amount,budget_transfer_out_amount,total_budget,encumbrance_amount,pre_encumbrance_amount,budget_uncommitted_amount,account_group_name,fund,account,department
0,2018,AGING,GENERAL FUND (GENERAL BUDGET),CONTRACTUAL SERVICES,2222382.0,1608157.04,9000.0,0.0,453500.0,1777882.0,93331.0,0.0,76393.96,Expenses,100,003040,2
1,2018,AGING,TITLE VII OLDER AMERICANS ACT,OMBUDSMAN VII A PROGRAM,0.0,87876.0,87876.0,0.0,0.0,87876.0,0.0,0.0,0.0,,564,02PB01,2
2,2018,AGING,SENIOR HUMAN SERVICES PROGRAM,EVIDENCE BASED PROGRAMS,0.0,292338.0,303447.0,0.0,0.0,303447.0,11109.0,0.0,0.0,,42J,02R340,2
3,2018,AGING,AREA PLAN FOR THE AGING TIT 7,HOME DELIVERED MEALS FOR SENIORS,0.0,2419162.0,2543845.0,0.0,0.0,2543845.0,36122.0,0.0,88561.0,,395,02PQ04,2
4,2018,AGING,GENERAL FUND (GENERAL BUDGET),OVERTIME GENERAL,3900.0,15943.36,0.0,13300.0,0.0,17200.0,0.0,0.0,1256.64,Salaries and benefits,100,001090,2


In [160]:
(df['total_budget'] 
 - df['total_expenditures'] 
 + df['budget_transfer_in_amount'] 
 - df['budget_transfer_out_amount'] 
 + df['encumbrance_amount'] 
 + df['budget_uncommitted_amount']
#  + df['adopted_budget_amount']
)

0        -114050.08
1              0.00
2          22218.00
3         249366.00
4          15813.28
5              0.00
6           4880.00
7          35272.00
8          70182.00
9          11975.20
10          6204.42
11         32223.36
12        186718.00
13           746.00
14        271790.00
15             0.00
16        928787.68
17          8436.00
18          5298.00
19         22444.00
20        237484.00
21         39496.74
22             0.00
23         79544.00
24             0.00
25        112848.32
26         10526.00
27           178.00
28        881136.20
29          7857.60
           ...     
3623   -73121120.00
3624   -44724700.00
3625            NaN
3626   -73130000.00
3627     -460000.00
3628            NaN
3629      127867.54
3630           0.00
3631       61382.12
3632      464628.56
3633           0.00
3634       44148.24
3635           6.72
3636           0.00
3637       30000.00
3638        3501.36
3639            NaN
3640           0.00
3641          10.00


In [161]:
foo = df.loc[0]

In [165]:
foo['total_budget'] - foo['total_expenditures'] - foo['budget_transfer_out_amount']

-283775.04000000004

In [170]:
abs(foo['adopted_budget_amount'] - foo['total_budget'] - foo['budget_transfer_out_amount'])

9000.0

In [167]:
foo

budget_fiscal_year                                     2018
department_name                                       AGING
fund_name                     GENERAL FUND (GENERAL BUDGET)
account_name                           CONTRACTUAL SERVICES
adopted_budget_amount                           2.22238e+06
total_expenditures                              1.60816e+06
budget_change_amount                                   9000
budget_transfer_in_amount                                 0
budget_transfer_out_amount                           453500
total_budget                                    1.77788e+06
encumbrance_amount                                    93331
pre_encumbrance_amount                                    0
budget_uncommitted_amount                             76394
account_group_name                                 Expenses
fund                                                    100
account                                              003040
department                              

Missing data
Finding missing values
Imputing missing data

In [14]:
df.groupby(['department_name', 'department'])[['account_name']].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,account_name
department_name,department,Unnamed: 2_level_1
AGING,2,41
AIRPORTS,4,16
ANIMAL SERVICES,6,30
BUILDING AND SAFETY,8,58
CANNABIS REGULATION,13,9
CITY ADMINISTRATIVE OFFICER,10,155
CITY ATTORNEY,12,56
CITY CLERK,14,212
CITY EMPLOYEES RETIREMENT SYSTEM,16,12
CITY ETHICS COMMISSION,17,10
