# Data Cleaning & EDA
*Author: [Douglas Strodtman](http://linkedin.com/in/dstrodtman/)*



Data cleaning and exploratory data analysis often go hand in hand. 
- Without examining our data, it's difficult to know whether or not there are errors in it. 
- Without cleaning our data, our aggregate statistics may be skewed by errant data.
The interplay of these processes is often very cyclical. For a data science workflow, these steps are essential to help us understand the nature of our data and ensure that we haven't injected or propogated unnecessary noise to our modeling algorithm. Oftentimes we will find ourselves circling back to data cleaning and EDA after modeling when we are dissatisfied with results.

**No matter your goals working with data, becoming proficient with cleaning and EDA is amongst the most important skills you can learn.**

## Skills Covered
1. Module import
1. Data import
1. Previewing Data
1. Renaming Columns
1. Masking
1. Reindexing
1. Summary Statistics
1. `groupby` and Aggregation
1. Pivot Tables
1. Missing data
    - Finding missing values
    - Imputing missing data
1. Data export

## Key Objectives

Our walkthrough will focus on data from the years 2017 and 2018. By the end of this lesson, you'll be able to answer the following questions (which will be the focus of the accompanying lab):

- Which department had the most line item entries each year?
- Which department had the highest total expenditures each year?
- Which fund had the highest budget allocation each year?
- What percentage of money from the general fund was allocated to different departments each year?
- Which departments saw the largest budget increase and decrease from 2017 to 2018?

## Module Import
Start off by importing pandas.

In [1]:
import pandas as pd

## Data Import
Load the full data. Use a relative path so that your code will be robust.

To see all the data files that were included with this lesson, run the following cell:

In [2]:
!ls ../data

2017_budget.csv                  City_Budget_and_Expenditures.csv
2018_budget.csv


Import this to the variable `all_data`.

In [3]:
all_data = pd.read_csv('../data/City_Budget_and_Expenditures.csv')

## Preview Data
Look at the first 5 rows of your data to see how it loaded.

In [4]:
all_data.head()

Unnamed: 0,BUDGET FISCAL YEAR,DEPARTMENT NAME,FUND NAME,ACCOUNT NAME,ADOPTED BUDGET AMOUNT,TOTAL EXPENDITURES,BUDGET CHANGE AMOUNT,BUDGET TRANSFER IN AMOUNT,BUDGET TRANSFER OUT AMOUNT,TOTAL BUDGET,ENCUMBRANCE AMOUNT,PRE-ENCUMBRANCE AMOUNT,BUDGET UNCOMMITTED AMOUNT,ACCOUNT GROUP NAME,FUND,ACCOUNT,DEPARTMENT
0,2019,AGING,GENERAL FUND (GENERAL BUDGET),CONTRACTUAL SERVICES,2185782.0,750988.85,2000.0,0.0,413400.0,1774382.0,522198.0,0.0,467277.15,EXPENSES,100,003040,2
1,2019,AGING,HEALTH INS COUNS ADV (HICAP),FINANCIAL ALIGNMENT - NEW,0.0,,66184.0,0.0,0.0,66184.0,66184.0,0.0,0.0,,47Y,02RDD3,2
2,2019,AGING,OTHER PROGRAMS FOR THE AGING,ENROLLEE WAGES,0.0,1073105.46,1601346.0,0.0,0.0,1601346.0,0.0,0.0,528240.54,,410,021021,2
3,2019,AGING,SENIOR CITYRIDE PROGRAM FUND,CITYRIDE PROGRAM,0.0,1709925.0,3708000.0,0.0,0.0,3708000.0,1961240.0,0.0,0.0,,599,02R220,2
4,2019,AGING,GENERAL FUND (GENERAL BUDGET),OVERTIME GENERAL,3900.0,319.28,0.0,0.0,0.0,3900.0,0.0,0.0,3580.72,SALARIES AND BENEFITS,100,001090,2


While our default options appear to have successfully loaded the data, we have column names that are all caps and contain spaces. Let's fix this before moving forward.

## Renaming Columns

As long as our column names are only letters, numbers, and underscores, we can also use a dot notation to access Series. In addition, this format will work accross almost all parts of your data workflow, and is especially friendly to SQL.

Let's start by looking at all of our columns.

In [5]:
all_data.columns

Index(['BUDGET FISCAL YEAR', 'DEPARTMENT NAME', 'FUND NAME', 'ACCOUNT NAME',
       'ADOPTED BUDGET AMOUNT', 'TOTAL EXPENDITURES', 'BUDGET CHANGE AMOUNT',
       'BUDGET TRANSFER IN AMOUNT', 'BUDGET TRANSFER OUT AMOUNT',
       'TOTAL BUDGET', 'ENCUMBRANCE AMOUNT', 'PRE-ENCUMBRANCE AMOUNT',
       'BUDGET UNCOMMITTED AMOUNT', 'ACCOUNT GROUP NAME', 'FUND', 'ACCOUNT',
       'DEPARTMENT'],
      dtype='object')

We're aiming for `snake_case` here, which means we'll want only lowercase letters and underscores.

Let's start by just saving our lowercase strings to a new variable, `columns_clean`.

In [6]:
columns_lower = all_data.columns.str.lower()

As a next step, let's just replace the hyphens, overwriting our variable.

In [7]:
columns_lower = columns_lower.str.replace('-', '_')

Finally, we can replace our spaces with underscores as well.

In [8]:
columns_lower = columns_lower.str.replace(' ', '_')

Because we've maintained the order of our columns, we can safely overwrite the original columns in our DataFrame.

In [9]:
all_data.columns = columns_lower

Preview the first 3 rows to see that this worked.

In [10]:
all_data.head(3)

Unnamed: 0,budget_fiscal_year,department_name,fund_name,account_name,adopted_budget_amount,total_expenditures,budget_change_amount,budget_transfer_in_amount,budget_transfer_out_amount,total_budget,encumbrance_amount,pre_encumbrance_amount,budget_uncommitted_amount,account_group_name,fund,account,department
0,2019,AGING,GENERAL FUND (GENERAL BUDGET),CONTRACTUAL SERVICES,2185782.0,750988.85,2000.0,0.0,413400.0,1774382.0,522198.0,0.0,467277.15,EXPENSES,100,003040,2
1,2019,AGING,HEALTH INS COUNS ADV (HICAP),FINANCIAL ALIGNMENT - NEW,0.0,,66184.0,0.0,0.0,66184.0,66184.0,0.0,0.0,,47Y,02RDD3,2
2,2019,AGING,OTHER PROGRAMS FOR THE AGING,ENROLLEE WAGES,0.0,1073105.46,1601346.0,0.0,0.0,1601346.0,0.0,0.0,528240.54,,410,021021,2


## Masking

We're only interested in data from 2017 and 2018. Let's set up a unique mask for each of these years.

To do this, we'll just do a check for equality on our `budget_fiscal_year`.

In [11]:
mask_2017 = all_data.budget_fiscal_year == 2017
mask_2018 = all_data.budget_fiscal_year == 2018

We can now put these masks back into our DataFrame to look at only those rows for each year. Let's do this for each year and check the `shape` attribute so we can see how many rows we're selecting.

In [12]:
all_data[mask_2017].shape

(3593, 17)

In [13]:
all_data[mask_2018].shape

(3653, 17)

In [14]:
all_data[mask_2017].shape[0] + all_data[mask_2018].shape[0]

7246

We can also use the bitwise `or` operator `|` to select all those rows where either of these conditions are true. The number of rows here should equal 7246.

In [15]:
all_data[mask_2017 | mask_2018].shape

(7246, 17)

Because we know that this is the data we wish to work with for the remainder of our exploration, let's save this out to a new DataFrame `df`.

In [16]:
df = all_data[mask_2017 | mask_2018].copy()

And let's take a sample of 10 rows to do a quick check that we haven't included any data from other years.

In [17]:
df.sample(10)

Unnamed: 0,budget_fiscal_year,department_name,fund_name,account_name,adopted_budget_amount,total_expenditures,budget_change_amount,budget_transfer_in_amount,budget_transfer_out_amount,total_budget,encumbrance_amount,pre_encumbrance_amount,budget_uncommitted_amount,account_group_name,fund,account,department
9491,2017,POLICE,GENERAL FUND (GENERAL BUDGET),CONTRACTUAL SERVICES,32860764.0,46004523.86,8460105.0,7532363.0,640108.0,48213124.0,637085.05,0.0,1571515.09,EXPENSES,100,003040,70
7715,2017,FIRE,GENERAL FUND (GENERAL BUDGET),CONSTRUCTION EXPENSE,313755.0,108854.92,0.0,0.0,165809.0,147946.0,36939.12,0.0,2151.96,EXPENSES,100,003030,38
4033,2018,ECONOMIC AND WORKFORCE DEVELOPMENT DEPARTMENT,LA COUNTY DEPARTMENT OF PROBATION GRANTS,TRAUMA-INFORMED YOUTH DEVELOPMENT PROGRAM,0.0,109331.0,180000.0,0.0,0.0,180000.0,70669.0,0.0,0.0,,60A,22P871,22
8205,2017,MAYOR,FY15 UASI HOMELAND SECURITY GRANT FUND,MAYOR,0.0,207262.8,207262.8,0.0,0.0,207262.8,0.0,0.0,0.0,,58H,46N146,46
3995,2018,ECONOMIC AND WORKFORCE DEVELOPMENT DEPARTMENT,LACCD CA CAREER PATHWAY TRUST FUND,RELATED COSTS - PERSONNEL,0.0,,3224.0,0.0,0.0,3224.0,0.0,0.0,3224.0,,59A,22P297,22
3126,2018,AIRPORTS,AIRPORT REVENUE,HIRING HALL-OVERTIME,0.0,88645.08,0.0,0.0,0.0,0.0,0.0,0.0,-88645.08,,700,041190,4
3954,2018,CULTURAL AFFAIRS,ARTS DEVELOPMENT FEE TRUST FND,5414 S CRENSHAW PMT 03695,0.0,,2779.12,0.0,0.0,2779.12,0.0,0.0,2779.12,,516,30PB17,30
6412,2018,TRANSPORTATION,WARNER CTR TRANS IMPROVE TRUST,TRANSPORTATION MANAGEMENT ORGANIZATION,0.0,117066.84,785000.0,0.0,0.0,785000.0,227933.16,440000.0,0.0,,573,94P695,94
7478,2017,CULTURAL AFFAIRS,ARTS DEVELOPMENT FEE TRUST FND,888 S HOPE ST 90017 PMT 01727 NA35,0.0,,7531.8,0.0,0.0,7531.8,0.0,0.0,7531.8,,516,30NA35,30
6168,2018,RECREATION AND PARKS,MUNICIPAL SPORTS ACCOUNT,TRAINING & CONFERENCE,0.0,6067.56,11614.53,0.0,0.0,11614.53,0.0,0.0,5546.97,,301,88006M,88


## Reset Index

You'll note in the preview above that our indices are quite high. This index is not especially informative (it was generated by Pandas automatically upon import).

Personally, when my index doesn't correspond to a primary key, I prefer to work with a serial index starting at 0.

This method is also helpful for returning columns that you've used in a `groupby` statement back into your main DataFrame (more on this later).

Make sure to set the argument `drop=True` if you want to discard your old index (here, we desire this functionality).

In addition, once you've checked that your code is working, you should set `inplace=True` to persist these changes in your `df`.

In [18]:
df.reset_index(drop=True, inplace=True)

## Summary Stats

We've already looked at the shape of our data, but let's check out our `info` to see the types and make note of any missing values.

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7246 entries, 0 to 7245
Data columns (total 17 columns):
budget_fiscal_year            7246 non-null int64
department_name               7246 non-null object
fund_name                     7246 non-null object
account_name                  7242 non-null object
adopted_budget_amount         7246 non-null float64
total_expenditures            5106 non-null float64
budget_change_amount          7246 non-null float64
budget_transfer_in_amount     7246 non-null float64
budget_transfer_out_amount    7246 non-null float64
total_budget                  7246 non-null float64
encumbrance_amount            7246 non-null float64
pre_encumbrance_amount        7246 non-null float64
budget_uncommitted_amount     7246 non-null float64
account_group_name            2458 non-null object
fund                          7246 non-null object
account                       7246 non-null object
department                    7246 non-null int64
dtypes: float64(9),

And we can look at our overall numeric summary statistics. (Don't forget to transpose to make these easy to read).

In [20]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
budget_fiscal_year,7246.0,2017.504,0.5000174,2017.0,2017.0,2018.0,2018.0,2018.0
adopted_budget_amount,7246.0,3807644.0,37207150.0,0.0,0.0,0.0,261208.0,1114645000.0
total_expenditures,5106.0,11899910.0,134873400.0,0.0,36292.92,200215.785,1478339.0,5256445000.0
budget_change_amount,7246.0,1085642.0,26430660.0,-60641220.0,0.0,2454.855,107702.2,1449055000.0
budget_transfer_in_amount,7246.0,119959.8,1696095.0,0.0,0.0,0.0,0.0,85940630.0
budget_transfer_out_amount,7246.0,119959.8,1888885.0,0.0,0.0,0.0,0.0,81720300.0
total_budget,7246.0,4893286.0,45468000.0,0.0,5756.0,98919.32,716220.5,1449055000.0
encumbrance_amount,7246.0,74174.92,621173.8,0.0,0.0,0.0,0.0,18554310.0
pre_encumbrance_amount,7246.0,4686.861,83939.23,0.0,0.0,0.0,0.0,4069569.0
budget_uncommitted_amount,7246.0,-3571651.0,105073100.0,-5256445000.0,0.0,0.0,52133.89,428909900.0


Is there anything of value you note here? Do these numbers provide insight into any of the questions we originally sought to answers?

## `groupby` and Aggregation

We're not actually interested in aggregate statistics calculated over the entire column. Rather, we want to identify groups.

When using `groupby`, you'll need to also apply an aggregation method. Some useful aggregation methods include:

| method | function |
| --- | --- |
| `.count` | Returns the count of total rows that have been grouped together. |
| `.sum` | Returns the sum of all the rows in each group. |
| `.mean` | Returns the average of all the rows in each group. |

Let's start by just grouping by our `budget_fiscal_year` and calculating the mean. Transpose the result for easier interpretation.

In [21]:
df.groupby('budget_fiscal_year').mean().T

budget_fiscal_year,2017,2018
adopted_budget_amount,3726947.0,3887015.0
total_expenditures,11667700.0,12134860.0
budget_change_amount,1095589.0,1075858.0
budget_transfer_in_amount,126353.2,113671.4
budget_transfer_out_amount,126353.2,113671.4
total_budget,4822536.0,4962873.0
encumbrance_amount,50326.87,97631.27
pre_encumbrance_amount,3210.761,6138.717
budget_uncommitted_amount,-3570947.0,-3572344.0
department,51.34929,48.45497


We can also use `value_counts` and `describe` with `groupby`, but I'd recommend you limit these to a single column.

Let's use `describe` on our `total_budget` grouped by `budget_fiscal_year`.

In [22]:
df.groupby('budget_fiscal_year')['total_budget'].describe().T

budget_fiscal_year,2017,2018
count,3593.0,3653.0
mean,4822536.0,4962873.0
std,45101110.0,45832060.0
min,0.0,0.0
25%,5250.0,6000.0
50%,88936.16,102055.1
75%,614408.8,800000.0
max,1447680000.0,1449055000.0


All let's look at the `value_counts` of our `department_name` when grouped by year.

In [23]:
df.groupby('budget_fiscal_year')['department_name'].value_counts()

budget_fiscal_year  department_name                                             
2017                NON-DEPARTMENTAL - APPROPRIATIONS TO SPECIAL PURPOSE FUND       634
                    HOUSING AND COMMUNITY INVESTMENT DEPARTMENT                     279
                    RECREATION AND PARKS                                            269
                    TRANSPORTATION                                                  225
                    CULTURAL AFFAIRS                                                200
                    CITY ADMINISTRATIVE OFFICER                                     156
                    ECONOMIC AND WORKFORCE DEVELOPMENT DEPARTMENT                   148
                    CITY CLERK                                                      128
                    RECREATION AND PARKS - SPECIAL ACCOUNTS                         124
                    NEIGHBORHOOD EMPOWERMENT                                        118
                    MAYOR              

It's difficult to garner any insights from this preview.

## `.pivot_table`

Instead, we'll create a [pivot table](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html).

In my experience, power users of Excel think in pivot tables, whereas folks coming to data from a more programmatic background can struggle with the concept.

I think of it as allowing you to find the unique intersections of two different GROUP BY statements, and then identify an additional column to aggregate over.

Here's a break down of the arguments:

| arg | function |
| --- | --- |
| `values` | A column that will be aggregated |
| `index` | A column or list of columns; unique values will become the index of the resultant DataFrame |
| `columns` | A column or list of columns; unique values will become the columns of the resultant DataFrame |
| `aggfunc` | A aggregate function or list of aggregate functions that will be applied to the specified `values` column |

**Note**: There are additional layers of complex functionality available in this method, which is extremely powerful for data exploration.

Here, we'll return the `count` and `sum` of the `total_budget` with `department_name` as our index and `budget_fiscal_year` as our columns.

In [24]:
df.pivot_table(values='total_budget', index='department_name', columns='budget_fiscal_year', aggfunc=['count', 'sum'])

Unnamed: 0_level_0,count,count,sum,sum
budget_fiscal_year,2017,2018,2017,2018
department_name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
AGING,37.0,41.0,31263560.0,30459580.0
AIRPORTS,17.0,16.0,0.0,0.0
ANIMAL SERVICES,29.0,30.0,25617520.0,27520020.0
BUILDING AND SAFETY,45.0,59.0,370090800.0,457657600.0
CANNABIS REGULATION,,9.0,,1696200.0
CITY ADMINISTRATIVE OFFICER,156.0,156.0,1705370000.0,1658883000.0
CITY ATTORNEY,51.0,56.0,147896000.0,157313200.0
CITY CLERK,128.0,212.0,40031190.0,35475700.0
CITY EMPLOYEES RETIREMENT SYSTEM,12.0,12.0,1036152000.0,1106476000.0
CITY ETHICS COMMISSION,23.0,10.0,7960872.0,6430336.0


We're getting very close to being able to having all the tools we need to answer the questions we posed at the beginning of the lesson. However, there's still an elephant in the room...

## Missing Data

What _should_ we do about the missing values in our data?

This is a difficult question to answer. Unless you are reasonably confident that you can find the true value for a missing data point, you should always be careful when imputing a value. Without going too far into this, a few concerns with data imputation include:

1. Changes to distributions
1. Reduction in variance
1. Obfuscation of meaningful nulls
1. Data is "made up"

Many modeling techniques will require that you deal with all null values before moving forward, so there may be times that you have to impute missing values. A few common approaches include using:

1. The mean, median, or mode
1. A random value selected from the distribution of values in the sample
1. A placeholder to indicate missingness (e.g. -1, 999999, '?')

Each of these is imperfect, and in all cases, it's **imperative** to clearly indicate that you've edited missing values if you're going to store this data for later analysis. (Imagine coming across a dataset with a mostly normal distribution but a huge spike of values right at the median. How would you implicitly know whether this data were real or the result of data cleaning?)

That being said, let's go ahead look at the total number of missing values in each column to see if we can derive a plan of attack.

The `isna` method returns a boolean list that we can `sum` to get these counts.

In [25]:
df.isna().sum()

budget_fiscal_year               0
department_name                  0
fund_name                        0
account_name                     4
adopted_budget_amount            0
total_expenditures            2140
budget_change_amount             0
budget_transfer_in_amount        0
budget_transfer_out_amount       0
total_budget                     0
encumbrance_amount               0
pre_encumbrance_amount           0
budget_uncommitted_amount        0
account_group_name            4788
fund                             0
account                          0
department                       0
dtype: int64

I propose that the missingness represented by our 3 columns is unlikely to be random. Let's examine each column indepedently before making any decisions.

### Account Name

We'll start with `account_name`, which has the fewest nulls.

Let's begin by creating a mask of each row that has a missing value here.

In [26]:
acc_name_null = df.account_name.isna()

We can then use this to review the values present in these rows.

In [27]:
df[acc_name_null]

Unnamed: 0,budget_fiscal_year,department_name,fund_name,account_name,adopted_budget_amount,total_expenditures,budget_change_amount,budget_transfer_in_amount,budget_transfer_out_amount,total_budget,encumbrance_amount,pre_encumbrance_amount,budget_uncommitted_amount,account_group_name,fund,account,department
138,2018,BUILDING AND SAFETY,BLDG & SAFETY PERMIT ENTERPRIS,,44991842.0,48426284.84,7000000.0,0.0,0.0,51991842.0,0.0,0.0,3565557.16,SPECIAL,48R,08P299,8
260,2018,CITY ADMINISTRATIVE OFFICER,DISASTER ASSISTANCE TRUST FUND,,20581791.0,,0.0,0.0,0.0,20581791.0,0.0,0.0,20581791.0,SPECIAL,872,10P210,10
4530,2017,ECONOMIC AND WORKFORCE DEVELOPMENT DEPARTMENT,TEMPORARY ASSISTANCE FOR NEEDY FAMILIES FUND,,238001.0,,0.0,0.0,0.0,238001.0,0.0,0.0,238001.0,,56E,22N122,22
4551,2017,ECONOMIC AND WORKFORCE DEVELOPMENT DEPARTMENT,TEMPORARY ASSISTANCE FOR NEEDY FAMILIES FUND,,71994.0,,0.0,0.0,0.0,71994.0,0.0,0.0,71994.0,,56E,22N299,22


My thought would be to see whether or not we can identify with cetainty the `account_name` by looking at other rows with the same `account` code.

We can use `isin` to find the rows that match here.

In [28]:
missing_accounts = df['account'].isin(df[acc_name_null]['account'].values)

By selecting only those columns we're interested in and sorting them, we can quickly see that in these 4 cases, it's probably safe to impute the account names used in other instances of the account code.

In [29]:
df[missing_accounts][['account', 'account_name']].sort_values(['account', 'account_name'])

Unnamed: 0,account,account_name
123,08P299,REIMBURSEMENT OF GENERAL FUND COSTS
125,08P299,REIMBURSEMENT OF GENERAL FUND COSTS
138,08P299,
223,10P210,DISASTER COSTS REIMBURSEMENTS TO OTHER DEPARTM...
260,10P210,
4452,22N122,ECONOMIC AND WORKFORCE DEVELOPMENT
4462,22N122,ECONOMIC AND WORKFORCE DEVELOPMENT
4464,22N122,ECONOMIC AND WORKFORCE DEVELOPMENT
4473,22N122,ECONOMIC AND WORKFORCE DEVELOPMENT
4475,22N122,ECONOMIC AND WORKFORCE DEVELOPMENT


While in this case our data were small enoug that we could visually review this, let's work out a way to do this programmatically.

The `dropna` method will, by default, drop rows that contain any nulls. **These changes will only persist if you use the `inplace=True` keyword argument.** Let's start building up our argument by again applying our mask, selecting our columns of interest, and dropping those rows containing nulls.

In [30]:
df[missing_accounts][['account', 'account_name']].dropna()

Unnamed: 0,account,account_name
123,08P299,REIMBURSEMENT OF GENERAL FUND COSTS
125,08P299,REIMBURSEMENT OF GENERAL FUND COSTS
223,10P210,DISASTER COSTS REIMBURSEMENTS TO OTHER DEPARTM...
4447,22N299,REIMBURSEMENT OF GENERAL FUND COSTS
4451,22N299,REIMBURSEMENT OF GENERAL FUND COSTS
4452,22N122,ECONOMIC AND WORKFORCE DEVELOPMENT
4453,22N299,REIMBURSEMENT OF GENERAL FUND COSTS
4457,22N299,REIMBURSEMENT OF GENERAL FUND COSTS
4459,22N299,REIMBURSEMENT OF GENERAL FUND COSTS
4462,22N122,ECONOMIC AND WORKFORCE DEVELOPMENT


To get only those rows that are distinct, we can use `drop_duplicates`. Again, these changes won't persist unless we use the `inplace=True` argument.

In [31]:
df[missing_accounts][['account', 'account_name']].dropna().drop_duplicates()

Unnamed: 0,account,account_name
123,08P299,REIMBURSEMENT OF GENERAL FUND COSTS
223,10P210,DISASTER COSTS REIMBURSEMENTS TO OTHER DEPARTM...
4447,22N299,REIMBURSEMENT OF GENERAL FUND COSTS
4452,22N122,ECONOMIC AND WORKFORCE DEVELOPMENT


Now we can clearly see that we have only one `account_name` for each `account`.

Calling `values` on the previous command will return an array.

In [32]:
df[missing_accounts][['account', 'account_name']].dropna().drop_duplicates().values

array([['08P299', 'REIMBURSEMENT OF GENERAL FUND COSTS'],
       ['10P210', 'DISASTER COSTS REIMBURSEMENTS TO OTHER DEPARTMENTS'],
       ['22N299', 'REIMBURSEMENT OF GENERAL FUND COSTS'],
       ['22N122', 'ECONOMIC AND WORKFORCE DEVELOPMENT']], dtype=object)

Which we can cast as a dictionary to make our `account` the keys and the `account_name` the values.

In [33]:
acc_dict = dict(df[missing_accounts][['account', 'account_name']].dropna().drop_duplicates().values)

Now when we `map` this back our `account` column for those missing rows, we'll return our `account_name`.

In [34]:
df.loc[acc_name_null, 'account'].map(acc_dict)

138                   REIMBURSEMENT OF GENERAL FUND COSTS
260     DISASTER COSTS REIMBURSEMENTS TO OTHER DEPARTM...
4530                   ECONOMIC AND WORKFORCE DEVELOPMENT
4551                  REIMBURSEMENT OF GENERAL FUND COSTS
Name: account, dtype: object

Which we can assign back to our DataFrame with `.loc`.

In [35]:
df.loc[acc_name_null, 'account_name'] = df.loc[acc_name_null, 'account'].map(acc_dict)

### Total Expenditures

If you recall from our first lab, the missingness in our total expenditures can't be easily calculated.

Let's create a `exp_null` mask so we can explore this feature.

In [36]:
exp_null = df['total_expenditures'].isna()

Since we know we have a lot of rows here, let's just look at the first 10.

In [37]:
df[exp_null].head(10)

Unnamed: 0,budget_fiscal_year,department_name,fund_name,account_name,adopted_budget_amount,total_expenditures,budget_change_amount,budget_transfer_in_amount,budget_transfer_out_amount,total_budget,encumbrance_amount,pre_encumbrance_amount,budget_uncommitted_amount,account_group_name,fund,account,department
41,2018,AIRPORTS,AIRPORT INSURANCE TRUST FD ONT,OTHER EXPENDITURES,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,735,041000,4
58,2018,ANIMAL SERVICES,ANIMAL WELFARE TRUST,DONATION-STAR PROGRAM,0.0,,23619.3,0.0,0.0,23619.3,0.0,0.0,23619.3,,859,060024,6
65,2018,ANIMAL SERVICES,ANIMAL WELFARE TRUST,EAST VALLEY SHELTER,0.0,,13397.63,0.0,0.0,13397.63,0.0,0.0,13397.63,,859,060006,6
67,2018,ANIMAL SERVICES,ANIMAL WELFARE TRUST,VENDING SALES,0.0,,6812.74,0.0,0.0,6812.74,0.0,0.0,6812.74,,859,060045,6
68,2018,ANIMAL SERVICES,ANIMAL WELFARE TRUST,SOUTH LA SHELTER,0.0,,5125.1,0.0,0.0,5125.1,0.0,0.0,5125.1,,859,060005,6
69,2018,ANIMAL SERVICES,ANIMAL WELFARE TRUST,DONATION-SMART,0.0,,150.0,0.0,0.0,150.0,0.0,0.0,150.0,,859,06023K,6
71,2018,ANIMAL SERVICES,ANIMAL WELFARE TRUST,WEST LA SHELTER,0.0,,30615.38,0.0,0.0,30615.38,0.0,0.0,30615.38,,859,060007,6
73,2018,ANIMAL SERVICES,ANIMAL WELFARE TRUST,DONATION-SMART,0.0,,54.5,0.0,0.0,54.5,0.0,0.0,54.5,,859,060023,6
77,2018,ANIMAL SERVICES,ANIMAL WELFARE TRUST,HARBOR SHELTER,0.0,,14143.76,0.0,0.0,14143.76,0.0,0.0,14143.76,,859,060003,6
78,2018,ANIMAL SERVICES,ANIMAL WELFARE TRUST,W.VALLEY SHELTER,0.0,,17480.22,0.0,0.0,17480.22,0.0,0.0,17480.22,,859,060002,6


These first 10 rows have many zero values for the numeric columns, and also have identical values for the `budget_change_amount`, `total_budget`, and `budget_uncommitted_amount`. Let's explore how commonly these observations are true in the rest of the data.

In [38]:
(df[exp_null]['adopted_budget_amount']==0).mean()

0.7327102803738318

Just for comparison, let's look at how commonly this amount is zero when expenditures aren't null.

In [39]:
(df[~exp_null]['adopted_budget_amount']==0).mean()

0.5473952213082648

Here we see that visual inspection of a sample led us to a spurious hypothesis. Indeed, we can remember from our earlier investigation of our summary statistics that many of our numeric fields have many 0 values. Let's abandon further investigation of zeroes for now.

Instead, let's build logic to investigate `budget_change_amount`, `total_budget`, and `budget_uncommitted_amount`.

In [40]:
(df[exp_null]['budget_change_amount'] == df[exp_null]['total_budget']).mean()

0.7691588785046729

Given that not all of our observations have 0 for the `adopted_budget_amount`, it follows that `budget_change_amount` and `total_budget` won't be equal in many of our rows.

In [41]:
(df[exp_null]['total_budget'] == df[exp_null]['budget_uncommitted_amount']).mean()

0.9654205607476636

We do see, however, that our uncommited budget is equal to our total budget in most of our data. If you recall, we should actually also be including the `encumbrance_amount` here.

In [42]:
(df[exp_null]['total_budget'] == df[exp_null]['budget_uncommitted_amount'] + df[exp_null]['encumbrance_amount']).mean()

0.9920560747663552

This is very nearly every row. (We could push further and check for floating point errors in our calculations, but we'll skip this for now).

Based on our observations, do you think it's safe to impute `0` into our `total_expenditures` column for missing values?

Remember, most of our aggregate calculations in Pandas will ignore null values by default. If we impute something (whether it's zero, the mean, a numeric placeholder, or a random value), these values will factor into any future summary statistics. **I would recommend, whenever possible, that you avoid imputation until you have completed all of your EDA.**

### Account Group Name

This column had the highest number of nulls in our entire dataset. Let's look at the percentage.

In [43]:
df['account_group_name'].isna().mean()

0.6607783604747447

With that many nulls, let's see what our `value_counts` are for this field.

In [44]:
df.account_group_name.value_counts()

SPECIAL                  1589
EXPENSES                  561
SALARIES AND BENEFITS     278
EQUIPMENT                  30
Name: account_group_name, dtype: int64

The nature and distribution of these labels suggests that they are optional tags that provide additional context to line items. As such, the best we can likely due in order to eliminate the null values, is to use a filler string like `"UNSPECIFIED"`.

We can use `fillna` with the argument `inplace=True` to persist these changes to our data.

In [45]:
df.account_group_name.fillna('UNSPECIFIED', inplace=True)

## Data Export

Saving data in Pandas is just as easy as loading data. Here, we'll save our data back to our data directory with the name `clean1718.csv` using the `to_csv` method. Because our numeric index is not meaningful here, we can pass the keyword argument `index=False`.

In [46]:
df.to_csv('../data/clean1718.csv', index=False)