In [115]:
import pandas as pd
import matplotlib.pyplot as plt

In [116]:
# read in data from csv
df = pd.read_csv("allemployeescy2019_feb19_20final-all.csv")

In [117]:
df.head()

Unnamed: 0,NAME,DEPARTMENT_NAME,TITLE,REGULAR,RETRO,OTHER,OVERTIME,INJURED,DETAIL,QUINN/EDUCATION INCENTIVE,TOTAL EARNINGS,POSTAL
0,"Bottomley,Torii A",BPS Business Service,BPS Worker's Comp Job Class,-,-,-,-,772034.24,-,-,772034.24,1938
1,"Smith,Lincoln",Workers Compensation Service,Workers Comp Job Classificatn,-,-,-,-,401182.80,-,-,401182.8,2125
2,"Kervin,Timothy M.",Boston Police Department,Police Lieutenant/Hdq Dispatch,142061.86,-,21262.85,115361.12,-,41360.00,35492.87,355538.7,2135
3,"Danilecki,John H",Boston Police Department,Police Captain,161608.85,-,24040.29,68964.13,-,53040.00,40402.20,348055.47,2081
4,"Maguire,Joseph M",Boston Police Department,Police Sergeant/Hdq Dispatcher,128912.77,-,7128.30,121616.21,-,55544.00,31310.86,344512.14,2038


## Cleaning ToDo List
##### The goal: a dataframe with two columns, department_name and frequency. The frequency column represents the frequency of each department_name in relation to the other target department names. This frequency is a percentage out of 100.

#### Define

Create list of target departments and access only rows with those department names. Set df to this modified dataframe.

In [118]:
df.DEPARTMENT_NAME.value_counts().index

Index(['Boston Police Department', 'Boston Fire Department',
       'BPS Substitute Teachers/Nurs', 'BPS Special Education',
       'BPS Transportation', 'BPS Facility Management',
       'Boston Public Library', 'Boston Cntr - Youth & Families',
       'Public Works Department', 'Traffic Division',
       ...
       'Legal Advisor', 'Dorchester Academy', 'Boston Cntr-Youth & Families',
       'BPS Mattahunt Elementary', 'DND Neighborhood Development',
       'Fenway High School', 'BPS Facilitites Management',
       'BPS MPH\Commerce Academy', 'BPS MPH\Crafts Academy',
       'BPS Withthrop Elementary'],
      dtype='object', length=230)

In [119]:
# Get rows with desired departments under DEPARTMENT_NAME
target_departments = ['Boston Police Department', 'BPS Substitute Teachers/Nurs', 
                      'BPS Transportation', 'Boston Fire Department', 
                      'BPS Special Education', 'BPS Business Service', 
                      'Boston Public Library', 'BPS Facility Management']

df = df[df['DEPARTMENT_NAME'].isin(target_departments)]

#### Define

Use pd.drop to drop all columns except department_name

In [120]:
# drop all columns except for DEPARTMENT_NAME
df.drop(['NAME', 'TITLE', ' REGULAR ', ' RETRO ', ' OTHER ', ' OVERTIME ', ' INJURED ', ' DETAIL ', ' QUINN/EDUCATION INCENTIVE ', 'TOTAL EARNINGS', 'POSTAL'], axis=1, inplace=True)

#### Define

Use df.value_counts() to get count of each department name and turn this into a dataframe with frequency as the column name. Then change frequency values to percentages out of 100 in decimal format. Finally, use reset_index() to make department_name a column and create numbered indexing and make DEPARTMENT_NAME lowercase.

In [121]:
# create dataframe with department_names and frequency
viz_df = pd.DataFrame(df.value_counts(), columns= ['frequency'])

In [122]:
# change values in value column to represent percent out of 100 in decimal format
viz_df['frequency'] = viz_df.frequency / viz_df.frequency.sum()

In [123]:
# verifying operation worked
viz_df

Unnamed: 0_level_0,frequency
DEPARTMENT_NAME,Unnamed: 1_level_1
Boston Police Department,0.375934
Boston Fire Department,0.202965
BPS Substitute Teachers/Nurs,0.106654
BPS Special Education,0.098494
BPS Transportation,0.075049
BPS Facility Management,0.067349
Boston Public Library,0.06459
BPS Business Service,0.008964


In [124]:
# rename DEPARTMENT_NAME to department_name
viz_df.rename_axis('department_name', inplace=True)

# rename some index names to shorter strings so there won't be overlap in visualization
viz_df.rename(index={'Boston Police Department': 'Police Department', 
                     'Boston Fire Department': 'Fire Department',
                     'BPS Substitute Teachers/Nurs': 'BPS Substitute Teachers'}, inplace=True)

In [125]:
# add index to dataframe
viz_df.reset_index(inplace=True)

#### Define

Export dataframe as csv without the index. This is because the Observable notebook would read the index as another column.

In [126]:
# export dataframe to csv
viz_df.to_csv('final_earnings_data.csv', index=False)

In [127]:
# final dataframe
viz_df

Unnamed: 0,department_name,frequency
0,Police Department,0.375934
1,Fire Department,0.202965
2,BPS Substitute Teachers,0.106654
3,BPS Special Education,0.098494
4,BPS Transportation,0.075049
5,BPS Facility Management,0.067349
6,Boston Public Library,0.06459
7,BPS Business Service,0.008964
