# Analyzing New York City employees' payroll database 

## Data Source: [NYC open data](https://data.cityofnewyork.us/City-Government/Citywide-Payroll-Data-Fiscal-Year-/k397-673e/data)

In [1]:
import pandas as pd
df = pd.read_csv('Citywide_Payroll_Data__Fiscal_Year_.csv')
df.columns = df.columns.str.replace(" ", "_")
df.columns = df.columns.str.replace("-", "_")
df.columns = df.columns.str.lower()
# pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.options.display.float_format = '{:,.2f}'.format



In [2]:
df.shape

(2864545, 17)

Previous versions of the dataset didn't import all the years correctly, so I'll do these sanity checks a couple of times

In [3]:
df.fiscal_year.value_counts()

2019    592431
2020    590210
2021    573477
2017    562266
2018    546161
Name: fiscal_year, dtype: int64

#### Cleaning the data

The database includes people whose work locations are outside of NYC. For this analyses, I'm only including employees with work locations in NYC boroughs. Since Staten Island wasn't listed (unless included in the "other" location, this is filtered down to Queens, Manhattan, Bronx and Brooklyn.

In [4]:
boroughs = ['QUEENS', 'MANHATTAN', 'BROOKLYN', 'BRONX']

In [5]:
df = df[df.work_location_borough.isin(boroughs)]

In [6]:
df.shape

(2760682, 17)

Note: This reduced the dataset by 103,863 rows.

### A note on the assumptions I'm making here on:

From our reporting, I found that some employees tend to work more overtime as they get closer to retirement. Let's take a closer look here on.

In [12]:
df['agency_start_date'] = pd.to_datetime(df.agency_start_date, errors='coerce')
df['today'] = pd.to_datetime('today')
df['tenure_years'] = (df.today - df.agency_start_date).astype('timedelta64[Y]')

In [13]:
df

Unnamed: 0,fiscal_year,payroll_number,agency_name,last_name,first_name,mid_init,agency_start_date,work_location_borough,title_description,leave_status_as_of_june_30,base_salary,pay_basis,regular_hours,regular_gross_paid,ot_hours,total_ot_paid,total_other_pay,today,tenure,tenure_years
0,2017,,ADMIN FOR CHILDREN'S SVCS,AARON,TERESA,,2016-03-21,BRONX,CHILD PROTECTIVE SPECIALIST,ACTIVE,51315.00,per Annum,1825.00,51709.59,588.00,22374.31,639.66,2022-05-07 19:31:01.812351,6.00,6.00
1,2017,,ADMIN FOR CHILDREN'S SVCS,AARONS,CAMELIA,M,2016-08-08,BROOKLYN,CHILD PROTECTIVE SPECIALIST,ACTIVE,51315.00,per Annum,1595.55,41960.18,121.75,3892.19,108.25,2022-05-07 19:31:01.812351,5.00,5.00
2,2017,,ADMIN FOR CHILDREN'S SVCS,ABDUL,MODUPE,,2008-02-11,BROOKLYN,CHILD PROTECTIVE SPECIALIST,ACTIVE,54720.00,per Annum,1825.00,56298.93,54.75,2455.88,3938.75,2022-05-07 19:31:01.812351,14.00,14.00
3,2017,,ADMIN FOR CHILDREN'S SVCS,ABDUL RAHMAN,ABDUL AZIZ,I,2014-10-20,MANHATTAN,CHILD PROTECTIVE SPECIALIST,ACTIVE,54720.00,per Annum,1825.00,55346.09,273.00,11069.41,1124.51,2022-05-07 19:31:01.812351,7.00,7.00
4,2017,,ADMIN FOR CHILDREN'S SVCS,ABDULGANIYU,MONSURAT,A,2013-02-04,BRONX,JUVENILE COUNSELOR,ACTIVE,44409.00,per Annum,1762.00,44157.49,815.50,27878.15,2019.34,2022-05-07 19:31:01.812351,9.00,9.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2864536,2021,745.00,DEPT OF ED HRLY SUPPORT STAFF,FRATIANNI,MARYANN,C,1999-04-27,MANHATTAN,F/T SCHOOL AIDE,CEASED,17.00,per Hour,0.00,-57256.00,0.00,0.00,5814.70,2022-05-07 19:31:01.812351,23.00,23.00
2864537,2021,742.00,DEPT OF ED PEDAGOGICAL,LAMBERT,MARISA,M,2005-09-06,MANHATTAN,ASSISTANT PRINCIPAL,ON LEAVE,130351.00,per Annum,0.00,-36364.44,0.00,0.00,-15369.52,2022-05-07 19:31:01.812351,16.00,16.00
2864538,2021,745.00,DEPT OF ED HRLY SUPPORT STAFF,RIVERA,SARAH,M,1997-09-02,MANHATTAN,F/T SCHOOL AIDE,CEASED,17.04,per Hour,0.00,-58284.17,0.00,0.00,4347.24,2022-05-07 19:31:01.812351,24.00,24.00
2864541,2021,902.00,BRONX DISTRICT ATTORNEY,SIMMONS,NATHANIEL,,1990-07-02,BRONX,SPECIAL ASSISTANT TO THE DISTRICT ATTORNEY,CEASED,110000.00,per Annum,-70.00,-4207.65,0.00,0.00,-75440.00,2022-05-07 19:31:01.812351,31.00,31.00


In [None]:
import statsmodels.formula.api as smf

# YOU CAN ADD FILTERS HERE IF YOU WANT TO LOOK INTO A PARTICULAR AGENCY
# Let's start with no filters
to_model = df # .query("agency_name=='DEPARTMENT OF CORRECTION'")

# title_description
# MODEL y=F(X) - which factors do you want to control for? 
# What do we think should explain the variance in overtime pay
model = smf.ols('ot_hours ~ regular_hours + tenure', data=to_model) 
# note that I added a squared term because it fits better
# https://stackoverflow.com/questions/31978948/python-stats-models-quadratic-term-in-regression

results = model.fit()
display(results.summary())

# FINDING OUTLIERS
# + E (what is still unaccounted for once you have controlled for those factors)
outliers = to_model.query("agency_name=='POLICE DEPARTMENT'").assign(
    predicted = results.predict(),
    residulas = results.resid,
    residuals_z = results.resid / results.resid.std()
    )\
    .sort_values(by='residuals_z', ascending=False)