# What can city employees' payroll data tell us?
## -A quick data dive!

### Data source: [Payroll data for NYC employees](https://data.cityofnewyork.us/City-Government/Citywide-Payroll-Data-Fiscal-Year-/k397-673e/data)

In [1]:
from plotnine import *
import pandas as pd
df = pd.read_csv('all-employees.csv')
df.columns = df.columns.str.replace(" ", "_")
df.columns = df.columns.str.replace("-", "_")
df.columns = df.columns.str.lower()
# pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.options.display.float_format = '{:,.2f}'.format



In [2]:
df.year.value_counts()

2020    590210
2017    458365
Name: year, dtype: int64

#### Cleaning the data
I'm choosing to keep only NYC-specific boroughs: Queens, Manhattan, Brooklyn and Bronx. Data for Staten Island wasn't in the data here, unless, those employees were included in the "other" location.

In [3]:
boroughs = ['QUEENS', 'MANHATTAN', 'BROOKLYN', 'BRONX']

In [4]:
df = df[df.borough.isin(boroughs)]

In [5]:
df.head(5)

Unnamed: 0,year,payroll_no,agency_name,last_name,first_name,mid_int,start_date,borough,title_desc,leave_status,base_salary,pay_basis,reg_hrs,reg_gross_paid,ot_hrs,ot_paid,other_pay
0,2020,17.0,OFFICE OF EMERGENCY MANAGEMENT,BEREZIN,MIKHAIL,,8/10/15,BROOKLYN,EMERGENCY PREPAREDNESS MANAGER,ACTIVE,86005.0,per Annum,1820.0,84698.21,0.0,0.0,0.0
1,2020,17.0,OFFICE OF EMERGENCY MANAGEMENT,GEAGER,VERONICA,M,9/12/16,BROOKLYN,EMERGENCY PREPAREDNESS MANAGER,ACTIVE,86005.0,per Annum,1820.0,84698.21,0.0,0.0,0.0
2,2020,17.0,OFFICE OF EMERGENCY MANAGEMENT,RAMANI,SHRADDHA,,2/22/16,BROOKLYN,EMERGENCY PREPAREDNESS MANAGER,ACTIVE,86005.0,per Annum,1820.0,84698.21,0.0,0.0,0.0
3,2020,17.0,OFFICE OF EMERGENCY MANAGEMENT,ROTTA,JONATHAN,D,9/16/13,BROOKLYN,EMERGENCY PREPAREDNESS MANAGER,ACTIVE,86005.0,per Annum,1820.0,84698.21,0.0,0.0,0.0
4,2020,17.0,OFFICE OF EMERGENCY MANAGEMENT,WILSON II,ROBERT,P,4/30/18,BROOKLYN,EMERGENCY PREPAREDNESS MANAGER,ACTIVE,86005.0,per Annum,1820.0,84698.21,0.0,0.0,0.0


🚨 `Editorial choice`


Let's narrow down the dataset to focus only on the employees who worked more hours in overtime than regular hours

In [None]:
ot_extra = df.query('ot_hrs > reg_hrs')

In [None]:
ot_extra

In [None]:
ot_extra.last_name.nunique()

In [None]:
df.last_name.nunique()

In [None]:
df.agency_name.nunique()

In [None]:
ot_extra.agency_name.nunique()

`Let's plot these 939 employees (remember 805 unique values, and the rest 134 are repeats!)`

#### Preliminary questions to answer with charts:
1. What agencies do these 939 employees work for?
2. What boroughs do they work for?
3. Are most of these employees still working?

In [None]:
ot_extra.to_csv('ot_extra.csv')

In [None]:
(
    ggplot(ot_extra,
        aes('ot_hrs', 'reg_hrs'))
        + geom_point(aes(color='borough'))
        + facet_wrap('agency_name')
        + theme(figure_size=(20, 18))
)

#### 👉🏻  Soo ... Let's narrow to the top three agencies with the most number of employees who have worked extra in overtime

In [None]:
agencies_df = ot_extra.agency_name.value_counts().head(3)
agencies_df

To plot this further, I'm re-reading the filtered dataset from a new excel — because I couldn't figure out how to do the filtering while plotting.

In [None]:
agencies_df = pd.read_excel('agencies.xlsx')

In [None]:
agencies_df.query('agency_name =="DEPT OF PARKS & RECREATION"').reg_hrs.value_counts()

In [None]:
chart = (
    ggplot(agencies_df.query('reg_hrs > 0'),
        aes('ot_hrs', 'reg_hrs'))
        + geom_point(aes(color='borough'))
        + facet_wrap('agency_name')
        + theme(figure_size=(16, 5))
        + labs(
            title = "City employees who worked more overtime than regular hours for top three agencies, by borough",
            y = "Regular hours worked",
            x = "Overtime hours"
        )
)
chart.save("three_agencies.svg")
chart

#### Combining the plot into one chart

In [None]:
(
    ggplot(agencies_df.query('reg_hrs > 0'),
        aes('ot_hrs', 'reg_hrs'))
        + geom_point(aes(color='borough', shape='agency_name'))
        + theme(figure_size=(8, 5))
        + theme_bw()
        + labs(
            title = "City employees who worked more overtime than regular hours by top three agencies",
            y = "Regular hours worked",
            x = "Overtime hours"
        )
)

In [None]:
(
    ggplot(agencies_df.query('reg_hrs > 0'),
        aes('ot_hrs', 'reg_hrs'))
        + geom_point(aes(color='agency_name'))
        + theme(figure_size=(8, 5))
        + theme_bw()
        + labs(
            title = "City employees who worked more overtime than regular hours by top three agencies",
            y = "Regular hours worked",
            x = "Overtime hours"
        )
)

In [None]:
(
    ggplot(agencies_df.query('reg_hrs > 0'),
        aes('ot_hrs', 'reg_hrs'))
        + geom_point(aes(color='borough'))
        + theme(figure_size=(8, 5))
        + theme_bw()
        + labs(
            title = "City employees who worked more overtime than regular hours by work boroughs",
            y = "Regular hours worked",
            x = "Overtime hours"
        )
)

### 📓 Observation

`Most of these employees' work location is Brooklyn. Let's see how much overtime of employees with more overtime than regular hours cost the city, broken down by boroughs`

But first, a quick refresher: 

The new dataframe (agencies_df) has employees who worked more in overtime hours than regular hours, sorted by the top three agencies.

In [None]:
agencies_df

In [None]:
agencies_df.query('reg_hrs > 0').groupby(by='borough').ot_paid.sum()

In [None]:
agencies_df.query('reg_hrs > 0').borough.value_counts()

In [None]:
agencies_df.query('reg_hrs > 0').nunique()

In [None]:
agencies_df.query('reg_hrs > 0').groupby(by='borough').ot_paid.sum().plot(kind='barh')

In [None]:
agencies_df.query('reg_hrs > 0').query('borough=="BROOKLYN"').agency_name.value_counts()

- 85 Brooklyn employees made over $366K in overtime and they worked more overtime than regular hours.
- They also made more than the 149 employees in Manhattan

### 📝 More observations📝

Most of these employees appear to have "ceased" in their leave status. Let's take a closer look at that. Also, let's only include employees who have non-negative regular hours

In [None]:
ceased_df = agencies_df.query('reg_hrs > 0').query('leave_status == "CEASED"')
ceased_df

In [None]:
ceased_df.shape

#### Observation: 40 employees who worked more overtime than regular hours are no longer working. Let's take a closer look!

In [None]:
ceased_df.ot_paid.sum().round()

In [None]:
ceased_df.ot_paid.mean().round()

`Compare this to average overtime earned across all employees`

In [None]:
df.ot_paid.mean().round()

# AHHA!

#### 40 employees, worked more overtime than regular, racked up more money, on average, with overtime than all employees combined. These 40 employees are no longer working.

`Where did these 40 employees work and for what agencies?`

In [None]:
ceased_df.borough.value_counts()

In [None]:
ceased_df.agency_name.value_counts()

`Taking a closer look at Brooklyn employees`

In [None]:
chart = (
    ggplot(ceased_df.query('borough == "BROOKLYN"'),
        aes('ot_hrs', 'reg_hrs'))
        + geom_point(aes(color='agency_name'))
        + theme(figure_size=(8, 5))
        + theme_bw()
        + labs(
            title = "Brooklyn employees who worked more overtime than regular hours & are no longer employed",
            y = "Regular hours worked",
            x = "Overtime hours"
        )
)

chart.save("brooklyn.svg")
chart

In [None]:
chart = (
    ggplot(df.sort_values(by='ot_paid', ascending = False).head(10),
        aes('ot_hrs', 'ot_paid'))
        + geom_point(aes(shape='agency_name', color='borough', size=4))
        + theme(figure_size=(8, 5))
        + theme_bw()
        + labs(
            title = "Two NYC Housing Authority employees made over $210,000 in overtime pay in 2020",
            x = "Overtime hours worked",
            y = "Overtime paid"
        )
)

chart.save("overallNYCHA.svg")
chart

In [None]:
df.sort_values(by='ot_paid', ascending = False).head(10)

In [None]:
(
    ggplot(df.sort_values(by='ot_paid', ascending = False).head(100),
        aes('year', 'ot_paid'))
        + geom_point(aes(color='borough', size=3))
        + theme(figure_size=(8, 5))
        + theme_bw()
        + labs(
            title = "2020 and 2017 saw the most in ",
            x = "Year",
            y = "Overtime paid"
        )
)

# chart.save("overallNYCHA.svg")
# chart

In [None]:
df.year.value_counts()

## Lots more possibilities, but here's some initial observations /// a quick recap!

### 1. 805 city employees, between 2017-2021 worked more in overtime hours than regular hours
### 2. Most of these employees were from the Police, Fire and Parks & Rec Departments
### 3. While there's one NYPD outlier, most employees were from the fire department, who worked in Brooklyn and are no longer employed
### 4. 85 Brooklyn employees made over $366K in overtime
### 5. 40 employees, worked more overtime than regular, racked up more money, on average, with overtime than all employees combined. These 40 employees are no longer working.