# What can city employees' payroll data tell us?
## -A quick data dive!

### Data source: [Payroll data for NYC employees](https://data.cityofnewyork.us/City-Government/Citywide-Payroll-Data-Fiscal-Year-/k397-673e/data)

In [1]:
from plotnine import *
import pandas as pd
df = pd.read_csv('Citywide_Payroll_Data__Fiscal_Year_.csv')
df.columns = df.columns.str.replace(" ", "_")
df.columns = df.columns.str.replace("-", "_")
df.columns = df.columns.str.lower()
# pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
# pd.options.display.float_format = '{:,.2f}'.format



In [2]:
df.fiscal_year.value_counts()

2019    592431
2020    590210
2021    573477
2017    562266
2018    546161
Name: fiscal_year, dtype: int64

#### Cleaning the data
I'm choosing to keep only NYC-specific boroughs: Queens, Manhattan, Brooklyn and Bronx. Data for Staten Island wasn't in the data here, unless, those employees were included in the "other" location.

In [None]:
boroughs = ['QUEENS', 'MANHATTAN', 'BROOKLYN', 'BRONX']

In [None]:
df = df[df.work_location_borough.isin(boroughs)]

In [None]:
df.shape

In [None]:
df.last_name.nunique()

In [None]:
df.fiscal_year.value_counts()

🚨 `Editorial choice`


Let's narrow down the dataset to focus only on the employees who worked more hours in overtime than regular hours

In [None]:
ot_extra = df.query('ot_hours > regular_hours')

In [None]:
ot_extra

In [None]:
ot_extra.last_name.nunique()

In [None]:
df.last_name.nunique()

In [None]:
df.agency_name.nunique()

In [None]:
ot_extra.agency_name.nunique()

`Let's plot these 2110 employees (remember 1556 unique values, and the rest 554 are repeats!)`

#### Preliminary questions to answer with charts:
1. What agencies do these 2110 employees work for?
2. What boroughs do they work in?
3. Are these people still employed?

In [None]:
ot_extra.to_csv('ot_extra.csv')

In [None]:
(
    ggplot(ot_extra,
        aes('ot_hours', 'regular_hours'))
        + geom_point(aes(color='work_location_borough'))
        + facet_wrap('agency_name')
        + theme(figure_size=(20, 18))
)

#### 👉🏻  Soo ... Let's narrow to the top three agencies with the most number of employees who have worked extra in overtime

In [None]:
three = ot_extra.agency_name.value_counts().head(3)
three

In [None]:
three.index

In [None]:
ot_extra[ot_extra.agency_name.isin(three.index)]

To plot this further, I'm re-reading the filtered dataset from a new excel — because I couldn't figure out how to do the filtering while plotting.

In [None]:
agencies_df = pd.read_csv('three-agencies.csv')

In [None]:
agencies_df

In [None]:
agencies_df.query('regular_hours < 0')

Strangely, a bunch of values for regular hours are listed as negatives. Focussing on those with at least one regular hour moving forward. This would be a reporting question that the dataset doesn't answer

In [None]:
chart = (
    ggplot(agencies_df.query('regular_hours > 0'),
        aes('ot_hours', 'regular_hours'))
        + geom_point(aes(color='work_location_borough'))
        + facet_wrap('agency_name')
        + theme(figure_size=(16, 5))
        + labs(
            title = "City employees who worked more overtime than regular hours for top three agencies, by borough",
            y = "Regular hours worked",
            x = "Overtime hours"
        )
)
chart.save("top_three_agencies.svg")
chart

#### Combining the plot into one chart

In [None]:
(
    ggplot(agencies_df.query('regular_hours > 0'),
        aes('ot_hours', 'regular_hours'))
        + geom_point(aes(color='work_location_borough', shape='agency_name'))
        + theme(figure_size=(8, 5))
        + theme_bw()
        + labs(
            title = "City employees who worked more overtime than regular hours by top three agencies",
            y = "Regular hours worked",
            x = "Overtime hours"
        )
)

In [None]:
(
    ggplot(agencies_df.query('regular_hours > 0'),
        aes('ot_hours', 'regular_hours'))
        + geom_point(aes(color='agency_name'))
        + theme(figure_size=(8, 5))
        + theme_bw()
        + labs(
            title = "City employees who worked more overtime than regular hours by top three agencies",
            y = "Regular hours worked",
            x = "Overtime hours"
        )
)

In [None]:
(
    ggplot(agencies_df.query('regular_hours > 0'),
        aes('ot_hours', 'regular_hours'))
        + geom_point(aes(color='work_location_borough'))
        + theme(figure_size=(8, 5))
        + theme_bw()
        + labs(
            title = "City employees who worked more overtime than regular hours by work boroughs",
            y = "Regular hours worked",
            x = "Overtime hours"
        )
)

### 📓 Observation

Most of these employees' work location is Manhattan and fairly similar number of employees for Police Dept and Children's Services. Let's see how much these employees' overtime cost the city

But first, a quick refresher: 

The new dataframe `agencies_df` has employees who worked more in overtime hours than regular hours, sorted by the top three agencies.

In [None]:
agencies_df.head(10)

In [None]:
agencies_df.query('regular_hours > 0').groupby(by='work_location_borough').total_ot_paid.sum()

In [None]:
agencies_df.query('regular_hours > 0').groupby(by='work_location_borough').fiscal_year.value_counts()

In [None]:
agencies_df.query('regular_hours > 0').groupby(by='work_location_borough').last_name.nunique()

In [None]:
agencies_df.query('regular_hours > 0').work_location_borough.value_counts()

In [None]:
agencies_df.query('regular_hours > 0').nunique()

In [None]:
agencies_df.query('regular_hours > 0').groupby(by='work_location_borough').total_ot_paid.sum().plot(kind='barh')

In [None]:
agencies_df.query('regular_hours > 0').query('work_location_borough=="MANHATTAN"').agency_name.value_counts()

In [None]:
# Sanity check -- ignore
agencies_df.fiscal_year.value_counts()

# 32 unique Manhattan employees made over $2.6 million in overtime and they worked more overtime than regular hours — across just three agencies

### 📝 More observations📝

Let's see who among these employees have "ceased" in their leave status — presumably meaning they've retired. Let's take a closer look at that. Also, let's only include employees who have non-negative regular hours.

In [None]:
ceased_df = agencies_df.query('regular_hours > 0').query('leave_status_as_of_june_30 == "CEASED"')
ceased_df

#### Observation: 70 employees who worked more overtime than regular hours are no longer working. Let's take a closer look!

In [None]:
ceased_df.total_ot_paid.mean().round()

> Compare this to average overtime earned across all employees

In [None]:
df.total_ot_paid.mean().round()

# AHHA!

#### 70 employees, worked more overtime than regular, racked up more money, on average, with overtime than all employees combined. And, these 70 employees are no longer working.

#### Where did they work and for what agencies?

In [None]:
ceased_df.work_location_borough.value_counts()

In [None]:
ceased_df.agency_name.value_counts()

`Taking a closer look at Manhattan employees`

In [None]:
chart = (
    ggplot(ceased_df.query('work_location_borough == "MANHATTAN"'),
        aes('total_ot_paid', 'ot_hours'))
        + geom_point(aes(color='agency_name'))
        + theme(figure_size=(8, 5))
        + theme_bw()
        + labs(
            title = "Manhattan employees who worked more overtime than regular hours & are no longer employed",
            y = "Overtime hours worked",
            x = "Overtime paid"
        )
)

chart.save("manhattan.svg")
chart

In [None]:
ceased_df.query('work_location_borough == "MANHATTAN"').sort_values(by="ot_hours", ascending = False).head(3)

In [None]:
df.sort_values(by='total_ot_paid', ascending = False).head(5)

In [None]:
chart = (
    ggplot(df.sort_values(by='total_ot_paid', ascending = False).head(10),
        aes('total_ot_paid', 'ot_hours'))
        + geom_point(aes(shape='agency_name', color='work_location_borough', size=4))
        + theme(figure_size=(8, 5))
        + theme_bw()
        + labs(
            title = "NYC Housing Authority and Health & Mental Hygiene employees made the most in overtime pay in 2021",
            y = "Overtime hours worked",
            x = "Overtime paid"
        )
)

chart.save("overall-high-paid.svg")
chart

In [None]:
chart = (
    ggplot(df,
        aes('fiscal_year', 'ot_hours'))
        + geom_point(aes(color='work_location_borough', size=3))
        + theme(figure_size=(8, 5))
        + theme_bw()
        + labs(
            title = "Overtime hours for each year ",
            x = "Year",
            y = "Overtime hours"
        )
)

# chart.save("year.svg")
chart

In [None]:
df.sort_values(by='total_ot_paid', ascending = False).fiscal_year.value_counts()

## Lots more possibilities, but here's some initial observations /// a quick recap!

1. 1556 city employees, between 2017-2021 worked more in overtime hours than regular hours
2. Of these 1556 employees, 766 were from three agencies: Police, Children's Services and Parks & Rec Departments
3. 70 of these 766 employees no longer work for the city. Together they made nearly \\$12,500 in overtime, on average. This compares to the average being \\$3200 for all employees

## -------

In [None]:
df.sort_values(by='total_ot_paid', ascending = False).head(10)

In [None]:
df.query('total_ot_paid > 200000')

In [None]:
df.query('total_ot_paid > 200000').agency_name.value_counts()

In [None]:
df.query('total_ot_paid > 200000').fiscal_year.value_counts()

In [None]:
import re
df.query('total_ot_paid > 100000').title_description.str.extractall(r'(.PLUMBER)',re.IGNORECASE).count()

In [None]:
df[df['title_description'].str.contains('.PLUMBER')== True].agency_name.value_counts()

In [None]:
df.title_description.value_counts()

In [None]:
df[df['title_description'].str.contains('.PLUMBER')== True]

In [None]:
df[df['title_description'].str.contains('.PLUMBER')== True].total_ot_paid.mean()

In [None]:
df[df['title_description'].str.contains('.ELECTRICIAN')== True].total_ot_paid.mean()

In [None]:
df[df['title_description'].str.contains('.PLUMBER')== True].total_ot_paid.max()

In [None]:
df[df['title_description'].str.contains('.PLUMBER')== True].total_ot_paid.min()

In [None]:
df[df['title_description'].str.contains('.PLUMBER')== True].groupby(by='agency_name').total_ot_paid.sum()

In [None]:
df[df['title_description'].str.contains('.PLUMBER')== True].groupby(by='agency_name').base_salary.sum()

In [None]:
df[df['title_description'].str.contains('.PLUMBER')== True].groupby(['agency_name', 'fiscal_year']).base_salary.sum()

### NYCHA plumbers made \\$31,865 in base salary in total over five years. In overtime they made over $7.6 million

In [None]:
df.total_ot_paid.sum().round()

In [None]:
df.groupby(by='fiscal_year').total_ot_paid.sum().mean().round()

In [None]:
df.groupby(by='fiscal_year').total_ot_paid.mean()

In [None]:
df.groupby(by='agency_name').total_ot_paid.sum().head(10).plot(kind='barh')

In [None]:
df.groupby(by='agency_name').total_ot_paid.sum().reset_index().sort_values(by='total_ot_paid', ascending = False).head(5)

In [None]:
# pd.set_option('display.max_rows', None)
df.groupby(by='agency_name').total_ot_paid.sum().reset_index().sort_values(by='total_ot_paid', ascending = False)

In [None]:
df.groupby(by='agency_name').total_ot_paid.sum().reset_index().sort_values(by='total_ot_paid', ascending = False).head(5)

In [None]:
df.query('agency_name == "NYC HOUSING AUTHORITY"')