# Sunshine List gender analysis

*March 26, 2022*

I had an idea a few months ago to use the sunshine list to look at the gender pay gap. In Canada, there really isn't a whole lot of info on this, but the sunshine list might give us a good opportunity to see what it looks like in the public sector in Canada. The big problem: we don't have genders of the people on the sunshine list. Here's one way we could solve that problem.

First, we'll import pandas and set a global option to display floats with commas (just to make things more readable).

In [68]:
import pandas as pd

pd.options.display.float_format = '{:,.2f}'.format

### Preparing the data

There's a wealth of sunshine list data going back all the way to 2012, so you might wonder why I don't include it all in this analysis. There are two reasons for that:

1. It's hard to compare money over the course of many years, as inflation is a thing.
2. There are likely many duplicate names that are very hard to remove from our dataset if we combine them all.

Therefore it's probably best to just use the latest data available. It's still quite robust!

In [36]:
raw = pd.read_csv("./raw/2021.csv")

Now let's do some cleaning. We'll make our column names all lower case for consistency, and we'll make all the strings in our dataset uppercase. This makes it easier to match the names up on joins later in the analysis.

In [37]:
data = raw.copy()

data.columns = data.columns.str.lower()

for label, content in data[["sector", "last name", "first name", "employer", "job title"]].items():
    data[label] = (data[label]
                        .str.upper()
                        .str.replace("\s(AND){1}\s", " & ", regex=True)
                        .str.replace("\-", "–", regex=True)
                        .str.replace("*", "", regex=False)
                )
    
data.sample(5)

Unnamed: 0,sector,last name,first name,salary,benefits,employer,job title,year,_docid
158007,SCHOOL BOARDS,DAVIS,GORICA,103022.13,0.0,KAWARTHA PINE RIDGE DISTRICT SCHOOL BOARD,ELEMENTARY TEACHER,2021,158007
238723,UNIVERSITIES,ROBERTSON,JEAN,214339.02,143.16,UNIVERSITY OF TORONTO,"DIRECTOR OF HUMAN RESOURCES, FACULTY OF MEDICINE",2021,238723
127873,ONTARIO POWER GENERATION,LESSARD,KEITH,108517.69,1367.76,ONTARIO POWER GENERATION,ELECTRICAL & CONTROL TECHNICIAN/TECHNOLOGIST,2021,127873
32400,GOVERNMENT OF ONTARIO – MINISTRIES,ORLANDO,SUSAN,249546.4,309.88,ATTORNEY GENERAL,DIRECTOR,2021,32400
119309,MUNICIPALITIES & SERVICES,VAN DER KRABBEN,STEVEN,130057.32,1037.3,CITY OF TORONTO – POLICE SERVICE,POLICE CONSTABLE,2021,119309


Now we're going to read in a dataset of names, downloaded [here](https://data.world/howarder/gender-by-name).

In [38]:
names = pd.read_csv("name_gender.csv")

This dataset has two columns that are important to us: one with a first name, and another with an M or an F, signifying whether it's a male or female name. Now of course this approach is fraught with complications:

1. Some names are common for both men and women. The dataset actually includes a third "probability" column that lists the probability a name is male or female. We don't use it here, but this analysis could be refined by only using names that have a high enough probability.
2. Not everyone identifies as male or female, and their name is a somewhat poor way to identify someone's gender. Given that few other approaches exist, we're trying it this way anyways.

Now we're going to make the data in the name field uppercase so we don't have to worry about cases messing up our matching.

In [39]:

names["name"] = names["name"].str.upper()

We're also going to add another column to our dataset for comparing to the names database. We do this because several names in the sunshine list data have initials following the first name (which would not match to a name that would otherwise match). We also have a line of code here that breaks double names (Mary Jane) into just the first part (Mary) so that we don't have to just throw those names out.

In [40]:
data["first name_cleaned"] = (data["first name"]
                              .str.replace("\s+[A-Z]+\.+", "", regex=True)
                              .str.upper()
                              .dropna()
                              .str.split(" ", 1)
                              .dropna().apply(lambda x: x[0])
                              )

Now we merge the names dataset with the sunshine list data. Note we're coming to our newly cleaned first name column. I'm also calling `.loc` to ensure we only include the columns we want, and in the right order.

In [41]:
data = (data
        .merge(names, left_on='first name_cleaned', right_on="name", how="left")
        .drop_duplicates()
        .loc[:, ["first name", "last name", "gender", "sector", "job title", "employer", "salary", "benefits"]]
        )

We're also going to fill anything that doesn't match with "UNKNOWN".

In [42]:
data["gender"] = data["gender"].fillna("UNKNOWN")

Now let's see what it looks like.

In [43]:
data.sample(5)

Unnamed: 0,first name,last name,gender,sector,job title,employer,salary,benefits
154667,JASON,CODE,M,SCHOOL BOARDS,SECONDARY TEACHER,DISTRICT SCHOOL BOARD OF NIAGARA,103595.85,0.0
114377,CHRISTOPHER,SIMOVIC,M,MUNICIPALITIES & SERVICES,OPERATIONS TECHNICIAN,CITY OF BRAMPTON,108991.85,434.34
11327,DIANA,HALLETT,F,CROWN AGENCIES,"DIRECTOR, TRANSPLANT SERVICES/DIRECTRICE, SERV...",ONTARIO HEALTH,112567.42,336.48
60097,RICHARD,PARADIS,M,HOSPITALS & BOARDS OF PUBLIC HEALTH,ANESTHESIA ASSISTANT,SINAI HEALTH SYSTEM,115413.61,314.4
191721,LINDSAY,MONTADOR,F,SCHOOL BOARDS,TEACHER,PEEL DISTRICT SCHOOL BOARD,103200.99,0.0


Now that our dataset is prepared, we can dive into the good stuff.

### Mean salary by sector

Beceause the sunshine list is everyone who makes more than $100K annually, we can find out a lot here by just checking out the counts of men and women on the list. Let's start with that.

In [54]:
gender_counts = data[["gender", "first name"]].groupby("gender").count()

gender_counts

Unnamed: 0_level_0,first name
gender,Unnamed: 1_level_1
F,121163
M,105807
UNKNOWN,17420


There are actually more women than men on our list!

It's also useful to see how many names on our list have been labeled with a gender (versus just labelled Unknown, which is what we relabelled our null values). Of course, we should do some manually spot checking to see how accurate we think the name gendering was, but this will give us a sense of how many values were mapped to something in the dataset.

In [57]:
(gender_counts.loc["F", :] + gender_counts.loc["M", :]) / gender_counts.sum() * 100

first name   92.87204877449977
dtype: float64

Roughly 92% of names in the sunshine list were assigned a gender from the names list! Not bad. Now, back to our analysis. Let's see the counts by sector.

In [50]:
data.pivot_table(index="sector", values="salary", columns="gender", aggfunc="count").sort_values("F", ascending=False)

gender,F,M,UNKNOWN
sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SCHOOL BOARDS,53392.0,22770.0,4272.0
HOSPITALS & BOARDS OF PUBLIC HEALTH,21351.0,6314.0,2906.0
MUNICIPALITIES & SERVICES,14000.0,36528.0,3307.0
UNIVERSITIES,9190.0,10919.0,2987.0
GOVERNMENT OF ONTARIO – MINISTRIES,7991.0,10991.0,1337.0
OTHER PUBLIC SECTOR EMPLOYERS,5080.0,3020.0,631.0
CROWN AGENCIES,4450.0,4490.0,884.0
COLLEGES,3637.0,3790.0,437.0
ONTARIO POWER GENERATION,1490.0,6440.0,590.0
GOVERNMENT OF ONTARIO – JUDICIARY,298.0,299.0,33.0


There are far more women than men on the list in the school, hospital, and municipalities sectors.

Before we continue, let's remove the seconded sectors here, as they don't seem that interesting and we want to keep our tables readable here.

In [None]:
data = data[~data["sector"].str.contains("SECONDED")]

Now, we'll take a look at the mean salaries for each sector.

In [69]:
(data
 .pivot_table(index="sector", values="salary", columns="gender", aggfunc="mean")
 .dropna()
 .rename(columns={"salary": "gendered"})
 )

gender,F,M,UNKNOWN
sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
COLLEGES,119965.02,119909.56,116069.26
CROWN AGENCIES,128613.06,135230.03,126394.56
GOVERNMENT OF ONTARIO – JUDICIARY,239394.48,243610.83,209683.97
GOVERNMENT OF ONTARIO – LEGISLATIVE ASSEMBLY & OFFICES,138690.75,138861.28,132769.4
GOVERNMENT OF ONTARIO – MINISTRIES,130026.95,130673.86,123632.89
HOSPITALS & BOARDS OF PUBLIC HEALTH,119190.0,134527.42,125276.92
MUNICIPALITIES & SERVICES,122733.62,125364.4,121554.09
ONTARIO POWER GENERATION,145029.68,155543.36,149466.83
OTHER PUBLIC SECTOR EMPLOYERS,129350.55,140225.4,131233.5
SCHOOL BOARDS,106120.84,108193.1,106849.94


And the median salary too, to get a sense of the average values without the big outliers influencing things.

In [70]:
(data
 .pivot_table(index="sector", values="salary", columns="gender", aggfunc="median")
 .dropna()
 .rename(columns={"salary": "gendered"})
 )

gender,F,M,UNKNOWN
sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
COLLEGES,115378.12,115378.12,115354.01
CROWN AGENCIES,115918.25,120310.04,116419.38
GOVERNMENT OF ONTARIO – JUDICIARY,159266.54,267013.37,154173.92
GOVERNMENT OF ONTARIO – LEGISLATIVE ASSEMBLY & OFFICES,128859.79,129517.58,126242.05
GOVERNMENT OF ONTARIO – MINISTRIES,115786.98,119338.34,114129.41
HOSPITALS & BOARDS OF PUBLIC HEALTH,109873.87,114076.31,110821.33
MUNICIPALITIES & SERVICES,115780.48,119933.02,115544.7
ONTARIO POWER GENERATION,133665.06,144675.01,139950.98
OTHER PUBLIC SECTOR EMPLOYERS,117454.21,122479.24,115951.2
SCHOOL BOARDS,102766.24,103998.38,102766.24


### Mean salary by employer

Now let's break things down by employer instead. The analysis is similar to above.

In [71]:
employers = (data
 .pivot_table(index="employer", values="salary", columns="gender", aggfunc="count")
 .dropna()
 .rename(columns={"salary": "gendered"})
 .sort_values("M", ascending=False)
 )

employers

gender,F,M,UNKNOWN
employer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ONTARIO POWER GENERATION,1477.00,6341.00,583.00
CITY OF TORONTO,2059.00,4716.00,570.00
ONTARIO PROVINCIAL POLICE,1331.00,3924.00,120.00
CITY OF TORONTO – POLICE SERVICE,1004.00,3661.00,360.00
TORONTO DISTRICT SCHOOL BOARD,6595.00,3111.00,843.00
...,...,...,...
CHILDREN’S AID SOCIETY OF THE UNITED COUNTIES OF STORMONT DUNDAS & GLENGARRY,18.00,1.00,2.00
SUDBURY DISTRICT NURSE PRACTITIONER CLINICS,5.00,1.00,1.00
ONTARIO PAROLE BOARD,2.00,1.00,1.00
ONTARIO SOCIETY OF PROFESSIONAL ENGINEERS,1.00,1.00,1.00


In [72]:
employers["total_gendered"] = employers["M"] + employers["F"]
employers["%_unknown"] = round(employers["total_gendered"] / (employers["UNKNOWN"] + employers["total_gendered"])* 100, 2)

In [74]:
included_orgs = employers[employers["total_gendered"].gt(300)].index

In [79]:
employer_avg = (data
.loc[data["employer"].isin(included_orgs), :]
.pivot_table(index="employer", values="salary", columns="gender", aggfunc="mean")
)

gender,F,M,UNKNOWN
employer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ALGOMA DISTRICT SCHOOL BOARD,107740.80,110466.52,107627.53
ALGONQUIN & LAKESHORE CATHOLIC DISTRICT SCHOOL BOARD,107644.35,110302.79,109060.36
ALGONQUIN COLLEGE OF APPLIED ARTS & TECHNOLOGY,119323.42,117602.71,111351.31
ATTORNEY GENERAL,173546.32,185128.43,161273.08
AVON MAITLAND DISTRICT SCHOOL BOARD,106496.67,108493.00,107134.05
...,...,...,...
WINDSOR–ESSEX CATHOLIC DISTRICT SCHOOL BOARD,106972.66,108672.03,106841.77
WORKPLACE SAFETY & INSURANCE BOARD,113427.52,121038.77,118187.98
YORK CATHOLIC DISTRICT SCHOOL BOARD,105773.73,108995.56,105728.09
YORK REGION DISTRICT SCHOOL BOARD,106007.04,107414.67,106386.60


THis time, let's also add another column to calculate the difference between the average salary for men and for women, then sort by that category.

In [81]:
employer_avg["diff"] = employer_avg["M"] - employer_avg["F"]

employer_avg.sort_values("diff", ascending=False).head()

gender,F,M,UNKNOWN,diff
employer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
THE HOSPITAL FOR SICK CHILDREN,134284.0,180429.0,164794.0,46145.0
SINAI HEALTH SYSTEM,125670.0,165116.0,133841.0,39446.0
GRAND RIVER HOSPITAL CORPORATION,117412.0,145796.0,131944.0,28384.0
HAMILTON HEALTH SCIENCES,118083.0,139945.0,129009.0,21862.0
MCMASTER UNIVERSITY,155117.0,176454.0,169482.0,21337.0
