Special thanks to Derek Duin without whom these notebooks would not have been possible.
These notebooks are interactive! To launch in a live environment click:
In order to achieve the stated goal of a diverse work environment, we need to be able to produce quantifiable measures of diversity. The challenge is that indicators of diversity such as national origin, veteran status, gender, etc. are sensitive and not reported in available datasets.
However, for any population that we wish to analyze we will always have, at a minimum a First and Last name.
In most cultures, there exist 'masculine' and 'feminine' names. However, there is no universal law that requires this. The result is that some names are strong predictors of sex such:
- Elizabeth
- Sarah
- John
- James
While others such as :
- Casey
- Jessie
- Jordan
- Pat
are not strong predictors.
Based on our own experiences we are likely to agree with the above names and their respective assignments. If our goal is to provide a quantifiable measure, we need some method to determine this.
Let's examine two popular approaches
The categorical approach assigns names to categories based on their tendency to predict a sex. For instance we may see:
-
Male : John, James, Jordan
-
Female : Sarah, Elizabeth, Casey, Jessie
-
Strongly Male: John, James
-
Weakly Male: Jordan
-
Ambiguous: Pat
-
Strongly Female: Elizabeth, Sarah
-
Weakly Female: Jessie
The probabilistic approach assigns discrete probabilities of sex for each name. We may see:
- John: 0.05% Female
- Sarah: 99.5% Female
- Jordan: 26.0% Female
- Jessie: 60.2% Female
Percy's diversity analysis is based on probabilistic data. The reasoning will become apparent later.