# $\color{purple}{\text{Classification of Names by Gender}}$

**By Alexander Ng & Philip Tanofsky**

*April 3, 2022*



## $\color{blue}{\text{Overview}}$

We consider the classification of over 6000 given names by gender through the use of machine learning within the framework of linguistic theory.  The goal of this study is to predict the gender of name $X$ given nothing more than a training set of other names.   To construct useful predictors, we use both an orthographic and a phonetic approach to analyze the names corpus.  These predictors extract additional information necessary for accurate gender classification.  

We review the background literature on the plausibility of a phonological approach to gender classification, then describe our data set including the derived predictors and an exploratory analysis.  Lastly, the model results are presented.

## $\color{blue}{\text{Theoretical Background}}$

There is a research literature on the prediction of gender of first names based on phonological consideration.   [Slepian and Galinsky (2016)](http://www.columbia.edu/~ms4992/Pubs/2016_Slepian-Galinsky_JPSP.pdf) show that voiced initial phonemes in first names are more associated with male names while unvoiced initial phonemes are associated with female names.   [Sullivan and Kang (2019)](http://www.assta.org/proceedings/ICPhS2019/papers/ICPhS_2173.pdf) conducted field experiments with nonsense names that follow known phonological patterns to confirm a gender preference in names.   [Whissell (2013)](https://doi.org/10.1179/nam.2001.49.2.97) note that predictability in gender of names is supported by the linguistic theory of *sound symbolism*.

**Sound symbolism** argues that the sound of a word is associated with the word's referent.  "Sound  patterns in words are both non-random and informative." [Whissell (2013)](https://doi.org/10.1179/nam.2001.49.2.97)   These studies challenge the prevailing theory that the relationship between a word and its meaning are arbitrary.  This arbitrariness is a fundamental aspect of language. 

The key patterns observed in names and their gender associations are:

*  **Diversity of names**  Female names are more diverse than male ones.

*  **Number of Syllables**  Female names have more syllables than male ones.

*  **Final Syllable** Female names are more likely to end in a *schwa* or open vowel.

*  **Stress Placement**  Female names have stress on syllable after the first one.

*  **Fewer Consonants**  Male names have fewer consonants.

*  **Voiced Consonant**  Male names have more initial voiced consonants.

These findings enable us to construct additional predictors to guide the classification models.  A big question is whether these effects work together in concert to increase prediction accuracy.


## $\color{blue}{\text{Data and Methods}}$

To obtain these phonemes of each name, the syllables and stress placement, we require an automated way to obtain and codify the pronunciation of the entire name corpus.   We used the CMU pronunciation library on the backend to compute the pronunciation of each name.   The front-end software package was `pronouncing`.   The CMU and `pronouncing` library produce a string representation of the phonemes of the names.   It can also gives the number of syllables, the stress placement.   The phonemes are encoded in a special convention called `Arpabet` which was developed by the DARPA agency for automated text translation.   For example, the `schwa` is encoded as a two letter string `AH`.   The article [Arpabet](https://en.wikipedia.org/wiki/ARPABET) gives more details of the encoding.  It represents a subset of IPA (the International Phonetic Alphabet) and is geared to capture American English.



The biggest challenge was that only 66% of the nltk names corpus is covered by the `pronouncing` library.  No attempt is made by the CMU dictionary to guess the pronunciation of many relatively simple words.

Thus, we also use the `Double Metaphone` sound encoding algorithm (based on its ancestor `Soundex`) to construct an alphanumeric representation of each name in a simple string format.  This `dmeta` encoding gives us an additional way to obtain pronunciation data.  Because vowels are stripped off, we do not expect Double Metaphone to be useful in all cases.   For example, ROBERT and ROBERTA map to the same `dmeta` result.

In total, we construct the following predictors:
    
*  first letter
*  second letter
*  last letter
*  word length
*  number of consonants (non-phonemically computed)
*  number of vowels (aeiou only - not phonemically derived)
*  Double metaphone encoding of the name
*  phoneme 1 - 12.  Each name can have up to 12 phonemes.
*  phoneme string (The raw string from CMU).
*  first phoneme
*  number of phonemes
*  number of syllables
*  the syllable at which the primary stress is pronounced
*  the last phoneme of the name


To ensure reproducibility of the study, we seed the random number generator prior to splitting the test train split.

To deal with the data discrepancy between the complete set of names and the partial set with all phonemic data, we run TWO models.

One analysis uses the basic set of features that don't require the CMU pronouncing library.   We call that the basic features data set.

Another analysis uses the entire set of column (including the phonological features) but is limited to 66% of the original names corpus.   We call this the advanced feature set.


