Experiments about gender detection from names
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.





damegender is a gender detection tool coded by David Arroyo MEnéndez (DAME)


  • If you want determine gender gap in free software projects or mailing lists.
  • If you don’t know the gender about a name in spanish or english (current status).
  • If you want research with statistics about why a name is related with males or females.
  • If you want use the main solutions in gender detection (genderize, genderapi, namsor, nameapi and gender guesser) from a command.

DAMe Gender is for you!


Installing Software

$ sudo apt-get install python3-nose-exclude dict dict-freedict-eng-spa dict-freedict-spa-eng dictd
$ git clone https://github.com/davidam/damegender
$ cd damegender
$ pip3 install -r requirements.txt

Obtaining an api key

Currently you can need an api key from:

You can execute:

$ python3 apikeyadd.py

To configure your api key

Configuring nltk

$ python3
>>> import nltk
>>> nltk.download('names')

Check test

All tests

$ nosetest3 test

Single test

$ nosetests3 test/test_dame_sexmachine.py:TddInPythonExample.test_string2array_method_returns_correct_result

Execute program

# Detect gender from a name
$ python3 main.py David
$ python main.py David --ine="yes"
david gender is male
363.559  males for david from INE.es
0 females for david from INE.es

# Detect gender from a name only using machine learning (experimental way)
$ python3 main.py David --ml=multinomialNB
# Count gender from a csv example file
$ python3 csv2gender.py files/partial.csv
The number of males in files/partial.csv is 16
The number of females in files/partial.csv is 3
The number of gender not recognised in files/partial.csv is 2
# Count gender from a git repository
$ python3 git2gender.py https://github.com/chaoss/grimoirelab-perceval.git --directory="/tmp/clonedir"
The number of males sending commits is 15
The number of females sending commits is 7
# Count gender from a mailing list
$ cd files
$ wget -c http://mail-archives.apache.org/mod_mbox/httpd-announce/201706.mbox
$ cd ..
$ python3 mail2gender.py http://mail-archives.apache.org/mod_mbox/httpd-announce/
# Use an api to detect the gender
$ python3 api2gender.py Leticia --api=namsor
scale: 0.99
# Give me informative features
$ python3 infofeatures.py
Females with last letter a: 0.4705246078961601
Males with last letter a: 0.048672566371681415
Females with last letter consonant: 0.2735841767750908
Males with last letter consonant: 0.6355328972681801
Females with last letter vocal: 0.7262612995441552
Males with last letter vocal: 0.3640823393612928
# To measure success
$ python3 accuracy.py --csv=files/min.csv
################### Namsor!!
Gender list: [1, 1, 1, 1, 2, 1, 0, 0]
Guess list:  [1, 1, 1, 1, 1, 1, 0, 0]
Namsor accuracy: 0.875
################### Genderize!!
Gender list: [1, 1, 1, 1, 2, 1, 0, 0]
Guess list:  [1, 1, 1, 1, 2, 1, 0, 0]
Genderize accuracy: 1
################### GenderGuesser!!
Gender list: [1, 1, 1, 1, 2, 1, 0, 0]
Guess list:  [1, 1, 1, 1, 2, 1, 0, 0]
GenderGuesser accuracy: 0.875
################### Sexmachine!!
Gender list: [1, 1, 1, 1, 2, 1, 0, 0]
Guess list:  [1, 1, 1, 1, 2, 1, 0, 0]
Sexmachine accuracy: 0.875
$ python3 confusion.py
A confusion matrix C is such that Ci,j is equal to the number of observations known to be in group i but predicted to be in group j.
If the classifier is nice, the diagonal is high because there are true positives
Namsor confusion matrix:
 [[ 3  0  0]
 [ 0 16  0]
 [ 0  2  0]]
Sexmachine confusion matrix:
 [[ 2  1  0]
 [ 2 14  0]
 [ 1  1  0]]

# To analyze errors guessing names from a csv
$ python3 errors.py --csv="files/all.csv" --api="genderguesser"
Gender Guesser with files/all.csv has:
+ The error code: 0.22564457518601835
+ The error code without na: 0.026539047204698716
+ The na coded: 0.20453365634192766
+ The error gender bias: 0.0026103980857080703

# To deploy a graph about correlation between variables
$ python3 corr.py
$ python3 corr.py --csv="categorical"
$ python3 corr.py --csv="nocategorical"
# To create the pickle models in files directory
$ python3 damemodels.py

Statistics for damegender

Some theory could be useful to understand some commands

Errors and Confusion Matrix

Guessing the sex, we have an true idea (example: female) and we obtain a result, the guessed result (example: female). We have written count_true2guess to make statistics variables about it.

In confusion matrix litherature, we can find this vocabulary for true and guess:

True positiveFalse Positive
False negativeTrue Negative

Precision is about true positives between true positives plus false positives

(self.femalefemale + self.malemale ) / (self.femalefemale + self.malemale + self.femalemale)

Recall is about true positives between true positives plus false negatives.

(self.femalefemale + self.malemale ) / (self.femalefemale + self.malemale + self.malefemale)

The F1 score is the harmonic mean of precision and recall taking both metrics into account in the following equation:

2 * ((precision * recall) / (precision + recall))

Error coded is about the true is different than the guessed:

(self.femalemale + self.malefemale + self.maleundefined + self.femaleundefined) / (self.malemale + self.femalemale + self.malefemale + self.femalefemale + self.maleundefined + self.femaleundefined)

Error coded without na is about the true is different than the guessed, but without undefined results.

(self.maleundefined + self.femaleundefined) / (self.malemale + self.femalemale + self.malefemale + self.femalefemale + self.maleundefined + self.femaleundefined)

Error gender bias is to understand if the error is bigger guessing males than females or viceversa.

(self.malefemale - self.femalemale) / (self.malemale + self.femalemale + self.malefemale + self.femalefemale)

The weighted error is about the true is different than the guessed, but giving a weight to the guessed as undefined.

(self.femalemale + self.malefemale + w * (self.maleundefined + self.femaleundefined)) / (self.malemale + self.femalemale + self.malefemale + self.femalefemale + w * (self.maleundefined + self.femaleundefined))

The confusion matrix creates a matrix between the true and the guess. If you have this confusion matrix:

[[ 2, 0, 0]
 [ 0, 5, 0]]

It means, I have 2 females true and I’ve guessed 2 females and I’ve 5 males true and I’ve guessed 5 males. I don’t have errors in my classifier.

[[ 2  1  0]
[ 2 14  0]

It means, I have 2 females true and I’ve guessed 2 females and I’ve 14 males true and I’ve guessed 14 males. 1 female was considered male, 2 males was considered female.



The dispersion measures between 1 variables are: variance, standard deviation, …


If you have 2 variables, you can write a formula so similar to variance.


If you have 3 variables or more, you can write a covariance matrix.


In essence, an eigenvector v of a linear transformation T is a non-zero vector that, when T is applied to it, does not change direction. Applying T to the eigenvector only scales the eigenvector by the scalar value λ, called an eigenvalue.


A feature vector is constructed taking the eigenvectors that you want to keep from the list of eigenvectors.

The new dataset take the transpose of the vector and multiply it on the left of the original data set, transposed.

FinalData = RowFeatureVector x RowDataAdjust

Choosing components

We can choose components with:

import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
import argparse
parser = argparse.ArgumentParser()
args = parser.parse_args()

#filepath = 'files/features_list.csv' #your path here
data = np.genfromtxt(args.csv, delimiter=',', dtype='float64')

scaler = MinMaxScaler(feature_range=[0, 1])
data_rescaled = scaler.fit_transform(data[1:, 0:8])

#Fitting the PCA algorithm with our Data
pca = PCA().fit(data_rescaled)
#Plotting the Cumulative Summation of the Explained Variance
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)') #for each component
plt.title('Dataset Explained Variance')


Taking a look to the image. We can choose 6 components.