Experiments about gender detection from names
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
app
articles
damegender.egg-info
files
slides
test
.gitignore
LICENSE
README.org
TODO.org.gpg
accuracy.py
api2gender.py
apikeyadd.py
config.cfg
confusion.py
confusion2.py
corr.py
csv2gender.py
damemodels.py
errors.py
git2gender.py
infofeatures.py
mail2gender.py
main.py
requirements.txt
setup.py

README.org

Logo

files/gender.png

Name

damegender is a gender detection tool coded by David Arroyo MEnéndez (DAME)

Why?

  • If you want determine gender gap in free software projects or mailing lists.
  • If you don’t know the gender about a name in spanish or english (current status).
  • If you want research with statistics about why a name is related with males or females.
  • If you want use the main solutions in gender detection (genderize, genderapi, namsor, nameapi and gender guesser) from a command.

DAMe Gender is for you!

Install

Installing Software

$ sudo apt-get install python3-nose-exclude dict dict-freedict-eng-spa dict-freedict-spa-eng dictd
$ git clone https://github.com/davidam/damegender
$ cd damegender
$ pip3 install -r requirements.txt

Obtaining an api key

Currently you can need an api key from:

You can execute:

$ python3 apikeyadd.py

To configure your api key

Configuring nltk

$ python3
>>> import nltk
>>> nltk.download('names')

Check test

All tests

$ nosetest3 test

Single test

$ nosetests3 test/test_dame_sexmachine.py:TddInPythonExample.test_string2array_method_returns_correct_result

Execute program

# Detect gender from a name
$ python3 main.py David
male
$ python main.py David --ine="yes"
david gender is male
363.559  males for david from INE.es
0 females for david from INE.es

# Detect gender from a name only using machine learning (experimental way)
$ python3 main.py David --ml=multinomialNB
male
# Count gender from a csv example file
$ python3 csv2gender.py files/partial.csv
The number of males in files/partial.csv is 16
The number of females in files/partial.csv is 3
The number of gender not recognised in files/partial.csv is 2
# Count gender from a git repository
$ python3 git2gender.py https://github.com/chaoss/grimoirelab-perceval.git --directory="/tmp/clonedir"
The number of males sending commits is 15
The number of females sending commits is 7
# Count gender from a mailing list
$ cd files
$ wget -c http://mail-archives.apache.org/mod_mbox/httpd-announce/201706.mbox
$ cd ..
$ python3 mail2gender.py http://mail-archives.apache.org/mod_mbox/httpd-announce/
# Use an api to detect the gender
$ python3 api2gender.py Leticia --api=namsor
female
scale: 0.99
# Give me informative features
$ python3 infofeatures.py
Females with last letter a: 0.4705246078961601
Males with last letter a: 0.048672566371681415
Females with last letter consonant: 0.2735841767750908
Males with last letter consonant: 0.6355328972681801
Females with last letter vocal: 0.7262612995441552
Males with last letter vocal: 0.3640823393612928
# To measure success
$ python3 accuracy.py --csv=files/min.csv
files/min.csv
################### Namsor!!
Gender list: [1, 1, 1, 1, 2, 1, 0, 0]
Guess list:  [1, 1, 1, 1, 1, 1, 0, 0]
0.875
Namsor accuracy: 0.875
################### Genderize!!
Gender list: [1, 1, 1, 1, 2, 1, 0, 0]
Guess list:  [1, 1, 1, 1, 2, 1, 0, 0]
Genderize accuracy: 1
################### GenderGuesser!!
Gender list: [1, 1, 1, 1, 2, 1, 0, 0]
Guess list:  [1, 1, 1, 1, 2, 1, 0, 0]
GenderGuesser accuracy: 0.875
################### Sexmachine!!
Gender list: [1, 1, 1, 1, 2, 1, 0, 0]
Guess list:  [1, 1, 1, 1, 2, 1, 0, 0]
Sexmachine accuracy: 0.875
$ python3 confusion.py
A confusion matrix C is such that Ci,j is equal to the number of observations known to be in group i but predicted to be in group j.
If the classifier is nice, the diagonal is high because there are true positives
Namsor confusion matrix:
 [[ 3  0  0]
 [ 0 16  0]
 [ 0  2  0]]
Sexmachine confusion matrix:
 [[ 2  1  0]
 [ 2 14  0]
 [ 1  1  0]]

# To analyze errors guessing names from a csv
$ python3 errors.py --csv="files/all.csv" --api="genderguesser"
Gender Guesser with files/all.csv has:
+ The error code: 0.22564457518601835
+ The error code without na: 0.026539047204698716
+ The na coded: 0.20453365634192766
+ The error gender bias: 0.0026103980857080703

# To deploy a graph about correlation between variables
$ python3 corr.py
$ python3 corr.py --csv="categorical"
$ python3 corr.py --csv="nocategorical"
# To create the pickle models in files directory
$ python3 damemodels.py

Statistics for damegender

Some theory could be useful to understand some commands

Errors and Confusion Matrix

Guessing the sex, we have an true idea (example: female) and we obtain a result, the guessed result (example: female). We have written count_true2guess to make statistics variables about it.

In confusion matrix litherature, we can find this vocabulary for true and guess:

True positiveFalse Positive
False negativeTrue Negative

Precision is about true positives between true positives plus false positives

(self.femalefemale + self.malemale ) / (self.femalefemale + self.malemale + self.femalemale)

Recall is about true positives between true positives plus false negatives.

(self.femalefemale + self.malemale ) / (self.femalefemale + self.malemale + self.malefemale)

The F1 score is the harmonic mean of precision and recall taking both metrics into account in the following equation:

2 * ((precision * recall) / (precision + recall))

Error coded is about the true is different than the guessed:

(self.femalemale + self.malefemale + self.maleundefined + self.femaleundefined) / (self.malemale + self.femalemale + self.malefemale + self.femalefemale + self.maleundefined + self.femaleundefined)

Error coded without na is about the true is different than the guessed, but without undefined results.

(self.maleundefined + self.femaleundefined) / (self.malemale + self.femalemale + self.malefemale + self.femalefemale + self.maleundefined + self.femaleundefined)

Error gender bias is to understand if the error is bigger guessing males than females or viceversa.

(self.malefemale - self.femalemale) / (self.malemale + self.femalemale + self.malefemale + self.femalefemale)

The weighted error is about the true is different than the guessed, but giving a weight to the guessed as undefined.

(self.femalemale + self.malefemale + w * (self.maleundefined + self.femaleundefined)) / (self.malemale + self.femalemale + self.malefemale + self.femalefemale + w * (self.maleundefined + self.femaleundefined))

The confusion matrix creates a matrix between the true and the guess. If you have this confusion matrix:

[[ 2, 0, 0]
 [ 0, 5, 0]]

It means, I have 2 females true and I’ve guessed 2 females and I’ve 5 males true and I’ve guessed 5 males. I don’t have errors in my classifier.

[[ 2  1  0]
[ 2 14  0]

It means, I have 2 females true and I’ve guessed 2 females and I’ve 14 males true and I’ve guessed 14 males. 1 female was considered male, 2 males was considered female.

PCA

Concepts

The dispersion measures between 1 variables are: variance, standard deviation, …

files/variance.png

If you have 2 variables, you can write a formula so similar to variance.

files/covariance.png

If you have 3 variables or more, you can write a covariance matrix.

files/matrix-covariance.png

In essence, an eigenvector v of a linear transformation T is a non-zero vector that, when T is applied to it, does not change direction. Applying T to the eigenvector only scales the eigenvector by the scalar value λ, called an eigenvalue.

files/eigenvector.png

A feature vector is constructed taking the eigenvectors that you want to keep from the list of eigenvectors.

The new dataset take the transpose of the vector and multiply it on the left of the original data set, transposed.

FinalData = RowFeatureVector x RowDataAdjust

Choosing components

We can choose components with:

import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--csv')
args = parser.parse_args()

#filepath = 'files/features_list.csv' #your path here
data = np.genfromtxt(args.csv, delimiter=',', dtype='float64')

scaler = MinMaxScaler(feature_range=[0, 1])
data_rescaled = scaler.fit_transform(data[1:, 0:8])

#Fitting the PCA algorithm with our Data
pca = PCA().fit(data_rescaled)
#Plotting the Cumulative Summation of the Explained Variance
plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)') #for each component
plt.title('Dataset Explained Variance')
plt.show()

files/pca-number-components.png

Taking a look to the image. We can choose 6 components.