damegender is a gender detection tool coded by David Arroyo MEnéndez (DAME)
- If you want determine gender gap in free software projects or mailing lists.
- If you don’t know the gender about a name in spanish or english (current status).
- If you want research with statistics about why a name is related with males or females.
- If you want use the main solutions in gender detection (genderize, genderapi, namsor, nameapi and gender guesser) from a command.
DAMe Gender is for you!
$ sudo apt-get install python3-nose-exclude dict dict-freedict-eng-spa dict-freedict-spa-eng dictd $ git clone https://github.com/davidam/damegender $ cd damegender $ pip3 install -r requirements.txt
Obtaining an api key
Currently you can need an api key from:
You can execute:
$ python3 apikeyadd.py
To configure your api key
$ python3 >>> import nltk >>> nltk.download('names')
$ nosetest3 test
$ nosetests3 test/test_dame_sexmachine.py:TddInPythonExample.test_string2array_method_returns_correct_result
# Detect gender from a name $ python3 main.py David male $ python main.py David --ine="yes" david gender is male 363.559 males for david from INE.es 0 females for david from INE.es # Detect gender from a name only using machine learning (experimental way) $ python3 main.py David --ml=multinomialNB male # Count gender from a csv example file $ python3 csv2gender.py files/partial.csv The number of males in files/partial.csv is 16 The number of females in files/partial.csv is 3 The number of gender not recognised in files/partial.csv is 2 # Count gender from a git repository $ python3 git2gender.py https://github.com/chaoss/grimoirelab-perceval.git --directory="/tmp/clonedir" The number of males sending commits is 15 The number of females sending commits is 7 # Count gender from a mailing list $ cd files $ wget -c http://mail-archives.apache.org/mod_mbox/httpd-announce/201706.mbox $ cd .. $ python3 mail2gender.py http://mail-archives.apache.org/mod_mbox/httpd-announce/ # Use an api to detect the gender $ python3 api2gender.py Leticia --api=namsor female scale: 0.99 # Give me informative features $ python3 infofeatures.py Females with last letter a: 0.4705246078961601 Males with last letter a: 0.048672566371681415 Females with last letter consonant: 0.2735841767750908 Males with last letter consonant: 0.6355328972681801 Females with last letter vocal: 0.7262612995441552 Males with last letter vocal: 0.3640823393612928 # To measure success $ python3 accuracy.py --csv=files/min.csv files/min.csv ################### Namsor!! Gender list: [1, 1, 1, 1, 2, 1, 0, 0] Guess list: [1, 1, 1, 1, 1, 1, 0, 0] 0.875 Namsor accuracy: 0.875 ################### Genderize!! Gender list: [1, 1, 1, 1, 2, 1, 0, 0] Guess list: [1, 1, 1, 1, 2, 1, 0, 0] Genderize accuracy: 1 ################### GenderGuesser!! Gender list: [1, 1, 1, 1, 2, 1, 0, 0] Guess list: [1, 1, 1, 1, 2, 1, 0, 0] GenderGuesser accuracy: 0.875 ################### Sexmachine!! Gender list: [1, 1, 1, 1, 2, 1, 0, 0] Guess list: [1, 1, 1, 1, 2, 1, 0, 0] Sexmachine accuracy: 0.875 $ python3 confusion.py A confusion matrix C is such that Ci,j is equal to the number of observations known to be in group i but predicted to be in group j. If the classifier is nice, the diagonal is high because there are true positives Namsor confusion matrix: [[ 3 0 0] [ 0 16 0] [ 0 2 0]] Sexmachine confusion matrix: [[ 2 1 0] [ 2 14 0] [ 1 1 0]] # To analyze errors guessing names from a csv $ python3 errors.py --csv="files/all.csv" --api="genderguesser" Gender Guesser with files/all.csv has: + The error code: 0.22564457518601835 + The error code without na: 0.026539047204698716 + The na coded: 0.20453365634192766 + The error gender bias: 0.0026103980857080703 # To deploy a graph about correlation between variables $ python3 corr.py $ python3 corr.py --csv="categorical" $ python3 corr.py --csv="nocategorical" # To create the pickle models in files directory $ python3 damemodels.py
Statistics for damegender
Some theory could be useful to understand some commands
Errors and Confusion Matrix
Guessing the sex, we have an true idea (example: female) and we obtain a result, the guessed result (example: female). We have written count_true2guess to make statistics variables about it.
In confusion matrix litherature, we can find this vocabulary for true and guess:
|True positive||False Positive|
|False negative||True Negative|
Precision is about true positives between true positives plus false positives
(self.femalefemale + self.malemale ) / (self.femalefemale + self.malemale + self.femalemale)
Recall is about true positives between true positives plus false negatives.
(self.femalefemale + self.malemale ) / (self.femalefemale + self.malemale + self.malefemale)
The F1 score is the harmonic mean of precision and recall taking both metrics into account in the following equation:
2 * ((precision * recall) / (precision + recall))
Error coded is about the true is different than the guessed:
(self.femalemale + self.malefemale + self.maleundefined + self.femaleundefined) / (self.malemale + self.femalemale + self.malefemale + self.femalefemale + self.maleundefined + self.femaleundefined)
Error coded without na is about the true is different than the guessed, but without undefined results.
(self.maleundefined + self.femaleundefined) / (self.malemale + self.femalemale + self.malefemale + self.femalefemale + self.maleundefined + self.femaleundefined)
Error gender bias is to understand if the error is bigger guessing males than females or viceversa.
(self.malefemale - self.femalemale) / (self.malemale + self.femalemale + self.malefemale + self.femalefemale)
The weighted error is about the true is different than the guessed, but giving a weight to the guessed as undefined.
(self.femalemale + self.malefemale + w * (self.maleundefined + self.femaleundefined)) / (self.malemale + self.femalemale + self.malefemale + self.femalefemale + w * (self.maleundefined + self.femaleundefined))
The confusion matrix creates a matrix between the true and the guess. If you have this confusion matrix:
[[ 2, 0, 0] [ 0, 5, 0]]
It means, I have 2 females true and I’ve guessed 2 females and I’ve 5 males true and I’ve guessed 5 males. I don’t have errors in my classifier.
[[ 2 1 0] [ 2 14 0]
It means, I have 2 females true and I’ve guessed 2 females and I’ve 14 males true and I’ve guessed 14 males. 1 female was considered male, 2 males was considered female.
The dispersion measures between 1 variables are: variance, standard deviation, …
If you have 2 variables, you can write a formula so similar to variance.
If you have 3 variables or more, you can write a covariance matrix.
In essence, an eigenvector v of a linear transformation T is a non-zero vector that, when T is applied to it, does not change direction. Applying T to the eigenvector only scales the eigenvector by the scalar value λ, called an eigenvalue.
A feature vector is constructed taking the eigenvectors that you want to keep from the list of eigenvectors.
The new dataset take the transpose of the vector and multiply it on the left of the original data set, transposed.
FinalData = RowFeatureVector x RowDataAdjust
We can choose components with:
import numpy as np from sklearn.decomposition import PCA from sklearn.preprocessing import MinMaxScaler import matplotlib.pyplot as plt import argparse parser = argparse.ArgumentParser() parser.add_argument('--csv') args = parser.parse_args() #filepath = 'files/features_list.csv' #your path here data = np.genfromtxt(args.csv, delimiter=',', dtype='float64') scaler = MinMaxScaler(feature_range=[0, 1]) data_rescaled = scaler.fit_transform(data[1:, 0:8]) #Fitting the PCA algorithm with our Data pca = PCA().fit(data_rescaled) #Plotting the Cumulative Summation of the Explained Variance plt.figure() plt.plot(np.cumsum(pca.explained_variance_ratio_)) plt.xlabel('Number of Components') plt.ylabel('Variance (%)') #for each component plt.title('Dataset Explained Variance') plt.show()
Taking a look to the image. We can choose 6 components.