Gender Prediction Methods Based on Name #36

Hamedloghmani · 2023-01-14T00:00:11Z

Hello @gabrielrueda,
You have been added to this repo and you can log and discuss your process here from now on as well as Trello.

gabrielrueda · 2023-01-14T15:44:26Z

Update

Here is an update: I researched how to detect gender based on name. First, I realized that there were a few APIs to detect gender, but those had a limits which will not be practical for us. Then I looked into three articles labeled as Methods 1-3 in the PDF. In the PDF I provide some rough notes that I got from each of the articles. Methods 1&3 seem to have a similar approach while Method 2 is different.

For preprocessing Method 1 and 3 would split each character and assign a number to each possible character. Method 2 would use a count vectorizer. This means it would get substrings of a size specified (e.g. 2-4 chars) and you would get all the substrings possible in your dataset. Then for each name you would count the repetitions of each sub-string.

For the model, Method 1 and 3 use a bidirectional LSTM. Method 2 would use logical regression. From what I read, a logical regression would be more lightweight, but less accurate. However, I am not too familiar with these and will probably do more research on this.

Also, there are possible differences in patterns for names from different countries. We could possibly train different models for a few different countries, but I am not sure how practical this would be.

We can also account for names that are used interchangeably between two genders (e.g. Alex) and ignore those.

Let me know if you have any suggestions.

Notes

Method 1:

https://towardsdatascience.com/boy-or-girl-a-machine-learning-web-app-to-detect-gender-from-name-16dc0331716c

Input Modifications:

lowercase
split each character
pad empty spaces to make all names same length
encode characters to numbers
- space = 0, a = 1, b = 2, ...
Encode gender (F to 0 and M to 1)

NLP Model:

Embedding Layer

to embed each input character's encoded number into a dense 256 dimension vector.
embedding is a method used to represent discrete variables as continuous vectors

Bidirrectional LSTM Layer

read the seq of character embeddings from the previous step and output a single vector representing that sequence
The values for units and dropouts are hyperparameters as well

Dense Layer

outputs single value close to 0 as 'F'
- close to 1 for 'M'
Not sure if we should also have threshold
- options for kind of male or kind of female

Method 2:

https://pub.towardsai.net/predicting-name-gender-from-notebook-to-production-99e51d2aabd7

Generate Features

Count Vectorizer: a way to build vocabulary and features from a corpus automatically
- frequency of substrings in a certain string
Example (2-4) char count vectorizer for "Chris"
- Ch
- hr
- ri
- is
- chr
- hri
- his
- chri
- hris

For These few names:

Logistic Regression

lightweight model
- if runtime is more important than accuracy -> this is a good option
other options: Decision Trees, Neural Networks, SVMs
input: the frequency of the repetition of the sub-strings defined above
- Example:

Improvements Suggested:

87% accuarcy -> ways to improve this could be
- training one model per country -> femine/masculine may diff in differnt countries
- vocabulary size could be too large for count vectorizer

Method 3:

https://maelfabien.github.io/machinelearning/NLP_7/#
working with dataset from France

Data Preprocess

removes accents from letters

LSTM

also uses LSTM algorithm -> character level LSTM
used bi-dirrectional LSTM

Hamedloghmani · 2023-01-14T17:38:29Z

Hi Gabriel,

Thank you so much for the update.

Great explanation. I'll take a deeper look at the methods that you kindly suggested and send you an update then we'll choose between these methods.

Thanks a lot.

Hamedloghmani · 2023-01-15T05:09:57Z

Hello @hosseinfani , @gabrielrueda
I read and thought about the methods. Based on the pros and cons and their evaluation results, I think the third method is more accurate and practical.
However, there are some significant concerns about using machine learning to predict gender based on first names:

• We need a massive dataset of names from around the world with very little or no bias in the distribution of names. This is because our target dataset (for example, DBLP) is from all around the world and is fitted on a specific region or country will considerably decrease the model’s performance in the prediction phase.

• The best method offered around 89% accuracy with potential signs of overfitting. Even if we consider that 90%, it is still not sufficient at all in our case. In a small scale, we will have almost 10,000 misses out of 100,000. This is critical because based on the results from this model, we will conduct research on “fairness and bias” and make assumptions about that. I think this will be a loose end if we continue based on noisy or faulty predictions.

To put it in a nutshell, I think paid query-based APIs will be our best option at the moment. As Gabriel kindly mentioned, there are a few, and I looked into 2 popular options:

The pricing seems reasonable, especially the second one.

I would be happy to hear your thoughts about my opinion.

hosseinfani · 2023-01-16T05:32:18Z

The second one worked pretty well on some samples. also it accept country

Honestly, we can try both and make a vote.

How much will they cost us for dblp, imdb, gith?

for uspt, we have it already I believe.

Hamedloghmani · 2023-01-16T06:15:42Z

@hosseinfani

DPLP v12 has 4894081 records
IMDB title.basics.tsv.gz has 6321302 records
They sum up to 11,215,383 records. I think we have to get the 99$/month plan.

I don't think this method will be effective on Github though. Because I examined some data and too many people use nick names or stuff like that as their name. But if we count that, we might have around 1,000,000 there.

hosseinfani · 2023-01-16T06:29:07Z

@Hamedloghmani
how many months we need?
is there any educational discount?

Hamedloghmani · 2023-01-16T07:13:18Z

@hosseinfani
I'll check for educational discount asap.
We'll be done in almost 1 month by 10.000.000 requests/month which is 99$/month

gabrielrueda · 2023-01-17T23:19:17Z

@hosseinfani @Hamedloghmani
It seems the second option (https://genderize.io/) is good based on it being more affordable than the other website, while having data from various countries.

I also had an idea to reduce number of requests we would need to make:

What if we kept our own record of name and gender every time we make a request. This would mean if there were duplicate names in our datasets, than it wouldn't have to make the request again.
This could possibly reduce the 11,215,383 amount to a number below 10 000 000.

Hamedloghmani · 2023-01-18T02:58:50Z

@gabrielrueda
I think it's a brilliant idea. Will have a bit of time complexity for us but I think the number of requests will be significantly decreased.

gabrielrueda · 2023-01-20T17:49:42Z

@Hamedloghmani
Sounds good, I can begin to write a program to keep record duplicate names and record the gender to each respective name. Should I add the gender parameter to each person's record or make a new record of just person and gender? Also should I start with the toy dataset from DBLP?

Hamedloghmani · 2023-02-02T22:14:44Z

Hi Gabriel, hope you are doing well.
Last time we spoke in the issue page, I mentioned 2 API's regarding gender retrieval for names.
Before entering the final phase and buying one, we are willing to do an experiment. I broke down the steps as follows:1) Create an Excel or csv file with 100 random full names ( Firstname and Lastname). Please try to include some diversity in these samples regarding gender and country of origin.2) Using the free version of each API, get the results for each of these samples. gender-api takes lastname too, but genderize.io does not.3) We have to compare the results and decide between them based on the output that we get from this experiment. The final output will be 4 columns, first two are name and last name, third and forth are gender results from each of those APIs.Please note that genderize.io also has a link to a python library for usage as well as their API details, it might be helpful.You can log your process in the issue page and define tasks in Trello as well.Let me know what do you think about it.

hosseinfani · 2023-02-02T22:22:19Z

@gabrielrueda
@Hamedloghmani
I assume the dataset is dblp, right?
Also, keep track of what experts has which (firstname-lastname). Because, later we want to double check with the actual persons in the dataset. Something like this:

[40], Hossein, Fani, 0, 0
[42], Ali, Fani, 0, 1
...
0 being male, 1 being female,
e.g., Hossein Fani is the 40th author in dblp

make sense?

Hamedloghmani · 2023-02-03T03:22:47Z

@hosseinfani
Exactly, we are starting with dblp.

gabrielrueda · 2023-02-14T02:12:31Z

Hello @Hamedloghmani,
I have made some observations. I took 100 names from the DBLP trying to include diversity (although I ended up with a majority of male names (77 vs 23)).

Observations:

96/100 names had the same results for both APIs
Gender-API had no NULL results
Genderize has three 3 entries with a NULL result
Thus, only 1 of the names had a conflicting result (male vs female)

Probabilities:
Since both of the APIs had a accuracy/probability of their result, I graphed the accuracy of each results in order to observe the accurarcies. For the most part, both APIs have around the same value. However, at times one API would get a higher percentage than the other, especially in the names which were harder to predict.

Here is the graph for the first 10 entries:

I have the output.csv, the remaining bar graphs for the accuracies, and the python scripts to formulate this information. Where in the repository should I share this information, or should I share it privately?

Hamedloghmani · 2023-02-14T02:40:05Z

Hello @gabrielrueda
Thank you so much for your update and informative representation.
Based on the plot it seems like usually Genderize is outperfoming Gender API.

Please push your code as a .py file in fair_team_formation/src/util directory.
I'll examine the full results and we will shorty start with labeling the whole DBLP dataset.

Thanks

gabrielrueda · 2023-02-14T03:00:33Z

@Hamedloghmani
I just pushed my code.

Hamedloghmani · 2023-02-14T03:08:41Z

@gabrielrueda
Thanks a lot. I'll review the code and the results asap.

gabrielrueda · 2023-04-03T02:19:31Z

@Hamedloghmani

Here are observations that I mentioned about earlier. These observations are some of the inconsistencies that occur when names are represented in the dataset.

Case 1) The dataset shows "DuarteCesar" but on dblp the actual name is "Cesar Duarte"
Case 2) The dataset shows "A AbramovSergei" but on dblp the actual name is "Sergei A. Abramov"
Case 3) The dataset show "A A Aoude" but there's multiple results on dblp (no first name given it seems)
Case 4) The dataset shows "M. Turunen" but M is just the first letter of the first name

I looked into a few names like case 1 and case 2 and there seems to be a pattern, that when there is no space between the names, the order for first name, last name is reversed. I was wondering if I could implement something in the code to detect that and assume case 1 and case 2. As for case 3 and 4, I was thinking I could just discard those.

My idea to deal with these cases:

Pass Through 1:

Create new json file(e.g. dblp_correctNames.json)
If the name is successful in finding first name using regular method (FIRSTNAME SPACE LASTNAME), that json row is copied to the dblp_correctNames.json file.
If the name follows case (1) or case (2), then the name will be modified and also copied to the dblp_correctNames.json file.
Otherwise, if the name follows case (3) or case (4), those will be copied to another json file (dblp_failed_to_parse_name.json), where they could possibly be used for future if needed.

Pass Through 2:

Will go through dblp_correctNames.json and record all unique names as originally intended. They will be recorded in an indexed pandas dataframe with empty columns of "Gender" and "Probability"

I will let you know when I implement this idea in code and I'll share the results.

Hamedloghmani · 2023-04-03T05:13:01Z

@gabrielrueda
Thank you so much for your progress report.

As we discussed, your idea was great based on the observations that you had.
Thank you, looking forward to seeing the implementation and results 😄

gabrielrueda · 2023-04-04T22:04:43Z

Hello @Hamedloghmani,

After running the filter on the intitial dataset, and then finding all the unique first names, the program found 275 859 unique first names in the filtered dataset (Removing Cases 3 & 4 as well modifying cases 1 & 2).

I have made a pull request (#59) to include the code. Also I have included the results as a uniquenames_filtered.pkl and uniquenames_filtered.csv files (uniquenames_filtered.pkl is there to preserve the index I setup on the pandas dataframe).

I also have the json files for dblp_correctNames.json and dblp_failed_to_parse.json. I can send these privately since the files are too large to be pushed to git.

Let me know if you have any questions.

Hamedloghmani · 2023-04-04T22:09:29Z

Thank you so much for the update @gabrielrueda
I'll go over your code and results today. We'll start labeling after I merge this request and refactor it.
I'll keep you posted.
You can upload the large results in Teams -> Adila -> DBLP Labeling Files
Thanks.

gabrielrueda · 2023-04-07T17:53:38Z

Hello @Hamedloghmani,
I filled in the table for genders of each unique name, The table will be stored as both a .pkl and .csv, but use the .pkl to import data into the program just like the uncommented lines in the main function.

I basically completed two steps in order to obtain data:

Made http requests to get the data from genderize and outputted the whole thing to a text file. (I have the text files on my computer if your interested) -> makeParallelAPIReqs()
Read from those text files and updated the values in the dataframe -> addGenderResultsFromFile()

I left some of things commented at the bottom, since for a chunk of the data (between records 90k and 160k), I obtained the data in a different way, however for future use, the functions in class should be used.

Changes are in pull request #61

Hamedloghmani · 2023-04-07T18:05:14Z

Hi @gabrielrueda

Thank you so much for the implementation. I merged your pull request.
I believe we can proceed to the final phase and label the whole dataset since we have the gender for all the unique names.
What do you think ?

gabrielrueda · 2023-04-07T23:47:15Z

@Hamedloghmani
I think that we are ready to label the whole dataset. Also, some of the entries will result in NULL for gender/probability. Should we just filter out those entries when creating our new dataset?

Hamedloghmani · 2023-04-08T00:11:36Z

@gabrielrueda
Great !
Yes, I believe we can filter them out.

gabrielrueda · 2023-04-08T00:36:24Z

@Hamedloghmani
Also, would a structure like this: "gender": {"value": true, "probability": 0.97} for each author be good to represent the values in the dataset?

Example:
{"id":1,"authors":[{"name":"Hinton","gender": {"value": true, "probability": 0.97},"org":"Shinshu University","id":1},{"name":"LeCun","gender": {"value": false, "probability": 0.87},"org":"Shinshu University","id":3}],"fos":[{"name":"Machine Learning","w":0.45139},{"name":"Image Captioning", "w":0.3241}],"title":"Preliminary Design of a Network Protocol Learning Tool Based on the Comprehension of High School Students: Design by an Empirical Study Using a Simple Mind Map","year":2000,"n_citation":1,"page_start":"89","page_end":"93","doc_type":"Conference","publisher":"Springer, Berlin, Heidelberg","volume":"","issue":"","doi":"10.1007/978-3-642-39476-8_19","references":[2005687710,2018037215],"indexed_abstract":{"IndexLength":58,"InvertedIndex":{"tool.":[42],"study":[4],"aim":[37],"purpose":[1],"scientific":[17],"for":[11],"aspects":[18],"students":[14,46],"focus":[27],"hands-on":[47],"learning":[9,41],"experience":[48],"our":[40],"we":[26],"network":[33,56],"The":[0],"More":[24],"high":[12],"protocols.":[57],"school":[13],"and":[21],"of":[2,19,32,55],"communication":[22],"protocols":[34],"gives":[45],"on":[28],"a":[8],"studying":[15],"specifically,":[25],"this":[3],"understand":[51],"is":[5],"develop":[7,39],"Our":[43],"tool":[10,44],"the":[16,29,36,52],"help":[50],"as":[35],"principles":[31,54],"information":[20],"networks.":[23],"to":[6,38,49],"basic":[30,53]}},"venue":{"raw":"International Conference on Human-Computer Interaction","id":1127419992,"type":"C"}}

Hamedloghmani · 2023-04-08T00:41:55Z

@gabrielrueda
Yes. I think that's great since we might need to use inferred genders with specific levels of confidence in some cases.

Thanks.

gabrielrueda · 2023-04-09T01:01:04Z

Hi @Hamedloghmani,
I just wanted to let you know that I labelled the dataset, and kept all the successful entries (entries whose name could successfully a gender). I will create a pull request for the code. Where should I upload the labelled dataset (since it's too big to upload to GitHub)?

Hamedloghmani · 2023-04-09T01:49:24Z

Hello @gabrielrueda
Thank you so much for the update. I'll go over your pull request tonight.
Please upload them in MS Teams -> Adila -> DBLP Labeling Files
You can create another folder inside this directory if you want. I can reformat it later, no worries.

Let me know if you run into any issues.
Thanks

gabrielrueda · 2023-04-09T03:04:22Z

@Hamedloghmani
The dataset finished uploading. It should be in MS Teams -> Adila -> DBLP Labeling Files -> FinalGenderLabelledDataset. Let me know if you are unable to see/access it.

Thanks

Hamedloghmani · 2023-04-09T03:21:56Z

@gabrielrueda
Thank you Gabriel. I just checked and I got access to it.

gabrielrueda · 2023-04-29T16:07:25Z

Hi @Hamedloghmani ,
I finished labelling the dataset, but I filtered some of the members from the name.basics.tsv file. Since I filtered out the names, should I update the title.basics.tsv and title.principals.tsv files to remove the entries that include the names that I filtered out in the labelling process?

Hamedloghmani · 2023-04-30T04:27:23Z

Hello @gabrielrueda
Thank you so much for the update. You can consider doing that if it is not too time consuming.
In the next steps we would discuss mapping the names from the OpeNTF outputs to the raw dataset and also a policy for missing names.
I will schedule a meeting to talk about that soon if you want.

gabrielrueda · 2023-04-30T14:52:54Z

Hello @Hamedloghmani,
It shouldn't be time consuming to filter it out in the other files, so I'll do that and let you know when I complete it. As for the next task, yes it would be great to schedule a meeting for that.

gabrielrueda · 2023-05-04T01:38:11Z

Hi @Hamedloghmani,
I finished filtering the entries in title.basics.tsv and title.principals.tsv in order to match what was filtered in the names.basics.tsv file. These 3 files are uploaded to the "IMDB Labelling Files" folder. I'll make a pull request tomorrow for the code, I just have to add comments and rename some of the functions.

Hamedloghmani · 2023-05-04T03:57:59Z

Hello @gabrielrueda
Thanks a lot for the update.
Please make the pull request on the main branch this time. Including your previous implementations for dblp.

Thank you

gabrielrueda · 2023-05-04T19:56:33Z

Hi @Hamedloghmani,
I just created the pull request for the code.

Thanks, Gabriel

Hamedloghmani · 2023-05-04T21:24:03Z

Hello @gabrielrueda
I merged your pull request. Thanks a lot.

gabrielrueda · 2023-05-12T15:02:27Z

Hello @Hamedloghmani,

Here is the approach I will take to map the indexes from OpeNTF to the raw dataset:

Part 1:

In the indexes.pkl there is the 'i2c' and 'c2i' dictionaries:

Example:

'i2c': {
    9 : '54977.0_Reginald_Barker'
}

c2i: {
	'54977.0_Reginald_Barker' : 9
}

The string '54977.0_Reginald_Barker' will include the member id from the raw dataset before the '.''
In this case it would be 54977

Part 2:

However, in the name.basics.tsv file, the member id would be listed as 'nm0054977'.

The member id has the layout of "nm" + 7 digits.

e.g. 54977 would be 'nm0054977'

(add 0's in unassigned digits)

Part 3: The mapping:

Loop through 'c2i' dictionary and create new dictionary with layout as described in part 2. {memberID: opeNTF_output_index}
Create a pandas dataframe: (make sure to set opeNTF_output_index as the index)

opeNTF_output_index	gender	probability
(integer)	(true/false/null)	(double from 0.0 to 1.0 or null)

true = male, false = female

Loop through name.basics_labelled.tsv. For each member id, find the index using the dictionary created in step 1. Add the opeNTF_output_index and the gender/probability from the file to the new dataframe.
Then, loop through the list of keys from 'i2c'. If the opeNTF_output_index is not in the pandas dataframe, the add that opeNTF_output_index with gender as null and probabilty as null.
Export the dataframe as .csv and .pkl for future use.

Hamedloghmani · 2023-05-12T19:58:06Z

Hi @gabrielrueda
Thank you so much for your report. The approach makes sense as we discussed. You can proceed with the implementation.

Thanks

gabrielrueda · 2023-05-15T23:29:45Z

Hi @Hamedloghmani,
I finished the implementation. Should I make a pull request for my implementation to the dev or main branch?

Thanks, Gabriel

Hamedloghmani · 2023-05-15T23:36:14Z

Hi @gabrielrueda
Thank you so much for the update. Make it to the main branch please.

Thanks

gabrielrueda · 2023-05-26T17:27:05Z

Hello @Hamedloghmani,

I just wanted to let you know that I changed the true/false in the dataset to M/F, as well as updated the results in mapping table for IMDB. I also created a mapping table for DBLP.

In the pull request I updated the mappingGender.py and labelDataset.py for the changes needed. I have also included the file called changeDataset.py, which I only used as a temporary way to change the values to true/false to M/F.

Finally, the updated datasets and mapping tables will be uploaded to the Teams Adila channel in folders of DBLP Labelling Files/Gender Mappings and IMDB Labelling Files/Gender Mappings (I'll let you know when they finish uploading).

Hamedloghmani added the good first issue Good for newcomers label Jan 14, 2023

Hamedloghmani assigned gabrielrueda Jan 14, 2023

Hamedloghmani mentioned this issue Mar 11, 2023

The Epic Road to Fairness #47

Open

gabrielrueda mentioned this issue Apr 4, 2023

Added UniqueNames Dataframe #59

Merged

gabrielrueda mentioned this issue Apr 7, 2023

Filled in Genders of Names into Table #61

Merged

gabrielrueda mentioned this issue Apr 9, 2023

Added Functions for Labelling the Dataset #62

Merged

gabrielrueda mentioned this issue May 4, 2023

Labelling IMDB Files Python Script #69

Merged

gabrielrueda mentioned this issue May 16, 2023

Mapping OpeNTF Indexes to Gender Results #71

Merged

gabrielrueda mentioned this issue May 26, 2023

Labelled Dataset Fix + DBLP Table Added #75

Merged

gabrielrueda linked a pull request Sep 14, 2023 that will close this issue

Updated Output For mappingGender.py #82

Merged

Hamedloghmani closed this as completed in #82 Sep 18, 2023

Gender Prediction Methods Based on Name #36

Gender Prediction Methods Based on Name #36

Comments

Hamedloghmani commented Jan 14, 2023

gabrielrueda commented Jan 14, 2023

Update

Notes

Method 1:

Input Modifications:

NLP Model:

Method 2:

Generate Features

Logistic Regression

Improvements Suggested:

Method 3:

Data Preprocess

LSTM

Hamedloghmani commented Jan 14, 2023

Hamedloghmani commented Jan 15, 2023 • edited Loading

hosseinfani commented Jan 16, 2023

Hamedloghmani commented Jan 16, 2023

hosseinfani commented Jan 16, 2023

Hamedloghmani commented Jan 16, 2023

gabrielrueda commented Jan 17, 2023

Hamedloghmani commented Jan 18, 2023

gabrielrueda commented Jan 20, 2023

Hamedloghmani commented Feb 2, 2023

hosseinfani commented Feb 2, 2023

Hamedloghmani commented Feb 3, 2023

gabrielrueda commented Feb 14, 2023

Hamedloghmani commented Feb 14, 2023

gabrielrueda commented Feb 14, 2023

Hamedloghmani commented Feb 14, 2023

gabrielrueda commented Apr 3, 2023

Hamedloghmani commented Apr 3, 2023

gabrielrueda commented Apr 4, 2023 • edited Loading

Hamedloghmani commented Apr 4, 2023

gabrielrueda commented Apr 7, 2023 • edited Loading

Hamedloghmani commented Apr 7, 2023 • edited Loading

gabrielrueda commented Apr 7, 2023

Hamedloghmani commented Apr 8, 2023

gabrielrueda commented Apr 8, 2023

Hamedloghmani commented Apr 8, 2023

gabrielrueda commented Apr 9, 2023

Hamedloghmani commented Apr 9, 2023

gabrielrueda commented Apr 9, 2023

Hamedloghmani commented Apr 9, 2023

gabrielrueda commented Apr 29, 2023

Hamedloghmani commented Apr 30, 2023

gabrielrueda commented Apr 30, 2023

gabrielrueda commented May 4, 2023

Hamedloghmani commented May 4, 2023

gabrielrueda commented May 4, 2023

Hamedloghmani commented May 4, 2023

gabrielrueda commented May 12, 2023

Part 1:

Part 2:

Part 3: The mapping:

Hamedloghmani commented May 12, 2023

gabrielrueda commented May 15, 2023

Hamedloghmani commented May 15, 2023

gabrielrueda commented May 26, 2023

Hamedloghmani commented Jan 15, 2023 •

edited

Loading

gabrielrueda commented Apr 4, 2023 •

edited

Loading

gabrielrueda commented Apr 7, 2023 •

edited

Loading

Hamedloghmani commented Apr 7, 2023 •

edited

Loading