-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gender Prediction Methods Based on Name #36
Comments
UpdateHere is an update: I researched how to detect gender based on name. First, I realized that there were a few APIs to detect gender, but those had a limits which will not be practical for us. Then I looked into three articles labeled as Methods 1-3 in the PDF. In the PDF I provide some rough notes that I got from each of the articles. Methods 1&3 seem to have a similar approach while Method 2 is different. For preprocessing Method 1 and 3 would split each character and assign a number to each possible character. Method 2 would use a count vectorizer. This means it would get substrings of a size specified (e.g. 2-4 chars) and you would get all the substrings possible in your dataset. Then for each name you would count the repetitions of each sub-string. For the model, Method 1 and 3 use a bidirectional LSTM. Method 2 would use logical regression. From what I read, a logical regression would be more lightweight, but less accurate. However, I am not too familiar with these and will probably do more research on this. Also, there are possible differences in patterns for names from different countries. We could possibly train different models for a few different countries, but I am not sure how practical this would be. We can also account for names that are used interchangeably between two genders (e.g. Alex) and ignore those. Let me know if you have any suggestions. NotesMethod 1:Input Modifications:
NLP Model:Embedding Layer
Bidirrectional LSTM Layer
Dense Layer
Method 2:Generate Features
Logistic Regression
Improvements Suggested:
Method 3:
Data Preprocess
LSTM
|
Hi Gabriel, Thank you so much for the update. Great explanation. I'll take a deeper look at the methods that you kindly suggested and send you an update then we'll choose between these methods. Thanks a lot. |
Hello @hosseinfani , @gabrielrueda • We need a massive dataset of names from around the world with very little or no bias in the distribution of names. This is because our target dataset (for example, DBLP) is from all around the world and is fitted on a specific region or country will considerably decrease the model’s performance in the prediction phase. • The best method offered around 89% accuracy with potential signs of overfitting. Even if we consider that 90%, it is still not sufficient at all in our case. In a small scale, we will have almost 10,000 misses out of 100,000. This is critical because based on the results from this model, we will conduct research on “fairness and bias” and make assumptions about that. I think this will be a loose end if we continue based on noisy or faulty predictions. To put it in a nutshell, I think paid query-based APIs will be our best option at the moment. As Gabriel kindly mentioned, there are a few, and I looked into 2 popular options: The pricing seems reasonable, especially the second one. I would be happy to hear your thoughts about my opinion. |
The second one worked pretty well on some samples. also it accept country Honestly, we can try both and make a vote. How much will they cost us for dblp, imdb, gith? for uspt, we have it already I believe. |
I don't think this method will be effective on Github though. Because I examined some data and too many people use nick names or stuff like that as their name. But if we count that, we might have around 1,000,000 there. |
@Hamedloghmani |
@hosseinfani |
@hosseinfani @Hamedloghmani I also had an idea to reduce number of requests we would need to make: What if we kept our own record of name and gender every time we make a request. This would mean if there were duplicate names in our datasets, than it wouldn't have to make the request again. |
@gabrielrueda |
@Hamedloghmani |
Hi Gabriel, hope you are doing well. |
@gabrielrueda [40], Hossein, Fani, 0, 0 make sense? |
@hosseinfani |
Hello @Hamedloghmani, Observations:
Probabilities: Here is the graph for the first 10 entries: I have the output.csv, the remaining bar graphs for the accuracies, and the python scripts to formulate this information. Where in the repository should I share this information, or should I share it privately? |
Hello @gabrielrueda Please push your code as a .py file in fair_team_formation/src/util directory. Thanks |
@Hamedloghmani |
@gabrielrueda |
Here are observations that I mentioned about earlier. These observations are some of the inconsistencies that occur when names are represented in the dataset. Case 1) The dataset shows "DuarteCesar" but on dblp the actual name is "Cesar Duarte" I looked into a few names like case 1 and case 2 and there seems to be a pattern, that when there is no space between the names, the order for first name, last name is reversed. I was wondering if I could implement something in the code to detect that and assume case 1 and case 2. As for case 3 and 4, I was thinking I could just discard those. My idea to deal with these cases: Pass Through 1:
Pass Through 2:
I will let you know when I implement this idea in code and I'll share the results. |
@gabrielrueda As we discussed, your idea was great based on the observations that you had. |
Hello @Hamedloghmani, After running the filter on the intitial dataset, and then finding all the unique first names, the program found 275 859 unique first names in the filtered dataset (Removing Cases 3 & 4 as well modifying cases 1 & 2). I have made a pull request (#59) to include the code. Also I have included the results as a uniquenames_filtered.pkl and uniquenames_filtered.csv files (uniquenames_filtered.pkl is there to preserve the index I setup on the pandas dataframe). I also have the json files for dblp_correctNames.json and dblp_failed_to_parse.json. I can send these privately since the files are too large to be pushed to git. Let me know if you have any questions. |
Thank you so much for the update @gabrielrueda |
Hello @Hamedloghmani, I basically completed two steps in order to obtain data:
I left some of things commented at the bottom, since for a chunk of the data (between records 90k and 160k), I obtained the data in a different way, however for future use, the functions in class should be used. Changes are in pull request #61 |
Thank you so much for the implementation. I merged your pull request. |
@Hamedloghmani |
@gabrielrueda |
@Hamedloghmani Example: |
@gabrielrueda Thanks. |
Hi @Hamedloghmani, |
Hello @gabrielrueda Let me know if you run into any issues. |
@Hamedloghmani Thanks |
@gabrielrueda |
Hi @Hamedloghmani , |
Hello @gabrielrueda |
Hello @Hamedloghmani, |
Hi @Hamedloghmani, |
Hello @gabrielrueda Thank you |
Hi @Hamedloghmani, Thanks, Gabriel |
Hello @gabrielrueda |
Hello @Hamedloghmani, Here is the approach I will take to map the indexes from OpeNTF to the raw dataset: Part 1:In the indexes.pkl there is the 'i2c' and 'c2i' dictionaries: Example:
The string '54977.0_Reginald_Barker' will include the member id from the raw dataset before the '.'' Part 2:However, in the name.basics.tsv file, the member id would be listed as 'nm0054977'. The member id has the layout of "nm" + 7 digits. e.g. 54977 would be 'nm0054977' (add 0's in unassigned digits) Part 3: The mapping:
|
Hi @gabrielrueda Thanks |
Hi @Hamedloghmani, Thanks, Gabriel |
Hi @gabrielrueda Thanks |
Hello @Hamedloghmani, I just wanted to let you know that I changed the true/false in the dataset to M/F, as well as updated the results in mapping table for IMDB. I also created a mapping table for DBLP. In the pull request I updated the mappingGender.py and labelDataset.py for the changes needed. I have also included the file called changeDataset.py, which I only used as a temporary way to change the values to true/false to M/F. Finally, the updated datasets and mapping tables will be uploaded to the Teams Adila channel in folders of DBLP Labelling Files/Gender Mappings and IMDB Labelling Files/Gender Mappings (I'll let you know when they finish uploading). |
Hello @gabrielrueda,
You have been added to this repo and you can log and discuss your process here from now on as well as Trello.
The text was updated successfully, but these errors were encountered: