Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some (minor) issues and suggestions to this awesome project #1

Open
phseiff opened this issue Dec 22, 2020 · 1 comment
Open

Some (minor) issues and suggestions to this awesome project #1

phseiff opened this issue Dec 22, 2020 · 1 comment

Comments

@phseiff
Copy link
Contributor

phseiff commented Dec 22, 2020

Hello, and first of all, thank you for this amazing dataset and all the work you've put into it; it helped me a lot!

Whilst using your dataset for my own (not-yet open-sourced project), which required parsing your data into a different format, I found some minor issues with your dataset, whom I would like to share with you. I fixed them in my own copy of your dataset, but since said copy uses a different data format, sharing it would not be of much use for you.

In the following, I will iterate over the issues I found and refer to the report generated by my program where necessary, and the automated suggestions it generated.

The full log of my analysis can be found here, but it may not be self-explanatory and contain information you don't actually need.

First of all, you added (as stated by your README) several words to your JSON data that aren't hyponyms for "person". My program removes these and told me that there are 22 words like this:

"female" ignored because it is not part of wordnet and therefore not a hyponym for a person.
"gentlemen" ignored because it is not part of wordnet and therefore not a hyponym for a person.
"he" ignored because it is not part of wordnet and therefore not a hyponym for a person.
"her" ignored because it is not part of wordnet and therefore not a hyponym for a person.
"her" ignored because it is not part of wordnet and therefore not a hyponym for a person.
Found an "other"-word! It's "hermaphrodite".
"hers" ignored because it is not part of wordnet and therefore not a hyponym for a person.
"herself" ignored because it is not part of wordnet and therefore not a hyponym for a person.
"him" ignored because it is not part of wordnet and therefore not a hyponym for a person.
"himself" ignored because it is not part of wordnet and therefore not a hyponym for a person.
"his" ignored because it is not part of wordnet and therefore not a hyponym for a person.
"ladies" ignored because it is not part of wordnet and therefore not a hyponym for a person.
"ma'am" ignored because it is not part of wordnet and therefore not a hyponym for a person.
"madam" ignored because it is not part of wordnet and therefore not a hyponym for a person.
"male" ignored because it is not part of wordnet and therefore not a hyponym for a person.
"mamma" ignored because it is not part of wordnet and therefore not a hyponym for a person.
"men" ignored because it is not part of wordnet and therefore not a hyponym for a person.
"miss" ignored because it is not part of wordnet and therefore not a hyponym for a person.
"mr." ignored because it is not part of wordnet and therefore not a hyponym for a person.
"mrs." ignored because it is not part of wordnet and therefore not a hyponym for a person.
"ms." ignored because it is not part of wordnet and therefore not a hyponym for a person.
"she" ignored because it is not part of wordnet and therefore not a hyponym for a person.
"women" ignored because it is not part of wordnet and therefore not a hyponym for a person.

My (admittedly quite minor) issue with this is that you include "men" and "women" as the plural forms of "man" and "woman", even though this contradicts the convention that (and sorry if I am wrong here) no other plurals are part of the dataset.

My second issue is that there are 270 words that are referenced by the gender_map-attribute of other words, but not part of the database. I find that somewhat contra intuitive, since these words appear to be nouns that represent a person and are gendered, but are not part of the database. For example, yardman lists yardwoman as its female equivalent, but yardwoman is not part of the dataset, so finding the male version of yardwoman is not possible with your dataset.
My log lists all of these words, in case you want to read it.

My third issue is that some connections are one-sided. For example, archduke lists archduchess as its female version, but archduchess does not list archduke as its male version. I found and found 50 instances of this behavior on my first run of the program. Some of these cases may also be cases whereA (male) lists B as its female version and C as its neutral version, but B and C don't link to each other, so checking every link for two-sidedness might not necessarily be enough to solve this issue.

My fourth issue is that "actress" lists "actor" as its male version, but appears to have no neutral version if you look at its gender_mapping, even though "actor" is listed as a neutral word by your dataset. This issue could be easily solved by adding actor to "actress"'s gender_mapping twice, once as its male and once as its female version. There are 52 occurrences of this in your dataset, all documented by the log I linked for you.

My fifth issue is that many words of the dataset have no male/female/neutral (most often neutral) version they link to, even though they end with a clear gender-indicator like "-man", "-women" or "-girl" that could be easily replaced with the opposite sex or "-person" to create a male/female/neutral version. The program I wrote automatically creates male/female/neutral versions for all words like this that do not already have a male/female/neutral version in the dataset. This required automatically generating 403 new words and 1626 new links (= references of words to differently gendered versions of the word), all of whom are documented by the log I linked for you.

My sixth issue is that a bunch of words are listed as male, even though they may be used to refer to people of any gender, so there is no way of knowing whether they just have no neutral version or if they are neutral in addition to being male. I fixed this in my copy of your dataset by assigning every word its male version as its neutral versio (if there is a male version), and its female version otherwise. There were 243 cases where I had to use a words male version as its female version, and 132 where a female word neither had a male nor a neutral equivalent; all cases of this issue, as well as the solution I automaticaly chose for each of them are listed in the log I linked for you.

Also, your dataset lists "town" as male.... wordnet surely is weird sometimes xD

I don't know if this will be of any use to you, since some of the "issues" I listed above may be design choices I misunderstood, and my suggestions are in the form of an obscure log and sometimes possibly wrong since they where automaticalyl generated by an algorithm, but I still decided to share them with you to show you that there is interest in your project and just in case you fidn them helpfull.

Keep up the good work!

@phseiff
Copy link
Contributor Author

phseiff commented Dec 30, 2020

I also just realized that the word-attribute of objects in gender_map contains white space rather than underscores like the word-attribute of top-level entries. I can see why you did this; it makes indexing the data to find a fully gendered version of a word admittedly easier, but this can also become a hassle for those who are not aware of it and innocently assume that the names in gender_map follow the same naming convention as the names in top-level entries. I feel like this should be mentioned in the "Contents and Format"-section of the README.

I can make a pull request for this if you want me to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant