## BiCIKL-Hackathon Topic 9: Hidden Women in Science

____
### Summary
____

[Topic overview](https://github.com/pensoft/BiCIKL/tree/main/Topic%209%20Hidden%20women%20in%20science)

I’ve been seeing how well the connections between different infrastructures using collector names link together in the current architecture, with the idea that seeing where/why things are getting lost might be useful:

Collector names in gbif occurrences (preserved) -> proportion that hold enough detail to resolve -> number of records with a matching profile on bionomia -> number of corresponding records on wikidata holding gender data -> number of women

Results from a subset of gbif records:

    1. Records:                         50,403
    2. Unique recordedBy values:        6,078 
    3. … which aren’t initials:         1,272
    4. Names which match bionomia*:     392
    5. Wikidata record has gender:      320
    6. Women:                           10

* confidence >= 51.

Recommendations: people are looking for women collectors, they’re just hard to find because the data homogenous - infra could look at supporting human-in-the-loop name disambiguation through platforms such as bionomia and wikidata - may be useful side-products of their existing processes (clustering, for eg) that could speed up independent researcher activities. 

[Topic overview](https://github.com/pensoft/BiCIKL/tree/main/Topic%209%20Hidden%20women%20in%20science)


___
### Imports and params
___

In [1]:
import requests
import itertools
import re
import json
import urllib.parse
import time
import sys
from SPARQLWrapper import SPARQLWrapper, JSON

import collector_utils as co

In [2]:
# params to fiddle with
start_year_range = 1880
end_year_range = 1880
gbif_taxon_id = 7819616
confidence = 51
wikidata_username = 'Essssveeee'

____
### 1. Get GBIF botanist sample
____

Get a sample of unique values from dwc.recordedBy fields in occurrence records (preserved specimens only) in GBIF to work with. The sample is defined by year of collection (set to c. 1870 because that's when we started to see more botanistas appearing, but it isn't so recent they'll still be alive) + taxonomic groups within Plantae (mostly to keep the number of records/processing speed at a sensible level) 


#### Why tho?

* Anecdata but probably more historical women botanists around - flowers being ladylike n all that.
* Doesn't look like GBIF do much with the name strings they harvest, so should be pretty representative of source data quality?
* Bionomia records reference GBIF specimen occ records so there's already a link there. 

In [3]:
# Set daterange of interest and taxon-id (easily grabbable from occ search GUI url)
gbif_collectors = co.get_gbif_recordedBy(start_year_range, end_year_range, gbif_taxon_id)

In [4]:
# Summary + counts
taxa_name = co.get_species_label(gbif_taxon_id)
print("Unique name strings")
print(f"Taxa of interest: {taxa_name} (preserved specimens only)")
print(f"Collection event date range: {start_year_range}-{end_year_range}")
print(f"Count of unique names: {len(gbif_collectors)}") 

Unique name strings
Taxa of interest: Charophyta (preserved specimens only)
Collection event date range: 1880-1880
Count of unique names: 241


##### Notes 

1. Could have retrieved `dwc.family` + `dwc.year` per record/unique collector name too. 
            {'original_name': 'Alfreda Collectoro', 
            'taxa': [t1, t2, t3],
            'year': [1870, 1870, 1871, 1864]}  
        
    * Might have been useful later on, but also would have been annoying to deal with 
    

2. Found a fair few recordedBy fields with 'Mrs/Miss' in them - just added them to the 'full names' list in the end, but could be worth returning them separately cos they're definitely women.

##### Interesting questions

1. How much is recordedByID being used? What kind of IDs are in there?
2. Distribution/frequency of each name variant within result set
3. Are names consistent within datasets/institutions

_______

### 2. ID easier-to-resolve names
_______

Attempt to parse out 'fullname' names from the collector list generated in previous steps. a.k.a, filter out names that are either a single word string, or which have a leading or trailing initial. 


#### Why tho?

* Easy wins! 
* Fuller names = easier/less risky to disambiguate 
* Need a decent name string to do any filtering based on demographics. 
* Interested to see the proportion of names that fall into full/thin camp and the different patterns used within this. 


In [5]:
# Separate name list from previous step into full names vs thin/initials
fuller_names, initials = co.get_rid_of_gunk(gbif_collectors)
fuller_names.sort()
initials.sort()

In [6]:
# Peek at the first 10 names in each
print(f"Fuller names: {fuller_names[:10]}")
print()
print(f"Thinner names: {initials[:10]}")

Fuller names: ['Arthur Bennett', 'Barbey William', 'Beeby, Mr William Hadden FLS', 'Bennett, Mr Arthur F.L.S. - Croydon', 'Boissier Pierre Edmond', 'Bolton King', 'Braun Alexander Karl (Carl) Heinrich', 'Bulnheim Otto', 'Charles Bailey', 'Chenevard Paul']

Thinner names: ['', '....berger', 'A Bennett', 'A. Bennett', 'A. Dichtl', 'A. H. Curtiss', 'A. Kellogg', 'A. Loefgren', 'A. LÃ¶fgren', 'A. Löfgren']


In [7]:
# Summary + counts
f"Full name count: {len(fuller_names)}, thinner name count: {len(initials)}"

'Full name count: 62, thinner name count: 179'

##### Notes

1. Seems to work well enough, although full names are in the minority in all the samples I've tried. 
    * Could try clustering the names back around thin names once they've been resolved/if they can be? 
    * Outputs would need a bit of manual checking, but could be something citizen science folks would like to do + be good at?  
    

2. Still a few values that are clearly > 1 name though. 
    * Already splitting on these: & ; 'and' |, but the rem look like comma delimited... 
    * Might be splittable using whitespace counts? 

##### Interesting questions

1. Are patterns of errors characteristic to institutions?
2. Frequency of each name in terms of occurrence record count.


_______

### 3. Try to resolve against Bionomia
_______

We're trying to match the full collector names from previous steps to profiles on Bionomia, using a couple of API endpoints: autocomplete widget and JSON-LD search for people (former give a confidence match score, latter allows additional search params to help narrow search)  

Docs: https://bionomia.net/developers

#### Why tho?

* Seems a good source of names + there was an nice API for searching them - seemed rude not to.
* Everything in Bionomia has to have either an ORCiD or all of [birth date, death date, wikidata QID] and for the date range we're looking at, ORCiDs seem unlikely. So. Everything we match in bionomia is also a match against wikidata (but not necessarily vice versa)
* Wikidata record means a person-id we can maybe trust, yay!
* We're trying to light up 'lost' people, so unmatching names are of interest because they aren't in bionomia, but the people were collectors... 
* ... or occurence recordedBy strings are garbled beyond matchability, which is also useful - how much do they need to be cleaned up before they match well enough? 


In [8]:
# pass in list of names and date range
bionomia_matches, bionomia_unmatches = co.search_bionomia_people_auto(fuller_names, confidence)

In [9]:
print(f"Confidence cutoff = {confidence}: {len(bionomia_matches)}/{len(fuller_names)} full names were matched against the basic Bionomia endpoint")

Confidence cutoff = 51: 38/62 full names were matched against the basic Bionomia endpoint


In [10]:
# Quick look
for match in bionomia_matches[:20]:
    print(f"{match['original_name']} -> {match['bionomia_match']['fullname']} ({match['bionomia_match']['wikidata']})")

Arthur Bennett -> Arthur Bennett (Q5706955)
Barbey William -> William Barbey (Q3568417)
Boissier Pierre Edmond -> Edmond Boissier (Q34430)
Bolton King -> Bolton King (Q18730032)
Braun Alexander Karl (Carl) Heinrich -> Alexander Braun (Q62855)
Bulnheim Otto -> Carl Otto Bulnheim (Q21506645)
Chenevard Paul -> Paul Chenevard (Q6067136)
Cyrus Guernsey Pringle -> Cyrus Pringle (Q3009492)
Edouard Rostan -> Edouard Rostan (Q21607448)
Ellsworth J. Hill -> Ellsworth Jerome Hill (Q19955677)
Fauconnet Charles Isaac -> Charles Isaac Fauconnet (Q21512667)
Favrat Louis -> Louis Favrat (Q3261879)
Frederick Arnold Lees -> Frederick Arnold Lees (Q21518563)
George Claridge Druce -> George Claridge Druce (Q601969)
George Nicholson -> George Nicholson (Q5542889)
George Nicholson, George Nicholson -> George Nicholson (Q5542889)
Gustaf Tiselius -> Gustaf Tiselius (Q18246582)
Henry Groves -> Henry Groves (Q5894527)
Henry Groves, James Groves -> James Groves (Q21512721)
Henry Groves, John Ralfs -> John Ralfs 

##### Notes

1. Does the order of words in a name matter? aka, any difference from `forename, surname` pattern vs `surname, forename`?  
    * Doesn't look like it. e.g., the two calls below bring back the same match with an identical score (43.65668):   
        `https://api.bionomia.net/user.json?q=Nilsson+Alb.&limit=1`  
        `https://api.bionomia.net/user.json?q=Alb.+Nilsson&limit=1`  
 
   
2. Hard to say from limited sample size, but the JSON-LD endpoint seemed to match less accurately than the basic one
    * Either collection dates are off (and so fall outside lifespan of collector)
    * Or collection dates are correct and what looks like a perfect match is someone with the same name at a different time (this would imply the true collector isn't in Bionomia yet, I suppose?)
    * .... could be both, of course. JSON-LD endpoint doesn't give the match score so hard to QC, either way. 
    * You can pass in families collected as well, which might help, but I reckon that comes from GBIF data links anyway & if it doesn't it's probably real mucky so would need resolving first.
    * confidence cutoff for the basic API seemed to hit a sweet spot in terms of accuracy around 51

##### Interesting questions

1. Does having a title/indicator of marital status make any difference to number of matches/scores?
2. How good is each endpoint at resolving non-full names? Would think it's risky unless you have a bunch of other match points, and tbf the responsibility for thin names is at data creation/collection so maybe infra should design services around the assumption of decent source data. The carrot can also be the stick ;) 

_______

### 4. Filter using wikidata
_______

Everything matched from the previous step has a wikidata identifier, and wikidata has the person's gender so could use it to tag bionomia records with gender, because that doesn't seem to be in their data.

Brings back all botanists on en wikidata with a birth and death date: https://w.wiki/47QX, so if any on the bionomia matches from the last step ain't in there, they are Secret and might even be Secret Women Botanists, oooh. 

#### Why tho?

* It'll give us a better list of 'dunno who these collectors are' and also an idea of how many might be women, based on distribution of gender in the wikidata results.


In [11]:
endpoint_url = "https://query.wikidata.org/sparql"

query = """#
SELECT DISTINCT ?botanist ?botanistLabel ?occLabel ?birth ?death ?gender WHERE {
  VALUES ?occ { wd:Q2374149 wd:Q2083925 } # occupations: botanist and botanical collector
  ?botanist wdt:P106 ?occ ;               # botanist has occumpation ?occ
            wdt:P570 ?death ;             # botanist has deathdate
            wdt:P569 ?birth .             # botanist has birthdate
  optional {?botanist wdt:P21 ?gender . } # botanist has gender (optional)
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
  }
}

"""

In [12]:
# Get all the women botanists that are currently on wikidata
wikidata_result = co.get_wikidata_botanists(endpoint_url, query, wikidata_username)

In [13]:
# split based on value of gender tag
wiki_botanists = wikidata_result['results']['bindings']

wiki_unk = [] # gender field is null
wiki_men = []
wiki_women = []
wiki_nb = [] 

for w_botanist in wiki_botanists:
    if 'gender' not in w_botanist:
        wiki_unk.append(co.get_qid(w_botanist))
    elif w_botanist['gender']['value'] == 'http://www.wikidata.org/entity/Q6581072':
        wiki_women.append(co.get_qid(w_botanist))
    elif w_botanist['gender']['value'] == 'http://www.wikidata.org/entity/Q6581097':      
        wiki_men.append(co.get_qid(w_botanist))
    else:
        wiki_nb.append(co.get_qid(w_botanist))
        

In [14]:
print(f"Total wikidata botanist records: {len(wiki_botanists)}")
print()
print(f"Women botanists: {len(wiki_women)}")
print(f"Null: {len(wiki_unk)}")
print(f"Nb botanists: {len(wiki_nb)}")
print(f"Men botanists: {len(wiki_men)}")


Total wikidata botanist records: 19636

Women botanists: 2789
Null: 403
Nb botanists: 0
Men botanists: 16444


___
### 5. Known women!
___

In a specific and highly limited sense:
- is present in gbif occurrence records
- has a bionomia profile
- recordedBy value is clear enough to match against bionomia
- has gender data present in wikidata
- ... at the very minimum, in terms of data infra

In [15]:
women = []
men = []
nb = []
unk = []
uhoh = []

# Categorise collectors by gender
for x in bionomia_matches:
    
    b_qid = x['bionomia_match']['wikidata']
    
    if b_qid in wiki_women:
        women.append(b_qid)
    elif b_qid in wiki_men:
        men.append(b_qid)
    elif b_qid in wiki_nb:
        nb.append(b_qid)
    elif b_qid in wiki_unk:
        unk.append(b_qid)
    else:
        uhoh.append(b_qid)

In [16]:
print(f"Women: {len(women)}")
print(f"Men: {len(men)}")
print(f"Non-binary: {len(nb)}")
print(f"Unknown: {len(unk)}")
print(f"Uhoh: {len(uhoh)}") ## orcid, maybe? 

Women: 1
Men: 35
Non-binary: 0
Unknown: 0
Uhoh: 2
