## BiCIKL-Hackathon Topic 9: Hidden Women in Science

____
### Workplan
____

1. ~Get a list of botanists off GBIF that are affiliated with records from around a test subset of records (taxon, eveny year)~
2. ~Try to pick out full names, both to make resolution easier and to identify likely not-men - whats the frequency of each in the data?~
3. ~Try to resolve names against Bionomia - what fails and why~
4. For names that aren't found in Binomia, try and resolve against wikidata
5. Use any wikidata QIDs resulting from 3. and 4. to get gender label
6. For names that aren't found in either source: try a few broad-brush approaches to identifying which ones might be women (could look for titles eg 'Ms Miss Mrs' or use gender-spotting API etc)
7. For recs that might be women, run against GBIF occ ID again to get an idea of years of activity (slightly nonsensical in the test dataset, admittedly), major taxonomic groups of study, institutions of deposition? Useful output would be a set of unk women with enough hooks about their activities to encourage human research
8. Could also run names against BHL OCR text search to get a list of possible references together to augment 7.  

[Topic overview](https://github.com/pensoft/BiCIKL/tree/main/Topic%209%20Hidden%20women%20in%20science)



___
### Imports and params
___

In [10]:
import requests
import itertools
import re
import json
import urllib.parse
import time

In [11]:
start_year_range = 1870
end_year_range = 1870
gbif_taxon_id = 7819616
confidence = 51

___
### Functions
___

In [12]:
"""
Retrieve a unique list of values in GBIF occurrence dwc.recordedBy fields for a given date range and taxon  

:param start_year: start year of query date range (YYYY)
:param end_year: end year of query date range (YYYY)
:param taxon_key: GBIF taxon key
:return: Set of names (unique strings)
"""
def get_gbif_recordedBy(start_year, end_year, taxon_key):
    # Get the value of recordedBy for each record, where it exists
    collectors = set()
    
    for offset in itertools.count(step=300):
        r = requests.get(f"""https://api.gbif.org/v1/occurrence/search?year={start_year},{end_year}
        &basisOfRecord=PRESERVED_SPECIMEN&taxon_key={gbif_taxon_id}&limit=300&offset={offset}""").json()
        
        # Comment this out if you don't want to keep an eye on the progress of the queries
        # print(f"offset: {offset}")
        
        # Get the info we're interested in
        # Could beef this up in the future to grab recordedBy ID, inst/dataset ID, taxa of interest, years of activity?
        for d in r['results']:
            if 'recordedBy' in d:
                recordedBy = d['recordedBy']
                # These delimiters seem to usually indicate > 1 name, so split and add back in
                collectors.update(split_multiple_names(recordedBy, '&|\|| and |;'))  
            else:
                continue

        if r['endOfRecords']:
            break

    return collectors

In [13]:
"""
Lil utility func to turn a single gbif species id into something human-readable
"""

def get_species_label(gbif_species_id):
    response = requests.get(f"https://api.gbif.org/v1/species/{gbif_species_id}")
    json_response = response.json()
    name_label = json_response['scientificName']
    
    return name_label
    

In [14]:
"""
Identify strings which are more likely to contain a full given name vs. those that probably don't. 
Filters out single-word strings, leading or trailing single-character initials, unless the string also includes
a title in (Mrs, Miss)

:param input_names: iterable of names
:return: List of full names, list of thinner/harder-to-resolve names
"""
def get_rid_of_gunk(input_names):
    good_names = []
    initials = []
    
    for p_name in input_names: # there's probably a better way of combining all these regex eh sarah
        front_match = re.search('^[a-zA-Z]{3}', p_name)
        tail_match = re.search('[a-zA-Z]{3}$', p_name)
        multi_words = re.search('\s', p_name)
        
        # Only keep if there's a miss/mrs title, or if it doesn't start or end with an initial
        if (front_match and tail_match and multi_words) or 'Mrs' in p_name or 'Miss' in p_name:
            good_names.append(p_name)
        else:
            initials.append(p_name)
        
    return good_names, initials

In [15]:
"""
Break up strings that include common list delimiter characters 

:param input_names: iterable of names that might be a stringified list
:param delimiters: Pipe delimited, single-string list of split-on characters. e.g., '&|\|| and |;'
:return: Input list with additional items resulting from split appended
"""
def split_multiple_names(input_names, delimiters):
    
    return [s.strip() for s in re.split(delimiters, input_names)]

In [38]:
"""
Broad-brush check for matches against Bionomia

"""
def search_bionomia_people_auto(names, cutoff_score=50):
    autocomplete_base_url =  'https://api.bionomia.net/user.json?q=' # Useful to get confidence scores of match
    
    matches = []
    unmatches = []
    
    # url encode each name string and get result + score from autocomplete_base_url
    for name in names:
        # If you want to keep an eye on the progress, uncomment 
#         print(f"{name}...")
        
        response = requests.get(f"{autocomplete_base_url}{urllib.parse.quote_plus(name)}&limit=1")
        response.raise_for_status()
                           
        # Un-matching queries return an empty list
        if len(response.text) == 2:
            unmatches.append(name)
        else:
        # stash top result dict with original query/name string
            json_response = response.json()
            top_match = json_response[0]
            if top_match['score'] < cutoff_score:
                unmatches.append(name)
            else:
                matches.append({'original_name': name, 'bionomia_match': json_response[0]})

    return matches, unmatches

In [17]:
"""
Stricter check for matches against Bionomia - won't return matches if the collection data is not within the 
lifespan of the person defined in wikidata/bionomia (ex. year of birth)

"""
def search_bionomia_people_detail(names, year):
    details_base_url = 'https://api.bionomia.net/users/search?'
    
    for name in names:
        
        # throttle the connection - getting connection timeout errors so maybe this will help...
        time.sleep(1)
        
        query_params = {'q': name['original_name'], 
                        'date': year, 
                        'strict': 'true',
                        'limit': 1
                       }
        response = requests.get(details_base_url, params=query_params)
        response.raise_for_status()
        
        # print(response.url)
        
        json_response = response.json()
        
        # Check we've returned results - no score/confidence cutoff availabel here though
        result_count = json_response['opensearch:totalResults']
        
        # unpack the first ['item'] in dataElement and handle no results scenario
        if result_count == 0:
            continue
        else:
            # store in the original dict to help compare results from the different approachs
            name['bionomia_detail_match'] = json_response['dataFeedElement'][0]['item']
            
    return names

____
### 1. Get GBIF botanist sample
____

Get a sample of unique values from dwc.recordedBy fields in occurrence records (preserved specimens only) in GBIF to work with. The sample is defined by year of collection (set to c. 1870 because that's when we started to see more botanistas appearing, but it isn't so recent they'll still be alive) + taxonomic groups within Plantae (mostly to keep the number of records/processing speed at a sensible level) 


#### Why tho?

* Anecdata but probably more historical women botanists around - flowers being ladylike n all that.
* Doesn't look like GBIF do much with the name strings they harvest, so should be pretty representative of source data quality?
* Bionomia records reference GBIF specimen occ records so there's already a link there. 

In [18]:
# Set daterange of interest and taxon-id (easily grabbable from occ search GUI url)
gbif_collectors = get_gbif_recordedBy(start_year_range, end_year_range, gbif_taxon_id)

In [19]:
# Summary + counts
taxa_name = get_species_label(gbif_taxon_id)
print("Unique name strings")
print(f"Taxa of interest: {taxa_name} (preserved specimens only)")
print(f"Collection event date range: {start_year_range}-{end_year_range}")
print(f"Count of unique names: {len(gbif_collectors)}") 

Unique name strings
Taxa of interest: Charophyta (preserved specimens only)
Collection event date range: 1870-1870
Count of unique names: 136


##### Notes and other interesting stuff

1. Should have retrieved `dwc.family` + `dwc.year` per record/unique collector name too. 
    * e.g.
        `{'original_name': 'Alfreda Collectoro', 
            'taxa': [t1, t2, t3], 
            'year': [1870, 1870, 1871, 1864]}`
    * Would have been useful later on, but would have been annoying to deal with splitting delimited names...  
    

2. Found a fair few recordedBy fields with 'Mrs/Miss' in them - just added them to the 'full names' list in the end, but could be worth returning them separately cos they're definitely women.

##### Things I didn't get round to looking at

1. How much is recordedByID being used? What kind of IDs are in there?
2. Distribution/frequency of each name variant within result set
3. Are names consistent within datasets/institutions

_______

### 2. ID easier-to-resolve names
_______

Attempt to parse out 'fullname' names from the collector list generated in previous steps. a.k.a, filter out names that are either a single word string, or which have a leading or trailing initial. 


#### Why tho?

* Easy wins! 
* Fuller names = easier/less risky to disambiguate 
* Need a decent name string to do any filtering based on demographics. 
* Interested to see the proportion of names that fall into full/thin camp and the different patterns used within this. 


In [20]:
# Separate name list from previous step into full names vs thin/initials
fuller_names, initials = get_rid_of_gunk(gbif_collectors)
fuller_names.sort()
initials.sort()

In [36]:
# Peek at the first 10 names in each
print(f"Fuller names: {fuller_names[:10]}")
print()
print(f"Thinner names: {initials[:10]}")

Fuller names: ['Alexander Karl (Carl) Heinrich Braun', 'Axel Blytt', 'Axel Blytt, Arnell', 'Axel Tullberg', 'Blytt, Arnell', 'Bror Tydell', 'Carl Fredrik Otto Nordstedt', 'Carl Fredrik Otto Norstedt', 'Collector unknown', 'Conr. Indebetou']

Thinner names: ['-. Porntin', '?', 'A Ley', 'A. Blytt, Arnell', 'A. Braun', 'A. C. H. Braun', 'A. G. Blytt', 'A. H. Curtiss', 'A. Tullberg', 'Al. Braun']


In [37]:
# Summary + counts
f"Full name count: {len(fuller_names)}, thinner name count: {len(initials)}"

NameError: name 'fuller_names' is not defined

##### Notes and other interesting stuff

1. Seems to work well enough, although full names are in the minority in all the samples I've tried. 
    * Could try clustering the names back around thin names once they've been resolved/if they can be? 
    * Outputs would need a bit of manual checking, but could be something citizen science folks would like to do + be good at?  
    

2. Still a few values that are clearly > 1 name though. 
    * Already splitting on these: & ; 'and' |, but the rem look like comma delimited... 
    * Might be splittable using whitespace counts? 
    * Had a go at splitting on ',' symbols where there were 4+ whitespaces in the string, but decided not to go with it because I don't know enough about name patterns worldwide.

##### Things I didn't get round to looking at

1. Wonder if particular patterns of errors are characteristic to institutions? It does feel that fields like recordedBy are usable/consistent enough within institutions, so no-one fixes them up much, but sufficiently different within them that linking them is a 'mare, which might mean they're distinctive enough to be useful... 
2. Frequency of each name in terms of occurrence record count.


_______

### 3. Try to resolve against Bionomia
_______

We're trying to match the full collector names from previous steps to profiles on Bionomia, using a couple of API endpoints: autocomplete widget and JSON-LD search for people (former give a confidence match score, latter allows additional search params to help narrow search)  

Docs: https://bionomia.net/developers

#### Why tho?

* Seems a good source of names + there was an nice API for searching them - seemed rude not to.
* Everything in Bionomia has to have either an ORCiD or all of [birth date, death date, wikidata QID] and for the date range we're looking at, ORCiDs seem unlikely. So. Everything we match in bionomia is also a match against wikidata (but not necessarily vice versa)
* Wikidata record means a person-id we can maybe trust, yay!
* We're trying to light up 'lost' people, so unmatching names are of interest because they aren't in bionomia, but the people were collectors... 
* ... or occurence recordedBy strings are garbled beyond matchability, which is also useful - how much do they need to be cleaned up before they match well enough? Are there other (poss. easier to clean/infer) fields that could help increase match confidence?


In [23]:
# pass in list of names and date range
matches, unmatches = search_bionomia_people_auto(good_names, confidence)

Alexander Karl (Carl) Heinrich Braun...
Axel Blytt...
Axel Blytt, Arnell...
Axel Tullberg...
Blytt, Arnell...
Bror Tydell...
Carl Fredrik Otto Nordstedt...
Carl Fredrik Otto Norstedt...
Collector unknown...
Conr. Indebetou...
Conrad Indebetou...
Edouard Rostan...
Flodén, Alexis...
Frederic Stratton...
Frederick Arnold Lees...
Gust. Bernoulli...
Hampus Wilhelm Arnell...
Herb Suringar WFR...
Herb Weber-van Bosse...
Jair Törnblad...
James Groves...
Jean Armand Isidore Pancher...
Johan Elias Strömberg...
John Thomas Irvine Boswell Syme...
Lars Johan Wahlstedt...
Lundquist, Per Fredrik Alexander...
Mag. Östman...
Mag. Östmann...
Nordstedt CFO...
Nordstedt, Carl Fredrik Otto...
Otto Nordstedt...
Strömborg, Johan Elias...
Tullberg, Sven Axel Teodor...
Wahlstedt, Lars Johan...
Wilhelm Berndes...
William Curnow...
see Collection Note...


In [29]:
print(f"With a confidence cutoff of {confidence}, {len(matches)} out of {len(good_names)} full names were matched against the basic Bionomia endpoint")

With a confidence cutoff of 51, 19 out of 37 full names were matched against the basic Bionomia endpoint


In [26]:
# Quick eyeball to QC results
for match in matches:
    print(f"{match['original_name']} -> {match['bionomia_match']['fullname']} ({match['bionomia_match']['wikidata']})")

Alexander Karl (Carl) Heinrich Braun -> Alexander Braun (Q62855)
Axel Blytt -> Axel Gudbrand Blytt (Q610981)
Axel Tullberg -> Sven Axel Tullberg (Q6335227)
Carl Fredrik Otto Nordstedt -> Carl Fredrik Otto Nordstedt (Q6015981)
Edouard Rostan -> Edouard Rostan (Q21607448)
Frederic Stratton -> Frederic Stratton (Q21609937)
Frederick Arnold Lees -> Frederick Arnold Lees (Q21518563)
Gust. Bernoulli -> Carl Gustav Bernoulli (Q121991)
Hampus Wilhelm Arnell -> Hampus Wilhelm Arnell (Q5559534)
Herb Weber-van Bosse -> Anna Weber-van Bosse (Q1940785)
James Groves -> James Groves (Q21512721)
Jean Armand Isidore Pancher -> Jean Armand Isidore Pancher (Q3170415)
John Thomas Irvine Boswell Syme -> John Thomas Irvine Boswell Syme (Q5933649)
Lars Johan Wahlstedt -> Lars Johan Wahlstedt (Q16650555)
Mag. Östman -> Magnus Östman (Q21522469)
Nordstedt CFO -> Carl Fredrik Otto Nordstedt (Q6015981)
Otto Nordstedt -> Carl Fredrik Otto Nordstedt (Q6015981)
Wilhelm Berndes -> Wilhelm Eugene Berndes (Q21506016)


##### Notes and other interesting stuff

1. Does the order of words in a name matter? aka, any difference from `forename, surname` pattern vs `surname, forename`?  
    * Doesn't look like it. e.g., the two calls below bring back the same match with an identical score (43.65668):   
        `https://api.bionomia.net/user.json?q=Nilsson+Alb.&limit=1`  
        `https://api.bionomia.net/user.json?q=Alb.+Nilsson&limit=1`  
 
   
2. Hard to say from limited sample size, but the JSON-LD endpoint seemed to match less accurately than the basic one
    * Either collection dates are off (and so fall outside lifespan of collector)
    * Or collection dates are correct and what looks like a perfect match is someone with the same name at a different time (this would imply the true collector isn't in Bionomia yet, I suppose?)
    * .... could be both, of course. JSON-LD endpoint doesn't give the match score so hard to QC, either way. 
    * You can pass in families collected as well, which might help, but I reckon that comes from GBIF data links anyway & if it doesn't it's probably real mucky so would need resolving first.
    * confidence cutoff for the basic API seemed to hit a sweet spot in terms of accuracy around 51

##### Things I didn't get round to looking at

1. Does having a title/indicator of marital status make any difference to number of matches/scores?
2. How good is each endpoint at resolving non-full names? Would think it's risky unless you have a bunch of other match points, and tbf the responsibility for thin names is at data creation/collection so maybe infra should design services around the assumption of decent source data. The carrot can also be the stick ;) 

Next steps:
- run queries that found a match + year of interest against the JSON-LD bionomia endpoint 
- see if abbreviated names work against the JSON-LD endpoint, providing there's a year -- skip for now
- ~Stash any good matches for now~ - yup, but need to write 'em out somewhere  
- for full names what don't match: ID women and then try to run them against wikidata
- If there is a match in wikidata: figure out the best info etc to dumpout to encourage folks to make profiles on bionomia
- No match on wikidata or bionomia: clustering names/families of research? Look in BHL?

___
### 4. Some things that didn't work, but in an interesting way 
___
Don't run these unless you especially want to

In [None]:
# Will using year of collection + json-ld endpoint removed/improve these? Looking for improved accuracy, not precision
strict_matches = search_bionomia_people_detail(matches, start_year_range)

In [None]:
# ... okay this made it worse. Ha!
print(f"recordedBy value; bionomia best match (match score); bionomia + year of collection best match")
print("--------")

for i in strict_matches:
    if 'bionomia_detail_match' in i:
        strict_fullname = i['bionomia_detail_match']['name']
    else:
        strict_fullname = 'Not found'
        
    print(f"{i['original_name']} -> {i['bionomia_match']['fullname']} ({i['bionomia_match']['score']}) -> {strict_fullname}")

In [None]:
# These are the records where the first bionomia name suggestion differed from the result returned 
# when year of collection is included... Assume either collection date or collector lifespan is wrong? 

print(f"recordedBy value; bionomia best match (match score); bionomia + year of collection best match")
print("--------")

for p in strict_matches:
    if 'bionomia_detail_match' in p:
        if p['bionomia_match']['fullname'] != p['bionomia_detail_match']['name']:
            print(f"{p['original_name']}; {p['bionomia_match']['fullname']} ({p['bionomia_match']['score']}); {p['bionomia_detail_match']['name']}")