## BiCIKL-Hackathon Topic 9: Hidden Women in Science

____
### Workplan
____

1. ~Get a list of botanists off GBIF that are affiliated with records from around a test subset of records (taxon, eveny year)~
2. ~Try to pick out full names, both to make resolution easier and to identify likely not-men - whats the frequency of each in the data?~
3. ~Try to resolve names against Bionomia - what fails and why~
4. For names that aren't found in Binomia, try and resolve against wikidata
5. Use any wikidata QIDs resulting from 3. and 4. to get gender label
6. For names that aren't found in either source: try a few broad-brush approaches to identifying which ones might be women (could look for titles eg 'Ms Miss Mrs' or use gender-spotting API etc)
7. For recs that might be women, run against GBIF occ ID again to get an idea of years of activity (slightly nonsensical in the test dataset, admittedly), major taxonomic groups of study, institutions of deposition? Useful output would be a set of unk women with enough hooks about their activities to encourage human research
8. Could also run names against BHL OCR text search to get a list of possible references together to augment 7.  

[Topic overview](https://github.com/pensoft/BiCIKL/tree/main/Topic%209%20Hidden%20women%20in%20science)



___
### Imports and params
___

In [1]:
import requests
import itertools
import re
import json
import urllib.parse
import time

In [2]:
start_year_range = 1870
end_year_range = 1870
gbif_taxon_id = 7819616

___
### Functions
___

In [3]:
"""
Retrieve a unique list of values in GBIF occurrence dwc.recordedBy fields for a given date range and taxon  

:param start_year: start year of query date range (YYYY)
:param end_year: end year of query date range (YYYY)
:param taxon_key: GBIF taxon key
:return: Set of names (unique strings)
"""
def get_gbif_recordedBy(start_year, end_year, taxon_key):
    # Get the value of recordedBy for each record, where it exists
    collectors = set()
    
    for offset in itertools.count(step=300):
        r = requests.get(f"""https://api.gbif.org/v1/occurrence/search?year={start_year},{end_year}
        &basisOfRecord=PRESERVED_SPECIMEN&taxon_key={gbif_taxon_id}&limit=300&offset={offset}""").json()
        
        # Comment this out if you don't want to keep an eye on the progress of the queries
        # print(f"offset: {offset}")
        
        # Get the info we're interested in
        # Could beef this up in the future to grab recordedBy ID, inst/dataset ID, taxa of interest, years of activity?
        for d in r['results']:
            if 'recordedBy' in d:
                recordedBy = d['recordedBy']
                # These delimiters seem to usually indicate > 1 name, so split and add back in
                collectors.update(split_multiple_names(recordedBy, '&|\|| and |;'))  
            else:
                continue

        if r['endOfRecords']:
            break

    return collectors

In [4]:
"""
Identify strings which are more likely to contain a full given name vs. those that probably don't. 
Filters out single-word strings, leading or trailing single-character initials, unless the string also includes
a title in (Mrs, Miss)

:param input_names: iterable of names
:return: List of full names, list of thinner/harder-to-resolve names
"""
def get_rid_of_gunk(input_names):
    good_names = []
    initials = []
    
    for p_name in input_names: # there's probably a better way of combining all these regex eh sarah
        front_match = re.search('^[a-zA-Z]{3}', p_name)
        tail_match = re.search('[a-zA-Z]{3}$', p_name)
        multi_words = re.search('\s', p_name)
        
        # Only keep if there's a miss/mrs title, or if it doesn't start or end with an initial
        if (front_match and tail_match and multi_words) or 'Mrs' in p_name or 'Miss' in p_name:
            good_names.append(p_name)
        else:
            initials.append(p_name)
        
    return good_names, initials

In [5]:
"""
Break up strings that include common list delimiter characters 

:param input_names: iterable of names that might be a stringified list
:param delimiters: Pipe delimited, single-string list of split-on characters. e.g., '&|\|| and |;'
:return: Input list with additional items resulting from split appended
"""
def split_multiple_names(input_names, delimiters):
    
    return [s.strip() for s in re.split(delimiters, input_names)]

In [6]:
"""
Broad-brush check for matches against Bionomia

"""
def search_bionomia_people_auto(names, cutoff_score=50):
    autocomplete_base_url =  'https://api.bionomia.net/user.json?q=' # Useful to get confidence scores of match
    
    matches = []
    unmatches = []
    
    # url encode each name string and get result + score from autocomplete_base_url
    for name in names:
        response = requests.get(f"{autocomplete_base_url}{urllib.parse.quote_plus(name)}&limit=1")
        response.raise_for_status()
                           
        # Un-matching queries return an empty list
        if len(response.text) == 2:
            print(f"No matches found for {name}")
            unmatches.append(name)
        else:
        # stash top result dict with original query/name string
        # keys: ['id', 'score', 'orcid', 'wikidata', 'fullname', 'fullname_reverse', 'thumbnail', 'lifespan', 'description'] 
            json_response = response.json()
            top_match = json_response[0]
            if top_match['score'] < cutoff_score:
                print(f"No matches > score {cutoff_score} for {name}. Best: {top_match['score']}: {top_match['fullname']}")
                unmatches.append(name)
            else:
                print(f"Match found for {name}: {top_match['fullname']}. Score: {top_match['score']}")
                matches.append({'original_name': name, 'bionomia_match': json_response[0]})

    return matches, unmatches

In [7]:
"""
Stricter check for matches against Bionomia - won't return matches if the collection data is not within the 
lifespan of the person defined in wikidata/bionomia (ex. year of birth)

"""
def search_bionomia_people_detail(names, year):
    details_base_url = 'https://api.bionomia.net/users/search?'
    
    for name in names:
        
        # throttle the connection - getting connection timeout errors so maybe this will help...
        time.sleep(1)
        
        query_params = {'q': name['original_name'], 
                        'date': year, 
                        'strict': 'true',
                        'limit': 1
                       }
        response = requests.get(details_base_url, params=query_params)
        response.raise_for_status()
        
        print(response.url)
        
        json_response = response.json()
        
        # Check we've returned results - no score/confidence cutoff availabel here though
        result_count = json_response['opensearch:totalResults']
        
        # unpack the first ['item'] in dataElement and handle no results scenario
        if result_count == 0:
            continue
        else:
            # store in the original dict to help compare results from the different approachs
            name['bionomia_detail_match'] = json_response['dataFeedElement'][0]['item']
            
    return names

____
### 1. Get GBIF botanist sample
____

In [8]:
# Set daterange of interest and taxon-id (easily grabbable from occ search GUI url)
gbif_collectors = get_gbif_recordedBy(start_year_range, end_year_range, gbif_taxon_id)

Other things that might be interesting:
* How much is recordedByID being used? What kind of IDs are in there?
* Distribution/frequency of each name variant within result set
* Are names consistent within datasets/institutions

_______

### 2. ID easier-to-resolve names
_______


In [9]:
good_names, initials = get_rid_of_gunk(gbif_collectors)

In [10]:
good_names.sort()

# Peek at the top 10 names
good_names[:10]

['Alexander Karl (Carl) Heinrich Braun',
 'Axel Blytt',
 'Axel Blytt, Arnell',
 'Axel Tullberg',
 'Blytt, Arnell',
 'Bror Tydell',
 'Carl Fredrik Otto Nordstedt',
 'Carl Fredrik Otto Norstedt',
 'Collector unknown',
 'Conr. Indebetou']

In [11]:
f"Full name count: {len(good_names)}, initials count: {len(initials)}"

'Full name count: 37, initials count: 99'

_To-do : add summary chart of count of fuller names vs less full?_

_______

### 3. Try to resolve against Binomia
_______

Using two endpoints: autocomplete widget and JSON-LD search for people endpoint  
Docs: https://bionomia.net/developers

##### Notes and other interesting stuff

1. Does the order of words in a name matter? aka, any difference from `forename, surname` pattern vs `surname, forename`?  
    * Doesn't look like it. e.g., the two calls below bring back the same match with an identical score (43.65668):   
        `https://api.bionomia.net/user.json?q=Nilsson+Alb.&limit=1`  
        `https://api.bionomia.net/user.json?q=Alb.+Nilsson&limit=1`  
        
            
2. Does having a title/indicator of marital status make any difference to number of matches/scores?


In [12]:
# pass in list of names and date range
matches, unmatches = search_bionomia_people_auto(good_names)

# See if there's much diff between confidence/match level for full vs initial-based names (manually spoof names for now)

Match found for Alexander Karl (Carl) Heinrich Braun: Alexander Braun. Score: 67.03193
Match found for Axel Blytt: Axel Gudbrand Blytt. Score: 56.20684
No matches > score 50 for Axel Blytt, Arnell. Best: 48.653996: J. Hal Arnell
Match found for Axel Tullberg: Sven Axel Tullberg. Score: 61.397266
No matches > score 50 for Blytt, Arnell. Best: 48.653996: J. Hal Arnell
No matches > score 50 for Bror Tydell. Best: 25.896307: Bror Hamfelt
Match found for Carl Fredrik Otto Nordstedt: Carl Fredrik Otto Nordstedt. Score: 75.72169
No matches > score 50 for Carl Fredrik Otto Norstedt. Best: 33.059036: Carl Friedrich Eduard Otto
No matches > score 50 for Collector unknown. Best: 15.09475: Walter Scott
Match found for Conr. Indebetou: Carl Gustaf Indebetou. Score: 50.336353
Match found for Conrad Indebetou: Carl Gustaf Indebetou. Score: 50.336353
Match found for Edouard Rostan: Edouard Rostan. Score: 66.8318
No matches > score 50 for Flodén, Alexis. Best: 23.354412: Alexis Chassang
Match found for

In [13]:
matches[0:20]

[{'original_name': 'Alexander Karl (Carl) Heinrich Braun',
  'bionomia_match': {'id': 21926,
   'score': 67.03193,
   'orcid': None,
   'wikidata': 'Q62855',
   'fullname': 'Alexander Braun',
   'fullname_reverse': 'Braun, Alexander',
   'thumbnail': 'https://abekpgaoen.cloudimg.io/crop/24x24/n/https://commons.wikimedia.org/wiki/Special:FilePath/Alexander%20Braun%20(1805-1877).png',
   'lifespan': '&#42; May 10, 1805 &ndash; March 29, 1877 &dagger;',
   'description': 'German botanist and university teacher (1805-1877)'}},
 {'original_name': 'Axel Blytt',
  'bionomia_match': {'id': 10622,
   'score': 56.20684,
   'orcid': None,
   'wikidata': 'Q610981',
   'fullname': 'Axel Gudbrand Blytt',
   'fullname_reverse': 'Blytt, Axel Gudbrand',
   'thumbnail': 'https://abekpgaoen.cloudimg.io/crop/24x24/n/https://commons.wikimedia.org/wiki/Special:FilePath/Axel%20blytt.jpg',
   'lifespan': '&#42; May 19, 1843 &ndash; July 18, 1898 &dagger;',
   'description': 'Norwegian botanist and geologist (

In [14]:
unmatches[:10]

['Axel Blytt, Arnell',
 'Blytt, Arnell',
 'Bror Tydell',
 'Carl Fredrik Otto Norstedt',
 'Collector unknown',
 'Flodén, Alexis',
 'Herb Suringar WFR',
 'Jair Törnblad',
 'Johan Elias Strömberg',
 'Lundquist, Per Fredrik Alexander']

In [15]:
f"Out of {len(good_names)} non-initialised names, {len(matches)} were matched against the basic Bionomia endpoint (precision/accuracy tbc!)"

'Out of 37 non-initialised names, 21 were matched against the basic Bionomia endpoint (precision/accuracy tbc!)'

In [16]:
# Quick eyeball to see how many bad matches the first pass against Bionomia returned
# Will using year of collection + json-ld endpoint removed/improve these? Looking for improved accuracy, not precision
for match in matches:
    print(f"{match['original_name']} -> {match['bionomia_match']['fullname']} ({match['bionomia_match']['wikidata']})")

Alexander Karl (Carl) Heinrich Braun -> Alexander Braun (Q62855)
Axel Blytt -> Axel Gudbrand Blytt (Q610981)
Axel Tullberg -> Sven Axel Tullberg (Q6335227)
Carl Fredrik Otto Nordstedt -> Carl Fredrik Otto Nordstedt (Q6015981)
Conr. Indebetou -> Carl Gustaf Indebetou (Q5823030)
Conrad Indebetou -> Carl Gustaf Indebetou (Q5823030)
Edouard Rostan -> Edouard Rostan (Q21607448)
Frederic Stratton -> Frederic Stratton (Q21609937)
Frederick Arnold Lees -> Frederick Arnold Lees (Q21518563)
Gust. Bernoulli -> Carl Gustav Bernoulli (Q121991)
Hampus Wilhelm Arnell -> Hampus Wilhelm Arnell (Q5559534)
Herb Weber-van Bosse -> Anna Weber-van Bosse (Q1940785)
James Groves -> James Groves (Q21512721)
Jean Armand Isidore Pancher -> Jean Armand Isidore Pancher (Q3170415)
John Thomas Irvine Boswell Syme -> John Thomas Irvine Boswell Syme (Q5933649)
Lars Johan Wahlstedt -> Lars Johan Wahlstedt (Q16650555)
Mag. Östman -> Magnus Östman (Q21522469)
Nordstedt CFO -> Carl Fredrik Otto Nordstedt (Q6015981)
Otto N

In [17]:
strict_matches = search_bionomia_people_detail(matches, start_year_range)

https://api.bionomia.net/users/search?q=Alexander+Karl+%28Carl%29+Heinrich+Braun&date=1870&strict=true&limit=1
https://api.bionomia.net/users/search?q=Axel+Blytt&date=1870&strict=true&limit=1
https://api.bionomia.net/users/search?q=Axel+Tullberg&date=1870&strict=true&limit=1
https://api.bionomia.net/users/search?q=Carl+Fredrik+Otto+Nordstedt&date=1870&strict=true&limit=1
https://api.bionomia.net/users/search?q=Conr.+Indebetou&date=1870&strict=true&limit=1
https://api.bionomia.net/users/search?q=Conrad+Indebetou&date=1870&strict=true&limit=1
https://api.bionomia.net/users/search?q=Edouard+Rostan&date=1870&strict=true&limit=1
https://api.bionomia.net/users/search?q=Frederic+Stratton&date=1870&strict=true&limit=1
https://api.bionomia.net/users/search?q=Frederick+Arnold+Lees&date=1870&strict=true&limit=1
https://api.bionomia.net/users/search?q=Gust.+Bernoulli&date=1870&strict=true&limit=1
https://api.bionomia.net/users/search?q=Hampus+Wilhelm+Arnell&date=1870&strict=true&limit=1
https://ap

In [18]:
for i in strict_matches:
    if 'bionomia_detail_match' in i:
        strict_fullname = i['bionomia_detail_match']['name']
    else:
        strict_fullname = 'Not found'
        
    print(f"{i['original_name']} -> {i['bionomia_match']['fullname']} ({i['bionomia_match']['score']}) -> {strict_fullname}")

Alexander Karl (Carl) Heinrich Braun -> Alexander Braun (67.03193) -> Alexander Braun
Axel Blytt -> Axel Gudbrand Blytt (56.20684) -> Axel Gudbrand Blytt
Axel Tullberg -> Sven Axel Tullberg (61.397266) -> Sven Axel Tullberg
Carl Fredrik Otto Nordstedt -> Carl Fredrik Otto Nordstedt (75.72169) -> Carl Fredrik Otto Nordstedt
Conr. Indebetou -> Carl Gustaf Indebetou (50.336353) -> Carl Gustaf Indebetou
Conrad Indebetou -> Carl Gustaf Indebetou (50.336353) -> Carl Gustaf Indebetou
Edouard Rostan -> Edouard Rostan (66.8318) -> Edouard Rostan
Frederic Stratton -> Frederic Stratton (60.23281) -> Frederic Stratton
Frederick Arnold Lees -> Frederick Arnold Lees (61.451565) -> Frederick Arnold Lees
Gust. Bernoulli -> Carl Gustav Bernoulli (52.890484) -> Carl Gustav Bernoulli
Hampus Wilhelm Arnell -> Hampus Wilhelm Arnell (69.93576) -> Hampus Wilhelm Arnell
Herb Weber-van Bosse -> Anna Weber-van Bosse (59.31223) -> Anna Weber-van Bosse
James Groves -> James Groves (56.52253) -> James Groves
Jean 

In [19]:
print(strict_matches[4]['bionomia_detail_match'])

{'@type': 'Person', '@id': 'https://bionomia.net/Q5823030', 'name': 'Carl Gustaf Indebetou', 'givenName': 'Carl Gustaf', 'familyName': 'Indebetou', 'alternateName': ['Indebetou, Carl Gustaf'], 'knowsAbout': [{'@type': 'ItemList', 'name': 'families_identified', 'itemListElement': [{'@type': 'ListItem', 'name': 'Poaceae'}]}, {'@type': 'ItemList', 'name': 'families_collected', 'itemListElement': [{'@type': 'ListItem', 'name': 'Poaceae'}]}], 'sameAs': 'http://www.wikidata.org/entity/Q5823030', 'birthDate': '1801-01-16', 'deathDate': '1893-03-06'}


Next steps:
- run queries that found a match + year of interest against the JSON-LD bionomia endpoint
- see if abbreviated names work against the JSON-LD endpoint, providing there's a year
- Stash any good matches for now
- for full names what don't match: ID women and then try to run them against wikidata
- If there is a match in wikidata: figure out the best info etc to dumpout to encourage folks to make profiles on bionomia
- No match on wikidata or bionomia: clustering names/families of research? Look in BHL?