## BiCIKL-Hackathon Topic 9: Hidden Women in Science

____
### Workplan
____

1. ~Get a list of botanists off GBIF that are affiliated with records from around a test subset of records (taxon, eveny year)~
2. ~Try to pick out full names, both to make resolution easier and to identify likely not-men - whats the frequency of each in the data?~
3. ~Try to resolve names against Bionomia - what fails and why~
4. For names that aren't found in Binomia, try and resolve against wikidata
5. Use any wikidata QIDs resulting from 3. and 4. to get gender label
6. For names that aren't found in either source: try a few broad-brush approaches to identifying which ones might be women (could look for titles eg 'Ms Miss Mrs' or use gender-spotting API etc)
7. For recs that might be women, run against GBIF occ ID again to get an idea of years of activity (slightly nonsensical in the test dataset, admittedly), major taxonomic groups of study, institutions of deposition? Useful output would be a set of unk women with enough hooks about their activities to encourage human research
8. Could also run names against BHL OCR text search to get a list of possible references together to augment 7.  

[Topic overview](https://github.com/pensoft/BiCIKL/tree/main/Topic%209%20Hidden%20women%20in%20science)



___
### Imports and params
___

In [18]:
import requests
import itertools
import re
import json
import urllib.parse

In [19]:
start_year_range = 1870
end_year_range = 1870
gbif_taxon_id = 6

___
### Functions
___

In [20]:
"""
Retrieve a unique list of values in GBIF occurrence dwc.recordedBy fields for a given date range and taxon  

:param start_year: start year of query date range (YYYY)
:param end_year: end year of query date range (YYYY)
:param taxon_key: GBIF taxon key
:return: Set of names (unique strings)
"""
def get_gbif_recordedBy(start_year, end_year, taxon_key):
    # Get the value of recordedBy for each record, where it exists
    collectors = set()
    
    for offset in itertools.count(step=300):
        r = requests.get(f"""https://api.gbif.org/v1/occurrence/search?year={start_year},{end_year}
        &basisOfRecord=PRESERVED_SPECIMEN&taxon_key={gbif_taxon_id}&limit=300&offset={offset}""").json()
        
        # Comment this out if you don't want to keep an eye on the progress of the queries
        # print(f"offset: {offset}")
        
        # Get the info we're interested in
        # Could beef this up in the future to grab recordedBy ID, inst/dataset ID, taxa of interest, years of activity?
        for d in r['results']:
            if 'recordedBy' in d:
                recordedBy = d['recordedBy']
                # These delimiters seem to usually indicate > 1 name, so split and add back in
                collectors.update(split_multiple_names(recordedBy, '&|\|| and |;'))  
            else:
                continue

        if r['endOfRecords']:
            break

    return collectors

In [21]:
"""
Identify strings which are more likely to contain a full given name vs. those that probably don't. 
Filters out single-word strings, leading or trailing single-character initials, unless the string also includes
a title in (Mrs, Miss)

:param input_names: iterable of names
:return: List of full names, list of thinner/harder-to-resolve names
"""
def get_rid_of_gunk(input_names):
    good_names = []
    initials = []
    
    for p_name in input_names: # there's probably a better way of combining all these regex eh sarah
        front_match = re.search('^[a-zA-Z]{3}', p_name)
        tail_match = re.search('[a-zA-Z]{3}$', p_name)
        multi_words = re.search('\s', p_name)
        
        # Only keep if there's a miss/mrs title, or if it doesn't start or end with an initial
        if (front_match and tail_match and multi_words) or 'Mrs' in p_name or 'Miss' in p_name:
            good_names.append(p_name)
        else:
            initials.append(p_name)
        
    return good_names, initials

In [22]:
"""
Break up strings that include common list delimiter characters 

:param input_names: iterable of names that might be a stringified list
:param delimiters: Pipe delimited, single-string list of split-on characters. e.g., '&|\|| and |;'
:return: Input list with additional items resulting from split appended
"""
def split_multiple_names(input_names, delimiters):
    
    return [s.strip() for s in re.split(delimiters, input_names)]

In [23]:
"""
Check for binomia matches

"""
def search_binomia_people_auto(names, start_year, end_year, cutoff_score=40):
    autocomplete_base_url =  'https://api.bionomia.net/user.json?q=' # Useful to get confidence scores of match
    details_base_url = 'https://api.bionomia.net/users/search?q=' # Holds more metadata, but no score
    
    matches = []
    unmatches = []
    
    # url encode each name string and get result + score from autocomplete_base_url
    for name in names:
        response = requests.get(f"{autocomplete_base_url}{urllib.parse.quote_plus(name)}&limit=1")
        response.raise_for_status()
                           
        # Un-matching queries return an empty list
        if len(response.text) == 2:
        #    print(f"No matches found for {name}")
            unmatches.append(name)
        else:
        # stash top result dict with original query/name string
        # keys: ['id', 'score', 'orcid', 'wikidata', 'fullname', 'fullname_reverse', 'thumbnail', 'lifespan', 'description'] 
            json_response = response.json()
            top_match = json_response[0]
            if top_match['score'] < cutoff_score:
            #    print(f"No matches > score {cutoff_score} for {name}. Best: {top_match['score']}: {top_match['fullname']}")
                unmatches.append(name)
            else:
            #    print(f"Match found for {name}: {top_match['fullname']}. Score: {top_match['score']}")
                matches.append({'original_name': name, 'binomia_match': json_response[0]})

    return matches, unmatches

____
### 1. Get GBIF botanist sample
____

In [24]:
# Set daterange of interest and taxon-id (easily grabbable from occ search GUI url)
gbif_collectors = get_gbif_recordedBy(start_year_range, end_year_range, gbif_taxon_id)

Other things that might be interesting:
* How much is recordedByID being used? What kind of IDs are in there?
* Distribution/frequency of each name variant within result set
* Are names consistent within datasets/institutions

_______

### 2. ID easier-to-resolve names
_______


In [25]:
good_names, initials = get_rid_of_gunk(gbif_collectors)

In [26]:
good_names.sort()

# Peek at the top 10 names
good_names[:10]

['ATP in H. Vandenbroeck',
 'Abbott, James',
 'Abbé M. Gandoger',
 'Abeleven THAJ',
 'Abrahamson, Knut',
 'Adalbert Geheeb',
 'Addison, Rev Frederick',
 'Adolf Grape',
 'Adolf Roth',
 'Adolphus Pansch']

In [27]:
f"Full name count: {len(good_names)}, initials count: {len(initials)}"

'Full name count: 1283, initials count: 5095'

_To-do : add summary chart of count of fuller names vs less full?_

_______

### 3. Try to resolve against Binomia
_______

Using two endpoints: autocomplete widget and JSON-LD search for people endpoint  
Docs: https://bionomia.net/developers

##### Notes and other interesting stuff

1. Does the order of words in a name matter? aka, any difference from `forename, surname` pattern vs `surname, forename`?  
    * Doesn't look like it. e.g., the two calls below bring back the same match with an identical score (43.65668):   
        `https://api.bionomia.net/user.json?q=Nilsson+Alb.&limit=1`  
        `https://api.bionomia.net/user.json?q=Alb.+Nilsson&limit=1`  
        
            
2. Does having a title/indicator of marital status make any difference to number of matches/scores?


In [28]:
# pass in list of names and date range
matches, unmatches = search_binomia_people_auto(good_names, start_year=start_year_range, end_year=end_year_range)

# See if there's much diff between confidence/match level for full vs initial-based names (manually spoof names for now)

In [29]:
matches[0]

{'original_name': 'Abbé M. Gandoger',
 'binomia_match': {'id': 11855,
  'score': 52.513565,
  'orcid': None,
  'wikidata': 'Q2601687',
  'fullname': 'Michel Gandoger',
  'fullname_reverse': 'Gandoger, Michel',
  'thumbnail': 'https://bionomia.net/images/photo24X24.png',
  'lifespan': '&#42; May 10, 1850 &ndash; October  4, 1926 &dagger;',
  'description': 'French botanist and mycologist (1850-1926)'}}

Next steps:
- run queries that found a match + year of interest against the JSON-LD bionomia endpoint
- see if abbreviated names work against the JSON-LD endpoint, providing there's a year
- Stash any good matches for now
- for full names what don't match: ID women and then try to run them against wikidata
- If there is a match in wikidata: figure out the best info etc to dumpout to encourage folks to make profiles on bionomia
- No match on wikidata or bionomia: clustering names/families of research? Look in BHL?