## BiCIKL-Hackathon Topic 9: Hidden Women in Science

____
### Workplan
____

1. Get a list of botanists off GBIF that are affiliated with records from around 1850/1880 (test subset)
2. Try to pick out full names, both to make resolution easier and to identify likely not-men - whats the % of each in the data?
3. Try to resolve names against Bionomia - what fails and why
4. For names that aren't found in Binomia, try and resolve against wikidata
5. Use any wikidata QIDs resulting from 3. and 4. to get gender label
6. For names that aren't found in either source: try a few broad-brush approaches to identifying which ones might be women (could look for titles eg 'Ms Miss Mrs' or use gender-spotting API etc)
7. For recs that might be women, run against GBIF occ ID again to get an idea of years of activity (slightly nonsensical in the test dataset, admittedly), major taxonomic groups of study, institutions of deposition? Useful output would be a set of unk women with enough hooks about their activities to encourage human research
8. Could also run names against BHL OCR text search to get a list of possible references together to augment 7.  

[Topic overview](https://github.com/pensoft/BiCIKL/tree/main/Topic%209%20Hidden%20women%20in%20science)



___
### Prep
___

In [43]:
import requests
import itertools
import re
import json
import urllib.parse

In [59]:
start_year_range = 1870
end_year_range = 1900
gbif_taxon_id = 36

In [76]:
"""
Retrieve a unique list of values in GBIF occurrence dwc.recordedBy fields for a given date range and taxon  

:param start_year: start year of query date range (YYYY)
:param end_year: end year of query date range (YYYY)
:param taxon_key: GBIF taxon key
:return: Set of names (unique strings)
"""
def get_gbif_recordedBy(start_year, end_year, taxon_key):
    # Get the value of recordedBy for each record, where it exists
    collectors = set()
    
    for offset in itertools.count(step=300):
        r = requests.get(f"https://api.gbif.org/v1/occurrence/search?year={start_year},{end_year}&basisOfRecord=PRESERVED_SPECIMEN&taxon_key={36}&limit=300&offset={offset}").json()
        
        # Comment this out if you don't want to keep an eye on the progress of the queries
        print(f"offset: {offset}")
        
        # Get the info we're interested in
        # Could beef this up in the future to grab recordedBy ID, inst/dataset ID, taxa of interest, years of activity?
        for d in r['results']:
            if 'recordedBy' in d:
                recordedBy = d['recordedBy']
                # These delimiters seem to usually indicate > 1 name, so split and add back in
                collectors.update(split_multiple_names(recordedBy, '&|\|| and |;'))  
            else:
                continue

        if r['endOfRecords']:
            break

    return collectors


"""
Check for binomia matches

"""
def search_binomia_people_auto(names, start_year, end_year, cutoff_score=40):
    autocomplete_base_url =  'https://api.bionomia.net/user.json?q=' # Useful to get confidence scores of match
    details_base_url = 'https://api.bionomia.net/users/search?q=' # Holds more metadata, but no score
    
    matches = []
    unmatches = []
    
    # url encode each name string and get result + score from autocomplete_base_url
    for name in names:
        response = requests.get(f"{autocomplete_base_url}{urllib.parse.quote_plus(name)}&limit=1")
        response.raise_for_status()
                           
        # Un-matching queries return an empty list
        if len(response.text) == 2:
            print(f"No matches found for {name}")
            unmatches.append(name)
        else:
            # stash top result dict with original query/name string
            # keys: ['id', 'score', 'orcid', 'wikidata', 'fullname', 'fullname_reverse', 'thumbnail', 'lifespan', 'description'] 
            json_response = response.json()
            top_match = json_response[0]
            if top_match['score'] < cutoff_score:
            #    print(f"No matches above score {cutoff_score} found for {name}... Best score is {top_match['score']}: {top_match['fullname']}")
                unmatches.append(name)
            else:
            #    print(f"Match found for {name}: {top_match['fullname']}. Score: {top_match['score']}")
                matches.append({'original_name': json_response[0]})

                            
                            
    # Next steps for matches (might be better in its own function?)
    # if score is below confidence level/no match against query once year is included, tag w reason for fail 
    # + store in a separate structure for info. 

    return matches, unmatches

"""
Break up strings that include common list delimiter characters 

:param input_names: iterable of names that might be a stringified list
:param delimiters: Pipe delimited, single-string list of split-on characters. e.g., '&|\|| and |;'
:return: Input list with additional items resulting from split appended
"""
def split_multiple_names(input_names, delimiters):
    
    return [s.strip() for s in re.split(delimiters, input_names)]


"""
Identify strings which are more likely to contain a full given name vs. those that probably don't. 
Filters out single-word strings, leading or trailing single-character initials, unless the string also includes
a title in (Mrs, Miss)

:param input_names: iterable of names
:return: List of full names, list of thinner/harder-to-resolve names
"""
def get_rid_of_gunk(input_names):
    good_names = []
    initials = []
    
    for p_name in input_names: # there's probably a better way of combining all these regex eh sarah
        front_match = re.search('^[a-zA-Z]{3}', p_name)
        tail_match = re.search('[a-zA-Z]{3}$', p_name)
        multi_words = re.search('\s', p_name)
        
        # Only keep if there's a miss/mrs title, or if it doesn't start or end with an initial
        if (front_match and tail_match and multi_words) or 'Mrs' in p_name or 'Miss' in p_name:
            good_names.append(p_name)
        else:
            initials.append(p_name)
        
    return good_names, initials
    

____
### 1. Get GBIF botanist sample
____

In [60]:
# Set daterange of interest and taxon-id (easily grabbable from occ search GUI url)
gbif_collectors = get_gbif_recordedBy(start_year_range, end_year_range, gbif_taxon_id)

offset: 0
offset: 300
offset: 600
offset: 900
offset: 1200
offset: 1500
offset: 1800
offset: 2100
offset: 2400
offset: 2700
offset: 3000
offset: 3300
offset: 3600
offset: 3900
offset: 4200
offset: 4500
offset: 4800
offset: 5100
offset: 5400
offset: 5700
offset: 6000
offset: 6300
offset: 6600
offset: 6900
offset: 7200
offset: 7500
offset: 7800
offset: 8100
offset: 8400
offset: 8700
offset: 9000
offset: 9300
offset: 9600
offset: 9900
offset: 10200
offset: 10500
offset: 10800
offset: 11100
offset: 11400
offset: 11700
offset: 12000
offset: 12300
offset: 12600
offset: 12900
offset: 13200
offset: 13500
offset: 13800
offset: 14100
offset: 14400
offset: 14700
offset: 15000
offset: 15300
offset: 15600
offset: 15900
offset: 16200
offset: 16500
offset: 16800
offset: 17100
offset: 17400
offset: 17700
offset: 18000
offset: 18300
offset: 18600
offset: 18900
offset: 19200
offset: 19500
offset: 19800
offset: 20100
offset: 20400
offset: 20700
offset: 21000
offset: 21300
offset: 21600
offset: 21900
offs

Other things that might be interesting:
* How much is recordedByID being used? What kind of IDs are in there?
* Distribution/frequency of each name variant within result set
* Are names consistent within datasets/institutions

_______

### 2. ID easier-to-resolve names
_______


In [25]:
good_names, initials = get_rid_of_gunk(gbif_collectors)

In [50]:
good_names.sort()

# Peek at the top 10 names
good_names[:10]

['Abeleven THAJ',
 'Alan Dersin',
 'Alb. Nilsson',
 'Albatross Expedition',
 'Albert Grunow',
 'Albert R. Sweetster',
 "Albertis (d'), Enrico Alberto",
 'Alberto Löfgren',
 'Alfred Rehder',
 'Allen Hiram Curtiss']

In [52]:
f"Full name count: {len(good_names)}, initials count: {len(initials)}"

'Full name count: 355, initials count: 1699'

_To-do : add summary chart of count of fuller names vs less full?_

_______

### 3. Try to resolve against Binomia
_______

Using two endpoints: autocomplete widget and JSON-LD search for people endpoint  
Docs: https://bionomia.net/developers

##### Notes and other interesting stuff

1. Does the order of words in a name matter? aka, any difference from `forename, surname` pattern vs `surname, forename`?  
    * Doesn't look like it. e.g., the two calls below bring back the same match with an identical score (43.65668):   
        `https://api.bionomia.net/user.json?q=Nilsson+Alb.&limit=1`  
        `https://api.bionomia.net/user.json?q=Alb.+Nilsson&limit=1`  
        
            
2. Does having a title/indicator of marital status make any difference to number of matches/scores?


In [77]:
# pass in list of names and date range
matches, unmatches = search_binomia_people_auto(good_names, start_year=start_year_range, end_year=end_year_range)

# See if there's much diff between confidence/match level for full vs initial-based names (manually spoof names for now)

No matches found for Davy, J.Burtt
No matches found for Exp. Vanadis
No matches found for Kroon JSA
No matches found for Museo Nacional


_______

#### Scratch
_______


In [28]:
# To-do: Test against binomia and wikidata resolution endpoint...
# If we don't need to shuffle the names around to get a match, don't bother with this bit.
names_with_commas = [(x, len(x.split())) for x in good_names if ',' in x]

In [29]:
## from this batch, looks like strings with > 4 whitespace chars are more than one name?
joined_back_up = []

for cn in names_with_commas:
    if cn[1] < 5:
        # split and reverse it
        reversed_name = cn[0].split(', ')
        reversed_name.reverse()
        back_together = ' '.join(reversed_name) # this doesn't work all the time but literally nothing does sooo
        print(back_together)
    else:
        print(cn[0])
        continue

Holden Isaac
Setchell, W. A., C. P. Nott
George William Traill
Julio Augusto Henriques
Hugo Lojka
Enrico Alberto Albertis (d')
Lamy de la Chapelle, Pierre Marie Édouard
Cora E. Pease, Eloise Butler
Frank Shipley Collins
Mrs. E. Snyder
Setchell, W. A., I. Holden
Setchell, W. A., R. A. Lawson
William Nylander
Isaac Newton
W.A. Mrs Weymouth
Ernst Bernhard Almquist
Pease, C. E., E. Butler
Butler Eloise
Mrs Gale
Robert Collector: Wollny
Otto Nordstedt
Mrs. G.A. Hall
J.Burtt Davy
Isaac Holden
Miss Ryan
Paul Collector: Kuckuck
Henry Albert Green
R.A. Mrs Bastow
Setchell, W. A., A. A. Lawson
Mrs. Bainbridge
Setchell, W. A., R. E. Gibbs
Albert Des Méloizes
Miss Gale
Mrs A. Beal
Stephen Johnson
