## BiCIKL-Hackathon Topic 9: Hidden Women in Science

____
### Workplan
____

1. ~Get a list of botanists off GBIF that are affiliated with records from around a test subset of records (taxon, eveny year)~
2. ~Try to pick out full names, both to make resolution easier and to identify likely not-men - whats the frequency of each in the data?~
3. ~Try to resolve names against Bionomia - what fails and why~
4. For names that aren't found in Binomia, try and resolve against wikidata
5. Use any wikidata QIDs resulting from 3. and 4. to get gender label
6. For names that aren't found in either source: try a few broad-brush approaches to identifying which ones might be women (could look for titles eg 'Ms Miss Mrs' or use gender-spotting API etc)
7. For recs that might be women, run against GBIF occ ID again to get an idea of years of activity (slightly nonsensical in the test dataset, admittedly), major taxonomic groups of study, institutions of deposition? Useful output would be a set of unk women with enough hooks about their activities to encourage human research
8. Could also run names against BHL OCR text search to get a list of possible references together to augment 7.  

[Topic overview](https://github.com/pensoft/BiCIKL/tree/main/Topic%209%20Hidden%20women%20in%20science)



___
### Imports and params
___

In [1]:
import requests
import itertools
import re
import json
import urllib.parse

In [2]:
start_year_range = 1870
end_year_range = 1870
gbif_taxon_id = 6

___
### Functions
___

In [3]:
"""
Retrieve a unique list of values in GBIF occurrence dwc.recordedBy fields for a given date range and taxon  

:param start_year: start year of query date range (YYYY)
:param end_year: end year of query date range (YYYY)
:param taxon_key: GBIF taxon key
:return: Set of names (unique strings)
"""
def get_gbif_recordedBy(start_year, end_year, taxon_key):
    # Get the value of recordedBy for each record, where it exists
    collectors = set()
    
    for offset in itertools.count(step=300):
        r = requests.get(f"""https://api.gbif.org/v1/occurrence/search?year={start_year},{end_year}
        &basisOfRecord=PRESERVED_SPECIMEN&taxon_key={gbif_taxon_id}&limit=300&offset={offset}""").json()
        
        # Comment this out if you don't want to keep an eye on the progress of the queries
        # print(f"offset: {offset}")
        
        # Get the info we're interested in
        # Could beef this up in the future to grab recordedBy ID, inst/dataset ID, taxa of interest, years of activity?
        for d in r['results']:
            if 'recordedBy' in d:
                recordedBy = d['recordedBy']
                # These delimiters seem to usually indicate > 1 name, so split and add back in
                collectors.update(split_multiple_names(recordedBy, '&|\|| and |;'))  
            else:
                continue

        if r['endOfRecords']:
            break

    return collectors

In [4]:
"""
Identify strings which are more likely to contain a full given name vs. those that probably don't. 
Filters out single-word strings, leading or trailing single-character initials, unless the string also includes
a title in (Mrs, Miss)

:param input_names: iterable of names
:return: List of full names, list of thinner/harder-to-resolve names
"""
def get_rid_of_gunk(input_names):
    good_names = []
    initials = []
    
    for p_name in input_names: # there's probably a better way of combining all these regex eh sarah
        front_match = re.search('^[a-zA-Z]{3}', p_name)
        tail_match = re.search('[a-zA-Z]{3}$', p_name)
        multi_words = re.search('\s', p_name)
        
        # Only keep if there's a miss/mrs title, or if it doesn't start or end with an initial
        if (front_match and tail_match and multi_words) or 'Mrs' in p_name or 'Miss' in p_name:
            good_names.append(p_name)
        else:
            initials.append(p_name)
        
    return good_names, initials

In [5]:
"""
Break up strings that include common list delimiter characters 

:param input_names: iterable of names that might be a stringified list
:param delimiters: Pipe delimited, single-string list of split-on characters. e.g., '&|\|| and |;'
:return: Input list with additional items resulting from split appended
"""
def split_multiple_names(input_names, delimiters):
    
    return [s.strip() for s in re.split(delimiters, input_names)]

In [37]:
"""
Broad-brush check for matches against Bionomia

"""
def search_bionomia_people_auto(names, cutoff_score=40):
    autocomplete_base_url =  'https://api.bionomia.net/user.json?q=' # Useful to get confidence scores of match
    
    matches = []
    unmatches = []
    
    # url encode each name string and get result + score from autocomplete_base_url
    for name in names:
        response = requests.get(f"{autocomplete_base_url}{urllib.parse.quote_plus(name)}&limit=1")
        response.raise_for_status()
                           
        # Un-matching queries return an empty list
        if len(response.text) == 2:
            print(f"No matches found for {name}")
            unmatches.append(name)
        else:
        # stash top result dict with original query/name string
        # keys: ['id', 'score', 'orcid', 'wikidata', 'fullname', 'fullname_reverse', 'thumbnail', 'lifespan', 'description'] 
            json_response = response.json()
            top_match = json_response[0]
            if top_match['score'] < cutoff_score:
                print(f"No matches > score {cutoff_score} for {name}. Best: {top_match['score']}: {top_match['fullname']}")
                unmatches.append(name)
            else:
                print(f"Match found for {name}: {top_match['fullname']}. Score: {top_match['score']}")
                matches.append({'original_name': name, 'binomia_match': json_response[0]})

    return matches, unmatches

In [40]:
"""
Stricter check for matches against Bionomia - won't return matches if the collection data is not within the 
lifespan of the person defined in wikidata/bionomia (ex. year of birth)

"""
def search_bionomia_people_detail(names, year):
    details_base_url = 'https://api.bionomia.net/users/search?'
    
    for name in names:
        query_params = {'q': urllib.parse.quote_plus(name), 
                        'date': year, 
                        'strict': 'true',
                        'limit': 1
                       }
        response = requests.get(details_base_url, params=query_params)
        response.raise_for_status()
        
        json_response = response.json()
        
        # Check we've returned results - no score/confidence cutoff availabel here though
        result_count = json_response['opensearch:totalResults']
        
        # unpack the first ['item'] in dataElement and handle no results scenario
        if result_count == 0:
            continue
        else:
            # store in the original dict to help compare results from the different approachs
            name['bionomia_detail_match'] = json_response['dataFeedElement'][0]['item']
            
    return names

____
### 1. Get GBIF botanist sample
____

In [7]:
# Set daterange of interest and taxon-id (easily grabbable from occ search GUI url)
gbif_collectors = get_gbif_recordedBy(start_year_range, end_year_range, gbif_taxon_id)

Other things that might be interesting:
* How much is recordedByID being used? What kind of IDs are in there?
* Distribution/frequency of each name variant within result set
* Are names consistent within datasets/institutions

_______

### 2. ID easier-to-resolve names
_______


In [8]:
good_names, initials = get_rid_of_gunk(gbif_collectors)

In [9]:
good_names.sort()

# Peek at the top 10 names
good_names[:10]

['ATP in H. Vandenbroeck',
 'Abbott, James',
 'Abbé M. Gandoger',
 'Abeleven THAJ',
 'Abrahamson, Knut',
 'Adalbert Geheeb',
 'Addison, Rev Frederick',
 'Adolf Grape',
 'Adolf Roth',
 'Adolphus Pansch']

In [10]:
f"Full name count: {len(good_names)}, initials count: {len(initials)}"

'Full name count: 1277, initials count: 4836'

_To-do : add summary chart of count of fuller names vs less full?_

_______

### 3. Try to resolve against Binomia
_______

Using two endpoints: autocomplete widget and JSON-LD search for people endpoint  
Docs: https://bionomia.net/developers

##### Notes and other interesting stuff

1. Does the order of words in a name matter? aka, any difference from `forename, surname` pattern vs `surname, forename`?  
    * Doesn't look like it. e.g., the two calls below bring back the same match with an identical score (43.65668):   
        `https://api.bionomia.net/user.json?q=Nilsson+Alb.&limit=1`  
        `https://api.bionomia.net/user.json?q=Alb.+Nilsson&limit=1`  
        
            
2. Does having a title/indicator of marital status make any difference to number of matches/scores?


In [38]:
# pass in list of names and date range
matches, unmatches = search_bionomia_people_auto(good_names)

# See if there's much diff between confidence/match level for full vs initial-based names (manually spoof names for now)

No matches > score 40 for ATP in H. Vandenbroeck. Best: 20.669916: Frank H.
No matches > score 40 for Abbott, James. Best: 21.423473: John Angell James
Match found for Abbé M. Gandoger: Michel Gandoger. Score: 52.514725
Match found for Abeleven THAJ: Theodoor Hendrik Arnold Jacob Abeleven. Score: 52.89057
No matches > score 40 for Abrahamson, Knut. Best: 23.354465: Knut Fægri
Match found for Adalbert Geheeb: Adalbert Geheeb. Score: 77.08774
No matches > score 40 for Addison, Rev Frederick. Best: 25.522219: Lafayette Frederick
Match found for Adolf Grape: Anders Grape. Score: 44.217564
No matches > score 40 for Adolf Roth. Best: 37.070232: Santiago Roth
Match found for Adolphus Pansch: Adolf Pansch. Score: 48.654083
No matches > score 40 for Afzelius, Arwid. Best: 15.5228405: Ivar Arwidsson
No matches > score 40 for Agostino Daldini. Best: 24.19717: Agostino Bassi
No matches > score 40 for Ahlberg, Fredrik. Best: 20.226503: Fredrik Hasselqvist
No matches > score 40 for Ahlberg, Nils Fre

Match found for Baglietto G. di Voltri: Francesco Baglietto. Score: 52.89057
No matches > score 40 for Bagnall, Mr James Eustace. Best: 21.423473: John Angell James
Match found for Baguet Charles: Charles Baguet. Score: 64.82505
No matches > score 40 for Bailey, George Henry. Best: 27.992971: George Morrison Reid Henry
No matches > score 40 for Bailey, Mr Charles. Best: 19.882156: Vera Katharine Charles
Match found for Balansa, Benedict: Francis Gano Benedict. Score: 40.32904
No matches > score 40 for Ball, John. Best: 18.1349: Hans John
No matches > score 40 for Baptiste Jacob. Best: 38.15323: Baptiste Jaques Jacob
Match found for Barcelo Combis, Francisco: Francisco Barceló. Score: 65.643295
Match found for Barceló y Combis, Francisco: Francisco Barceló. Score: 65.643295
No matches > score 40 for Barker, Prof. Thomas MA - Botanist. Best: 26.829056: Mathieu Thomas
Match found for Baron Haussmann: David Haussmann. Score: 52.89057
No matches > score 40 for Barrow, Mr John - Botanist. Be

Match found for Chr. Sommerfelt: Christian Sommerfelt. Score: 46.394154
Match found for Christian Gregor Brügger: Christian Georg Brügger. Score: 54.10313
Match found for Christian Kaurin, Elling Ryan: Elling Jacobsen Ryan. Score: 53.67875
Match found for Claes Eric Grill: Claes Grill. Score: 70.75803
Match found for Claes Gritz: Jolien Claes. Score: 40.06582
Match found for Claes Mebius: Jolien Claes. Score: 40.06582
No matches > score 40 for Clarence E. Lyman. Best: 38.724503: Benjamin Smith Lyman
No matches > score 40 for Clark, H. James. Best: 26.914068: Sidney H. James
No matches > score 40 for Clark, James. Best: 21.423473: John Angell James
No matches > score 40 for Clarke, Charles Baron. Best: 33.43147: Richard Baron
No matches > score 40 for Clarke, Mr Charles Baron MA, FRS, FLS. Best: 33.43147: Richard Baron
Match found for Clas Bolin: Jolien Claes. Score: 40.06582
Match found for Claude Thomas Alexis Jordan: Claude Thomas Alexis Jordan. Score: 64.487045
No matches > score 40

No matches > score 40 for Emil Berroyer. Best: 15.946018: Emil Bretschneider
No matches > score 40 for Emil Borroyer. Best: 15.946018: Emil Bretschneider
No matches > score 40 for Emil Lindell. Best: 15.946018: Emil Bretschneider
Match found for Emil Lyttkens: August Lyttkens. Score: 50.33644
Match found for Emil Viktor Ekstrand: Emil Viktor Ekstrand. Score: 70.86043
Match found for Emile Burnat: Émile Burnat. Score: 69.01325
Match found for Emile Florentin Favre: Pierre Favre. Score: 44.84338
Match found for Emile Levier: Emilio Levier. Score: 52.89057
Match found for Emilio Marcucci: Emilio Marcucci. Score: 75.68788
No matches > score 40 for Enander, Sven Johan. Best: 21.846577: Sven Hedin
Match found for Engler HGA: Adolf Engler. Score: 50.33644
No matches > score 40 for Engström, Johan. Best: 11.443588: Johan Sandström
No matches found for Ericsson, Antipas
No matches > score 40 for Eriksson, Henrik. Best: 20.743404: Henrik Mohn
No matches > score 40 for Eriksson, Jacob. Best: 27.7

No matches > score 40 for Gibelli, Giuseppe. Best: 17.908594: Giuseppe Olivi
Match found for Gielens in François Crépin: François Crépin. Score: 68.98529
No matches > score 40 for Gillman, Henry. Best: 21.29573: William Arnon Henry
Match found for Gillot Xavier: François-Xavier Gillot. Score: 52.89057
Match found for Glaziou AFM: Auguste François Marie Glaziou. Score: 52.89057
No matches found for Glaziou, AFM
No matches found for Glaziou, Roosmalen
No matches found for Gorden, Legrev
Match found for Graf Ferdinand: Ferdinand Graf. Score: 55.26113
Match found for Gravet PJF: Pierre Joseph Frédéric Gravet. Score: 52.89057
Match found for Grill, Claes: Jolien Claes. Score: 40.06582
No matches > score 40 for Grimus,K.Ritter von Grimburg. Best: 8.236407: Veit Hanns Schnorr von Carolsfeld
No matches > score 40 for Groves, Henry. Best: 21.29573: William Arnon Henry
No matches > score 40 for Gtt. Peters. Best: 34.746693: James A. Peters
No matches > score 40 for Guillon, Anatole. Best: 25.896

Match found for Herb. Musée Préhistoire Nemours: Pierre Samuel du Pont de Nemours. Score: 52.89057
No matches > score 40 for Herbarium Musei Florentini. Best: 15.094803: MITCHELL POWER
Match found for Herbarium of the New York Botanical Garden: Edward of Norwich Duke of York. Score: 56.553444
No matches found for Herbier Brimmeyr
No matches > score 40 for Herbier d'un pharmacien de Wissembourg - grand for. Best: 39.35032: Larry F. Grand
Match found for Herbier d'un pharmacien de Wissembourg - petit for: Paul Charles Mirbel Petit. Score: 40.901093
Match found for Herman Milde: Carl Julius Milde. Score: 50.33644
No matches > score 40 for Heufler L. J. von. Best: 24.852144: Ludwig Samuel Joseph David Alexander Heufler zu Rasen und Perdonneg
Match found for Hewett Cottrell Watson: Hewett Watson. Score: 81.6737
No matches > score 40 for Hieronymus GHEW. Best: 38.168373: Tobin Hieronymus
Match found for Hieronymus Gander: Hieronymus Gander. Score: 76.00962
Match found for Hieronymus von Rens

No matches > score 40 for Johanson, Nils Abraham. Best: 33.49996: Rudolf Abraham
Match found for John  Macoun: John Macoun. Score: 61.22096
Match found for John Alpheus Allen: John Alpheus Allen. Score: 49.645756
Match found for John Barrow: Lisa Barrow. Score: 44.217564
Match found for John Blackstone: John Blackstone. Score: 63.77509
Match found for John Byrne Leicester Warren: John Byrne. Score: 54.54595
Match found for John Fergusson: John Fergusson. Score: 58.282024
Match found for John Firminger Duthie: John Firminger Duthie. Score: 66.46532
Match found for John G. Baker: John Gilbert Baker. Score: 45.439728
Match found for John G. Lemmon: John Gill Lemmon. Score: 60.086166
Match found for John H. Redfield: John Howard Redfield. Score: 55.772972
Match found for John Harbord Lewis: John Lewis. Score: 41.80284
Match found for John Kirk: John Kirk. Score: 50.234837
No matches > score 40 for John L. Wall. Best: 34.40468: Sven Wall
Match found for John L. Warren: James L. L. F. Warren

No matches found for Letourneux, Aristide-Horace
Match found for Levier et Sommier: Emilio Levier. Score: 52.89057
No matches > score 40 for Levier, Emile. Best: 18.67681: Émile Marchoux
Match found for Levy Paul: Josef Levý. Score: 42.28925
Match found for Lewis E. Foote: Albert E. Foote. Score: 49.543793
Match found for Lewis Foote: Robert Bruce Foote. Score: 44.217564
No matches > score 40 for Ley, A. Augustin. Best: 35.209988: Karl Wilhelm Augustin
No matches > score 40 for Ley, Agustin. Best: 24.363882: Agustín Yáñez
No matches > score 40 for Ley, Rev. Augustin. Best: 35.209988: Karl Wilhelm Augustin
No matches > score 40 for Lindberg, Alfred. Best: 15.688007: Alfred Heilbronn
Match found for Lindemann EE von: Emanuel von Lindemann. Score: 50.10295
No matches > score 40 for Lindemann, E. von. Best: 13.114628: Eduard von Bodemeyer
No matches > score 40 for Lindemann, Eduard Émanuilovichs von. Best: 24.84917: Eduard von Martens
No matches > score 40 for Linden, Charles. Best: 19.882

No matches > score 40 for NSW Forestry Commission. Best: 15.094803: Rahmat Asy'Ari
No matches > score 40 for NULL Rodriguez Fernández. Best: 36.41139: Salustio Alvarado Fernández
No matches > score 40 for Nathorst, Alfred Gabriel. Best: 30.677315: Bernardo Panuncialman Gabriel
No matches > score 40 for National Herbarium Victoria. Best: 24.363882: Victoria Leachman
Match found for Naumann in Berat in F. Crépin: François Crépin. Score: 52.89057
No matches > score 40 for Neander, Anders. Best: 31.768578: Jason Anders
No matches > score 40 for Neuman, Leopold Martin. Best: 34.620304: Philipp Leopold Martin
Match found for Nicholas Pike: Oliver G. Pike. Score: 43.16102
Match found for Nicolas Pike: Nicolas Pike. Score: 61.614586
No matches > score 40 for Niels Moe. Best: 30.949669: Nils Green Moe
Match found for Niels Wulfsberg: Nils Gregers Ingvald Wulfsberg. Score: 52.89057
No matches > score 40 for Nielsen, Peter. Best: 25.158176: Robert Peter
Match found for Nils Bryhn: Nils Bryhn. Sco

Match found for Professor C. Haussknecht: Heinrich Carl Haussknecht. Score: 52.89057
No matches > score 40 for Rabenhorst, Dr Gottlob Ludwig. Best: 26.162975: Christian Friedrich Ludwig
Match found for Rae Natalie P. Goodall: Rae Natalie Prosser de Goodall. Score: 68.84723
No matches > score 40 for Ragnar Oldberg. Best: 26.196608: Ragnar Öller
Match found for Randor Eretius Fridtz: Randor Eretius Fridtz. Score: 83.56357
Match found for Rankin in F. Crépin: François Crépin. Score: 52.89057
Match found for Raphaël Ritz: Rafael Ritz. Score: 60.65416
No matches > score 40 for Rastern, Nicomedes. Best: 31.734089: Nicomedes Valenzuela López
Match found for Regnell III: Anders Fredrik Regnell. Score: 50.33644
Match found for Reichenbach HGL: Ludwig Reichenbach. Score: 43.16102
No matches > score 40 for René Du Parquet. Best: 32.22809: Albert Reñé
Match found for Reuss A von: August Emanuel von Reuss. Score: 54.766483
No matches > score 40 for Reuss,A.E.R. von. Best: 8.236407: Veit Hanns Schno

ConnectionError: HTTPSConnectionPool(host='api.bionomia.net', port=443): Max retries exceeded with url: /user.json?q=Rudolf+Thelander&limit=1 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000025F14DE6EB0>: Failed to establish a new connection: [WinError 10053] An established connection was aborted by the software in your host machine'))

In [41]:
matches[0:20]

[{'original_name': 'Abbé M. Gandoger',
  'binomia_match': {'id': 11855,
   'score': 52.51468,
   'orcid': None,
   'wikidata': 'Q2601687',
   'fullname': 'Michel Gandoger',
   'fullname_reverse': 'Gandoger, Michel',
   'thumbnail': 'https://bionomia.net/images/photo24X24.png',
   'lifespan': '&#42; May 10, 1850 &ndash; October  4, 1926 &dagger;',
   'description': 'French botanist and mycologist (1850-1926)'}},
 {'original_name': 'Abeleven THAJ',
  'binomia_match': {'id': 14282,
   'score': 52.89057,
   'orcid': None,
   'wikidata': 'Q10381986',
   'fullname': 'Theodoor Hendrik Arnold Jacob Abeleven',
   'fullname_reverse': 'Abeleven, Theodoor Hendrik Arnold Jacob',
   'thumbnail': 'https://bionomia.net/images/photo24X24.png',
   'lifespan': '&#42; December 28, 1822 &ndash; February 21, 1904 &dagger;',
   'description': 'Dutch botanist, apothecary and teacher (1822-1904)'}},
 {'original_name': 'Adalbert Geheeb',
  'binomia_match': {'id': 9632,
   'score': 77.08774,
   'orcid': None,
  

In [42]:
unmatches[:10]

['ATP in H. Vandenbroeck',
 'Abbott, James',
 'Abrahamson, Knut',
 'Addison, Rev Frederick',
 'Adolf Roth',
 'Afzelius, Arwid',
 'Agostino Daldini',
 'Ahlberg, Fredrik',
 'Ahlberg, Nils Fredrik',
 'Ahlner, Klas']

In [43]:
f"Out of {len(good_names)} non-initialised names, {len(matches)} were matched against the basic Bionomia endpoint (precision/accuracy tbc!)"

'Out of 1277 non-initialised names, 649 were matched against the basic Bionomia endpoint (precision/accuracy tbc!)'

In [45]:
# Quick eyeball to see how many bad matches the first pass against Bionomia returned
# Will using year of collection + json-ld endpoint removed/improve these? Looking for improved accuracy, not precision
for match in matches:
    print(f"{match['original_name']} -> {match['binomia_match']['fullname']} ({match['binomia_match']['wikidata']})")

Abbé M. Gandoger -> Michel Gandoger (Q2601687)
Abeleven THAJ -> Theodoor Hendrik Arnold Jacob Abeleven (Q10381986)
Adalbert Geheeb -> Adalbert Geheeb (Q69278)
Adolf Grape -> Anders Grape (Q5768874)
Adolphus Pansch -> Adolf Pansch (Q15989586)
Aitchison JET -> James Edward Tierney Aitchison (Q1680330)
Alban Voigt -> Johann Christian Voigt (Q55073802)
Albert Commons -> A. Commons (Q21508915)
Albert Forssell -> Nils Edvard Forssell (Q5737689)
Albert Julius Otto (Albertus Giulio Ottone) Penzig -> Albert Julius Otto Penzig (Q3887349)
Albert Kellogg -> Albert Kellogg (Q1368910)
Albert Spear Hitchcock -> A. S. Hitchcock (Q2063155)
Albert Üksip -> Albert Üksip (Q4711556)
Alberto Franzoni -> Alberto Franzoni (Q97655472)
Alexander Karl (Carl) Heinrich Braun -> Alexander Braun (Q62855)
Alf. Hanson -> Bror Hanson (Q55042523)
Alf. Hansson -> Christer Hansson (None)
Alfr. Beckman -> Kaj Beckman (Q4937885)
Alfr. Wahlstedt -> Lars Johan Wahlstedt (Q16650555)
Alfred C. Hance -> Henry Fletcher Hance (Q26

Next steps:
- run queries that found a match + year of interest against the JSON-LD bionomia endpoint
- see if abbreviated names work against the JSON-LD endpoint, providing there's a year
- Stash any good matches for now
- for full names what don't match: ID women and then try to run them against wikidata
- If there is a match in wikidata: figure out the best info etc to dumpout to encourage folks to make profiles on bionomia
- No match on wikidata or bionomia: clustering names/families of research? Look in BHL?