Skip to content

BirdLife LitScan app for scraping and analysing journal articles relevant to species red-list assessment.

Notifications You must be signed in to change notification settings

billoxbury/blitscan

Repository files navigation

BirdLife LitScan

This repository contains code for a Birdlife International project LitScan on search/information discovery from the scientific literature.

The aims of the project are to compile in one place links to web resources relevant to the work of BirdLife in making assessments of species' IUCN red-list status.

The LitScan codebase divides into three component services:

  1. Scan the web for content, which gets stored in a PostGres database.
  2. Process text in the database: translate to English, score for relevance, locate species mentions.
  3. Web app: UI for access to database results, plus dashboard.

Functions under 1,2,3 are treated as independent micro-services. In the current repo, they are represented by code in the directories scrape, process, webapp. Each of these directories has its own README file that describes the service in more detail.

The services 1,2,3 are run as a single end-to-end process by the script run_all.sh. This calls a script for each service, more details of which can be found in the respective README files.

We'll say a word in this README about the database and about the Azure deployment.

PostGres database

All three services talk to a PG database. It contains various tables, of which two should be mentioned here.

links is the main table of documents, indexed by field link which is a URL of the document. Its structure is:

                       Table "public.links"
       Column        |       Type       | Collation | Nullable | Default 
----------------------+------------------+-----------+----------+---------
date                 | text             |           |          | 
link                 | text             |           | not null | 
link_name            | text             |           |          | 
snippet              | text             |           |          | 
language             | text             |           |          | 
title                | text             |           |          | 
abstract             | text             |           |          | 
pdf_link             | text             |           |          | 
domain               | text             |           |          | 
search_term          | text             |           |          | 
query_date           | text             |           |          | 
badlink              | integer          |           |          | 
donepdf              | integer          |           |          | 
gottext              | integer          |           |          | 
gotscore             | integer          |           |          | 
gotspecies           | integer          |           |          | 
score                | double precision |           |          | 
species              | text             |           |          | 
doi                  | text             |           |          | 
title_translation    | text             |           |          | 
abstract_translation | text             |           |          | 
gottranslation       | integer          |           |          | 
donecrossref         | integer          |           |          | 
pdftext              | text             |           |          | 
pdftext_translation  | text             |           |          | 
datecheck            | integer          |           |          | 
Indexes:
    "links_pkey" PRIMARY KEY, btree (link)

The integer fields 'badlink' etc are used as boolean flags for processing control.

species contains BirdLife International's species information. Its structure is:

            Table "public.species"
  Column   |  Type   | Collation | Nullable | Default 
------------+---------+-----------+----------+---------
link       | text    |           |          | 
name_com   | text    |           |          | 
name_sci   | text    |           |          | 
SISRecID   | integer |           |          | 
date       | text    |           |          | 
text_main  | text    |           |          | 
text_short | text    |           |          | 
status     | text    |           |          | 
recog      | text    |           |          | 
syn        | text    |           |          | 
alt        | text    |           |          | 

The file pg_views.sh in this directory contains informal notes and some examples of views into the database.

Azure deployment

Deployment of the whole system is in Microsoft Azure. It lives under a single subscription and is subdivided into three resource groups scrapeRG, procRG and webappRG.

Resources common to all three components, such as the database blitscan-pg, are hosted under webappRG. This also includes a storage account blitstore, which contains a single file share blitshare.

blitshare has a directory structure:

/costing        - contains some cost estimate reports
/data           - mainly temporary storage for PDF processing; some legacy data sets
/pg             - protected PostGres credentials for access to the database
/reports        - location for reports, including a subdirectory of 'scraper' dashboard files

Code from scrape and process (LitScan services 1,2) are run in an Ubuntu virtual machine blitscanVM. The file azure_vm.sh in this directory contains code for installing the necessary software stack in this VM.

The webapp (LitScan stage 3) is deployed as an Azure web app blitscanapp, wich runs a Docker container held in the container registry blitscanappcontainers. More details can be found in the scrape README file.

Tasks (system)

  • Complete migration to Azure VM
  • Design log files/system analytics
  • Automate with daily cron job
  • System dashboard: metrication of LitScan successes/failures and cost monitoring
  • Network: monitor TLS certificates (sometimes they expire and the system breaks!)
  • Agree BirdLife adoption route

The last point refers to the fact that LitScan is currently hosted in an isolated Azure subscription. (For example, this prevents BirdLife access to LitScan file shares.) So a plan is needed on the expected resourcing for operational management of the tool. This in turn will affect requirements for the Azure architecture.

About

BirdLife LitScan app for scraping and analysing journal articles relevant to species red-list assessment.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published