# A Baseline for Models That Leverage The Schwartz-Hearst Algorithm

In the Kaggle competition, the Schwartz Hearst Algorithm [(Schwartz & Hearst, 2003)](https://psb.stanford.edu/psb-online/proceedings/psb03/schwartz.pdf) for detecting abbreviations was used by Kaggle Models 1 and 2.

Model 2 uses this approach to extract all possible dataset candidates. These are
then classified by a binary classifier and the algorithm's other heuristics. So,
for at least Kaggle Model 2 this can be used as a baseline for Precision/Recall.
This notebook uses specifically model 2's implementation of the algorithm, which
is based on the [AllenAI scispacy
implementation](https://github.com/allenai/scispacy/blob/cc1a71701f1b83bf81d36c47c271ad5ee1c03d45/scispacy/abbreviation.py),
but could have introduced some bugs. 

For Kaggle Model 1 this is less directly informative because they used the
actual AllenAI scispacy code (which may or may not produce different results)
and they additionally used a custom algorithm to extract more candidate
datasets. Further they used the extracted entities to build up a repository of
text snippets to be used for training their model. Their method should be
evaluated in another notebook to establish better baselines.

In [2]:
import json
from itertools import chain

import numpy as np
import pandas as pd
from tqdm import tqdm
from thefuzz import fuzz, process

from democratizing_data_ml_algorithms.models.schwartz_hearst_model import SchwartzHearstModel
from democratizing_data_ml_algorithms.evaluate.model import evaluate_kaggle_private

In [3]:
# the `scorer` and `processor` arguments are explained in the notebook
# `defining_a_match_1.ipynb`

evaluation = evaluate_kaggle_private(
    SchwartzHearstModel(),
    dict(),  # this model doesn't have any configuration params
    -1,  # no batching the data
    scorer=fuzz.partial_ratio,  # use fuzzy string matching
    processor=lambda s: s.lower(),  # convert to lowercase
)

`evaluation` is an instance of the `ModelEvaluation` class which has will
aggregate some statistics about the model output for us.

In [4]:
evaluation


        Model Evaluation:

        - Run time: 304.2231593132019 seconds, avg: 0.037730765138683106 seconds per sample
        - True Postive Count: 6217, avg: 0.7710529579560957 per sample
        - Precision: 0.09721201507356965
        - Recall: 0.17481652279054072
        

We can see that that the recall is actually much lower than we would hope
(~0.17). Let's take a look at what datasets we are most commonly missing. To do
so, we look at the `output_statistics` dataframe.

In [5]:
# output_statistics is a dataframe with the following columns:
# ids: list of ids for the sample
# labels: list of labels given by the dataset
# statistics: list of dictionaries with the following keys:
#   labels: list of set union of input labels and model predictions
#   stats: list of with TP, FP, FN fo each of the entries in `labels`
evaluation.output_statistics

Unnamed: 0,ids,labels,statistics
0,7bfd8bb51,dbgap,"{'labels': ['dbgap', 'copy-number variants', '..."
1,7d3e31302,sass|sass data|school and staffing sass data|s...,"{'labels': ['sass', 'sass data', 'school and s..."
2,3644f959a,foodaps|national household food acquisition an...,"{'labels': ['foodaps', 'national household foo..."
3,ed3527cf3,database of genotypes and phenotypes|database ...,{'labels': ['database of genotypes and phenoty...
4,236061129,dbgap,"{'labels': ['dbgap', ''], 'stats': ['FN', 'FP']}"
...,...,...,...
8058,3cb8b0e09,database of genotypes and phenotypes|database ...,{'labels': ['database of genotypes and phenoty...
8059,324fe1310,dbgap,"{'labels': ['dbgap', ''], 'stats': ['FN', 'FP']}"
8060,39e66a274,1000 genomes project|1000 genomes project 1000...,"{'labels': ['1000 genomes project', '1000 geno..."
8061,c411b1b6c,dbgap|american cancer society cancer preventio...,{'labels': ['american cancer society cancer pr...


In [6]:
stats = evaluation.output_statistics
all_labels = list(chain(*list(map(lambda x: x["labels"], stats["statistics"].values))))
global_stats = list(chain(*list(map(lambda x: x["stats"], stats["statistics"].values))))

In [7]:
stats_df = pd.DataFrame({"labels": all_labels, "stats": global_stats})
stats_df.loc[stats_df["stats"] == "FN", :].groupby("labels").count().sort_values(
    "stats", ascending=False
)

Unnamed: 0_level_0,stats
labels,Unnamed: 1_level_1
dbgap,3310
database of genotypes and phenotypes,928
program for international student assessment,799
database of genotypes and phenotypes dbgap,519
schools and staffing survey,359
...,...
genotype and phenotype data and analyses,1
genotype and phenotype correlation studies,1
genotype and phenotype,1
genotype and clinical data,1


We see the most common false negative in this case is `dbgap`, which is commonly
written as `dbGaP` and stands for the "**d**ata**b**ase of **G**enotypes **a**nd
**P**henotypes". Let's look at some documents where Schwartz-Heartz misses the dataset



In [8]:
def model_missed_label(label, row):
    if label in row["labels"]:
        lbl_idx = row["labels"].index(label)
        return row["stats"][lbl_idx] == "FN"
    else:
        return False


missed_mask = stats["statistics"].apply(lambda row: model_missed_label("dbgap", row))
missed = stats.loc[missed_mask, :]
missed

Unnamed: 0,ids,labels,statistics
3,ed3527cf3,database of genotypes and phenotypes|database ...,{'labels': ['database of genotypes and phenoty...
4,236061129,dbgap,"{'labels': ['dbgap', ''], 'stats': ['FN', 'FP']}"
9,bf7f868aa,dbgap,"{'labels': ['dbgap', ''], 'stats': ['FN', 'FP']}"
11,7d9c76c33,1 000 genomes|1 000 genomes reference data|dat...,"{'labels': ['1 000 genomes', '1 000 genomes re..."
13,bca78640a,database of genotypes and phenotypes|database ...,{'labels': ['database of genotypes and phenoty...
...,...,...,...
8055,44c7743d6,dbgap,"{'labels': ['dbgap', 'variance stabilizing tra..."
8056,5f8929ee0,dbgap,"{'labels': ['dbgap', 'untranslated regions', '..."
8059,324fe1310,dbgap,"{'labels': ['dbgap', ''], 'stats': ['FN', 'FP']}"
8060,39e66a274,1000 genomes project|1000 genomes project 1000...,"{'labels': ['1000 genomes project', '1000 geno..."


First, the relevant snippet from `ed3527cf3.json`:

*The National Institutes of Health (NIH) has established NIH-designated data repositories (e.g., **database of Genotypes and Phenotypes (dbGaP)**, Sequence Read Archive (SRA), NIH Established Trusted Partnerships) for securely storing and sharing controlled-access human data submitted to NIH under the NIH Genomic Data Sharing (GDS) Policy.*

This entry is peculiar because the mention is within a set of parenthesis, so it's clear how that affects the algorithm. It is also probably an uncommon occurrence.

Next, the relevant snippet from `236061129.json`:

*The accession number for data generated from small RNA-sequencing of the maternal and non-maternal plasma samples reported in this paper is **dbGaP**: phs001892.v1.p1 (https://www.ncbi.nlm.nih.gov/gap/).*

This doesn't follow the expected pattern and so is an expected false negative.

Next, the relevant snippet from `bf7f868aa.json`:

*35 All primary sequencing data will be made publicly available in **dbGAP**, accession number phs000759.v1.p1.*

This also doesn't follow the expected pattern, so it could be the case that dbgap is so common that referring to it by acronym alone is fine within texts referring to it. 

Finally here is the snippet from `7d9c76c33.json`:

*We evaluate the performance of the proxy algorithm with data sets that were simulated using realistic linkage disequilibrium patterns obtained from the 1,000 Genomes project. Moreover, we successfully applied our approach to Type II diabetes (T2D) GWAS data derived from the **database of Genotypes and Phenotypes (dbGaP)***

This seems to be correctly formatted so there is likely an issue with the Schwartz-Hearst Algorithm implementation from Kaggle Model 2.

Below we will rerun the analysis, but this time using the Schwartz-Hearst algorithm as implemented by scispacy and AllenAI. Note, that this will take longer because of the time that spacy takes to parse a document, but it may better parse out sentences than the simple implementation from Kaggle Model 2.

In [9]:
from importlib import reload

In [11]:
import src.models.schwartz_hearst_allenai_model as allenai

reload(allenai)

evaluation_allenai = evaluate_kaggle_private(
    allenai.SchwartzHearstModel_AllenAI(model="en_core_sci_sm"),
    dict(
        char_limit=50000
    ),  # limit the amount of the doc that gets processed for memory issues
    -1,  # no batching the data
    scorer=fuzz.partial_ratio,  # use fuzzy string matching
    processor=lambda s: s.lower(),  # convert to lowercase
)

INFO: Pandarallel will run on 4 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=2016), Label(value='0 / 2016'))), …

  global_matches = self.global_matcher(doc)
  global_matches = self.global_matcher(doc)
  global_matches = self.global_matcher(doc)
  global_matches = self.global_matcher(doc)


In [12]:
evaluation_allenai


        Model Evaluation:

        - Run time: 2594.315016269684 seconds, avg: 0.3217555520612283 seconds per sample
        - True Postive Count: 21559, avg: 2.6738186779114472 per sample
        - Precision: 0.09690527025508484
        - Recall: 0.6106500495680498
        

Using the AllenAI implementation improved recall significantly from 0.17 to 0.61. However the computational overhead was significantly higher, from 300 seconds total on one core to 2661 second total over 4 cores. The per document average increased from 0.03 seconds per sample to 0.33 seconds per sample.

Let's look at the same attributes of the output to see how things have changed.

In [13]:
stats_allen = evaluation_allenai.output_statistics
all_labels_allen = list(
    chain(*list(map(lambda x: x["labels"], stats_allen["statistics"].values)))
)
global_stats_allen = list(
    chain(*list(map(lambda x: x["stats"], stats_allen["statistics"].values)))
)

In [14]:
stats_df_allen = pd.DataFrame({"labels": all_labels_allen, "stats": global_stats_allen})
stats_df_allen.loc[stats_df_allen["stats"] == "FN", :].groupby(
    "labels"
).count().sort_values("stats", ascending=False)

Unnamed: 0_level_0,stats
labels,Unnamed: 1_level_1
dbgap,2742
database of genotypes and phenotypes,397
program for international student assessment,208
national postsecondary student aid study,159
database of genotypes and phenotypes dbgap,146
...,...
genotype and clinical data,1
genotype and phenotype correlation studies,1
genotype and phenotype data and analyses,1
genotype and phenotype database,1


Let's compare the top 5 between the two. An empty space indicates its not
in the top 5.

|dataset| Model 2 | AllenAI |
|-------|---------|---------|
|dbgap                                      |3,310|2,742|
|database of genotypes and phenotypes       |928  |397  |
|program for interational student assessment|799  |208  |
|database of genotypes and phenotypes dbgap |519  |146  | 
|schools and staffing survey                |359  |     |
|national postsecondary student aid study   |     |159  |

Let's look at some of the missed examples to see if this approach also seems to
be missing things it should catch. First let's check `dbgap`:

In [15]:
def model_missed_label(label, row):
    if label in row["labels"]:
        lbl_idx = row["labels"].index(label)
        return row["stats"][lbl_idx] == "FN"
    else:
        return False


missed_mask = stats_allen["statistics"].apply(
    lambda row: model_missed_label("dbgap", row)
)
missed = stats_allen.loc[missed_mask, :]
missed

Unnamed: 0,ids,labels,statistics
0,7bfd8bb51,dbgap,"{'labels': ['dbgap', 'fl', 'Whole-exome sequen..."
4,236061129,dbgap,"{'labels': ['dbgap', 'representative of relati..."
9,bf7f868aa,dbgap,"{'labels': ['dbgap', 'hprt', 'MEFs', 'N-methyl..."
18,de28ee4d4,1000 genomes project|clinvar|database of genot...,{'labels': ['database of genotypes and phenoty...
21,d58f1872e,gtex|genotype tissue expression gtex consortiu...,"{'labels': ['gtex', 'genotype tissue expressio..."
...,...,...,...
8056,5f8929ee0,dbgap,"{'labels': ['dbgap', 'GWAS', 'gene B', 'genome..."
8058,3cb8b0e09,database of genotypes and phenotypes|database ...,{'labels': ['database of genotypes and phenoty...
8059,324fe1310,dbgap,"{'labels': ['dbgap', 'loss of heterozygosity',..."
8060,39e66a274,1000 genomes project|1000 genomes project 1000...,"{'labels': ['lhs', 'lhs study', 'lung health s..."


The mention candidate in `7bfd8bb51.json`:

*Whole-exome sequencing data have been deposited in the **dbGaP** database (http://www. ncbi.nlm.nih.gov/gap) under the accession number phs000474.v3.p2.*

This doesn't follow the expected pattern, so it is an expected false negative. The mention in document `236061129.json`:

*The accession number for data generated from small RNA-sequencing of the maternal and non-maternal plasma samples reported in this paper is **dbGaP**: phs001892.v1.p1 (https://www.ncbi.nlm.nih.gov/gap/).*

This also an expected false negative. The mention in document `bf7f868aa.json` is the following:

*35 All primary sequencing data will be made publicly available in **dbGAP**, accession number phs000759.v1.p1.*

This is also an expected false negative. Then mention in document `de28ee4d4.json` looks like:

*Whole-exome sequencing data have been submitted to the **Database of Genotypes and Phenotypes (dbGaP; https://www.ncbi.nlm.nih.gov/gap)** per NIH study protocol and patient consent (accession number phs001232.v1.p1).*

This presents an interesting example of a false negative. It's nearly the correct format, but has what is not a completely uncommon format for multiple citations for a single reference **LONG FORM (ACRONYM; CITE)**. I wonder if this should be added in how the parser looks for candidates?

How do the precision and recall change if we exlude `dbgap` from the calculation?

In [16]:
non_dbgap = stats_df_allen.loc[stats_df_allen["labels"] != "dbgap", :]

tp = (non_dbgap["stats"] == "TP").sum()
fp = (non_dbgap["stats"] == "FP").sum()
fn = (non_dbgap["stats"] == "FN").sum()
print("precision", tp / (tp + fp))
print("recall", tp / (tp + fn))

precision 0.09316747759051806
recall 0.6522783290147254




### Summary

This notebook explores setting a baseline for all models that use
Schwartz-Hearst algorithm for extracting datasets from context in the private
kaggle. From the experiments we can see that although slower, the AllenAI 
implementation of the Schwartz-Hearst algorithm gives a higher recall than the
faster implementation from model 2. 

The best recall (at least in using the private kaggle dataset) is $\approx$ 0.61.

If we exclude `dbgap` from the calculation the we best recall we can have is 
$\approx$ 0.65.

So if the distribution of the dataset references in the private kaggle vaidation
set is similar to the distribution of dataset references in general then I other
methods like Kaggle Model 1's approach will likely be needed to improve the
recall.