# A Baseline for Models That Leverage The Schwartz-Hearst Algorithm

In the Kaggle competition, the Schwartz Hearst Algorithm [(Schwartz & Hearst, 2003)](https://psb.stanford.edu/psb-online/proceedings/psb03/schwartz.pdf) for detecting abbreviations was used by Kaggle Models 1 and 2.

Model 2 uses this approach to extract all possible dataset candidates. These are
then classified by a binary classifier and the algorithm's other heuristics. So,
for at least Kaggle Model 2 this can be used as a baseline for Precision/Recall.
This notebook uses specifically model 2's implementation of the algorithm, which
is based on the [AllenAI scispacy
implementation](https://github.com/allenai/scispacy/blob/cc1a71701f1b83bf81d36c47c271ad5ee1c03d45/scispacy/abbreviation.py),
but could have introduced some bugs. 

For Kaggle Model 1 this is less directly informative because they used the
actual AllenAI scispacy code (which may or may not produce different results)
and they additionally used a custom algorithm to extract more candidate
datasets. Further they used the extracted entities to build up a repository of
text snippets to be used for training their model. Their method should be
evaluated in another notebook to establish better baselines.

In [1]:
import json
from itertools import chain

import numpy as np
import pandas as pd
from tqdm import tqdm
from thefuzz import fuzz, process

from src.models.schwartz_hearst_model import SchwartzHearstModel
from src.evaluate.model import evaluate_kaggle_private

In [2]:
# the `scorer` and `processor` arguments are explained in the notebook
# `defining_a_match_1.ipynb`

evaluation = evaluate_kaggle_private(
    SchwartzHearstModel(),
    dict(),  # this model doesn't have any configuration params
    -1,  # no batching the data
    scorer=fuzz.partial_ratio,  # use fuzzy string matching
    processor=lambda s: s.lower(),  # convert to lowercase
)

WES|Whole-exome sequencing|SNVs|single-nucleotide variants|CNV|copy-number variant|CNVs|copy-number variants|WGS|whole-genome sequencing|MLPA|multiplex ligation-dependent probe amplification|GATK|genome analysis toolkit|http|have been deposited in the dbGaP database|IRB|Institutional Review Board dbgap
NCES|National Center for Education Statistics|SASS|Schools and Staffing Survey sass|sass data|school and staffing sass data|school and staffing survey|schools and staffing survey
 foodaps|national household food acquisition and purchase survey|arms|arms data|agricultural resources management survey
NIH|National Institutes of Health|GDS|Genomic Data Sharing|PI|Principal Investigator|IC|Institute or Center|DAC|Data Access Committee|DAR|Data Access Request|CSP|cloud service provider|PCS|Private Cloud System|IRB|Institutional Review Board|IaaS|Infrastructure as a Service|SaaS|Software as a Service|PaaS|Platform as a Service|IP|intellectual property database of genotypes and phenotypes|databa

`evaluation` is an instance of the `ModelEvaluation` class which has will
aggregate some statistics about the model output for us.

In [7]:
evaluation


        Model Evaluation:

        - Run time: 318.7753312587738 seconds, avg: 0.03953557376395558 seconds per sample
        - True Postive Count: 6217, avg: 0.7710529579560957 per sample
        - Precision: 0.09721201507356965
        - Recall: 0.17481652279054072
        

We can see that that the recall is actually much lower than we would hope
(~0.17). Let's take a look at what datasets we are most commonly missing. To do
so, we look at the `output_statistics` dataframe.

In [5]:
# output_statistics is a dataframe with the following columns:
# ids: list of ids for the sample
# labels: list of labels given by the dataset
# statistics: list of dictionaries with the following keys:
#   labels: list of set union of input labels and model predictions
#   stats: list of with TP, FP, FN fo each of the entries in `labels`
evaluation.output_statistics

Unnamed: 0,ids,labels,statistics
0,7bfd8bb51,dbgap,"{'labels': ['dbgap', 'CNVs', 'Whole-exome sequ..."
1,7d3e31302,sass|sass data|school and staffing sass data|s...,"{'labels': ['sass', 'sass data', 'school and s..."
2,3644f959a,foodaps|national household food acquisition an...,"{'labels': ['foodaps', 'national household foo..."
3,ed3527cf3,database of genotypes and phenotypes|database ...,{'labels': ['database of genotypes and phenoty...
4,236061129,dbgap,"{'labels': ['dbgap', ''], 'stats': ['FN', 'FP']}"
...,...,...,...
8058,3cb8b0e09,database of genotypes and phenotypes|database ...,{'labels': ['database of genotypes and phenoty...
8059,324fe1310,dbgap,"{'labels': ['dbgap', ''], 'stats': ['FN', 'FP']}"
8060,39e66a274,1000 genomes project|1000 genomes project 1000...,"{'labels': ['1000 genomes project', '1000 geno..."
8061,c411b1b6c,dbgap|american cancer society cancer preventio...,{'labels': ['american cancer society cancer pr...


In [4]:
stats = evaluation.output_statistics
all_labels = list(chain(*list(map(lambda x: x["labels"], stats["statistics"].values))))
global_stats = list(chain(*list(map(lambda x: x["stats"], stats["statistics"].values))))

In [5]:
stats_df = pd.DataFrame({"labels": all_labels, "stats": global_stats})
stats_df.loc[stats_df["stats"] == "FN", :].groupby("labels").count().sort_values(
    "stats", ascending=False
)

Unnamed: 0_level_0,stats
labels,Unnamed: 1_level_1
dbgap,3310
database of genotypes and phenotypes,928
program for international student assessment,799
database of genotypes and phenotypes dbgap,519
schools and staffing survey,359
...,...
genotype and phenotype data and analyses,1
genotype and phenotype correlation studies,1
genotype and phenotype,1
genotype and clinical data,1


We see the most common false negative in this case is `dbgap`, which is commonly
written as `dbGaP` and stands for the "**d**ata**b**ase of **G**enotypes **a**nd
**P**henotypes". Let's look at some documents where Schwartz-Heartz misses the dataset



In [8]:
def model_missed_label(label, row):
    if label in row["labels"]:
        lbl_idx = row["labels"].index(label)
        return row["stats"][lbl_idx] == "FN"
    else:
        return False


missed_mask = stats["statistics"].apply(lambda row: model_missed_label("dbgap", row))
missed = stats.loc[missed_mask, :]
missed

Unnamed: 0,ids,labels,statistics
3,ed3527cf3,database of genotypes and phenotypes|database ...,{'labels': ['database of genotypes and phenoty...
4,236061129,dbgap,"{'labels': ['dbgap', ''], 'stats': ['FN', 'FP']}"
9,bf7f868aa,dbgap,"{'labels': ['dbgap', ''], 'stats': ['FN', 'FP']}"
11,7d9c76c33,1 000 genomes|1 000 genomes reference data|dat...,"{'labels': ['1 000 genomes', '1 000 genomes re..."
13,bca78640a,database of genotypes and phenotypes|database ...,{'labels': ['database of genotypes and phenoty...
...,...,...,...
8055,44c7743d6,dbgap,"{'labels': ['dbgap', 'WGCNA', 'genome-wide ass..."
8056,5f8929ee0,dbgap,"{'labels': ['dbgap', 'genome-wide association ..."
8059,324fe1310,dbgap,"{'labels': ['dbgap', ''], 'stats': ['FN', 'FP']}"
8060,39e66a274,1000 genomes project|1000 genomes project 1000...,"{'labels': ['1000 genomes project', '1000 geno..."


First, the relevant snippet from `ed3527cf3.json`:

*The National Institutes of Health (NIH) has established NIH-designated data repositories (e.g., **database of Genotypes and Phenotypes (dbGaP)**, Sequence Read Archive (SRA), NIH Established Trusted Partnerships) for securely storing and sharing controlled-access human data submitted to NIH under the NIH Genomic Data Sharing (GDS) Policy.*

This entry is peculiar because the mention is within a set of parenthesis, so it's clear how that affects the algorithm. It is also probably an uncommon occurrence.

Next, the relevant snippet from `236061129.json`:

*The accession number for data generated from small RNA-sequencing of the maternal and non-maternal plasma samples reported in this paper is **dbGaP**: phs001892.v1.p1 (https://www.ncbi.nlm.nih.gov/gap/).*

This doesn't follow the expected pattern and so is an expected false negative.

Next, the relevant snippet from `bf7f868aa.json`:

*35 All primary sequencing data will be made publicly available in **dbGAP**, accession number phs000759.v1.p1.*

This also doesn't follow the expected pattern, so it could be the case that dbgap is so common that referring to it by acronym alone is fine within texts referring to it. 

Finally here is the snippet from `7d9c76c33.json`:

*We evaluate the performance of the proxy algorithm with data sets that were simulated using realistic linkage disequilibrium patterns obtained from the 1,000 Genomes project. Moreover, we successfully applied our approach to Type II diabetes (T2D) GWAS data derived from the **database of Genotypes and Phenotypes (dbGaP)***

This seems to be correctly formatted so there is likely an issue with the Schwartz-Hearst Algorithm implementation from Kaggle Model 2.

Below we will rerun the analysis, but this time using the Schwartz-Hearst algorithm as implemented by scispacy and AllenAI. Note, that this will take longer because of the time that spacy takes to parse a document, but it may better parse out sentences than the simple implementation from Kaggle Model 2.