# Trying out a Regular Expression as a Baseline for High Recall

Kaggle Model 2 uses the Schwartz-Heasrt (SW) algorithm for extracting candidate 
entities and then classifies them using a binary classifier (see 
`explore_schwartz_heart_baseline.ipynb` for more info). The SW algorithm will
miss any entities that don't match the pattern LONG FORM (ACRONYM). In the 
evalution of the Kaggle private data set, this produced a recall of 0.65. So,
at best models leveraging the SW algorithm will only produce a recall of 0.65.

This notebook tries using a Regular Expression based extraction method to get 
candidates which is more flexible than the SW algorithm.


In [5]:
from itertools import chain
import json

import pandas as pd
from thefuzz import fuzz, process

from src.models.regex_model import RegexModel
from src.evaluate.model import evaluate_kaggle_private

In [2]:
# the `scorer` and `processor` arguments are explained in the notebook
# `defining_a_match_1.ipynb`

evaluation = evaluate_kaggle_private(
    RegexModel(dict()),
    dict(),  # this model doesn't have any configuration params
    scorer=fuzz.partial_ratio,  # use fuzzy string matching
    processor=lambda s: s.lower(),  # convert to lowercase
)

In [3]:
evaluation


        Model Evaluation:

        - Run time: 3.1389195919036865 seconds, avg: 0.00038929921764897514 seconds per sample
        - True Postive Count: 16087, avg: 1.9951630906610442 per sample
        - Precision: 0.07309182936304198
        - Recall: 0.4545120641916709
        

In [6]:
stats = evaluation.output_statistics
all_labels = list(chain(*list(map(lambda x: x["labels"], stats["statistics"].values))))
global_stats = list(chain(*list(map(lambda x: x["stats"], stats["statistics"].values))))

In [7]:
stats_df = pd.DataFrame({"labels": all_labels, "stats": global_stats})
stats_df.loc[stats_df["stats"] == "FN", :].groupby("labels").count().sort_values(
    "stats", ascending=False
)

Unnamed: 0_level_0,stats
labels,Unnamed: 1_level_1
dbgap,3712
women s interagency hiv study,325
multicenter aids cohort study,276
pisa,263
women s interagency hiv study wihs,226
...,...
genetics neuroimaging and clinical data,1
genetic variants gene expression and phenotype data,1
genetic study,1
genetic epidemiology research on aging,1


The regex doesn't include acronyms so dbgap was expected to not be found, let's look at `women s interagency hiv study`

In [9]:
def model_missed_label(label, row):
    if label in row["labels"]:
        lbl_idx = row["labels"].index(label)
        return row["stats"][lbl_idx] == "FN"
    else:
        return False


missed_mask = stats["statistics"].apply(lambda row: model_missed_label("women s interagency hiv study", row))
missed = stats.loc[missed_mask, :]
missed

Unnamed: 0,id,label,statistics
7,a3f023f3d,aids clinical trials group study a5078|multice...,{'labels': ['aids clinical trials group study ...
38,407584a01,hers|hiv epidemiology research study|wihs|wome...,"{'labels': ['hers', 'hiv epidemiology research..."
53,5b470b7ef,wihs|women s interagency hiv study|women s int...,"{'labels': ['wihs', 'women s interagency hiv s..."
76,7a5f5a8b6,hiv epidemiology research study|women s intera...,"{'labels': ['hiv epidemiology research study',..."
106,9dec63cff,wihs|women s interagency hiv study|women s int...,"{'labels': ['wihs', 'women s interagency hiv s..."
...,...,...,...
7968,62acc2610,women s interagency hiv study|women s interage...,"{'labels': ['women s interagency hiv study', '..."
7995,e231de2ba,women s interagency hiv study|women s interage...,"{'labels': ['women s interagency hiv study', '..."
8024,5684db52a,multicenter aids cohort study|women s interage...,"{'labels': ['multicenter aids cohort study', '..."
8036,3f93412d2,women s interagency hiv study|women s interage...,"{'labels': ['women s interagency hiv study', '..."


The missed sample from `5b470b7ef.json` looks like this:

*This investigation was a part of the **Women's Interagency HIV Study (WIHS)**, an ongoing U.S. multicenter prospective cohort investigation of HIV infection among HIV seropositive women and seronegative comparison women at risk for HIV.*

The current Regex doesn't include Acronyms within the entity name. It also doesn't allow for `'s` as the closing symbol on a group which should be allowed too.

The regex should be updated to include acronyms within the entity name itself,
not just at the end. Additionally, there should be a whitelist of ubiquitous
names or acronyms to always catch.