# AHLT - Lab - NERC Baseline

This notebook contains the deliverables for the AHLT Lab NERC Baseline assignment, corresponding to Goals 1 and 2.
The notebook contains the following sections:

- [Entity extractor function *extract entities*](#function), with subset of features function to achieve Goals 1 and 2.
    
- [Evaluator output for Devel/Test sets for Goal 1.](#output_1)
- [Evaluator output for Devel/Test sets for Goal 2.](#output_2)
    

<a id='function'></a>
## Entity extractor function *extract_entities*

To better detect the different kinds of entities, by devised rules that characterise each type's tokens, on average, from the others. For instance, upper cased tokens are more probable to be brands.

This way of creating rules allows for fast improvement of the F1 score up to a certain point, given the rules are not absolute and each one may introduce incorrect detections.

Our most successful rules were the ones with regards to the suffixes/prefixes, where we took the most common 5-character suffixes/prefixes for each type and match if the token ends/begins with those strings. As the frequencies of each suffix/prefix were taken from the *Train* data-set entities, we expect them to generalise worse than other rules for the *Test* data-set, given the samples in Train data-set do not have the same distribution of suffixes/prefixes.

In [2]:
def extract_entities(token_list):
    """
    Extract entitites.
    Fuction to extract and tag the entites of the give token lists, taggin each
    foun entity with a type given a set of rules.

    Args:
        - token_list: list of token strings with token words
    Returns:
        - ents: list of dictionaries with entities' name, type and offset.
    """
    # For use in entity recognition rules
    # Common drug suffixes
    with open("drug_suffixes.txt", "r") as fp:
        drug_suffixes = [s.replace("\n", "") for s in fp.readlines()]
    with open("group_suffixes.txt", "r") as fp:
        group_suffixes = [s.replace("\n", "") for s in fp.readlines()]
    with open("brand_suffixes.txt", "r") as fp:
        brand_suffixes = [s.replace("\n", "") for s in fp.readlines()]
    with open("drug_n_suffixes.txt", "r") as fp:
        drug_n_suffixes = [s.replace("\n", "") for s in fp.readlines()]
    with open("data/Rules/drug_prefixes.txt", "r") as fp:
        drug_prefixes = [s.replace("\n", "") for s in fp.readlines()]
    with open("data/Rules/group_prefixes.txt", "r") as fp:
        group_prefixes = [s.replace("\n", "") for s in fp.readlines()]
    with open("data/Rules/brand_prefixes.txt", "r") as fp:
        brand_prefixes = [s.replace("\n", "") for s in fp.readlines()]
    with open("data/Rules/drug_n_prefixes.txt", "r") as fp:
        drug_n_prefixes = [s.replace("\n", "") for s in fp.readlines()]
    # Init output list
    ents = []
    # Iterate over index of token_list and detect entity type with rules
    # if no entity type detected, token is discarded and next token evaluated.
    i = 0
    i_max = len(token_list)
    while i < i_max:
        # Take token info
        token, start, end = token_list[i]
        # We take next token and previous token (if any) to evaluate
        # composite entity names.
        nxt_token, nxt_stat, nxt_end = token_list[i+1] if i < (i_max-1) else \
            ("EOS", inf, inf)
        prv_token, prv_stat, prv_end = token_list[i-1] if i > 0 else \
            ("BOS", inf, inf)
        type = None
        # Rules to detect if token is entity and which type
        # Detect "XX acid" drugs
        if nxt_token == "acid":
            type = "drug"
            token = f"{token} {nxt_token}"
            end = nxt_end
            i += 1
        # Detect "XX agents", "XX drugs" and "XX drug" groups
        elif ((nxt_token == "agents") or (nxt_token == "drugs")):
            type = "group"
            token = f"{token} {nxt_token}"
            end = nxt_end
            i += 1
        # Detect drug_n with usual suffixes
        elif (
              token.endswith(tuple(drug_n_suffixes)) or
              token.startswith(tuple(drug_n_prefixes))
              # Features for Goal 2
              # Tokens of type drug_n have on average more number of
              # non-alphanumeric symbols.
              or sum(w in ["+", "-", "(", ")"] for w in token)
              ):
            type = "drug_n"
        # Detect common brand suffixes
        elif (
            token.endswith(tuple(brand_suffixes)) or
            token.startswith(tuple(group_prefixes))
            # Features for Goal 2
            # Uppercase brand names
            # Avoid numerals and acronyms by limiting length
            or token.isupper() and len(token) > 4
            ):
            type = "brand"
        # Detect common group suffixes
        elif (
            token.endswith(tuple(group_suffixes)) or
            token.startswith(tuple(group_prefixes))
            # Features for Goal 2
            # Detect plural acronyms i.e. ADs
            or match(r"[A-Z]+s$", token)
            ):
            type = "group"
        # Detect common drug suffixes
        elif (
            token.endswith(tuple(drug_suffixes)) or 
            token.startswith(tuple(drug_prefixes))
            ):
            type = "drug"
        # If type was set, then it's an entity
        if type is not None:
            ent = {"name": token, "offset": f"{start}-{end}", "type": type}
            ents.append(ent)
        # Pass to next token
        i += 1
    return ents

<a id='output_1'></a>
## Evaluator output for Devel/Test sets for Goal 1

### Output for the Devel data-set with features indicated to obtain Goal 1

With the subset of features indicated in the previous section as minimal features, we obtain a F1 average score of 0.5 with the Devel data-set:

#### SCORES FOR THE GROUP: BASELINE 

Strict matching (boundaries + type)

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
787|349|0|635|396|1771|0.51|0.44|0.48

Exact matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
934|202|0|635|396|1771|0.61|0.53|0.57

Partial matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
934|0|202|635|396|1771|0.61|0.58|0.6

type matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
913|223|0|635|396|1771|0.6|0.52|0.55


#### SCORES FOR ENTITY TYPE

Exact matching on drug

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
494|17|0|534|48|1045|0.88|0.47|0.62

Exact matching on brand

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
112|3|0|65|38|180|0.73|0.62|0.67

Exact matching on group

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
164|106|0|184|98|454|0.45|0.36|0.4

Exact matching on drug_n

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
17|0|0|75|7|92|0.71|0.18|0.29


#### MACRO-AVERAGE MEASURES:

P|R|F1
---|---|---
0.69|0.41|0.5


We see **drug** and **brand** have F1 scores over **0.6**, meaning our rules capture well this types of entities, while for **group** and **drug_n** we have much lower scores of **0.4** and **0.39**. This was to be expected since these last types have more common multiple token entities that are not well detected through our rules. 

### Output for the Test data-set with features indicated to obtain Goal 1

We know apply these minimal features to the Test data-set to see how well they generalise:

#### SCORES FOR THE GROUP: BASELINE 

Strict matching (boundaries + type)

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
252|102|0|332|234|686|0.43|0.37|0.4

Exact matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
292|62|0|332|234|686|0.5|0.43|0.46

Partial matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
292|0|62|332|234|686|0.5|0.47|0.48

type matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
293|61|0|332|234|686|0.5|0.43|0.46


#### SCORES FOR ENTITY TYPE

Exact matching on drug

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
164|18|0|169|29|351|0.78|0.47|0.58

Exact matching on brand

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
32|0|0|27|7|59|0.82|0.54|0.65

Exact matching on group

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
56|16|0|83|55|155|0.44|0.36|0.4

Exact matching on drug_n

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
0|7|0|114|1|121|0|0|0


#### MACRO-AVERAGE MEASURES:

P|R|F1
---|---|---
0.51|0.34|0.41


As expected, when applying the extracted rules for recognizing and classifying our entities to the Test data-set, we realize the metrics go down below the intended threshold. This is due to the fact that our rules overfit the data in our development data-set but have a big generalization error, and thus do not apply well in the general case.

In particular, we see the greatest deviation with the validation metrics in the F1 score for **drug_n**.

<a id='output_2'></a>
## Evaluator output for Devel/Test sets for Goal 2

### Output for the Devel data-set with features indicated to obtain Goal 2

In this case we add the extra features to achieve the maximum F1 score on the Devel data-set:

#### SCORES FOR THE GROUP: BASELINE RUN=1

Strict matching (boundaries + type)

|cor|inc|par|mis|spu|total|prec|recall|F1|
|---|---|---|---|---|----|---|---|---|
|797|441|0|533|664|1771|0.42|0.45|0.43|

Exact matching

|cor|inc|par|mis|spu|total|prec|recall|F1|
|---|---|---|---|---|----|---|---|---|
|930|308|0|533|664|1771|0.49|0.53|0.51|

Partial matching

|cor|inc|par|mis|spu|total|prec|recall|F1|
|---|---|---|---|---|----|---|---|---|
|930|0|308|533|664|1771|0.49|0.61|0.54|

type matching

|cor|inc|par|mis|spu|total|prec|recall|F1|
|---|---|---|---|---|----|---|---|---|
|907|331|0|533|664|1771|0.48|0.51|0.49|


#### SCORES FOR ENTITY TYPE

Exact matching on drug

|cor|inc|par|mis|spu|total|prec|recall|F1|
|---|---|---|---|---|----|---|---|---|
|485|10|0|550|42|1045|0.9|0.46|0.61|

Exact matching on brand

|cor|inc|par|mis|spu|total|prec|recall|F1|
|---|---|---|---|---|----|---|---|---|
|111|0|0|69|26|180|0.81|0.62|0.7|

Exact matching on group

|cor|inc|par|mis|spu|total|prec|recall|F1|
|---|---|---|---|---|----|---|---|---|
|147|91|0|216|59|454|0.49|0.32|0.39|

Exact matching on drug_n

|cor|inc|par|mis|spu|total|prec|recall|F1|
|---|---|---|---|---|----|---|---|---|
|54|9|0|29|70|92|0.41|0.59|0.48|

#### MACRO-AVERAGE MEASURES:

|P|R|F1|
|--|--|--|
|0.65|0.5|0.55|

Compared to the previous section, we know have a significantly better F1 for the previous lower types **group** and **drug_n**, with a higher score for **brand** too, while not changing the F1 score for **drug**.

### Output for the Test data-set with features indicated to obtain Goal 2

Like before, we see how the extra features generalise with the Test data-set:

#### SCORES FOR THE GROUP: BASELINE RUN=2

Strict matching (boundaries + type)

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
255|146|0|285|401|686|0.32|0.37|0.34

Exact matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
294|107|0|285|401|686|0.37|0.43|0.4

Partial matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
294|0|107|285|401|686|0.37|0.51|0.43

type matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
311|90|0|285|401|686|0.39|0.45|0.42

#### SCORES FOR ENTITY TYPE

Exact matching on drug

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
164|7|0|180|28|351|0.82|0.47|0.6

Exact matching on brand

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
32|0|0|27|2|59|0.94|0.54|0.69

Exact matching on group

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
49|15|0|91|30|155|0.52|0.32|0.39

Exact matching on drug_n

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
10|35|0|76|73|121|0.08|0.08|0.08

#### MACRO-AVERAGE MEASURES:

P|R|F1
---|---|---
0.59|0.35|0.44

Although we obtain a significantly bettwe score, we still have a high generalization error. Again, being the main discrepancy the F1 score for **drug_n** type.