# AHLT - Lab - NERC ML

This notebook contains the deliverables for the AHLT Lab NERC Machine Learning assignment, corresponding to Goals 3 and 4.
The notebook contains the following sections:

- [Feature extractor function *extract features*](#features), with subset of features function to achieve Goals 3 and 4.
- [Classifier function *classifier*](#classifier)
- [Output generator function *output entities*](#output)
- [Evaluator output for Devel/Test sets for Goal 3.](#output_1)
- [Evaluator output for Devel/Test sets for Goal 4.](#output_2)

<a id='features'></a>
## Feature extractor function *extract_features*

To improve upon the baseline entity classification throughy ruled-based entity recognition, we devise a set of features with which we train three different classifiers to detect our entities.

The threee classifiers we used are:

- **Conditonal Random Fields classifier** (**CRF**), through its implementation *pycrfsuite* Python package.
- **Maximum Entropy classifier** (**MaxEnt**), through its implementation as a command line executable, details in [here](http://users.umiacs.umd.edu/~hal/megam/version0_3/)
- **Random Forest classifier** (**RandomForest**), through its implementation in *Sklearn* Python package.

For each classifier, we devise a set features that characterise the different tokens to classify them into drug entity types and using the B-I-O rules. That is we have 9 classes to classify tokens into: eight **B**(egining) or **I**(nternal) tags for each type (i.e. B-drug, I-group, etc); and a **O** tag for non-entities.

In [1]:
def extract_features(token_list):
    """
    Extract Features
    Fuction to extract features from each token of the given token list.
    Args:
        - token_list: list of token strings with token words
    Returns:
        - features: list of list of features for each token of the given list.
    """
    features = []
    for i, token_t in enumerate(token_list):
        token, start, end = token_t
        # Token form
        form = f"form={token}"
        # Suffix's 4 last letters
        suf4 = token[-4:]
        suf4 = f"suf4={suf4}"
        # Suffix's 3 last letters
        suf3 = token[-3:]
        suf3 = f"suf3={suf3}"
        # Suffix's 2 last letters
        suf2 = token[-2:]
        suf2 = f"suf2={suf2}"
        # Prefix's 4 first letters
        pre4 = token[:4]
        pre4 = f"pre4={pre4}"
        # Prefix's 3 first letters
        pre3 = token[:3]
        pre3 = f"pre3={pre3}"
        # Prefix's 2 first letters
        pre2 = token[:2]
        pre2 = f"pre2={pre2}"
        # Prev token
        if i == 0:
            prev = "prev=_BoS_"
        else:
            prev = f"prev={token_list[i - 1][0]}"
        # Next token
        if i == (len(token_list) - 1):
            nxt = "next=_EoS_"
        else:
            nxt = f"next={token_list[i + 1][0]}"
        # All token in capital letters
        capital = f"capital={token.isupper()}"
        # Begin with capital letter
        b_capital = f"b_capital={token[0].isupper()}"
        # Number of capitals in token
        capitals = str(sum(i.isupper() for i in token))
        # Number of digits in token
        digits = str(sum(i.isdigit() for i in token))
        # Number of hyphens in token
        hyphens = str(sum('-' == i for i in token))
        # Token length
        leng = str(len(token))
        # Token has Digit-Captial combination
        dig_cap = not not match(r"\d+-[A-Z]+", token)
        dig_cap = f"dig_cap={dig_cap}"
        # Feats list
        if model == "MaxEnt":
            feats = [form, pre2, pre3, pre4, suf2, suf4]
        elif model == "CRF":
            feats = [form, capital, nxt, pre2, pre3, pre4, suf2, suf4,
                     capitals, hyphens, leng]
        elif model == "RandomForest":
            feats = [b_capital, capital, dig_cap, suf2,
                     capitals, digits, hyphens, leng]
        else:
            feats = [form, b_capital, capital, dig_cap,
                     nxt, pre2, pre3, pre4, prev, suf2, suf3, suf4,
                     capitals, digits, hyphens, leng]
        features.append(feats)
    return features

<a id='classifier'></a>
## Classifier function *classifier*

The classifier function takes the generated features for the data and the trained model, according to the **model** parameter and outputs the predictions given by the model. The different prediction formats of each model type are normalized into the same format and finally passed onto the [ouput_entities](#output) function.

The *classifier* function makes use of an auxliary function to extract the features from the input file, attached following the main function's body

In [None]:
def classifier(model, feature_input, model_input, outputfile):
    sentences, X_valid, Y_valid = get_sentence_features(feature_input)
    if model == "CRF":
        # CRF classifier flow
        tagger = pycrfsuite.Tagger()
        tagger.open(f"{model_input}.crfsuite")
        predictions = [tagger.tag(x) for x in X_valid]

    elif model == "MaxEnt":
        # MaxEnt classifier flow
        megam_features = f"{tmp_path}/megam_valid_features.dat"
        megam_predictions = f"{tmp_path}/megam_predictions.dat"
        system(f"cat {feature_input} | cut -f5- | grep -v ’^$’ > \
            {megam_features}")
        system(f"./{megam} -nc -nobias -predict {model_input}.megam multiclass\
            {megam_features} > {megam_predictions}")
        with open(megam_predictions, "r") as fp:
            lines = fp.readlines()
        pred_classes = [line.split("\t")[0] for line in lines]
        predictions = []
        start = 0
        for sent in X_valid:
            end = start + len(sent)
            predictions.append(pred_classes[start:end])
            start = end

    elif model == "RandomForest":
        with open(f"{model_input}.randomForest", "rb") as fp:
            model, encoder = pickle.load(fp)
        # Unlist sentences
        x_cat = []
        x_num = []
        for x_sent in X_valid:
            x_cat_sent = [f[:4] for f in x_sent]
            x_num_sent = [f[4:] for f in x_sent]
            x_cat.extend(x_cat_sent)
            x_num.extend(x_num_sent)
        # One hot encoder to turn categorical variables to binary
        x_encoded = encoder.transform(x_cat).toarray()
        x = np.concatenate((x_encoded, x_num), axis=1)
        pred_classes = model.predict(x)
        predictions = []
        start = 0
        for sent in X_valid:
            end = start + len(sent)
            predictions.append(pred_classes[start:end])
            start = end

    else:
        print(f"[ERROR] Model {model} not implemented")
        raise NotImplementedError
    # Ouput entites for each sentence
    with open(outputfile, "w") as out:
        for sent, classes in zip(sentences, predictions):
            id = sent[0][0]
            tokens = [(word[1], word[2], word[3]) for word in sent if word]
            output_entities(id, tokens, classes, out)

In [None]:
def get_sentence_features(input):
    """
    """
    with open(input, "r") as fp:
        lines = fp.read()
    sentences = lines.split("\n\n")[:-1]
    X_feat = []
    Y_feat = []
    full_tokens = []
    for sent in sentences:
        tokens = sent.split("\n")
        feats = [token.split("\t") for token in tokens if len(token)]
        x = [f[5:] for f in feats if len(f)]
        # Turn back numeric variables
        # only for RandomForest model
        if model == "RandomForest":
            for i, token in enumerate(x):
                val = [int(elem) if elem.isdigit() else elem for elem in token]
                x[i] = val
        y = [f[4] for f in feats if len(f)]
        full_tokens.append(feats)
        X_feat.append(x)
        Y_feat.append(y)
    return full_tokens, X_feat, Y_feat

<a id='output'></a>
## Output generator function *output entities*

This function recieves the token list  **tokens** for each sentence, identified by the **id** parameter, and their predicted classes **classes**, in the B-I-O class convention, and outputs the detected entities for the given sentence into the output file object **outf**.

In [None]:
def output_entities(id, tokens, classes, outf):
    """
    """
    ind = 0
    while ind < len(tokens):
        tag = classes[ind]
        type = tag.split("-")[-1]
        if tag == "O":
            ind += 1
            continue
        elif "B" in tag:  # If Beginning of an entity
            name, start, end = tokens[ind]
            # Check if next token I-same_type
            # Continue search until EoS or no-match
            ind += 1
            tag_nxt = classes[ind] if ind < len(tokens) else "O"
            type_nxt = tag_nxt.split("-")[-1]
            while ind < len(tokens) and "I" in tag_nxt and type_nxt == type:
                name_nxt, _, end_nxt = tokens[ind]
                name = f"{name} {name_nxt}"
                end = end_nxt
                ind += 1
                tag_nxt = classes[ind] if ind < len(tokens) else "O"
                type_nxt = tag_nxt.split("-")[-1]
        else:  # I-tag
            name, start, end = tokens[ind]
            ind += 1
        # Print entity and continue
        offset = f"{start}-{end}"
        txt = f"{id}|{offset}|{name}|{type}\n"
        outf.write(txt)

<a id='output_1'></a>
## Evaluator output for Devel/Test sets for Goal 1

### Output for the Devel data-set with features indicated to obtain Goal 1

With the subset of features indicated in the previous section as minimal features, we obtain a F1 average score of 0.5 with the Devel data-set:

#### SCORES FOR THE GROUP: BASELINE RUN=1

Strict matching (boundaries + type)

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
776|310|0|685|296|1771|0.56|0.44|0.49

Exact matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
912|174|0|685|296|1771|0.66|0.51|0.58

Partial matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
912|0|174|685|296|1771|0.66|0.56|0.61

type matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
889|197|0|685|296|1771|0.64|0.5|0.56

#### SCORES FOR ENTITY TYPE

Exact matching on drug

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
495|16|0|534|44|1045|0.89|0.47|0.62

Exact matching on brand

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
112|3|0|65|37|180|0.74|0.62|0.67

Exact matching on group

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
152|94|0|208|61|454|0.5|0.33|0.4

Exact matching on drug_n

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
17|0|0|75|7|92|0.71|0.18|0.29

#### MACRO-AVERAGE MEASURES:

P|R|F1
---|---|---
0.71|0.4|0.5

We see **drug** and **brand** have F1 scores over **0.6**, meaning our rules capture well this types of entities, while for **group** and **drug_n** we have much lower scores of **0.4** and **0.29**. This was to be expected since these last types have more common multiple token entities that are not well detected through our rules. 

### Output for the Test data-set with features indicated to obtain Goal 1

We know apply these minimal features to the Test data-set to see how well they generalise:

#### SCORES FOR THE GROUP: BASELINE RUN=2

Strict matching (boundaries + type)

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
245|92|0|349|167|686|0.49|0.36|0.41

Exact matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
283|54|0|349|167|686|0.56|0.41|0.48

Partial matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
283|0|54|349|167|686|0.56|0.45|0.5

type matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
278|59|0|349|167|686|0.55|0.41|0.47

#### SCORES FOR ENTITY TYPE

Exact matching on drug

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
164|11|0|176|28|351|0.81|0.47|0.59

Exact matching on brand

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
32|0|0|27|6|59|0.84|0.54|0.66

Exact matching on group

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
49|15|0|91|30|155|0.52|0.32|0.39

Exact matching on drug_n

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
0|7|0|114|1|121|0|0|0

#### MACRO-AVERAGE MEASURES:

P|R|F1
---|---|---
0.54|0.33|0.41

As expected, when applying the extracted rules for recognizing and classifying our entities to the Test data-set, we realize the metrics go down below the intended threshold. This is due to the fact that our rules overfit the data in our development data-set but have a big generalization error, and thus do not apply well in the general case.

In particular, we see the greatest deviation with the validation metrics in the F1 score for **drug_n**.

<a id='output_2'></a>
## Evaluator output for Devel/Test sets for Goal 2

### Output for the Devel data-set with features indicated to obtain Goal 2

In this case we add the extra features to achieve the maximum F1 score on the Devel data-set:

#### SCORES FOR THE GROUP: BASELINE RUN=1

Strict matching (boundaries + type)

|cor|inc|par|mis|spu|total|prec|recall|F1|
|---|---|---|---|---|----|---|---|---|
|797|441|0|533|664|1771|0.42|0.45|0.43|

Exact matching

|cor|inc|par|mis|spu|total|prec|recall|F1|
|---|---|---|---|---|----|---|---|---|
|930|308|0|533|664|1771|0.49|0.53|0.51|

Partial matching

|cor|inc|par|mis|spu|total|prec|recall|F1|
|---|---|---|---|---|----|---|---|---|
|930|0|308|533|664|1771|0.49|0.61|0.54|

type matching

|cor|inc|par|mis|spu|total|prec|recall|F1|
|---|---|---|---|---|----|---|---|---|
|907|331|0|533|664|1771|0.48|0.51|0.49|


#### SCORES FOR ENTITY TYPE

Exact matching on drug

|cor|inc|par|mis|spu|total|prec|recall|F1|
|---|---|---|---|---|----|---|---|---|
|485|10|0|550|42|1045|0.9|0.46|0.61|

Exact matching on brand

|cor|inc|par|mis|spu|total|prec|recall|F1|
|---|---|---|---|---|----|---|---|---|
|111|0|0|69|26|180|0.81|0.62|0.7|

Exact matching on group

|cor|inc|par|mis|spu|total|prec|recall|F1|
|---|---|---|---|---|----|---|---|---|
|147|91|0|216|59|454|0.49|0.32|0.39|

Exact matching on drug_n

|cor|inc|par|mis|spu|total|prec|recall|F1|
|---|---|---|---|---|----|---|---|---|
|54|9|0|29|70|92|0.41|0.59|0.48|

#### MACRO-AVERAGE MEASURES:

|P|R|F1|
|--|--|--|
|0.65|0.5|0.55|

Compared to the previous section, we know have a significantly better F1 for the previous lower types **group** and **drug_n**, with a higher score for **brand** too, while not changing the F1 score for **drug**.

### Output for the Devel data-set with features indicated to obtain Goal 2

Like before, we see how the extra features generalise with the Test data-set:

#### SCORES FOR THE GROUP: BASELINE RUN=2

Strict matching (boundaries + type)

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
255|146|0|285|401|686|0.32|0.37|0.34

Exact matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
294|107|0|285|401|686|0.37|0.43|0.4

Partial matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
294|0|107|285|401|686|0.37|0.51|0.43

type matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
311|90|0|285|401|686|0.39|0.45|0.42

#### SCORES FOR ENTITY TYPE

Exact matching on drug

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
164|7|0|180|28|351|0.82|0.47|0.6

Exact matching on brand

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
32|0|0|27|2|59|0.94|0.54|0.69

Exact matching on group

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
49|15|0|91|30|155|0.52|0.32|0.39

Exact matching on drug_n

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
10|35|0|76|73|121|0.08|0.08|0.08

#### MACRO-AVERAGE MEASURES:

P|R|F1
---|---|---
0.59|0.35|0.44

Although we obtain a significantly bettwe score, we still have a high generalization error. Again, being the main discrepancy the F1 score for **drug_n** type.