# AHLT - Lab - NERC ML

Ri

This notebook contains the deliverables for the AHLT Lab NERC Machine Learning assignment, corresponding to Goals 3 and 4.
The notebook contains the following sections:

- [Feature extractor function *extract features*](#features), with subset of features function to achieve Goals 3 and 4.
- [Classifier function *classifier*](#classifier)
- [Output generator function *output entities*](#output)
- [Evaluator output for Devel/Test sets for Goal 3.](#output_1)
- [Evaluator output for Devel/Test sets for Goal 4.](#output_2)

<a id='features'></a>
## Feature extractor function *extract_features*

To improve upon the baseline entity classification throughy ruled-based entity recognition, we devise a set of features with which we train three different classifiers to detect our entities.

The threee classifiers we used are:

- **Conditonal Random Fields classifier** (**CRF**), through its implementation *pycrfsuite* Python package.
- **Maximum Entropy classifier** (**MaxEnt**), through its implementation as a command line executable, details in [here](http://users.umiacs.umd.edu/~hal/megam/version0_3/)
- **Random Forest classifier** (**RandomForest**), through its implementation in *Sklearn* Python package.

For each classifier, we devise a set features that characterise the different tokens to classify them into drug entity types and using the B-I-O rules. That is we have 9 classes to classify tokens into: eight **B**(egining) or **I**(nternal) tags for each type (i.e. B-drug, I-group, etc); and a **O** tag for non-entities.

In [11]:
def extract_features(token_list):
    """
    Extract Features
    Function to extract features from each token of the given token list.
    Args:
        - token_list: list of token strings with token words
    Returns:
        - features: list of list of features for each token of the given list.
    """
    features = []
    for i, token_t in enumerate(token_list):
        token, start, end = token_t
        # Token form
        form = f"form={token.lower()}"
        # Suffix's 4 last letters
        suf4 = f"suf4={token[-4:].lower()}"
        # Suffix's 3 last letters
        suf3 = f"suf3={token[-3:]}"
        # Suffix's 2 last letters
        suf2 = f"suf2={token[-2:]}"
        # Prefix's 4 first letters
        pre4 = f"pre4={token[:4]}"
        # Prefix's 3 first letters
        pre3 = f"pre3={token[:3]}"
        # Prefix's 2 first letters
        pre2 = f"pre2={token[:2]}"
        # Prev token
        if i == 0:
            prev = "prev=_BoS_"
        else:
            prev = f"prev={token_list[i - 1][0].lower()}"
        # Next token
        if i == (len(token_list) - 1):
            nxt = "next=_EoS_"
            nxt_end = nxt
        else:
            nxt = f"next={token_list[i + 1][0].lower()}"
            # Next token end
            nxt_end = f"next={token_list[i + 1][0][-3:-1]}"
        # All token in capital letters
        capital_num = str(int(token.isupper()))
        capital = f"capital={capital_num}"
        # Begin with capital letter
        b_capital_num = str(int(token[0].isupper()))
        b_capital = f"b_capital={b_capital_num}"
        # Number of digits in token
        digits = f"digits={sum(i.isdigit() for i in token)}"
        # Number of capitals in token
        capitals = f"capitals={sum(i.isupper() for i in token)}"
        # Number of hyphens in token
        hyphens = f"hyphens={sum(['-' == i for i in token])}"
        # Number of symbols in token
        symbols = f"symbols={len(re.findall(r'[()+-]', token))}"
        # Token length
        length = f"length={len(token)}"
        # Token has Digit-Captial combination
        dig_cap_num = str(int(bool(re.compile("([A-Z]+[0-9]+.*)").match(token) or re.compile(
            "([0-9]+[A-Z]+.*)").match(token))))
        dig_cap = f"dig_cap={dig_cap_num}"
        # Feats list
        if model == "MaxEnt":
            # Entities to reach Goal 3
            feats = [form, pre2, pre3, pre4, suf2, suf4]
        elif model == "CRF":
            # Features to reach Goal 3
            feats = [form, capital, nxt, pre2, suf2, prev,
                    capitals, 
                    # Features to reach Goal 4
                    pre3, pre4, suf4, dig_cap, hyphens, length
                    ]
        elif model == "RandomForest":
            # Entities to reach Goal 3
            feats = [suf2, pre2, nxt_end, b_capital, capital, dig_cap,
                     capitals[-1], digits[-1], hyphens[-1], symbols[-1], length[-1]]
        else:
            print(f"[ERROR] Model {model} not implemented")
            raise NotImplementedError
        features.append(feats)
    return features

<a id='classifier'></a>
## Classifier function *classifier*

The classifier function takes the generated features for the data and the trained model, according to the **model** parameter and outputs the predictions given by the model. The different prediction formats of each model type are normalized into the same format and finally passed onto the [ouput_entities](#output) function.

The *classifier* function makes use of an auxliary function to extract the features from the input file, attached following the main function's body

In [3]:
def classifier(model, feature_input, model_input, outputfile):
    sentences, X_valid, Y_valid = get_sentence_features(feature_input)
    if model == "CRF":
        # CRF classifier flow
        tagger = pycrfsuite.Tagger()
        tagger.open(f"{model_input}.crfsuite")
        predictions = [tagger.tag(x) for x in X_valid]

    elif model == "MaxEnt":
        # MaxEnt classifier flow
        megam_features = f"{tmp_path}/megam_valid_features.dat"
        megam_predictions = f"{tmp_path}/megam_predictions.dat"
        system(f"cat {feature_input} | cut -f5- | grep -v ’^$’ > \
            {megam_features}")
        system(f"./{megam} -nc -nobias -predict {model_input}.megam multiclass\
            {megam_features} > {megam_predictions}")
        with open(megam_predictions, "r") as fp:
            lines = fp.readlines()
        pred_classes = [line.split("\t")[0] for line in lines]
        predictions = []
        start = 0
        for sent in X_valid:
            end = start + len(sent)
            predictions.append(pred_classes[start:end])
            start = end

    elif model == "RandomForest":
        with open(f"{model_input}.randomForest", "rb") as fp:
            model, encoder = pickle.load(fp)
        # Unlist sentences
        x_cat = []
        x_num = []
        for x_sent in X_valid:
            x_cat_sent = [f[:6] for f in x_sent]
            x_num_sent = [f[6:] for f in x_sent]
            x_cat.extend(x_cat_sent)
            x_num.extend(x_num_sent)
        # One hot encoder to turn categorical variables to binary
        x_encoded = encoder.transform(x_cat).toarray()
        x = np.concatenate((x_encoded, x_num), axis=1)
        pred_classes = model.predict(x)
        predictions = []
        start = 0
        for sent in X_valid:
            end = start + len(sent)
            predictions.append(pred_classes[start:end])
            start = end

    else:
        print(f"[ERROR] Model {model} not implemented")
        raise NotImplementedError
    # Ouput entites for each sentence
    with open(outputfile, "w") as out:
        for sent, classes in zip(sentences, predictions):
            id = sent[0][0]
            tokens = [(word[1], word[2], word[3]) for word in sent if word]
            output_entities(id, tokens, classes, out)

In [4]:
def get_sentence_features(input):
    with open(input, "r") as fp:
        lines = fp.read()
    sentences = lines.split("\n\n")[:-1]
    X_feat = []
    Y_feat = []
    full_tokens = []
    for sent in sentences:
        tokens = sent.split("\n")
        feats = [token.split("\t") for token in tokens if len(token)]
        x = [f[5:] for f in feats if len(f)]
        # Turn back numeric variables
        # only for RandomForest model
        if model == "RandomForest":
            for i, token in enumerate(x):
                val = [int(elem) if elem.isdigit() else elem for elem in token]
                x[i] = val
        y = [f[4] for f in feats if len(f)]
        full_tokens.append(feats)
        X_feat.append(x)
        Y_feat.append(y)
    return full_tokens, X_feat, Y_feat

<a id='output'></a>
## Output generator function *output entities*

This function recieves the token list  **tokens** for each sentence, identified by the **id** parameter, and their predicted classes **classes**, in the B-I-O class convention, and outputs the detected entities for the given sentence into the output file object **outf**.

In [1]:
def output_entities(id, tokens, classes, outf):
    ind = 0
    while ind < len(tokens):
        tag = classes[ind]
        type = tag.split("-")[-1]
        if tag == "O":
            ind += 1
            continue
        elif "B" in tag:  # If Beginning of an entity
            name, start, end = tokens[ind]
            # Check if next token I-same_type
            # Continue search until EoS or no-match
            ind += 1
            tag_nxt = classes[ind] if ind < len(tokens) else "O"
            type_nxt = tag_nxt.split("-")[-1]
            while ind < len(tokens) and "I" in tag_nxt and type_nxt == type:
                name_nxt, _, end_nxt = tokens[ind]
                name = f"{name} {name_nxt}"
                end = end_nxt
                ind += 1
                tag_nxt = classes[ind] if ind < len(tokens) else "O"
                type_nxt = tag_nxt.split("-")[-1]
        else:  # I-tag
            name, start, end = tokens[ind]
            ind += 1
        # Print entity and continue
        offset = f"{start}-{end}"
        txt = f"{id}|{offset}|{name}|{type}\n"
        outf.write(txt)

<a id='output_1'></a>
## Evaluator output for CRF and Devel/Test sets for Goal 3

### Output for the Devel data-set with features indicated to obtain Goal 3
We present the output of all the model, where the CRF model, the best, is given with the minimal features to obtain the F1 score of 0.6 corresponding to Goal 3:

#### SCORES FOR THE GROUP: ML 

Strict matching (boundaries + type)

*Model*|cor|inc|par|mis|spu|total|prec|recall|F1
-------|---|---|---|---|---|----|---|---|---
**CRF**|1277|179|0|315|75|1771|0.83|0.72|0.77
**MaxEnt**|993|374|0|404|162|1771|0.65|0.56|0.6
**RandomForest**|1231|260|0|280|155|1771|0.75|0.7|0.72

Exact matching

*Model*|cor|inc|par|mis|spu|total|prec|recall|F1
-------|---|---|---|---|---|----|---|---|---
**CRF**|1342|114|0|315|75|1771|0.88|0.76|0.81
**MaxEnt**|1077|290|0|404|162|1771|0.7|0.61|0.65
**RandomForest**|1294|197|0|280|155|1771|0.79|0.73|0.76

Partial matching

*Model*|cor|inc|par|mis|spu|total|prec|recall|F1
-------|---|---|---|---|---|----|---|---|---
**CRF**|1342|0|114|315|75|1771|0.88|0.79|0.83
**MaxEnt**|1077|0|290|404|162|1771|0.7|0.69|0.7
**RandomForest**|1294|0|197|280|155|1771|0.79|0.79|0.79

type matching

*Model*|cor|inc|par|mis|spu|total|prec|recall|F1
-------|---|---|---|---|---|----|---|---|---
**CRF**|1370|86|0|315|75|1771|0.89|0.77|0.83
**MaxEnt**|1223|144|0|404|162|1771|0.8|0.69|0.74
**RandomForest**|1407|84|0|280|155|1771|0.85|0.79|0.82

#### SCORES FOR ENTITY TYPE

Exact matching on drug

*Model*|cor|inc|par|mis|spu|total|prec|recall|F1
-------|---|---|---|---|---|----|---|---|---
**CRF**|853|32|0|160|66|1045|0.9|0.82|0.85
**MaxEnt**|731|94|0|220|139|1045|0.76|0.7|0.73
**RandomForest**|837|33|0|175|80|1045|0.88|0.8|0.84

Exact matching on brand

*Model*|cor|inc|par|mis|spu|total|prec|recall|F1
-------|---|---|---|---|---|----|---|---|---
**CRF**|115|0|0|65|5|180|0.96|0.64|0.77
**MaxEnt**|98|2|0|80|1|180|0.97|0.54|0.7
**RandomForest**|112|0|0|68|2|180|0.98|0.62|0.76

Exact matching on group

*Model*|cor|inc|par|mis|spu|total|prec|recall|F1
-------|---|---|---|---|---|----|---|---|---
**CRF**|304|66|0|84|20|454|0.78|0.67|0.72
**MaxEnt**|144|134|0|176|41|454|0.45|0.32|0.37
**RandomForest**|270|143|0|41|46|454|0.59|0.59|0.59

Exact matching on drug_n

*Model*|cor|inc|par|mis|spu|total|prec|recall|F1
-------|---|---|---|---|---|----|---|---|---
**CRF**|5|0|0|87|1|92|0.83|0.05|0.1
**MaxEnt**|20|0|0|72|1|92|0.95|0.22|0.35
**RandomForest**|12|0|0|80|1|92|0.92|0.13|0.23

#### MACRO-AVERAGE MEASURES:

*Model*|P|R|F1
-------|---|---|---
**CRF**|0.87|0.54|0.61
**MaxEnt**|0.78|0.44|0.54
**RandomForest**|0.84|0.54|0.61

We see **drug** and **brand** have high F1 scores over **0.7** for all three models, while only **CRF** and **RandomForest** models have high score for **group**, meaning our features capture well this types of entities.

In contrast, the **CRF** model has much lower scores of **0.1** for **drug_n** entities, while **RandomForest** and **MaxEnt** models have, increasingly better scores. In spite of this, all models have 0 incorrect predictions on this type.

This result indicate us that our features do not characterise well these type *drug_n* entities, while they characterise pretty good other types.

### Output for the Test data-set with features indicated to obtain Goal 3

We test the generalisation of the previously presented models with the Test data-set:

#### SCORES FOR THE GROUP: ML RUN=2

Strict matching (boundaries + type)

*Model*|cor|inc|par|mis|spu|total|prec|recall|F1
-------|---|---|---|---|---|----|---|---|---
**CRF**|372|113|0|201|47|686|0.7|0.54|0.61
**MaxEnt**|326|191|0|169|132|686|0.5|0.48|0.49
**RandomForest**|372|149|0|165|105|686|0.59|0.54|0.57

Exact matching

*Model*|cor|inc|par|mis|spu|total|prec|recall|F1
-------|---|---|---|---|---|----|---|---|---
**CRF**|431|54|0|201|47|686|0.81|0.63|0.71
**MaxEnt**|385|132|0|169|132|686|0.59|0.56|0.58
**RandomForest**|430|91|0|165|105|686|0.69|0.63|0.66

Partial matching

*Model*|cor|inc|par|mis|spu|total|prec|recall|F1
-------|---|---|---|---|---|----|---|---|---
**CRF**|431|0|54|201|47|686|0.81|0.67|0.73
**MaxEnt**|385|0|132|169|132|686|0.59|0.66|0.62
**RandomForest**|1430|0|91|165|105|686|0.69|0.69|0.69

type matching

*Model*|cor|inc|par|mis|spu|total|prec|recall|F1
-------|---|---|---|---|---|----|---|---|---
**CRF**|404|81|0|201|47|686|0.76|0.59|0.66
**MaxEnt**|409|108|0|169|132|686|0.63|0.6|0.61
**RandomForest**|435|86|0|165|105|686|0.69|0.63|0.66

#### SCORES FOR ENTITY TYPE

Exact matching on drug

*Model*|cor|inc|par|mis|spu|total|prec|recall|F1
-------|---|---|---|---|---|----|---|---|---
**CRF**|260|20|0|71|46|351|0.8|0.74|0.77
**MaxEnt**|238|48|0|65|76|351|0.66|0.68|0.67
**RandomForest**|250|36|0|65|57|351|0.73|0.71|0.72

Exact matching on brand

*Model*|cor|inc|par|mis|spu|total|prec|recall|F1
-------|---|---|---|---|---|----|---|---|---
**CRF**|28|0|0|31|0|59|1|0.47|0.64
**MaxEnt**|21|0|0|38|0|59|1|0.36|0.53
**RandomForest**|26|0|0|33|0|59|1|0.44|0.61

Exact matching on group

*Model*|cor|inc|par|mis|spu|total|prec|recall|F1
-------|---|---|---|---|---|----|---|---|---
**CRF**|84|13|0|58|12|155|0.77|0.54|0.64
**MaxEnt**|62|38|0|55|18|155|0.53|0.4|0.45
**RandomForest**|91|27|0|37|27|155|0.63|0.59|0.61

Exact matching on drug_n

*Model*|cor|inc|par|mis|spu|total|prec|recall|F1
-------|---|---|---|---|---|----|---|---|---
**CRF**|0|0|0|121|0|121|0|0|0
**MaxEnt**|5|0|0|116|2|121|0.71|0.04|0.08
**RandomForest**|5|0|0|116|0|121|1|0.04|0.08

#### MACRO-AVERAGE MEASURES:

*Model*|P|R|F1
-------|---|---|---
**CRF**|0.64|0.44|0.51
**MaxEnt**|0.72|0.37|0.43
**RandomForest**|0.84|0.45|0.5

We see for the Test set we obtain the same trend as for the Devel set, where the types *drug*, *brand*, and *group* are better classified than the *drug_n*, better by the **MaxEnt** and **RandomForest** models.

Nevertheless, the best overall model on the test data-set is the **CRF** model.

<a id='output_2'></a>
## Evaluator output for CRF and Devel/Test sets for Goal 4

### Output for the Devel data-set with features indicated to obtain Goal 4

We improve the previous **CRF** model by adding some extra features, marked as *features to obtain the maximum F1* in the function code, and achieve the maximum F1 score for the Devel data-set.

#### SCORES FOR THE GROUP: CRF

Strict matching (boundaries + type)

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
1415|108|0|248|63|1771|0.89|0.8|0.84

Exact matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
1415|0|108|248|63|1771|0.89|0.83|0.86

Partial matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
1440|83|0|248|63|1771|0.91|0.81|0.86

type matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
442|71|0|173|35|686|0.81|0.64|0.72

#### SCORES FOR ENTITY TYPE

Exact matching on drug

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
886|43|0|116|65|1045|0.89|0.85|0.87

Exact matching on brand

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
112|3|0|65|2|180|0.96|0.62|0.75

Exact matching on group

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
329|51|0|74|17|454|0.83|0.72|0.77

Exact matching on drug_n

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
16|1|0|75|0|92|0.94|0.17|0.29

#### MACRO-AVERAGE MEASURES:

P|R|F1
---|---|---
0.9|0.59|0.67


We see **drug** has a high score of 0.87, **brand** and **group** have F1 scores over 0.75, meaning our rules capture well this types of entities, while for **drug_n** we have much lower score of 0.29. Still, **drug_n** is having a too low score. The features are not capturing its pattern. 

### Output for the Test data-set with features indicated to obtain Goal 4

Now we test our model using CRF with the Test data set. The best results we have obtained using the commented features are the following:

#### SCORES FOR THE GROUP: CRF 

Strict matching (boundaries + type)

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
397|116|0|173|35|686|0.72|0.58|0.64

Exact matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
441|72|0|173|35|686|0.8|0.64|0.71

Partial matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
441|0|72|173|35|686|0.8|0.7|0.75

type matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
442|71|0|173|35|686|0.81|0.64|0.72

#### SCORES FOR ENTITY TYPE

Exact matching on drug

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
265|28|0|58|35|351|0.81|0.75|0.78

Exact matching on brand

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
27|0|0|32|0|59|1|0.46|0.63

Exact matching on group

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
102|16|0|37|9|155|0.8|0.66|0.72

Exact matching on drug_n

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
3|3|0|115|0|121|0.5|0.02|0.05

#### MACRO-AVERAGE MEASURES:

P|R|F1
---|---|---
0.78|0.47|0.54


We see **drug** and **group** have F1 scores over 0.7, meaning our rules capture well this types of entities, while for **brand** and, again, **drug_n** we have much lower scores of 0.63 and 0.05. These values inform us that we need to improve the **drug_n** recognition to improve our model. It is the entity with always bad results which makes the average F1 score always fall down.