# AHLT - Lab - NERC ML

This notebook contains the deliverables for the AHLT Lab NERC Machine Learning assignment, corresponding to Goals 3 and 4.
The notebook contains the following sections:

- [Feature extractor function *extract features*](#features), with subset of features function to achieve Goals 3 and 4.
- [Classifier function *classifier*](#classifier)
- [Output generator function *output entities*](#output)
- [Evaluator output for Devel/Test sets for Goal 3.](#output_1)
- [Evaluator output for Devel/Test sets for Goal 4.](#output_2)

<a id='features'></a>
## Feature extractor function *extract_features*

To improve upon the baseline entity classification throughy ruled-based entity recognition, we devise a set of features with which we train three different classifiers to detect our entities.

The threee classifiers we used are:

- **Conditonal Random Fields classifier** (**CRF**), through its implementation *pycrfsuite* Python package.
- **Maximum Entropy classifier** (**MaxEnt**), through its implementation as a command line executable, details in [here](http://users.umiacs.umd.edu/~hal/megam/version0_3/)
- **Random Forest classifier** (**RandomForest**), through its implementation in *Sklearn* Python package.

For each classifier, we devise a set features that characterise the different tokens to classify them into drug entity types and using the B-I-O rules. That is we have 9 classes to classify tokens into: eight **B**(egining) or **I**(nternal) tags for each type (i.e. B-drug, I-group, etc); and a **O** tag for non-entities.

In [2]:
def extract_features(token_list):
    """
    Extract Features
    Function to extract features from each token of the given token list.
    Args:
        - token_list: list of token strings with token words
    Returns:
        - features: list of list of features for each token of the given list.
    """
    features = []
    for i, token_t in enumerate(token_list):
        token, start, end = token_t
        # Token form
        form = f"form={token.lower()}"
        # Suffix's 4 last letters
        suf4 = token[-4:].lower()
        suf4 = f"suf4={suf4}"
        # Suffix's 3 last letters
        suf3 = token[-3:]
        suf3 = f"suf3={suf3}"
        # Suffix's 2 last letters
        suf2 = token[-2:]
        suf2 = f"suf2={suf2}"
        # Prefix's 4 first letters
        pre4 = token[:4]
        pre4 = f"pre4={pre4}"
        # Prefix's 3 first letters
        pre3 = token[:3]
        pre3 = f"pre3={pre3}"
        # Prefix's 2 first letters
        pre2 = token[:2]
        pre2 = f"pre2={pre2}"
        # Prev token
        if i == 0:
            prev = "prev=_BoS_"
        else:
            prev = f"prev={token_list[i - 1][0].lower()}"
        # Next token
        if i == (len(token_list) - 1):
            nxt = "next=_EoS_"
            nxt_end = nxt
        else:
            nxt = f"next={token_list[i + 1][0].lower()}"
            # Next token end
            nxt_end = f"next={token_list[i + 1][0][-3:-1]}"
        # All token in capital letters
        capital_num = str(int(token.isupper()))
        capital = f"capital={capital_num}"
        # Begin with capital letter
        b_capital_num = str(int(token[0].isupper()))
        b_capital = f"b_capital={b_capital_num}"
        # Ends s for plurals
        ends_s_num = str(int(token.endswith('s')))
        ends_s = f"ends_s={ends_s_num}"
        # Number of has spaces in token
        # Number of digits in token
        digits = f"digits={sum(i.isdigit() for i in token)}"
        # Number of capitals in token
        capitals = f"capitals={sum(i.isupper() for i in token)}"
        # Number of hyphens in token
        hyphens = f"hyphens={sum(['-' == i for i in token])}"
        # Number of symbols in token
        symbols = f"symbols={len(re.findall(r'[()+-]', token))}"
        # Token length
        length = f"length={len(token)}"
        # Token has Digit-Captial combination
        dig_cap_num = str(int(bool(re.compile("([A-Z]+[0-9]+.*)").match(token) or re.compile(
            "([0-9]+[A-Z]+.*)").match(token))))
        dig_cap = f"dig_cap={dig_cap_num}"
        # Feats list
        if model == "MaxEnt":
            feats = [form, pre2, pre3, pre4, suf2, suf4]
        elif model == "CRF":
            # Minimum entities to reach Goal 3
            feats = [form, capital, nxt, pre2, suf2, prev,
                    capitals, 
                    # Entities tu reach the maximum F1
                    pre3, pre4, suf4, dig_cap, hyphens, length
                    ]
        elif model == "RandomForest":
            # Entities to reach Goal 3
            feats = [suf2, pre2, nxt_end, b_capital, capital, dig_cap,
                     capitals[-1], digits[-1], hyphens[-1], symbols[-1], length[-1]]
        else:
            feats = [form, b_capital, ends_s, capital, dig_cap,
                     nxt, pre2, pre3, pre4, prev, suf2, suf3, suf4,
                     capitals, digits, hyphens, symbols, length]
        features.append(feats)
    return features

<a id='classifier'></a>
## Classifier function *classifier*

The classifier function takes the generated features for the data and the trained model, according to the **model** parameter and outputs the predictions given by the model. The different prediction formats of each model type are normalized into the same format and finally passed onto the [ouput_entities](#output) function.

The *classifier* function makes use of an auxliary function to extract the features from the input file, attached following the main function's body

In [3]:
def classifier(model, feature_input, model_input, outputfile):
    sentences, X_valid, Y_valid = get_sentence_features(feature_input)
    if model == "CRF":
        # CRF classifier flow
        tagger = pycrfsuite.Tagger()
        tagger.open(f"{model_input}.crfsuite")
        predictions = [tagger.tag(x) for x in X_valid]

    elif model == "MaxEnt":
        # MaxEnt classifier flow
        megam_features = f"{tmp_path}/megam_valid_features.dat"
        megam_predictions = f"{tmp_path}/megam_predictions.dat"
        system(f"cat {feature_input} | cut -f5- | grep -v ’^$’ > \
            {megam_features}")
        system(f"./{megam} -nc -nobias -predict {model_input}.megam multiclass\
            {megam_features} > {megam_predictions}")
        with open(megam_predictions, "r") as fp:
            lines = fp.readlines()
        pred_classes = [line.split("\t")[0] for line in lines]
        predictions = []
        start = 0
        for sent in X_valid:
            end = start + len(sent)
            predictions.append(pred_classes[start:end])
            start = end

    elif model == "RandomForest":
        with open(f"{model_input}.randomForest", "rb") as fp:
            model, encoder = pickle.load(fp)
        # Unlist sentences
        x_cat = []
        x_num = []
        for x_sent in X_valid:
            x_cat_sent = [f[:6] for f in x_sent]
            x_num_sent = [f[6:] for f in x_sent]
            x_cat.extend(x_cat_sent)
            x_num.extend(x_num_sent)
        # One hot encoder to turn categorical variables to binary
        x_encoded = encoder.transform(x_cat).toarray()
        x = np.concatenate((x_encoded, x_num), axis=1)
        pred_classes = model.predict(x)
        predictions = []
        start = 0
        for sent in X_valid:
            end = start + len(sent)
            predictions.append(pred_classes[start:end])
            start = end

    else:
        print(f"[ERROR] Model {model} not implemented")
        raise NotImplementedError
    # Ouput entites for each sentence
    with open(outputfile, "w") as out:
        for sent, classes in zip(sentences, predictions):
            id = sent[0][0]
            tokens = [(word[1], word[2], word[3]) for word in sent if word]
            output_entities(id, tokens, classes, out)

In [4]:
def get_sentence_features(input):
    """
    """
    with open(input, "r") as fp:
        lines = fp.read()
    sentences = lines.split("\n\n")[:-1]
    X_feat = []
    Y_feat = []
    full_tokens = []
    for sent in sentences:
        tokens = sent.split("\n")
        feats = [token.split("\t") for token in tokens if len(token)]
        x = [f[5:] for f in feats if len(f)]
        # Turn back numeric variables
        # only for RandomForest model
        if model == "RandomForest":
            for i, token in enumerate(x):
                val = [int(elem) if elem.isdigit() else elem for elem in token]
                x[i] = val
        y = [f[4] for f in feats if len(f)]
        full_tokens.append(feats)
        X_feat.append(x)
        Y_feat.append(y)
    return full_tokens, X_feat, Y_feat

<a id='output'></a>
## Output generator function *output entities*

This function recieves the token list  **tokens** for each sentence, identified by the **id** parameter, and their predicted classes **classes**, in the B-I-O class convention, and outputs the detected entities for the given sentence into the output file object **outf**.

In [1]:
def output_entities(id, tokens, classes, outf):
    """
    """
    ind = 0
    while ind < len(tokens):
        tag = classes[ind]
        type = tag.split("-")[-1]
        if tag == "O":
            ind += 1
            continue
        elif "B" in tag:  # If Beginning of an entity
            name, start, end = tokens[ind]
            # Check if next token I-same_type
            # Continue search until EoS or no-match
            ind += 1
            tag_nxt = classes[ind] if ind < len(tokens) else "O"
            type_nxt = tag_nxt.split("-")[-1]
            while ind < len(tokens) and "I" in tag_nxt and type_nxt == type:
                name_nxt, _, end_nxt = tokens[ind]
                name = f"{name} {name_nxt}"
                end = end_nxt
                ind += 1
                tag_nxt = classes[ind] if ind < len(tokens) else "O"
                type_nxt = tag_nxt.split("-")[-1]
        else:  # I-tag
            name, start, end = tokens[ind]
            ind += 1
        # Print entity and continue
        offset = f"{start}-{end}"
        txt = f"{id}|{offset}|{name}|{type}\n"
        outf.write(txt)

<a id='output_1'></a>
## Evaluator output for CRF and Devel/Test sets for Goal 3

###### Output for the Devel data-set with features indicated to obtain Goal 3
We have implemented de CRF model on de NLP model. With the subset of features indicated in the previous section as minimal features to obtain a F1 average score  over 0.6 with the Devel data-set:

#### SCORES FOR THE GROUP: BASELINE RUN=1

Strict matching (boundaries + type)

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
1277|179|0|315|75|1771|0.83|0.72|0.77

Exact matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
1342|114|0|315|75|1771|0.88|0.76|0.81

Partial matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
1342|0|114|315|75|1771|0.88|0.79|0.83

type matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
1370|86|0|315|75|1771|0.89|0.77|0.83

#### SCORES FOR ENTITY TYPE

Exact matching on drug

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
853|32|0|160|66|1045|0.9|0.82|0.85

Exact matching on brand

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
115|0|0|65|5|180|0.96|0.64|0.77

Exact matching on group

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
304|66|0|84|20|454|0.78|0.67|0.72

Exact matching on drug_n

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
5|0|0|87|1|92|0.83|0.05|0.1

#### MACRO-AVERAGE MEASURES:

P|R|F1
---|---|---
0.87|0.54|0.61


We see **drug** and **brand** have F1 scores over **0.7**, meaning our rules capture well this types of entities, while for **group** and **drug_n** we have much lower scores of **0.72** and **0.1**. This was to be expected since these last types have more common multiple token entities that are not well detected through our rules. 

### Output for CRF and the Devel data-set with features indicated to obtain Goal 3

We have implemented the CRF model. Then, with the subset of features indicated in the previous section as  features to obtain the maximum F1 average score with the Devel data-set:

#### SCORES FOR THE GROUP: CRF RUN=1

Strict matching (boundaries + type)

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
1343|180|0|248|63|1771|0.85|0.76|0.8

Exact matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
1415|108|0|248|63|1771|0.89|0.8|0.84

Partial matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
1415|0|108|248|63|1771|0.89|0.83|0.86

type matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
1440|83|0|248|63|1771|0.91|0.81|0.86

#### SCORES FOR ENTITY TYPE

Exact matching on drug

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
886|43|0|116|65|1045|0.89|0.85|0.87

Exact matching on brand

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
112|3|0|65|2|180|0.96|0.62|0.75

Exact matching on group

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
329|51|0|74|17|454|0.83|0.72|0.77

Exact matching on drug_n

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
16|1|0|75|0|92|0.94|0.17|0.29

#### MACRO-AVERAGE MEASURES:

P|R|F1
---|---|---
0.9|0.59|0.67


We see that the only remaining type of giving us bad results is 

### Output for Random Forest and the Test data-set with features indicated to obtain Goal 3

Now we test our model using Random Forest with the Test data set. The best results we have obtained using the commented features are the following:

#### SCORES FOR THE GROUP: CRF 

Strict matching (boundaries + type)

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
1231|260|0|280|155|1771|0.75|0.7|0.72

Exact matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
1294|197|0|280|155|1771|0.79|0.73|0.76

Partial matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
1294|0|197|280|155|1771|0.79|0.79|0.79

type matching

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
1407|84|0|280|155|1771|0.85|0.79|0.82

#### SCORES FOR ENTITY TYPE

Exact matching on drug

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
837|33|0|175|80|1045|0.88|0.8|0.84

Exact matching on brand

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
112|0|0|68|2|180|0.98|0.62|0.76

Exact matching on group

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
270|143|0|41|46|454|0.59|0.59|0.59

Exact matching on drug_n

cor|inc|par|mis|spu|total|prec|recall|F1
---|---|---|---|---|----|---|---|---
12|0|0|80|1|92|0.92|0.13|0.23

#### MACRO-AVERAGE MEASURES:

P|R|F1
---|---|---
0.84|0.54|0.61
