# AHLT - Lab - DDI ML

Ricard Monge and Cristina Capdevila

This notebook contains the deliverables for the AHLT Lab DDI Machine Learning assignment.
The notebook contains the following sections:
@TOREVISE
- [Feature extractor function *extract features*](#features), with subset of features function to achieve Goals 3 and 4.
- [Classifier function *classifier*](#classifier)
- [Output generator function *output entities*](#output)
- [Evaluator output for Devel/Test sets for Goal 3.](#output_1)
- [Evaluator output for Devel/Test sets for Goal 4.](#output_2)

<a id='features'></a>
## Feature extractor function *extract_features*

To improve upon the baseline DDI classification we devise a set of features with which we train four different classifiers to detect the interactions.

The fours classifiers we have tested are:
- **Maximum Entropy classifier** (**MaxEnt**), through its implementation as a command line executable, details in [here](http://users.umiacs.umd.edu/~hal/megam/version0_3/)
- **Multi-layer Perceptron Classifier** (**MLP**), through its implementation in *Sklearn* Python package. We have used a unique hidden layer with size of 45.
- **Support Vector Classification** (**SVC**), through its implementation in *Sklearn* Python package.
- **Logistic Regression** (**LR**), through its implementation in *Sklearn* Python package.


In [11]:
def extract_features(token_list):
    """
    Extract Features
    Function to extract features from each token of the given token list.
    Args:
        - token_list: list of token strings with token words
    Returns:
        - features: list of list of features for each token of the given list.
    """
    features = []
    for i, token_t in enumerate(token_list):
        token, start, end = token_t
        # Token form
        form = f"form={token.lower()}"
        # Suffix's 4 last letters
        suf4 = f"suf4={token[-4:].lower()}"
        # Suffix's 3 last letters
        suf3 = f"suf3={token[-3:]}"
        # Suffix's 2 last letters
        suf2 = f"suf2={token[-2:]}"
        # Prefix's 4 first letters
        pre4 = f"pre4={token[:4]}"
        # Prefix's 3 first letters
        pre3 = f"pre3={token[:3]}"
        # Prefix's 2 first letters
        pre2 = f"pre2={token[:2]}"
        # Prev token
        if i == 0:
            prev = "prev=_BoS_"
        else:
            prev = f"prev={token_list[i - 1][0].lower()}"
        # Next token
        if i == (len(token_list) - 1):
            nxt = "next=_EoS_"
            nxt_end = nxt
        else:
            nxt = f"next={token_list[i + 1][0].lower()}"
            # Next token end
            nxt_end = f"next={token_list[i + 1][0][-3:-1]}"
        # All token in capital letters
        capital_num = str(int(token.isupper()))
        capital = f"capital={capital_num}"
        # Begin with capital letter
        b_capital_num = str(int(token[0].isupper()))
        b_capital = f"b_capital={b_capital_num}"
        # Number of digits in token
        digits = f"digits={sum(i.isdigit() for i in token)}"
        # Number of capitals in token
        capitals = f"capitals={sum(i.isupper() for i in token)}"
        # Number of hyphens in token
        hyphens = f"hyphens={sum(['-' == i for i in token])}"
        # Number of symbols in token
        symbols = f"symbols={len(re.findall(r'[()+-]', token))}"
        # Token length
        length = f"length={len(token)}"
        # Token has Digit-Captial combination
        dig_cap_num = str(int(bool(re.compile("([A-Z]+[0-9]+.*)").match(token) or re.compile(
            "([0-9]+[A-Z]+.*)").match(token))))
        dig_cap = f"dig_cap={dig_cap_num}"
        # Feats list
        if model == "MaxEnt":
            # Entities to reach Goal 3
            feats = [form, pre2, pre3, pre4, suf2, suf4]
        elif model == "CRF":
            # Features to reach Goal 3
            feats = [form, capital, nxt, pre2, suf2, prev,
                    capitals, 
                    # Features to reach Goal 4
                    pre3, pre4, suf4, dig_cap, hyphens, length
                    ]
        elif model == "RandomForest":
            # Entities to reach Goal 3
            feats = [suf2, pre2, nxt_end, b_capital, capital, dig_cap,
                     capitals[-1], digits[-1], hyphens[-1], symbols[-1], length[-1]]
        else:
            print(f"[ERROR] Model {model} not implemented")
            raise NotImplementedError
        features.append(feats)
    return features

<a id='classifier'></a>
## Classifier function *classifier*

The classifier function takes the generated features for the data and the trained model, according to the **model** parameter and outputs the predictions given by the model. The different prediction formats of each model type are normalized into the same format and finally passed onto the [ouput_entities](#output) function.

The *classifier* function makes use of an auxliary function to extract the features from the input file, attached following the main function's body

In [3]:
def classifier(model, feature_input, model_input, outputfile):
    sentences, X_valid, Y_valid = get_sentence_features(feature_input)
    if model == "CRF":
        # CRF classifier flow
        tagger = pycrfsuite.Tagger()
        tagger.open(f"{model_input}.crfsuite")
        predictions = [tagger.tag(x) for x in X_valid]

    elif model == "MaxEnt":
        # MaxEnt classifier flow
        megam_features = f"{tmp_path}/megam_valid_features.dat"
        megam_predictions = f"{tmp_path}/megam_predictions.dat"
        system(f"cat {feature_input} | cut -f5- | grep -v ’^$’ > \
            {megam_features}")
        system(f"./{megam} -nc -nobias -predict {model_input}.megam multiclass\
            {megam_features} > {megam_predictions}")
        with open(megam_predictions, "r") as fp:
            lines = fp.readlines()
        pred_classes = [line.split("\t")[0] for line in lines]
        predictions = []
        start = 0
        for sent in X_valid:
            end = start + len(sent)
            predictions.append(pred_classes[start:end])
            start = end

    elif model == "RandomForest":
        with open(f"{model_input}.randomForest", "rb") as fp:
            model, encoder = pickle.load(fp)
        # Unlist sentences
        x_cat = []
        x_num = []
        for x_sent in X_valid:
            x_cat_sent = [f[:6] for f in x_sent]
            x_num_sent = [f[6:] for f in x_sent]
            x_cat.extend(x_cat_sent)
            x_num.extend(x_num_sent)
        # One hot encoder to turn categorical variables to binary
        x_encoded = encoder.transform(x_cat).toarray()
        x = np.concatenate((x_encoded, x_num), axis=1)
        pred_classes = model.predict(x)
        predictions = []
        start = 0
        for sent in X_valid:
            end = start + len(sent)
            predictions.append(pred_classes[start:end])
            start = end

    else:
        print(f"[ERROR] Model {model} not implemented")
        raise NotImplementedError
    # Ouput entites for each sentence
    with open(outputfile, "w") as out:
        for sent, classes in zip(sentences, predictions):
            id = sent[0][0]
            tokens = [(word[1], word[2], word[3]) for word in sent if word]
            output_entities(id, tokens, classes, out)

In [4]:
def get_sentence_features(input):
    with open(input, "r") as fp:
        lines = fp.read()
    sentences = lines.split("\n\n")[:-1]
    X_feat = []
    Y_feat = []
    full_tokens = []
    for sent in sentences:
        tokens = sent.split("\n")
        feats = [token.split("\t") for token in tokens if len(token)]
        x = [f[5:] for f in feats if len(f)]
        # Turn back numeric variables
        # only for RandomForest model
        if model == "RandomForest":
            for i, token in enumerate(x):
                val = [int(elem) if elem.isdigit() else elem for elem in token]
                x[i] = val
        y = [f[4] for f in feats if len(f)]
        full_tokens.append(feats)
        X_feat.append(x)
        Y_feat.append(y)
    return full_tokens, X_feat, Y_feat

<a id='output'></a>
## Output generator function *output entities*

This function recieves the token list  **tokens** for each sentence, identified by the **id** parameter, and their predicted classes **classes**, in the B-I-O class convention, and outputs the detected entities for the given sentence into the output file object **outf**.

In [1]:
def output_entities(id, tokens, classes, outf):
    ind = 0
    while ind < len(tokens):
        tag = classes[ind]
        type = tag.split("-")[-1]
        if tag == "O":
            ind += 1
            continue
        elif "B" in tag:  # If Beginning of an entity
            name, start, end = tokens[ind]
            # Check if next token I-same_type
            # Continue search until EoS or no-match
            ind += 1
            tag_nxt = classes[ind] if ind < len(tokens) else "O"
            type_nxt = tag_nxt.split("-")[-1]
            while ind < len(tokens) and "I" in tag_nxt and type_nxt == type:
                name_nxt, _, end_nxt = tokens[ind]
                name = f"{name} {name_nxt}"
                end = end_nxt
                ind += 1
                tag_nxt = classes[ind] if ind < len(tokens) else "O"
                type_nxt = tag_nxt.split("-")[-1]
        else:  # I-tag
            name, start, end = tokens[ind]
            ind += 1
        # Print entity and continue
        offset = f"{start}-{end}"
        txt = f"{id}|{offset}|{name}|{type}\n"
        outf.write(txt)

<a id='output_1'></a>
## Evaluator output for Logistic Regression and Devel set

### Output for the Train data-set 
WRONG! We present the output of all the model, where the CRF model, the best, is given with the minimal features to obtain the F1 score of 0.6 corresponding to Goal 3:

#### SCORES FOR THE GROUP: ML 

SCORES FOR the file: task9.2_MLLR.txt
Gold Dataset: /Devel

Partial Evaluation: only detection of DDI (regadless to the type)
tp|fp|fn|total|prec|recall|F1
252|155|232|484|0.6192|0.5207|0.5657


Detection and Classification of DDI
tp|fp|fn|total|prec|recall|F1
190|217|294|484|0.4668|0.3926|0.4265



##SCORES FOR DDI TYPE

Scores for ddi with type mechanism
tp|fp|fn|total|prec|recall|F1
54|85|147|201|0.3885|0.2687|0.3176


Scores for ddi with type effect
tp|fp|fn|total|prec|recall|F1
89|105|73|162|0.4588|0.5494|0.5


Scores for ddi with type advise
tp|fp|fn|total|prec|recall|F1
--|--|--|---|------|------
45|24|74|119|0.6522|0.3782|0.4787


Scores for ddi with type int
tp|fp|fn|total|prec|recall|F1
2|3|0|2|0.4|1|0.5714


MACRO-AVERAGE MEASURES:
P|R|F1
|0.4749|0.549|0.5093

WRONG!!! We see **drug** and **brand** have high F1 scores over **0.7** for all three models, while only **CRF** and **MaxEnt** models have high score for **group**, meaning our features capture well this types of entities.

In contrast, the **CRF** model has much lower scores of **0.1** for **drug_n** entities, while **RandomForest** and **MaxEnt** models have, increasingly better scores. In spite of this, all models have 0 incorrect predictions on this type.

This result indicate us that our features do not characterise well these type *drug_n* entities, while they characterise pretty good other types.

We see **drug** and **group** have F1 scores over 0.7, meaning our rules capture well this types of entities, while for **brand** and, again, **drug_n** we have much lower scores of 0.63 and 0.05. These values inform us that we need to improve the **drug_n** recognition to improve our model. It is the entity with always bad results which makes the average F1 score always fall down.