# AHLT - Lab - DDI ML

Ricard Monge and Cristina Capdevila

This notebook contains the deliverables for the AHLT Lab DDI Machine Learning assignment.
The notebook contains the following sections:
@TOREVISE
- [Feature extractor function *extract features*](#features), with subset of features function to achieve Goals 3 and 4.
- [Classifier function *classifier*](#classifier)
- [Output generator function *output entities*](#output)
- [Evaluator output for Devel/Test sets for Goal 3.](#output_1)
- [Evaluator output for Devel/Test sets for Goal 4.](#output_2)

<a id='features'></a>
## Feature extractor function *extract_features*

To improve upon the baseline DDI classification we devise a set of features with which we train four different classifiers to detect the interactions.

The fours classifiers we have tested are:
- **Maximum Entropy classifier** (**MaxEnt**), through its implementation as a command line executable, details in [here](http://users.umiacs.umd.edu/~hal/megam/version0_3/)
- **Multi-layer Perceptron Classifier** (**MLP**), through its implementation in *Sklearn* Python package. We have used a unique hidden layer with size of 45.
- **Support Vector Classification** (**SVC**), through its implementation in *Sklearn* Python package.
- **Logistic Regression** (**LR**), through its implementation in *Sklearn* Python package.


!!! @COMENTAR PER QUIN MODEL ENS HEM DECANTAT (resultats a baix) [table of results].(#table_results)

!!! @COMENTAR LES FEATURES QUE HEM TESTEJAT

In [11]:
def extract_features(analysis, entities, e1, e2):
    """
    Extract Features.
    Function which receives an analyzed sentence tree, the entities
    present in the sentence, and the ids of the two target entities and returns
    a list of features to pass to a ML model to predict DDI.
    Args:
        - analysis: DependencyGraph object instance with sentence parsed
            information.
        - entities: dictionary of entities indexed by id with offset as value.
        - e1: string with id of the first entity to consider.
        - e2: string with id of the second entity to consider.
    Return:
        - feats: list of features extracted from the tree and e1, e2.
    """
    feats = []

    # Get entity nodes from tree
    n1 = get_entity_node(analysis, entities, e1)
    n2 = get_entity_node(analysis, entities, e2)

    # Get verb ancestor from entities
    v1 = get_verb_ancestor(analysis, n1)
    v2 = get_verb_ancestor(analysis, n2)

    # Get ancestors nodes list for entity nodes and verb nodes
    ance1 = get_ancestors(analysis, n1)
    ance2 = get_ancestors(analysis, n2)
    ancev1 = get_ancestors(analysis, v1)
    ancev2 = get_ancestors(analysis, v2)

    # DDI-type characteristic lemmas
    advise_lemmas = ["administer", "use", "recommend", "consider", "approach",
                     "avoid", "monitor", "advise", "require", "contraindicate"]
    effect_lemmas = ["increase", "report", "potentiate", "enhance", "decrease",
                     "include", "result", "reduce", "occur", "produce",
                     "prevent", "effect"]
    int_lemmas = ["interact", "interaction"]
    mechanism_lemmas = ["reduce", "increase", "decrease"]
    # Mix lemmas
    mix_lemmas = list(set(
        advise_lemmas + effect_lemmas + int_lemmas + mechanism_lemmas))
    # Modal verbs lemmas
    modal_vb = ["can", "could", "may", "might", "must", "will", "would",
                "shall", "should"]

    # Modal verbs and DDI-type lemmas present in sentence
    modal_present = check_lemmas(analysis, modal_vb)
    lemma_present = check_lemmas(analysis, mix_lemmas)
    advise_present = check_lemmas(analysis, advise_lemmas)
    effect_present = check_lemmas(analysis, effect_lemmas)
    int_present = check_lemmas(analysis, int_lemmas)
    mechanism_present = check_lemmas(analysis, effect_lemmas)

    # e2<-*-VB is part DDI-type lemmas
    advise_v1 = True if v1["lemma"] in advise_lemmas else "null"
    effect_v1 = True if v1["lemma"] in effect_lemmas else "null"
    int_v1 = True if v1["lemma"] in int_lemmas else "null"
    mechanism_v1 = True if v1["lemma"] in mechanism_lemmas else "null"
    # e2<-*-VB is part DDI-type lemmas
    advise_v2 = True if v2["lemma"] in advise_lemmas else "null"
    effect_v2 = True if v2["lemma"] in effect_lemmas else "null"
    int_v2 = True if v2["lemma"] in int_lemmas else "null"
    mechanism_v2 = True if v2["lemma"] in mechanism_lemmas else "null"


    # Check if entities hang from the same verb
    v1_lemma = v1["lemma"]
    v2_lemma = v2["lemma"]
    v1_equal_v2 = v1 == v2

    # Get head dependencies
    e1_rel = n1["rel"]
    e2_rel = n2["rel"]
    v1_rel = v1["rel"]
    v2_rel = v2["rel"]

    # Get node dependencies
    e1_deps = "_".join(n1["deps"].keys()) if len(n1["deps"]) else "null"
    e2_deps = "_".join(n2["deps"].keys()) if len(n2["deps"]) else "null"
    v1_deps = "_".join(v1["deps"].keys()) if len(v1["deps"]) else "null"
    v2_deps = "_".join(v2["deps"].keys()) if len(v2["deps"]) else "null"
    ance1_deps = "_".join([a["rel"] for a in ance1]) if len(ance1) else "null"
    ance2_deps = "_".join([a["rel"] for a in ance2]) if len(ance2) else "null"

    # Get node order
    e1_over_e2 = n1 in ance2
    v1_over_v2 = v1 in ancev2
    v2_over_v1 = v2 in ancev1

    # Common ancestor features
    common = ([n for n in ance1 if n in ance2] if len(ance1) > len(ance2) else
              [n for n in ance2 if n in ance1])
    common_rel = common[0]["rel"] if len(common) else "null"
    common_deps = ("_".join(common[0]["deps"].keys())
                   if len(common) and len(common[0]["deps"]) else "null")
    common_tag = common[0]["tag"] if len(common) else "null"
    common_tag = dict_tags[common_tag]
    common_dist_root = (len(ance1) - 1 - ance1.index(common[0])
                        if len(common) else 99)
    common_dist_e1 = ance1.index(common[0]) if len(common) else 99
    common_dist_e2 = ance2.index(common[0]) if len(common) else 99

    # Common ancestor son's rel for each entity's branch
    common_dep11_rel = (
        ance1[ance1.index(common[0]) - 1]["rel"]
        if len(common) and ance1.index(common[0]) > 0 else "null")
    common_dep12_rel = (
        ance1[ance1.index(common[0]) - 2]["rel"]
        if len(common) and ance1.index(common[0]) > 1 else "null")
    common_dep13_rel = (
        ance1[ance1.index(common[0]) - 3]["rel"]
        if len(common) and ance1.index(common[0]) > 2 else "null")
    common_dep21_rel = (
        ance2[ance2.index(common[0]) - 1]["rel"]
        if len(common) and ance2.index(common[0]) > 0 else "null")
    common_dep22_rel = (
        ance2[ance2.index(common[0]) - 2]["rel"]
        if len(common) and ance2.index(common[0]) > 1 else "null")
    common_dep23_rel = (
        ance2[ance2.index(common[0]) - 3]["rel"]
        if len(common) and ance2.index(common[0]) > 2 else "null")

    # Common ancestor son's tag for each entity's branch
    common_dep11_tag = (
        dict_tags[ance1[ance1.index(common[0]) - 1]["tag"]]
        if len(common) and ance1.index(common[0]) > 0 else "null")

    common_dep22_tag = (
        dict_tags[ance2[ance2.index(common[0]) - 2]["tag"]]
        if len(common) and ance2.index(common[0]) > 1 else "null")

    # Tree address features
    # e1<-conj-x<-dobj-VB-nmod->e2
    e2_nmod = get_dependency_address(v2, "nmod") == n2["address"]
    x_dobj = get_dependency_address(v1, "dobj")
    nx = analysis.nodes[x_dobj] if x_dobj != -1 else v1
    e1_conj_dobj = get_dependency_address(nx, "conj") == n1["address"]

    # NER features
        
    # Entity lemma features
    lemma1 = str(n1["lemma"])
    lemma2 = str(n2["lemma"])
    
    # 3-Prefix/Suffix from lemma
    pre3_1 = lemma1[:3].lower()
    pre3_2 = lemma2[:3].lower()
    suf3_1 = lemma1[-3:].lower()
    suf3_2 = lemma2[-3:].lower()
    
    # Number of capitals in token
    capitals2 = sum(i.isupper() for i in lemma2)

    # Gather variables
    feats = [
        modal_present,  
        lemma_present,
        advise_present,
        effect_present,
        int_present,
        mechanism_present,  
        advise_v1,
        effect_v1,
        int_v1,
        mechanism_v1,
        advise_v2,  
        effect_v2,
        int_v2,
        mechanism_v2,
        v1_equal_v2,
        e1_rel,
        e2_rel,
        v1_rel,
        v2_rel,
        e1_deps,  
        e2_deps,
        e1_over_e2,
        v1_over_v2,
        v2_over_v1,
        common_rel,  
        common_tag,
        common_dist_root,
        common_dist_e1,
        common_dist_e2,
        common_deps, 
        common_dep11_rel,
        common_dep12_rel,
        common_dep13_rel,
        common_dep21_rel,
        common_dep22_rel, 
        common_dep23_rel,
        common_dep11_tag,
        common_dep22_tag,
        v1_deps,
        v2_deps,  
        ance1_deps,
        ance2_deps,
        pre3_1,
        pre3_2,
        suf3_1,
        suf3_2,
    ]
    # Turn variables f to categorical var_i=f
    feats = [f"var_{i}={f}" for i, f in enumerate(feats)]
    return feats

We have created some auxiliary functions to get the desired features for the tree.

In [2]:
def get_entity_node(analysis, entities, entity):
    """
    Get Entity Node.
    Function which finds the node in the Dependency Tree which corresponds to
    the root of the entity.
    Args:
        - analysis: DependencyTree object instance with sentence analysis.
        - entities: dictionary with entity information.
        - entity: string with id of entity to get.
    Returns:
        - node: dictionary with node from DependencyTree.
    """
    # Get nodes list
    nodes = [analysis.nodes[k] for k in analysis.nodes]
    ents = entities[entity]["text"].split()
    # Capture possible tree nodes containing or that are contained in entity
    possible = sorted(
        [node for node in nodes if node["word"] is not None and
         any(ent in node["word"] for ent in ents)],
        key=lambda x: x["head"])
    node = possible[0] if len(possible) else nodes[0]
    return node


def get_verb_ancestor(analysis, node):
    """
    Get Verb Ancestor.
    Function which looks in the node's antecessor nodes inthe analysis tree
    until it finds a verb VB, and returns such verb.
    Args:
        - analysis: DependencyTree object instance with sentence analysis.
        - node: dictionary with node to start from.
    Return:
        - node: dictionary with verb antecessor node from DependencyTree.
    """
    nodes = analysis.nodes
    while node["tag"] != "TOP" and "VB" not in node["tag"]:
        node = nodes[node["head"]]
        if not node["tag"]:
            break
    return node


def get_dependency_address(node, dependency):
    """
    Get Dependency Address.
    Function which returns the address of a given dependency for a given node,
    or a non tractable value -1, which always evaluates to False in the
    features. To use when extracting features.
    Args:
        - node: dictionary with node to look dependencies from.
        - dependency: string with dependency name to look for in node.
    Return:
        - _: string with address of found dependency, or -1 if not found.
    """
    dep = node["deps"][dependency]
    # If dependency exists, return address
    # If dependency does not exist, return non-value
    return dep[0] if len(dep) else -1


def check_lemmas(analysis, lemmas):
    """
    Check Lemmas.
    Function which checks if the words in the sentence contain the given
    lemmas. Then returns the tree-higher encountered lemma, or "null" if none
    found.
    Args:
        - analysis: DependencyTree object instance with sentence analysis.
        - lemmas: list of strings with lemmas to check.
    Returns:
        - _: string with present lemma or None.
    """
    nds = analysis.nodes
    present = [nds[n] for n in nds
               if (nds[n]["word"] is not None and nds[n]["lemma"] in lemmas)]
    present = sorted(present, key=lambda x: x["head"])
    # return present[0]["lemma"] if len(present) else "null"
    return "True" if len(present) else "False"


def get_ancestors(analysis, node):
    """
    Get Ancestors.
    Function which returns the given node's ancestor nodes.
    Args:
        - analysis: DependencyTree object instance with sentence analysis.
        - node: dictionary with node to start from.
    Return:
        - node: dictionary with verb antecessor node from DependencyTree.
    """
    ancs = []
    nds = analysis.nodes
    while node["tag"] and node["tag"] != "TOP":
        ancs.append(node)
        node = nds[node["head"]]
    return ancs




<a id='classifier'></a>
## Classifier function *classifier*

The classifier function takes the generated features for the data and the trained model, according to the **model** parameter and outputs the predictions given by the model. The different prediction formats of each model type are normalized into the same format and finally passed onto the [ouput_features](#output) function.

The *classifier* function makes use of an auxliary function (get_features_labels) to extract the features from the input file, attached following the main function's body.

In [3]:
def classifier(model, feature_input, model_input, outputfile):
    """
    Classifier.
    Function which retrived a trainer model and predicts the output for a given
    validation set features file, to print output to another file.
    Args:
        - model: string with model type to use.
        - feature_input: string with filename of the file to extract features
            from to validate the model.
        - outputfile: string with filename of output file for validation
            predictions.
    """
    # Retrieve sentences, entities and feature vectos
    ids, x, _ = get_features_labels(feature_input)
    if model == "MaxEnt":
        # MaxEnt classifier flow
        megam_features = f"{tmp_path}/megam_valid_features.dat"
        megam_predictions = f"{tmp_path}/megam_predictions.dat"
        system(f"cat {feature_input} | cut -f {feat_col} > \
            {megam_features}")
        # system(f"cat {feature_input} | cut -f4- > \
        #     {megam_features}")
        system(f"./{megam} -quiet -nc -nobias -predict {model_input}.megam \
            multiclass {megam_features} > {megam_predictions}")
        with open(megam_predictions, "r") as fp:
            lines = fp.readlines()
        predictions = [line.split("\t")[0] for line in lines]

    elif model == "MLP":
        # Retrieve model
        with open(f"{model_input}.MLP", "rb") as fp:
            model, encoder = pickle.load(fp)
        # OneHotEncode variables
        x_ = encoder.transform(x)
        # Predict classes
        predictions = model.predict(x_)

    elif model == "SVC":
        # Retrieve model
        with open(f"{model_input}.SVC", "rb") as fp:
            model, encoder = pickle.load(fp)
        # OneHotEncode variables
        x_ = encoder.transform(x)
        # Predict classes
        predictions = model.predict(x_)

    elif model == "GBC":
        # Retrieve model
        with open(f"{model_input}.GBC", "rb") as fp:
            model, encoder = pickle.load(fp)
        # OneHotEncode variables
        x_ = encoder.transform(x)
        # Predict classes
        predictions = model.predict(x_)

    elif model == "LR":
        # Retrieve model
        with open(f"{model_input}.LR", "rb") as fp:
            model, encoder = pickle.load(fp)
        # OneHotEncode variables
        x_ = encoder.transform(x)
        # Predict classes
        predictions = model.predict(x_)

    else:
        print(f"[ERROR] Model {model} not implemented")
        raise NotImplementedError

    # Ouput entites for each sentence
    with open(outputfile, "w") as outf:
        for (id, id_e1, id_e2), type in zip(ids, predictions):
            output_ddi(id, id_e1, id_e2, type, outf)


In [3]:
def get_features_labels(input):
    """
    Get Features & Labels.
    Function which opens the given filename and extracts the feature and label
    vectors, togehter with the sentence and pair entities ids.
    Args:
        - input: string with filename of file to extract features from.
    Returns:
        - ids: list of lists with sentence id and entity pairs ids.
        - feats: list of lists with binary feature vector.
        - labels: list of labels for each entity pair, for the trainer to use.
    """
    with open(input, "r") as fp:
        lines = fp.read()
    pairs = [sent.split("\t") for sent in lines.split("\n")[:-1]]
    ids = []
    labels = []
    feats = []
    for p in pairs:
        ids.append((p[0], p[1], p[2]))
        labels.append(p[3])
        feat = [elem.split("=")[1] for elem in p[4:]]
        feats.append(feat)
    return ids, feats, labels



<a id='output'></a>
## Output generator function *output features*

This function recieves the token list  **tokens** for each sentence, identified by the **id** parameter, the ids of each entity of the considered pairs (**e1**,**e2**) in the given sentence, their predicted classes **type** of interaction and the list of extracted **features**. Then it outputs the correspondng line to write in the output features file object **outf**.

In [1]:
def output_features(id, e1, e2, type, features, out):
    """
    Output Features.
    Function which outputs to the given opened file object the entity pair
    specified with the features extracted from their sentence.
    Args:
        - id: string with sentence id.
        - e1: string with id of the first entity to consider.
        - e2: string with id of the second entity to consider.
        - type: string with gold class of DDI, for use in training.
        - features: list of extracted features from sentence tree.
        - outf: file object with opened file for writing output features.
    """
    feature_str = "\t".join(features)
    txt = f"{id}\t{e1}\t{e2}\t{type}\t{feature_str}\n"
    out.write(txt)


<a id='table_results'></a>
## Model comparison

Model comparison

|model|prec|recall|F1|
|--|--|--|--|
|MaxEnt|0|0|0|
|--|--|--|--|
|MLP|0|0|0|
|--|--|--|--|
|SVC|0|0|0|
|--|--|--|--|
|KNC|0|0|0|
|--|--|--|--|
|LR|0|0|0|


<a id='output_1'></a>
## Evaluator output for Logistic Regression and Devel set

### Output for the Train data-set 
WRONG! We present the output of all the model, where the CRF model, the best, is given with the minimal features to obtain the F1 score of 0.6 corresponding to Goal 3:

#### SCORES FOR THE GROUP: ML 

Gold Dataset: /Devel

Partial Evaluation: only detection of DDI (regadless to the type)

|tp|fp|fn|total|prec|recall|F1|
|--|--|--|--|--|--|--|
|252|155|232|484|0.6192|0.5207|0.5657|


Detection and Classification of DDI

|tp|fp|fn|total|prec|recall|F1|
|--|--|--|--|--|--|--|
|190|217|294|484|0.4668|0.3926|0.4265|



#### SCORES FOR DDI TYPE

Scores for ddi with type mechanism

|tp|fp|fn|total|prec|recall|F1|
|--|--|--|--|--|--|--|
|54|85|147|201|0.3885|0.2687|0.3176|


Scores for ddi with type effect

|tp|fp|fn|total|prec|recall|F1|
|--|--|--|--|--|--|--|
|89|105|73|162|0.4588|0.5494|0.5|


Scores for ddi with type advise

|tp|fp|fn|total|prec|recall|F1|
|--|--|--|--|--|--|--|
|45|24|74|119|0.6522|0.3782|0.4787|


Scores for ddi with type int

|tp|fp|fn|total|prec|recall|F1|
|--|--|--|--|--|--|--|
|2|3|0|2|0.4|1|0.5714|


#### MACRO-AVERAGE MEASURES:

|P|R|F1|
|--|--|--|
|0.4749|0.549|0.5093|

@ TO COMMENT