# Lab AHLT Drug NER - Part 2
### Authors: David Curto & David Hilazo

This is the second part of the deliverable which is based on building a NERC using Machine Learning that recognizes and classifies drug names in a text. In this second laboratory there are 2 goals that must be achieved:

1. Implement a NERC using machine learning and get an overall F1 score of at least **0.6** on Devel dataset using only information from the training dataset. 

2. Implement a NERC using machine learning and get an overall F1 score of at least **0.7** on Devel dataset using also external knowledge sources

In [27]:
import os
import re
import string
import xml.etree.ElementTree as ET
from collections import Counter

import nltk
import pycrfsuite
from chemdataextractor.nlp.tokenize import ChemWordTokenizer
from nltk import word_tokenize, QuadgramCollocationFinder
from nltk.corpus import stopwords

**Read and extract info from the XML file**

This first function reads the XML file using the The ElementTree XML API and returns the root of the tree that will be used to parse all the information needed.

The second one retrieves the information that will be used in the following functions which is the id of the sentence and the text that it contains.

In [28]:
def parse_xml(file):
    tree = ET.parse(file)
    return tree.getroot()


def get_sentence_info(child):
    return child.get('id'), child.get('text')

**Tokenizers**


Once we have the sentence text we have to apply some transformations in order to be able to manipulate them easily. This first preprocessing step is called  word Tokenization which splits the sentence into tokens. 

In order to do that, we have tested two different tokenizers. The first word tokenizer tested is one of the most well known in NLP task which is the NLTK word tokenizer. However, also we tested one implemented by a tool named ChemDataExtractor which is designed for automatically extracting chemical information from scientific documents. As this tokenizer is more field oriented, it handles better some chemical word formations than the NLTK word tokenizer which is less field oriented.

For this reason, we decided to just use the ChemWordTokenizer as it retrieves better the tokens for specific cases. 

Both of these function returns all the tokens founds and its starting and ending position for each one of them within the sentence.

In [29]:
def chem_tokenize(text):
    cwt = ChemWordTokenizer()
    tokens = cwt.tokenize(text)
    token_indexs = cwt.span_tokenize(text)
    tokenized_info = []
    for token_index, token in zip(token_indexs, tokens):
        tokenized_info.append((token, token_index[0], token_index[1] - 1))
    return tokenized_info


def tokenize(text):
    tokenized_sent = word_tokenize(text)
    tokenized_info = []
    current_index = 0

    for word in tokenized_sent:

        if not re.match("[" + string.punctuation + "]", word):
            for match in re.finditer(word, text):
                if match.start() >= current_index:
                    tokenized_info.append((word, match.start(), match.end() - 1))
                    current_index = match.end() - 1
                    break
    return tokenized_info

**Get external resources**

This function reads the external file named DrugBank and stores its information (name , type of drug) as in the first part of the laboratory. However in this second part we have also used the HSDB file and csv file that we have generated that contains all drug_n names.

In [30]:
def get_external_resources():
    file = open('../resources/DrugBank.txt', 'r', encoding="utf8")
    Lines = file.readlines()

    resources = {}

    # Strips the newline character
    for line in Lines:
        value = line.split("|")
        resources[value[0].lower()] = value[1][:-1]

    hsdb_resources = set()
    file = open('../resources/HSDB.txt', 'r', encoding="utf8")
    Lines = file.readlines()

    for line in Lines:
        value = line[:-1]
        hsdb_resources.add(value.lower())

    drug_n_resources = set()
    file = open('../resources/drug_n.csv', 'r', encoding="utf8")
    Lines = file.readlines()

    for line in Lines:
        value = line[:-1]
        drug_n_resources.add(value.lower())

    return resources, hsdb_resources, drug_n_resources

**Extract features**

TODO

In [31]:
def extract_features(token_list, entities_dict, drug_n_set, hsdb_set, with_resources=False):
    entities = []
    previous_token_offset = (0, 0)
    stop_words = set(stopwords.words('english'))

    # TODO: Create features for any class (punct, nomes numeros, ...)
    for i, token in enumerate(token_list):
        features = []
        if entities_dict and with_resources:
            if token[0] in entities_dict:
                features.append("in_entities_dict=" + entities_dict[token[0]])
            elif token[0].lower() in entities_dict:
                features.append("in_entities_dict=" + entities_dict[token[0].lower()])
            else:
                features.append("in_entities_dict=False")
        else:
            features.append("in_entities_dict=False")
        
        #Check if token belongs to drug_n (external file)
        if with_resources and token[0] in drug_n_set:
            features.append("is_drug_n=True")
        elif with_resources and token[0].lower() in drug_n_set:
            features.append("is_drug_n=True")
        else:
            features.append("is_drug_n=False")
        
        
        #Check if token is inside hsdb file  (external file)
        if with_resources and token[0] in hsdb_set:
            features.append("in_hsdb_set=True")
        elif with_resources and token[0].lower() in hsdb_set:
            features.append("in_hsdb_set=True")
        else:
            features.append("in_hsdb_set=False")
        
        
        lower_letters = sum(1 for c in token[0] if c.isupper())
        upper_letters = sum(1 for c in token[0] if c.isupper())
        num_digits = sum(1 for c in token[0] if c.isdigit())
        num_punctuation = sum(1 for c in token[0] if c in string.punctuation)
        num_roman = len(re.findall("[IVXDLCM]+", token[0]))
        
        # Custom feature that counts the amount of upper and lowercase letters
        # the number of digits , punctuation and roman characters
        word_shape = lower_letters + upper_letters + num_digits + num_punctuation + num_roman
        features.append("word_shape=" + str(word_shape))
        
        # Check if token starts with a digit
        features.append("starts_with_digit=" + str(token[0][0].isdigit()))
        
        # Check if token belongs to the stop word corpus, only available with resources
        features.append("is_stopword=" + str(token[0].lower() in stop_words and with_resources))
        
        
        # Check if token belongs to the stop word set
        stopword_set = {'of', 'the', 'and', 'in', 'with', 'to', 'be', 'or', 'is', 'not', 'by', 'for',
                        'should', 'on', 'that', 'been', 'have', 'other', 'was', 'when', 'are', 'as', 'were',
                        'no', 'has', 'these', 'an', 'this', 'such', 'at', 'from', 'it', 'if', 'there', 'after',
                        'which', 'can', 'between', 'during', 'because', 'both', 'than', 'did', 'its', 'but',
                        'some', 'who', 'any'}
        features.append("in_stopword_set=" + str(token[0].lower() in stopword_set))
        
        #Check if token has parenthesis and it's size is larger than 1 
        features.append("has_parenthesis=" + str('(' in token[0] and len(token[0]) > 1))
        
        #Check if token is a punctuation character
        features.append("is_punct=" + str(token[0] in {'.', ',', ';', ':', '(', ')', '-', '_', '\'', '/', '\\'}))
        
        #Check if token has lowercase letters separated by a hyphen
        features.append("has_lowercase_hyphen=" + str(bool(re.search("[a-z][\-][a-z]", token[0]))))
        
        #Ckeck if token is only formed by numbers
        features.append(
            "is_only_numbers=" + str(re.search("^(\d+[\-\.]\d+)$|^(\d+\.\d+\-\d+\.\d+)$", token[0]) is None))

        #Check if contains a hyphen
        features.append("has_hyphen=" + str(bool(re.search("\w[_%()\-]\w", token[0]))))
        
        #Check if token is in uppercase
        features.append("is_upper=" + str(token[0].isupper()))
    
        #Check if token contains an uppercase vowel
        pattern = re.compile("[AEIOU]")
        features.append("has_upper_vowel=" + str(bool(pattern.search(token[0]))))
        
        #Check if token contains any common word of the different types
        features.append(
            "has_group_keyword=" + str(bool(len(entities) > 0 and previous_token_offset[1] + 2 == token[1] and any(
                substring in token[0].lower() for substring in
                ['agent', 'inhibitor', 'blocker', 'drug', 'type', 'medication', 'contraceptive', 'anticoagulants']))))

        has_common_drug = token[0].lower() in ['digoxin', 'warfarin', 'phenytoin', 'theophylline', 'lithium',
                                               'ketoconazole', 'cimetidine', 'alcohol', 'cyclosporine', 'erythromycin',
                                               'tricyclic antidepressants', 'aspirin', 'carbamazepine', 'rifampin',
                                               'amiodarone', 'quinidine', 'phenobarbital', 'indinavir', 'propranolol',
                                               'methotrexate', 'diltiazem', 'cisapride', 'ethanol']
        features.append("has_common_drug=" + str(has_common_drug))

        features.append("has_common_group=" + str(any(substring in token[0].lower() for substring in
                                                      ['anticoagulant', 'corticosteroid', 'NSAID', 'antacid',
                                                       'contraceptive', 'diuretic', 'barbiturate'])))
        
        #Add features of prefixes and suffixes of different length(2-4)
        if len(token[0]) >= 4:
            features.append("suff4=" + token[0][-4:])
            features.append("pref4=" + token[0][:4])
        else:
            features.append("suff4=" + token[0])
            features.append("pref4=" + token[0])

        if len(token[0]) >= 3:
            features.append("suff3=" + token[0][-3:])
            features.append("pref3=" + token[0][:3])
        else:
            features.append("suff3=" + token[0])
            features.append("pref3=" + token[0])

        if len(token[0]) >= 2:
            features.append("suff2=" + token[0][-2:])
            features.append("pref2=" + token[0][:2])
        else:
            features.append("suff2=" + token[0])
            features.append("pref2=" + token[0])
        
        # Check if contains POC inside the token (characteristic word)
        features.append("has_poc=" + str("POC" in token[0]))

        features.append("starts_with_uppercase=" + str(token[0][0].isupper()))
        
        # Information from previous token
        if i != 0:
            features.append("prev_ent=" + token_list[i - 1][0])
            features.append("prev_postag=" + nltk.pos_tag([token_list[i - 1][0]])[0][1][0])
            features.append("prev_len=" + str(len(token_list[i - 1][0])))
        else:
            features.append("prev_ent=")
            features.append("prev_postag=")
            features.append("prev_len=0")
        
        #Information from current token
        features.append("curr_ent=" + token_list[i][0])
        features.append("curr_postag=" + nltk.pos_tag([token[0]])[0][1][0])
        features.append("curr_len=" + str(len(token_list[i][0])))
        
        #Information from next token
        if i != len(token_list) - 1:
            features.append("next_ent=" + token_list[i + 1][0])
            features.append("next_postag=" + nltk.pos_tag([token_list[i + 1][0]])[0][1][0])
            features.append("next_len=" + str(len(token_list[i + 1][0])))
        else:
            features.append("next_ent=")
            features.append("next_postag=")
            features.append("next_len=0")
        
        # Check if there are uppercase letters (without taking into account the start)
        features.append("capitals_inside=" + str(token[0][1:].lower() != token[0][1:]))
        
        # Count number of punctuation characters
        features.append("count_punct=" + str(len(re.findall("[\.\-+\,()]", token[0]))))
        
        # Check if token contains roman letters
        features.append("has_roman=" + str(bool(re.search("[IVXDLCM]+", token[0]))))

        entities.append(features)

    return entities

**Output features and entities**

TODO

In [32]:
def output_features(sid, tokens, gold_entities, features, output_file):
    for token, feature_vector, bio in zip(tokens, features, gold_entities):
        output_file.write(
            sid + "\t" + token[0] + "\t" + str(token[1]) + "\t" + str(token[2]) + "\t" + bio + "\t" + "\t".join(
                feature_vector) + "\n")
    output_file.write("\n")


def output_entities(sid, entities, output_file):
    for entity in entities:
        output_file.write(sid + "|" + entity['offset'] + "|" + entity['name'] + "|" + entity['type'] + "\n")

**Get truth entities**

TODO: Ground truth! 

In [33]:
def get_truth_entities(child):
    return list(zip([ent.get('text') for ent in child.findall('entity')],
                    [ent.get('charOffset') for ent in child.findall('entity')])), \
           [ent.get('type') for ent in child.findall('entity')]

**Get gold entities**

TODO: 

In [34]:
def get_gold_entities(token_list, truth_entities):
    gold_entities = []
    entity_counter = 0

    for j, token in enumerate(token_list):
        if not truth_entities or len(truth_entities[0]) <= entity_counter:
            gold_entities.append("O")
            continue
        if ';' in truth_entities[0][entity_counter][1]:
            entity_offsets = truth_entities[0][entity_counter][1].split(';')
            for i, offset in enumerate(entity_offsets):
                entity_offset = offset.split('-')
                entity_offset[0] = int(entity_offset[0])
                entity_offset[1] = int(entity_offset[1])
                entity_offsets[i] = (entity_offset[0], entity_offset[1])
        else:
            entity_offset = truth_entities[0][entity_counter][1].split('-')
            entity_offset[0] = int(entity_offset[0])
            entity_offset[1] = int(entity_offset[1])
            entity_offsets = [(entity_offset[0], entity_offset[1])]

        for offset in entity_offsets:
            if offset[0] == token[1] and offset[1] == token[2]:
                # Exact match
                gold_entities.append("B-" + truth_entities[1][entity_counter])
                entity_counter += 1
                break
            elif offset[0] == token[1]:
                # Beginning match
                gold_entities.append("B-" + truth_entities[1][entity_counter])
                break
            elif offset[0] < token[1] and offset[1] > token[2]:
                # Inside match
                gold_entities.append("I-" + truth_entities[1][entity_counter])
                break
            elif offset[1] == token[2]:
                # End match
                gold_entities.append("I-" + truth_entities[1][entity_counter])
                entity_counter += 1
                break
        if len(gold_entities) == j:
            gold_entities.append('O')

    return gold_entities

In [None]:
**Read features**

Reads the file 

In [35]:
def read_features(filename):
    # Using readlines()
    file1 = open('../output/' + filename, 'r')
    lines = file1.readlines()

    sentences_features = []
    sentences_bios = []

    token_features = []
    bios = []

    # Strips the newline character
    for line in lines:
        if line == "\n":
            # End of sentence
            sentences_features.append(token_features)
            sentences_bios.append(bios)

            token_features = []
            bios = []
            continue
        values = line.split("\t")

        sid = values[0]
        token_name = values[1]
        start_offset = values[2]
        end_offset = values[3]
        bio = values[4]
        feature_vector = values[5:]
        feature_vector[-1].replace('\n', '')

        bios.append(bio)
        token_features.append(feature_vector)

    return sentences_features, sentences_bios

In [36]:
def extract_entities_from_bio(token_list, bio_tags):
    entities = []
    previous_token_offset = (0, 0)

    for token, bio in zip(token_list, bio_tags):
        if bio.startswith('B'):
            entities.append({'name': token[0],
                             'offset': str(token[1]) + "-" + str(token[2]),
                             'type': bio.split("-")[-1]})
            previous_token_offset = (token[1], token[2])
        elif bio.startswith('I') and entities:
            entities[-1]['name'] += " " + token[0]
            entities[-1]['offset'] = str(previous_token_offset[0]) + "-" + str(token[2])
    return entities

In [37]:
def evaluate(inputdir, outputfile):
    return os.system("java -jar ../eval/evaluateNER.jar " + inputdir + " ../output/" + outputfile)

In [None]:
def save_features(with_resources):
    # Get Features:
    output_file_name = "features.txt"
    input_directory = '../data/Train/'

    entities_dict, drug_n_set, hsdb_set = get_external_resources()

    output_file = open('../output/' + output_file_name, 'w+')
    for filename in os.listdir(input_directory):
        root = parse_xml(input_directory + filename)
        for child in root:
            sid, text = get_sentence_info(child)
            token_list = chem_tokenize(text)
            features = extract_features(token_list, entities_dict, drug_n_set, hsdb_set,
                                        with_resources=with_resources)
            truth_entities = get_truth_entities(child)
            gold_entities = get_gold_entities(token_list, truth_entities)
            output_features(sid, token_list, gold_entities, features, output_file)

    # Close the file
    output_file.close()

In [38]:
def train_crf(with_resources):  
    
    # Generate and save features
    save_features(with_resources)
    
    # Train CRF:
    sentences_features, sentences_bios = read_features("features.txt")
    
    trainer = pycrfsuite.Trainer(verbose=False)

    for xseq, yseq in zip(sentences_features, sentences_bios):
        trainer.append(xseq, yseq)

    trainer.set_params({
        'c1': 1.0,  # coefficient for L1 penalty
        'c2': 1e-3,  # coefficient for L2 penalty
        'max_iterations': 50,  # stop earlier

        # include transitions that are possible, but not observed
        'feature.possible_transitions': True
    })

    trainer.train('crf_model.crfsuite')

In [None]:
def evaluate_crf(with_resources):
    # Evaluate Devel
    output_file_name = "task9.1_out_2.txt"
    input_directory = '../data/Devel/'

    entities_dict, drug_n_set, hsdb_set = get_external_resources()

    output_file = open('../output/' + output_file_name, 'w+')
    for filename in os.listdir(input_directory):
        root = parse_xml(input_directory + filename)
        for child in root:
            sid, text = get_sentence_info(child)
            token_list = chem_tokenize(text)
            features = extract_features(token_list, entities_dict, drug_n_set, hsdb_set,
                                        with_resources=with_resources)

            tagger = pycrfsuite.Tagger()
            tagger.open('crf_model.crfsuite')
            bio_tags = tagger.tag(features)

            entities = extract_entities_from_bio(token_list, bio_tags)
            output_entities(sid, entities, output_file)

    # Close the file
    output_file.close()
    evaluate(input_directory, output_file_name)

**Without resources**

In [42]:
with_resources = False
train_crf(with_resources)
evaluate_crf(with_resources)

**With resources**

In [40]:
with_resources = True
train_crf(with_resources)
evaluate_crf(with_resources)

### Get Missing and Failed entities

In [41]:
import json

# Using readlines()
file1 = open('goldNER.txt', 'r')
Lines = file1.readlines()

truth = {}

# Strips the newline character
for line in Lines:
    value = line.split("|")
    if value[0] not in truth:
        truth[value[0]] = []
    truth[value[0]].append((value[-2], value[-1]))

# Using readlines()
file1 = open('../output/task9.1_out_2.txt', 'r')
Lines = file1.readlines()

output = {}
wrong_entities = []
new_sent = ""
matched_entities = []
missing = []

# Strips the newline character
for line in Lines:
    value = line.split("|")
    if value[0] not in output:
        output[value[0]] = []
    output[value[0]].append((value[-2], value[-1]))

    if new_sent != value[0] and new_sent != "":
        if new_sent in truth:
            missing += [item for item in truth[new_sent] if item not in matched_entities]
        matched_entities = []
        new_sent = value[0]

    if new_sent == "":
        new_sent = value[0]

    if value[0] in truth and (value[-2], value[-1]) in truth[value[0]]:
        matched_entities.append((value[-2], value[-1]))
    else:
        wrong_entities.append((value[-2], value[-1]))

missing_dict = {}
for ent, type in missing:
    if type[:-1] not in missing_dict:
        missing_dict[type[:-1]] = []
    missing_dict[type[:-1]].append(ent)

wrong_entities_dict = {}
for ent, type in wrong_entities:
    if type[:-1] not in wrong_entities_dict:
        wrong_entities_dict[type[:-1]] = []
    wrong_entities_dict[type[:-1]].append(ent)

print("MISSING:\n", json.dumps(missing_dict, indent=4))
print("\n\n")
print("WRONG:\n", json.dumps(wrong_entities_dict, indent=4))

MISSING:
 {
    "drug": [
        "antibiotics",
        "gentamicin",
        "99mTc-MDP",
        "99mTc-MDP",
        "25-Dihydroxycholecalciferol D3",
        "1,25(OH)2D3",
        "1,25(OH)2D3",
        "1,25(OH)2D3",
        "norepinephrine",
        "norepinephrine",
        "norepinephrine",
        "EACA",
        "erythromycins",
        "amyl nitrite",
        "drugs",
        "toxin",
        "antiarrhythmics",
        "Calcidiol",
        "Calcium",
        "calcium",
        "aluminum",
        "Niacin",
        "Nicotinic Acid",
        "Magnesium-Aluminum Hydroxide",
        "CIMETlDINE",
        "ACTH",
        "thiazide diuretics",
        "Calcidiol",
        "Calcium",
        "calcium",
        "clorazepate dipotassium",
        "TAO",
        "CBZ",
        "Potassium",
        "amiloride",
        "triamterene",
        "Caffeine",
        "Theobromine",
        "Grepafloxacin",
        "ACTH",
        "Human growth hormone",
        "human growth hormone",
    

# Conclusions

- Objective completed
- CRF comentar les seves avantatges
- I explicar conclusions generals del Laboratori com a bloc complet
