# AHLT Term Project - DDI Classifier
## Alex Paranov, Anthony Nixon
### MIRI Masters - Term 2 2018

## Part 1: Defining Python classes for XML processing

The first component of our project is to create data structures in which we can store and manipulate the xml data in an effecient manner. This code is available in xml_classes.py (see appendix for source).

We create four classes which correspond to the tagged elements in the xml annotation:

#### Document:
Represents and stores a full text sample consisting of sentence objects. The document class also contains a function "set_features()" which passes a call to a set_features() method at the sentence level and assigns featured words to the document featured_words list.

#### Sentence:
A sentence is a discrete segment of text which can be broken down into entities and pairs. The composing entities and pairs are stored in lists in the object of the same name.

An important part of the Sentence class is it's "set_features()" function which, along with it’s helper functions, iterates over the entities and splits the words in the text and tags them. For each tagged_word the helper function "get_featured_tuple()" returns a list of features based on orthographic features, prefix and suffix, word shapes, etc. (We will cover these features and their rationale in more detail in the Part 3: feature extraction section of this report).

#### Entity:
Stores a relevent mention of a drug name / substance / etc. in a sentence as well as the location offset.

#### Pair:
A pair is a drug drug interation relating entities in a sentence.

In [None]:
# Run cell to import the classes
from xml_classes import Document, Sentence, Entity, Pair
print("DSEP classes loaded")

## Part 2: Parsing

After we have built the structures to store our data, we parse the data. Our parsing code is contained in parser.py (see appendix for source). The primary execution is initiated by the parse_all_files() method. The parser first looks to see if the project files have been previously parsed and stored locally, if not, then it will begin parsing.

The parser stores the data in our Document, Sentence, Entity, and Pair objects. It then takes these objects and writes them to local disk as pickle files. In this manner, we only have to run the parser once.

The data parsed is the drug_bank and med_line training sets as well as the drug-drug-interaction and name-entity-recognition test sets for both drug_bank and med_line.

In [None]:
# Run cell to begin parsing
from parser import main as parse_all_files
parse_all_files()

## Part 3: Feature Extraction

The function extract_features() below loads the parsed pickle objects back into memory and calls set_features() on the documents.

We follow an object oriented paradigm where each sentence object in a of the document object returns all features of its composing text (see appndix xml_classes.py for source).

For our feature vectors we include the following features:

- BIO (beginning, inside, outside) tag
- Word windows consisting of [word, pos_tag] for +- 2 words
- Boolean of length >= 7
- Orthographical features: alphanumeric, capitalization, digits, hyphenation
- Whether prefix and suffix of words are length 3, 4, or 5 boolean e.g. [pl3, pl4, pl5, sufl3, sufl4, sufl5]
- Word-shape:
    1. Generalized word shape, where the pattern of alphanumeric (X = upper, x = lower, 0 = numeric, O = other) eg. Aspirin1+ maps to Xxxxxxx00.

    2. Brief word shape, where consecutive forms aren’t condensed. eg. Aspirin1+ maps to Xx0O

In [None]:
# run cell to load documents from pickle file and extract features from
# the sentences.

from os.path import join, abspath
from os import listdir
import pickle

pickle_path = "data/pickle"
pickled_files = [join(abspath(pickle_path), f) for f in listdir(abspath(pickle_path))]

def extract_features():
    for file_name in pickled_files:
        f = open(file_name, 'rb')
        docs = pickle.load(f)
        f.close()
        all_featured_docs = []
        for doc in docs:
            #print("Extracting features for",doc.id)
            doc.set_features()
            all_featured_docs.append(doc)

        with open(file_name,'wb') as f:
            pickle.dump(all_featured_docs, f)
            print("All documents with features are set in "+file_name)

extract_features()

 ## Part 4: Classification for Name-Entity-Recognition

Our classification takes place through the NERClassifier class. The primary functions of the class are train_drugbank() and test_NER_model() which correspond to training and testing of the name entity recognition task.

The functions handle the loading of the data, one-hot-encoding, parameters and call to the SVM classifier and the result output.
We experimented with both an RBF and linear kernel.

In [None]:
# run cell to construct classifier

from classifier import NERClassifier
import warnings

nerCl = NERClassifier()

#### TRAIN A MODEL - based on feature vectures extracted in Part 3 - Run cell below. If you would like to use a pre-existing model, skip to next code cell.

In [None]:
#supress scikit warnings
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    nerCl.train_drugbank(kernel = 'linear')

#### TEST A MODEL - run the cell below with relevant parameters, model_index (you will find this appended to the file name of the model) and test_folder (1 = drugbank, 2 = medline) 

In [None]:
nerCl.test_NER_model(model_index = 2, test_folder = 1)