# AHLT Term Project - DDI Classifier
## Alex Paranov, Anthony Nixon
### MIRI Masters - Term 2 2018

## Part 1: Defining Python classes for XML processing

The first component of our project is to create data structures in which we can store and manipulate the xml data in an efficient manner. This code is available in xml_classes.py (source available in appendix).

We create four classes which correspond to the tagged elements in the xml annotation:

#### Document:
Represents and stores a full text sample consisting of sentence objects. The document class also contains a function "set_features()" which passes a call to a set_features() method at the sentence level and assigns featured words to the document featured_words list.

#### Sentence:
A sentence is a discrete segment of text which can be broken down into entities and pairs. The composing entities and pairs are stored in lists in the object of the same name.

An important part of the Sentence class is it's "set_features()" function which, along with it’s helper functions, iterates over the entities and splits the words in the text and tags them. For each tagged_word the helper function "get_featured_tuple()" returns a list of features based on orthographic features, prefix and suffix, word shapes, etc. (We will cover these features and their rationale in more detail in the Part 3: feature extraction section of this report).

#### Entity:
Stores a relevent mention of a drug name / substance / etc. in a sentence as well as the location offset.

#### Pair:
A pair is a drug drug interation relating entities in a sentence.

In [None]:
# Run cell to import the classes
from xml_classes import Document, Sentence, Entity, Pair
print("DSEP classes loaded")

## Part 2: Parsing

After we have built the structures to store our data, we parse the data. Our parsing code is contained in parser.py (see appendix for source). The primary execution is initiated by the parse_all_files() method. The parser first looks to see if the project files have been previously parsed and stored locally, if not, then it will begin parsing.

The parser stores the data in our Document, Sentence, Entity, and Pair objects. It then takes these objects and writes them to local disk as pickle files. In this manner, we only have to run the parser once.

The data parsed is the drug_bank and med_line training sets as well as the drug-drug-interaction and name-entity-recognition test sets for both drug_bank and med_line.

## Part 3: Feature Extraction

The function extract_features() below loads the parsed pickle objects back into memory and calls set_features() on the documents.

We follow an object oriented paradigm where each sentence object in a of the document object returns all features of its composing text (see appndix xml_classes.py for source).

For our feature vectors we include the following features:

- BIO (beginning, inside, outside) tag, which is tagged in each word.
- Word windows consisting of [word, pos_tag] for +- n words. Which means if n is 2, we have word-2,pos-2,word-1,pos-1,word,pos,word+1,pos+1,word+2,pos+2
- Boolean feature whether word is of length >= 7
- Orthographical features: alphanumeric or all-capitalized, is-capitalized, all-digits, contains-hyphen Y or N.
- Whether prefix and suffix of words are length 3, 4, or 5 respectively boolean e.g. [pl3, pl4, pl5, sufl3, sufl4, sufl5]
- Word-shapes:
    1. Generalized word shape, where the pattern of alphanumeric (X = upper, x = lower, 0 = numeric, O = other) eg. AAspirin11++ maps to XXxxxxxx0000.

    2. Brief word shape, where consecutive forms aren’t condensed. eg. AAspirin11++ maps to Xx0O
 - Metadata. This last element of a word contains specific info of a drug, which will be used in formating predicted out. So each word contains sentenceId|offsets...|text|drug_type|sentenceId|idDrug1|idDrug2|prediction
    First 4 elements are used in NER task and last 4 metadatas are used in DDI task. 

 ## Part 4: Task 1 - Classification for Name-Entity-Recognition

Our classification takes place through the Classifier class. The primary functions of the class are train_NER_model() and test_NER_model(), train_DDI_model() and test_DDI_model() which correspond to training and testing of both NER and DDI tasks.

The function train_NER_model() or train_DDI_model() handles loading data, splitting into classes and vectorset, one-hot-encoding and training of a model which is either SVM or CRF. While functions test_NER_model() and test_DDI_model() takes care of loading a model from computer, predicting on test set, and outputing to the file which follows output format of task. Then this file can be used to test the performance of a model, i.e. accuracy recall and F-1 score.

We ran many times both SVM and CRF models in order to tune parameters, which can be found in Appendix. However it is must to see that we used linear kernel in SVM since one-hot-encoding produces boolean and integer values.

This is all done using our implemented CLI menu which is depicted below with instructions. So for example if we want to train a CRF model for NER task for drugbank and parse files before we run this command:

./main -p -t 1 --train -f 1 -c 2

This will train a CRF classifier and will save the model as ner_drugbankmodel_0.txt

```
usage: main.py [-h] [-p] [-t TASK] [--train] [--test] [-f FOLDER_INDEX]
               [-i MODEL_INDEX] [-r RATIO] [-c CLASSIFIER]

Train or Test model

optional arguments:
  -h, --help            show this help message and exit
  -p, --parse           Parse files
  -t TASK, --task TASK  Task of problem. 1 - NER task, 2 - DDI task.
  --train               Train model
  --test                Test model at index i
  -f FOLDER_INDEX, --folder_index FOLDER_INDEX
                        Folder number. 1 - drugbank, 2 - medline
  -i MODEL_INDEX, --model_index MODEL_INDEX
                        Index of a model to test
  -r RATIO, --ratio RATIO
                        Ratio of data to use for training
  -c CLASSIFIER, --classifier CLASSIFIER
                        Classifier to use. 1 - SVM, 2 - CRF
```

In [6]:
%run main.py -p -t 1 --train -f 1 -c 2

drug_bank_train objects already parsed - skipping
medline_train objects already parsed - skipping
drug_bank_ddi_test objects already parsed - skipping
medline_ddi_test objects already parsed - skipping
drug_bank_ner_test objects already parsed - skipping
medline_ner_test objects already parsed - skipping
All documents with features are set in /home/anthonyn/Dropbox/FIB Classes/AHLT/Term Project/drugs-interaction/data/pickle/medline_ddi_test.pkl


KeyboardInterrupt: 

----
## Appendix - Source Code for Files:

### xml_classes.py =========
```python
from nltk import word_tokenize, pos_tag
from nltk.stem import SnowballStemmer

class Document:
    def __init__(self, id):
        self.id = id
        self.sentences = []

    def add_sentence(self, sentence):
        self.sentences.append(sentence)

    def __str__(self):
        st = "DOCUMENT. Id: "+self.id + '\n'
        for sentence in self.sentences:
            st = st + sentence.__str__() + '\n'
        return st

    # Sets features for each sentence
    def set_features(self):
        featured_words_dict = [] #we need dictionary for DictVectorizer
        featured_sent_dict = []
        for sentence in self.sentences:
            sent_features = sentence.set_features()
            sent_dict = []
            for s_feature in sent_features:
                # first indext contains BIO tag
                # last index contains DDI bio tag
                # previous to last index contains metadata
                ddi_tag = s_feature.pop()
                metadata = s_feature.pop()

                assert isinstance(metadata, list)

                m_dict = {'-2': metadata, '-1': ddi_tag}
                for i in range(len(s_feature)):
                    m_dict[str(i)] = s_feature[i]

                featured_words_dict.append(m_dict)
                sent_dict.append(m_dict)

            featured_sent_dict.append(sent_dict)

        self.featured_words_dict = featured_words_dict
        self.featured_sent_dict = featured_sent_dict

class Sentence:
    def __init__(self, id, text):
        self.id = id
        self.text = text
        self.entities = []
        self.pairs = []

    def add_entity(self, entity):
        self.entities.append(entity)

    def add_pair(self, pair):
        self.pairs.append(pair)

    def __str__(self):
        st = "\t---SENTENCE. Id: "+self.id+", Text: "+self.text + '\n'
        for entity in self.entities:
            st = st + entity.__str__() +'\n'
        return st

    def set_features(self):
        B_tags = [] #list with words that are of type B tag
        I_tags = [] #list of words that are of type I tag
        for entity in self.entities:
            words = entity.text.split(" ") #split words in text to tag
            for index, word in enumerate(words):
                if index == 0:
                    B_tags.append(word)
                else:
                    I_tags.append(word)

        tagged_words = pos_tag(word_tokenize(self.text))
        all_features = []

        window_size = 2
        for index, tagged_word in enumerate(tagged_words):
            # We don't want to save punctuations
            if len(tagged_word[0]) < 2:
                continue
            if tagged_word[0] in B_tags:
                all_features.append(self.get_featured_tuple(index, tagged_words, 'B', window_size))
            elif tagged_word[0] in I_tags:
                all_features.append(self.get_featured_tuple(index, tagged_words, 'I', window_size))
            else:
                all_features.append(self.get_featured_tuple(index, tagged_words, 'O', window_size))

        all_features = self.get_vector_metadatas(all_features, window_size)

        return all_features

    # We need this loop in order to assign metadata to a drug-type word.
    # It's necessary since our output should be of type:
    # for NER task we need sentenceId|offsets...|text|type
    # for DDI prediction we need sentenceId|idDrug1|idDrug2|prediction (ddi = 1 or ddi = 0)|type (advice, effect, etc.)
    def get_vector_metadatas(self, all_features, window_size):
        pos = 0 #initial search positions
        new_all_features = [] #vector of new features with appended metadata
        word_pos = 2 * window_size + 1 #position where main word is
        for i in range(len(all_features)):
            charOffset = ""
            type = "" #type of drug which is empty by default
            f_vector = all_features[i] #feature vector
            if len(f_vector) <= word_pos:
                continue
            if isinstance(f_vector[len(f_vector)-1], list):
                f_vector.pop()

            f_word = str(f_vector[word_pos]) #word which is contained in postion 2*n+1
            w_text = "" # word text
            # if BIO tag of feature vector is B then we proceed with special case assignment
            if f_vector[0] == 'B':
                pos = self.text.find(f_word, pos) #find position where word starts in the sentence
                # this should not be since there are always words in a sentence, but we don't want to deal with negative positions just in case
                if pos < 0:
                    continue

                # beginning and end positions of word, so offset will be set accordingly
                beg = pos; end = pos + len(f_word) - 1
                charOffset = str(beg)+"-"+str(end)
                pos = end #set a new search position to end of previous word, so that we search different words in sentence
                w_text = f_word

                metadata = [self.id, charOffset, w_text, type]
                # appending metadata to last extracted feature vector (might be from inner while loop)
                f_vector.append(metadata)
                new_all_features.append(f_vector)

                while i < len(all_features) - 1:
                    f_vector = all_features[i+1] #next word in a feature vectors

                    if len(f_vector) <= word_pos:
                        continue

                    if isinstance(f_vector[len(f_vector)-1], list):
                        f_vector.pop()

                    # As soon as next words BIO tag is not I, we break the inner loop
                    # otherwise we continue appending to charOffsetString. So eventually it looks like
                    # 100-150;155-170;190-200...
                    if f_vector[0] != 'I':
                        break

                    f_word = str(f_vector[word_pos])
                    pos = self.text.find(f_word, pos)

                    if pos < 0:
                        continue

                    w_text += " "+ f_word
                    beg = pos; end = pos + len(f_word) - 1
                    charOffset += ";" + str(beg)+"-"+str(end)
                    pos = end
                    i += 1

                    metadata = [self.id, charOffset, w_text, type]
                    # appending metadata to last extracted feature vector (might be from inner while loop)
                    f_vector.append(metadata)
                    new_all_features.append(f_vector)
            else:
                # Otherwise BIO tag is O so we simply have charOffset and empty type
                f_word = str(f_vector[word_pos])
                w_text = f_word
                pos = self.text.find(f_word, pos)
                if pos < 0:
                    continue

                beg = pos; end = pos + len(f_word) - 1
                charOffset += str(beg)+"-"+str(end)
                pos = end

                metadata = [self.id, charOffset, w_text, type]
                # appending metadata to last extracted feature vector (might be from inner while loop)
                f_vector.append(metadata)
                new_all_features.append(f_vector)

        updated_features = []
        for f_vector in new_all_features:
            # Update tags. It means each tag will be of type B_drug/B_group/I_drug/I_group/etc.
            if not isinstance(f_vector[len(f_vector)-1], list):
                continue

            metadata = f_vector.pop()

            word_ddi = self.get_word_ddi(str(f_vector[word_pos]))
            metadata.extend(word_ddi)

            assert len(metadata) == 8
            # if ddi = True then it's 1, otherwise it's 0
            ddi_tag = int(metadata[4])

            # append type of interaction in both cases
            if ddi_tag > 0:
                ddi_tag = str(ddi_tag)+"_"+metadata[len(metadata)-1]
            else:
                ddi_tag = str(ddi_tag)+"_null"

            # update metadata
            f_vector.append(metadata)

            # set class of ddi to the last element
            f_vector.append(ddi_tag)

            tag = f_vector[0]
            if tag == 'B' or tag == 'I':
                type = self.get_word_entity(str(f_vector[word_pos]))
                tag = tag + "_"+type
                f_vector[0] = tag

            # remove words at those indexes. They are located at positions word_pos +/- 2*i where i is in interval [-window_size,window_size and i != 0]
            skipping_indexes = [word_pos + 2*i for i in range(-window_size,window_size+1) if i != 0]
            ff_vector = [f_vector[j] for j in range(len(f_vector)) if j not in skipping_indexes]
            updated_features.append(ff_vector)

        return updated_features

    # since words is of type BI tag, then it must have type.
    # So we search through all entities and if word is contained then we set type
    # NOTE that all types of word in offsets like this 100-150;155-170;190-200 will be the same
    def get_word_entity(self, f_word):
        for entity in self.entities:
            text_ar = entity.text.split()
            if f_word in text_ar:
                return entity.type

    def get_word_ddi(self, f_word):
        ddi = False
        idDrug1 = ""
        idDrug2 = ""
        type = ""
        for entity in self.entities:
            text_ar = entity.text.split()
            if f_word in text_ar:
                for pair in self.pairs:
                    if pair.e1 == entity.id or pair.e2 == entity.id:
                        ddi = pair.ddi
                        idDrug1 = pair.e1
                        idDrug2 = pair.e2
                        type = pair.type

        return [ddi, idDrug1, idDrug2, type]

    # Following some guidelines from this table https://www.hindawi.com/journals/cmmm/2015/913489/tab1/
    def get_featured_tuple(self, index, tagged_words, bio_tag, window_size = 2):
        features = [bio_tag]
        word = tagged_words[index][0]

        # get array of [word,pos_tag] for +-window_size word window. Default is 2
        if len(tagged_words) > window_size:
            windows = get_words_window(index, tagged_words, window_size)
            features.extend(windows)

        # add boolean as length is more >= 7
        features.append(int(len(word) >= 7))

        orthographical_feature = get_orthographical_feature(word)
        features.append(orthographical_feature)

        # Prefix and suffix is of lengths 3,4,5 respectively
        prefix_suffix_features = get_prefix_suffix_feature(word)
        features.extend(prefix_suffix_features)

        # General word shape and brief word shape
        word_shapes = get_word_shapes(word)
        features.extend(word_shapes)

        # May be add Y,N if drug is in drugbank or FDA approved list of drugs?
        return features

# Getting words and pos tags of window +/- n
# return will be [word-n,pos_tag-n,.....word+n,pos_tag+n]
def get_words_window(index, tagged_words, n):
    windows = []
    if n >= len(tagged_words):
        raise ValueError("n must be less than length of tagged_words")

    for i in range(-n,n+1):
        # we can reach the first and last element, so we are safe to get them
        if index + i >= 0 and index + i < len(tagged_words):
            word = tagged_words[index + i][0]
            pos_tag = tagged_words[index + i][1]
        else:
            word = ''
            pos_tag = ''

        windows.append(word)
        windows.append(pos_tag)
    return windows

def get_orthographical_feature(word):
    orthographical_feature = "alphanumeric"
    f_uppercase = lambda w: 1 if ord(w) >= 65 and ord(w) <= 90 else 0
    upper_case = list(map(f_uppercase, word))

    if sum(upper_case) == len(word):
        orthographical_feature = "all-capitalized"
    elif f_uppercase(word[0]) == 1:
        orthographical_feature = "is-capitalized"

    # Lambda function which uses ascii code of a character
    f_numerics = lambda w: 1 if w.isnumeric() else 0
    numerics = list(map(f_numerics, word))

    if sum(numerics) == len(word):
        orthographical_feature = "all-digits"

    if "-" in word:
        orthographical_feature += "Y"
    else:
        orthographical_feature += "N"

    return orthographical_feature

def get_prefix_suffix_feature(word):
    snowball_stemmer = SnowballStemmer("english")
    stemmed_word = snowball_stemmer.stem(word)
    ind = word.find(stemmed_word)

    prefix_len = len(word[:ind])
    suffix_len = len(word) - prefix_len - len(stemmed_word)

    pl3 = int(prefix_len == 3); sufl3 = int(suffix_len == 3)
    pl4 = int(prefix_len == 4); sufl4 = int(suffix_len == 4)
    pl5 = int(prefix_len == 5); sufl5 = int(suffix_len == 5)

    return (pl3, pl4, pl5, sufl3, sufl4, sufl5)

def get_word_shapes(word):
    # Generalized Word Shape Feature. Map upper case, lower case, digit and
    # other characters to X,x,0 and O respectively
    # Aspirin1+ will be mapped to Xxxxxxx0O, for example
    word_shape = ""
    for w in word:
        if w.isupper():
            word_shape += "X"
        elif w.islower():
            word_shape += "x"
        elif w.isnumeric():
            word_shape += "0"
        else:
            word_shape += "O"

    # Brief word shape. maps consecutive uppercase letters, lowercase letters,
    # digits, and other characters to “X,” “x,” “0,” and “O,” respectively.
    # Aspirin1+ will be mapped to Xx0O

    # Lambda function to determine if character belongs to category other based on its ascii value
    # We assume ascii unicode, which is true since our XML has UTF-8 encoding (English text)
    f_other = lambda w: True if (ord(w) < 48 or (ord(w) >= 58 and ord(w) <= 64) or
    (ord(w) >= 91 and ord(w) <= 96) or ord(w) > 122) else False

    word_shape_brief = ""
    i = 0
    while i < len(word):
        if word[i].isupper():
            word_shape_brief += "X"
            while i < len(word) and word[i].isupper():
                i += 1
            if i == len(word):
                break
        if word[i].islower():
            word_shape_brief += "x"
            while i < len(word) and word[i].islower():
                i += 1
            if i == len(word):
                break
        if word[i].isnumeric():
            word_shape_brief += "0"
            while i < len(word) and word[i].isnumeric():
                i += 1
            if i == len(word):
                break
        if f_other(word[i]):
            word_shape_brief += "O"
            while i < len(word) and f_other(word[i]):
                i += 1
                if i == len(word):
                    break
        i += 1

    return (word_shape, word_shape_brief)

class Entity:
    def __init__(self, id, charOffset, type, text):
        self.id = id
        self.charOffset = charOffset
        self.type = type
        self.text = text

    def __str__(self):
        st = "\t\t---ENTITY. Id: "+self.id+", CharOffSet: "+self.charOffset+", Type: "+self.type+", Text: "+self.text
        return st

class Pair:
    def __init__(self, id, e1, e2, ddi):
        self.id = id
        self.e1 = e1
        self.e2 = e2
        self.ddi = ddi
        self.type = ""

    def set_type(self, type):
        self.type = type

    def __str__(self):
        st = "\t\t---PAIR. Id: "+self.id+", E1: "+self.e1+", E2: "+self.e2+", DDI: "+str(self.ddi)
        if self.ddi:
            st += ", Type: "+self.type
        return st

```

### parser.py =====
```python
#!/usr/bin/python3
from xml_classes import *
import xml.etree.ElementTree as ET
from os.path import abspath, join, isdir, exists
from os import listdir, makedirs
import sys
import pickle

# Each dictionary contains name of dictionary and data, which is paths of all files in specified directory
train_path = abspath("data/train/DrugBank")
drug_bank_train = {'name': 'drug_bank_train', 'data': [join(train_path, f) for f in listdir(train_path)]}

train_path = abspath("data/train/MedLine")
medline_train =   {'name':'medline_train', 'data': [join(train_path, f) for f in listdir(train_path)]}

# Test for DDI extraction task

test_path = abspath("data/test/Test_DDI_Extraction_task/DrugBank")
drug_bank_ddi_test = {'name': 'drug_bank_ddi_test', 'data': [join(test_path, f) for f in listdir(test_path)]}
test_path = abspath("data/test/Test_DDI_Extraction_task/MedLine")
medline_ddi_test =   {'name': 'medline_ddi_test', 'data': [join(test_path, f) for f in listdir(test_path)]}

# Test for DrugNER task
test_path = abspath("data/test/Test_DrugNER_task/DrugBank")
drug_bank_ner_test = {'name': 'drug_bank_ner_test', 'data': [join(test_path, f) for f in listdir(test_path)]}
test_path = abspath("data/test/Test_DrugNER_task/MedLine")
medline_ner_test =   {'name': 'medline_ner_test', 'data': [join(test_path, f) for f in listdir(test_path)]}

class Parser:
    def set_path(self, xml_path):
        self.path = xml_path

    def parse_xml(self):
        tree = ET.parse(self.path)
        root = tree.getroot()
        document = Document(root.attrib['id'])
        for child in root:
            if child.tag == "sentence":
                sentence = Sentence(child.attrib['id'], child.attrib['text'])
                if len(sentence.text) < 2:
                    continue
                for second_child in child:
                    attr = second_child.attrib
                    if second_child.tag == "entity":
                        entity = Entity(attr['id'], attr['charOffset'], attr['type'], attr['text'])
                        sentence.add_entity(entity)
                    elif second_child.tag == "pair":
                        ddi = False
                        if attr['ddi'] == "true":
                            ddi = True

                        pair = Pair(attr['id'],attr['e1'],attr['e2'], ddi)
                        if pair.ddi and 'type' in attr:
                            pair.set_type(attr['type'])

                        sentence.add_pair(pair)

                document.add_sentence(sentence)
        return document

    def parse_save_xml_dict(self, xml_dict):
        parsed_docs = []
        for doc in xml_dict['data']:
            print("Parsing: "+doc)
            self.set_path(doc)
            d = self.parse_xml()
            parsed_docs.append(d)

        dir_path = abspath("data/pickle")
        if not isdir(dir_path):
            makedirs(dir_path)

        pickle_name = xml_dict['name']+".pkl"
        with open(join(dir_path, pickle_name),"wb") as f:
            pickle.dump(parsed_docs, f)
            print("Saved parsed documents from " + pickle_name + " into pickle!\n")

def parse_all_files():
    parser = Parser()
    if not exists("data/pickle/"+drug_bank_train['name']+".pkl"):
        parser.parse_save_xml_dict(drug_bank_train)
    else:
        print("drug_bank_train objects already parsed - skipping")
    if not exists("data/pickle/"+medline_train['name']+".pkl"):
        parser.parse_save_xml_dict(medline_train)
    else:
        print("medline_train objects already parsed - skipping")
    if not exists("data/pickle/"+drug_bank_ddi_test['name']+".pkl"):
        parser.parse_save_xml_dict(drug_bank_ddi_test)
    else:
        print("drug_bank_ddi_test objects already parsed - skipping")
    if not exists("data/pickle/"+medline_ddi_test['name']+".pkl"):
        parser.parse_save_xml_dict(medline_ddi_test)
    else:
        print("medline_ddi_test objects already parsed - skipping")
    if not exists("data/pickle/"+drug_bank_ner_test['name']+".pkl"):
        parser.parse_save_xml_dict(drug_bank_ner_test)
    else:
        print("drug_bank_ner_test objects already parsed - skipping")
    if not exists("data/pickle/"+medline_ner_test['name']+".pkl"):
        parser.parse_save_xml_dict(medline_ner_test)
    else:
        print("medline_ner_test objects already parsed - skipping")

def main():
    parse_all_files()

if __name__ == "__main__":
    main()

    ```

### classifier.py ===

```python
#!/usr/bin/python3
import pickle
from os.path import join, abspath, isdir
from os import listdir, makedirs
from operator import contains

from numpy.random import randint
import scipy

import sklearn_crfsuite

from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction import DictVectorizer
from sklearn.externals import joblib
from sklearn.grid_search import RandomizedSearchCV

# Files are in the following order (on Alex's computer):
# 0 - medline_ner_test.pkl
# 1 - medline_train.pkl
# 2 - drug_bank_ddi_test.pkl
# 3 - drug_bank_train.pkl
# 4 - drug_bank_ner_test.pkl
# 5 - medline_ddi_test.pkl

pickle_path = "data/pickle"
pickled_files = [join(abspath(pickle_path), f) for f in listdir(abspath(pickle_path))]

# function that returns full path of a file
def get_file_full_path(file_name, pickled_files):
    for p_f in pickled_files:
        if contains(p_f, file_name):
            return p_f
    return ""

class Classifier:
    def __init__(self):
        self.path = ""

    def set_path(self, path):
        self.path = path

    # split dataset into classes and sub-dictionaries
    # return classes and dictionaries (i.e. feature vectors)
    def split_dataset(self):
        if len(self.path) == 0:
            raise ValueError("Path can't be empty")

        with open(self.path, 'rb') as f:
            docs = pickle.load(f)

        feature_vectors_dict = [] # feature vectors expressed as dicts. train data
        ner_classes = [] # B,I,O classes
        ddi_classes = [] # B,I,O classes
        dict_metadatas = []

        for doc in docs:
            for m_dict in doc.featured_words_dict:
                ner_classes.append(m_dict['0'])
                ddi_classes.append(m_dict['-1'])
                dict_metadatas.append(m_dict['-2'])

                # we want sub-dictionary of all elements besides the class
                sub_dict = {k:v for k,v in  m_dict.items() if k > '0' and not isinstance(v, list)}
                feature_vectors_dict.append(sub_dict)

        return (feature_vectors_dict, ner_classes, ddi_classes, dict_metadatas)

    def split_dataset_crf(self):
        if len(self.path) == 0:
            raise ValueError("Path can't be empty")

        with open(self.path, 'rb') as f:
            docs = pickle.load(f)

        all_sentences_features = []
        for doc in docs:
            # sent is a list of dictionaries
            for sent in doc.featured_sent_dict:
                ner_classes = []
                ddi_classes = []
                dict_metadatas = []
                sub_dicts = []
                for m_dict in sent:
                    ner_classes.append(m_dict['0'])
                    ddi_classes.append(m_dict['-1'])
                    dict_metadatas.append(m_dict['-2'])

                    # we want sub-dictionary of all elements besides the class
                    sub_dict = {k:v for k,v in  m_dict.items() if k > '0' and not isinstance(v, list)}
                    sub_dicts.append(sub_dict)

                all_sentences_features.append((sub_dicts, ner_classes, ddi_classes, dict_metadatas))

        return all_sentences_features

    # train dataset, where X is a list of feature vectors expressed as dictionary
    # and Y is class variable, which is BIO tag_type in our case. ratio is proportion of data to use to train
    def train_dataset_svm(self, X, Y, kernel, ratio):
        vec = DictVectorizer(sparse=False)
        svm_clf = svm.SVC(kernel = kernel, cache_size = 1800, C = 20, verbose = True, tol = 0.01)
        vec_clf = Pipeline([('vectorizer', vec), ('svm', svm_clf)])
        assert len(X) == len(Y)

        # subset of indexes to used in training
        r_indexes = randint(low = 0, high = len(X)-1, size = round(ratio*(len(X)-1)))

        X_subset = [X[i] for i in r_indexes]
        Y_subset = [Y[i] for i in r_indexes]

        vec_clf.fit(X_subset, Y_subset)

        return vec_clf

    def train_dataset_crf(self, X, Y, ratio):
        crf = sklearn_crfsuite.CRF(algorithm = 'lbfgs', max_iterations = 100, all_possible_transitions = True)

        params_space = { 'c1': scipy.stats.expon(scale = 0.5), 'c2': scipy.stats.expon(scale = 0.05)}

        import multiprocessing
        cpus = multiprocessing.cpu_count()
        rs = RandomizedSearchCV(crf, params_space, cv = 3, verbose = 1, n_jobs = cpus-1, n_iter = 50)

        assert len(X) == len(Y)

        # subset of indexes to used in training
        r_indexes = randint(low = 0, high = len(X)-1, size = round(ratio*(len(X)-1)))

        X_subset = [X[i] for i in r_indexes]
        Y_subset = [Y[i] for i in r_indexes]

        rs.fit(X_subset, Y_subset)

        return rs

    def train_NER_model(self, train_folder, kernel = 'linear', ratio = 1, classifier = 1):
        if not isdir('models'):
            makedirs('models')

        model_name = ""
        model_index = 0
        model_names = [join(abspath("models"), f) for f in listdir(abspath("models"))]

        if train_folder == 1:
            path = get_file_full_path("drug_bank_train.pkl", pickled_files)
            self.set_path(path)
            drugbank_models = list(filter(lambda x: contains(x, 'drugbank_model_'), model_names))
            model_index = len(drugbank_models) # save next model
            model_name = 'models/ner_drugbank_model_'+str(model_index)+'.pkl'
            print("Started training NER Drugbank model...")
        elif train_folder == 2:
            path = get_file_full_path("medline_train.pkl", pickled_files)
            self.set_path(path)
            medline_models = list(filter(lambda x: contains(x, 'medline_model_'), model_names))
            model_index = len(medline_models) # save next model
            model_name = 'models/ner_medline_model_'+str(model_index)+'.pkl'
            print("Started training NER Medline model...")
        else:
            raise ValueError('train_folder value should be 1 - drugbank, or 2 - medline')

        if classifier == 2:
            featured_sent_dict = self.split_dataset_crf()
            X_train = [f[0] for f in featured_sent_dict]
            Y_train = [f[1] for f in featured_sent_dict]
            clf = self.train_dataset_crf(X_train, Y_train, ratio)
        else:
            # we ignore Y_ddi classes since they are not used for NER model training
            X_train, Y_train, Y_ddi, metadatas = self.split_dataset()
            clf = self.train_dataset_svm(X_train, Y_train, kernel, ratio)

        joblib.dump(clf, model_name)
        print("\nNER Model trained and saved into", model_name)

    def test_NER_model(self, model_index, test_folder, classifier = 1):
        model_name = ""
        predictions_name = ""
        if test_folder == 1:
            model_name = 'models/ner_drugbank_model_'+str(model_index)+'.pkl'
            predictions_name = 'predictions/ner_drugbank_model_'+str(model_index)+'.txt'
            path = get_file_full_path("drug_bank_ner_test.pkl", pickled_files)
            self.set_path(path)
            print("Testing NER Drugbank model", model_index,"...")
        elif test_folder == 2:
            model_name = 'models/ner_medline_model_'+str(model_index)+'.pkl'
            predictions_name = 'predictions/ner_medline_model_'+str(model_index)+'.txt'
            path = get_file_full_path("medline_ner_test.pkl", pickled_files)
            self.set_path(path)
            print("Testing NER Medline model", model_index,"...")
        else:
            raise ValueError('test_folder value should be 1 - drugbank, or 2 - medline')

        vec_clf = joblib.load(model_name)

        if classifier == 2:
            featured_sent_dict = self.split_dataset_crf()
            X_test = [f[0] for f in featured_sent_dict]
            Y = [f[1] for f in featured_sent_dict] # ner classes
            Y_test = []
            for y in Y:
                for y_test in y:
                    Y_test.append(y_test)

            met = [f[3] for f in featured_sent_dict]
            metadatas = []
            for metadata in met:
                for met in metadata:
                    metadatas.append(met)
        else:
            # metadatas are of type: sentenceId | offsets... | text | type
            # we ignore Y_ddi classes since they are not used for NER model training
            X_test, Y_test, Y_ddi, metadatas = self.split_dataset()

        if classifier == 2:
            preds = vec_clf.predict(X_test)
            predictions = []
            for pred in preds:
                for prediction in pred:
                    predictions.append(prediction)
        else:
            predictions = vec_clf.predict(X_test)

        assert len(predictions) == len(Y_test) == len(metadatas)

        if not isdir('predictions'):
            makedirs('predictions')

        pr_f = open(predictions_name,'w')
        # clear file, i.e. remove all
        pr_f.close()

        # reopen clean file
        pr_f = open(predictions_name, 'w')

        for i, pred in enumerate(predictions):
            metadata = metadatas[i]
            # if prediction is B_type or I_type then we predicted the drug and it's type is after B_, thus we can write into check file
            if pred[0] == 'B':
                line = metadata[0] + '|' + metadata[1] + '|' + metadata[2] + '|' + pred[2:]
                pr_f.write(line + '\n')

        print("\nNER Predictions are saved in file", predictions_name)
        pr_f.close()

    def train_DDI_model(self, train_folder, kernel = 'linear', ratio = 1, classifier = 1):
        if not isdir('models'):
            makedirs('models')

        model_name = ""
        model_index = 0
        model_names = [join(abspath("models"), f) for f in listdir(abspath("models"))]
        if train_folder == 1:
            path = get_file_full_path("drug_bank_train.pkl", pickled_files)
            self.set_path(path)
            drugbank_ddi_models = list(filter(lambda x: contains(x, 'drugbank_ddi_model_'), model_names))
            model_index = len(drugbank_ddi_models) # save next model
            model_name = 'models/ddi_drugbank_model_'+str(model_index)+'.pkl'
            print("Started training DDI Drugbank model...")
        elif train_folder == 2:
            path = get_file_full_path("medline_train.pkl", pickled_files)
            self.set_path(path)
            medline_ddi_models = list(filter(lambda x: contains(x, 'medline_ddi_model_'), model_names))
            model_index = len(medline_ddi_models) # save next model
            model_name = 'models/ddi_medline_model_'+str(model_index)+'.pkl'
            print("Started training DDI Medline model...")
        else:
            raise ValueError('train_folder value should be 1 - drugbank, or 2 - medline')

        X_train, Y_ner, Y_train, metadatas = self.split_dataset()

        if classifier == 2:
            featured_sent_dict = self.split_dataset_crf()
            X_train = [f[0] for f in featured_sent_dict]
            Y_train = [f[2] for f in featured_sent_dict] # ddi classes
            clf = self.train_dataset_crf(X_train, Y_train, ratio)
        else:
            # we ignore Y_ddi classes since they are not used for NER model training
            X_train, Y_train, Y_ddi, metadatas = self.split_dataset()
            clf = self.train_dataset_svm(X_train, Y_train, kernel, ratio)

        joblib.dump(clf, model_name)
        print("\nDDI Model trained and saved into", model_name)

    def test_DDI_model(self, model_index, test_folder, classifier):
        model_name = ""
        predictions_name = ""
        if test_folder == 1:
            model_name = 'models/ddi_drugbank_model_'+str(model_index)+'.pkl'
            predictions_name = 'predictions/ddi_drugbank_model_'+str(model_index)+'.txt'
            path = get_file_full_path("drug_bank_ddi_test.pkl", pickled_files)
            self.set_path(path)
            print("Testing DDI Drugbank model", model_index,"...")
        elif test_folder == 2:
            model_name = 'models/ddi_medline_model_'+str(model_index)+'.pkl'
            predictions_name = 'predictions/ddi_medline_model_'+str(model_index)+'.txt'
            path = get_file_full_path("medline_ddi_test.pkl", pickled_files)
            self.set_path(path)
            print("Testing DDI Medline model", model_index,"...")
        else:
            raise ValueError('test_folder value should be 1 - drugbank, or 2 - medline')

        vec_clf = joblib.load(model_name)

        if classifier == 2:
            featured_sent_dict = self.split_dataset_crf()
            X_test = [f[0] for f in featured_sent_dict]
            Y = [f[1] for f in featured_sent_dict]
            Y_test = []
            for y in Y:
                for y_test in y:
                    Y_test.append(y_test)

            met = [f[3] for f in featured_sent_dict]
            metadatas = []
            for metadata in met:
                for met in metadata:
                    metadatas.append(met)
        else:
            # metadatas are of type: sentenceId | offsets... | text | type
            # we ignore Y_ddi classes since they are not used for NER model training
            X_test, Y_test, Y_ddi, metadatas = self.split_dataset()

        if classifier == 2:
            preds = vec_clf.predict(X_test)
            predictions = []
            for pred in preds:
                for prediction in pred:
                    predictions.append(prediction)
        else:
            predictions = vec_clf.predict(X_test)

        assert len(predictions) == len(Y_test) == len(metadatas)

        if not isdir('predictions'):
            makedirs('predictions')

        pr_f = open(predictions_name,'w')
        # clear file, i.e. remove all
        pr_f.close()

        # reopen clean file
        pr_f = open(predictions_name, 'w')

        for i, pred in enumerate(predictions):
            metadata = metadatas[i]
            # if prediction is 1_type then we predicted ddi correctly, thus we can write into check file
            if len(metadata[5]) > 0 and len(metadata[6]) > 0:
                # some document sentence ids are completely wrong, we want only DDI-Medline
                if metadata[0][:4] == 'DDI-':
                    line = metadata[0]+'|'+metadata[5]+'|'+metadata[6]+'|'+pred[0]+'|'+pred[2:]
                    pr_f.write(line + '\n')

        print("\nDDIPredictions are saved in file", predictions_name)
        pr_f.close()

```