# AHLT - First delivery - Task 9.1 NERC
**Albert Rial**   
**Karen Lliguin**   


This delivery consists of solving the task 9.1 of the SemEval-2013 challenge. The task concerns the named entity recognition and classification of drug names.  

The dataset provided contains XML files with sentences and the entities appearing on it and their corresponding type. There are four general types: drug, brand, group and drug_n. The data is already splitted in three subsets: Train, Devel and Test.

We are also provided with some external resources containing knowledge extracted from other databases (DrugBank, HSDB) and with evaluation scripts.

To do so, we use different methods and resources and we divide the task in different subtasks/goals.

## Goal 1: Rule-based, no external knowledge

### Introduction
First, we develop a simple rule-based baseline system to carry out the task. In this first version we only use the information from the Train dataset and we do not use external knowledge. 

With this we want to achieve an overall F1 score of at least 0.5 on the Devel dataset.

### Data exploration
To do this system, first we do a data exploration to know which common characteristics have the drugs in each type.

Given only the Train dataset, we analyze the following aspects:
- The most common words that appear after and before each type of drug.
- The most common prefixes and sufixes of each type of drug (given different number of characters).
- The most common drug entities of each type.
    
You can find the full code of the data exploration in the jupyter notebook called *data_exploration.ipynb*.

### Details
The main function of our system is *extract_entities(tokens)*, where, given a list with the tokens of a sentence and taking into account the rules defined, we return a list of the entities we found and their type.

Basically we loop over all the tokens and using the following rules we try to recognize and classify the drugs appearing. The rules are:
- Sufixes and prefixes: we use the most common prefixes and sufixes found on the data exploration for each type of drug. If we find a word with an specific prefix or sufix that is in our list of common prefixes and sufixes of a type, we consider the word as an entity and we classify it in the corresponding type.
- For the types drug_n and brand we also check if the word is uppercase. If is uppercase and the length is greater than 4, we classify the word as brand. Otherwise, if the word is uppercase and the length is less or equal than 4, we classify it as drug_n.
- As in data exploration we observed that type group contained several drug names formed by two words, we add a rule taking care of checking whether two consequent words are classified as drug, in order to distinguish this type of drug names. 
- As drug_n still showed a low F1 score, a rule checking whether the word contain the character "-" or digits is added, as from the previous analysis this pattern was observed in the names of the drugs of that partucular type. 

In [1]:
def extract_entities(tokens):
    entities = [] # Output list of entities
    prev_drug = "" # Previous drug found (if any)
    
    for i in range(len(tokens)):
        token = tokens[i]
        word = token[0]
        
        # Prefix and sufix rules
        drug_prefixes = ('pheny', 'digox', 'warfa', 'meth', 'theophy', 'lith', 'keto', 'cime', 'insu', 'fluox', 'alcoh', 'cyclos', 'eryth', 'carba', 'rifa', 'caffe')
        drug_sufixes = ('pitant', 'dine', 'azole', 'mide', 'pine', 'line', 'mine', 'tine', 'arin', 'avir', 'azem', 'rine', 'rone', 'arbital', 'olol', 'afil', 'inol', 'zolam')
        
        group_prefixes = ('benzo', 'beta', 'antico', 'antide', 'antibi', 'antihi', 'nsai', 'contra')
        group_sufixes = ('steroids','tics', 'ants', 'ents', 'tors', 'acid', 'acids', 'ceptives', 'gens', 'pines', 'lines', 'mines')
        
        brand_prefixes = ('aspi', 'accu', 'beza', 'star', 'exja')
        brand_sufixes = ('tane', 'dine', 'anil')
        
        drug_n_prefixes = ('ibog', 'endo')
        drug_n_sufixes = ('ate', 'sin', 'toxin', 'orfon')
        
        # Rules for drug type
        if word.lower().startswith(drug_prefixes) or word.lower().endswith(drug_sufixes):
            entities.append({'name':word, 'offset': str(token[1])+'-'+str(token[2]),'type':'drug'})
            prev_drug = str(word)+ " "+str(token[1])+" "+"drug"
        
        # Rules for group type
        elif word.lower().startswith(group_prefixes) or word.lower().endswith(group_sufixes):
            if prev_drug != '':
                info = prev_drug.split(" ")
                if len(entities) > 0 and info[2]=='group':
                    entities.pop()
                    entities.append({'name':str(info[0])+word, 'offset': str(info[1])+'-'+str(token[2]),'type':'group'})
            else:
                entities.append({'name':word, 'offset': str(token[1])+'-'+str(token[2]),'type':'group'})
            prev_drug = str(word)+ " "+str(token[1])+ " "+"group"
        
        # Rules for brand type
        elif (word.isupper() and len(word)>4) or word.lower().startswith(brand_prefixes) or word.lower().endswith(brand_sufixes):
            entities.append({'name':word, 'offset': str(token[1])+'-'+str(token[2]),'type':'brand'})
            prev_drug = str(word)+ " "+str(token[1])+ " "+ "brand"
        
        # Rules for drug_n type
        elif word.isupper() or word.lower().startswith(drug_n_prefixes) or word.lower().endswith(drug_n_prefixes)\
        or (bool(re.search(r'\d', word)) and '-' in word):
            entities.append({'name':word, 'offset': str(token[1])+'-'+str(token[2]),'type':'drug_n'})
            prev_drug = str(word)+ " "+str(token[1])+ " "+"drug_n"
            
        else:
            prev_drug = ""
        
    return entities

### Results

#### Devel

(The results of the analysis are contaied in the notebook _data_exploration_ on the Apendix folder)

```
The evaluator output for this version on devel is shown below

SCORES FOR THE GROUP: develGoal1 RUN=1

Strict matching (boundaries + type)
cor	inc	par	mis	spu	total	prec	recall	F1
773	292	0	706	496	1771	0.5	0.44	0.46



Exact matching
cor	inc	par	mis	spu	total	prec	recall	F1
815	250	0	706	496	1771	0.52	0.46	0.49



Partial matching
cor	inc	par	mis	spu	total	prec	recall	F1
815	0	250	706	496	1771	0.52	0.53	0.53



type matching
cor	inc	par	mis	spu	total	prec	recall	F1
904	161	0	706	496	1771	0.58	0.51	0.54



SCORES FOR ENTITY TYPE
Exact matching on drug
cor	inc	par	mis	spu	total	prec	recall	F1
511	14	0	520	45	1045	0.9	0.49	0.63


Exact matching on brand
cor	inc	par	mis	spu	total	prec	recall	F1
115	0	0	65	25	180	0.82	0.64	0.72


Exact matching on group
cor	inc	par	mis	spu	total	prec	recall	F1
120	117	0	217	83	454	0.38	0.26	0.31


Exact matching on drug_n
cor	inc	par	mis	spu	total	prec	recall	F1
27	0	0	65	33	92	0.45	0.29	0.36


MACRO-AVERAGE MEASURES:
P	R	F1
0.64	0.42	0.5
```

#### Test

The evaluator output for this version on test is shown below

```
SCORES FOR THE GROUP: testGoal1 RUN=1

Strict matching (boundaries + type)
cor	inc	par	mis	spu	total	prec	recall	F1
214	109	0	363	331	686	0.33	0.31	0.32



Exact matching
cor	inc	par	mis	spu	total	prec	recall	F1
242	81	0	363	331	686	0.37	0.35	0.36



Partial matching
cor	inc	par	mis	spu	total	prec	recall	F1
242	0	81	363	331	686	0.37	0.41	0.39



type matching
cor	inc	par	mis	spu	total	prec	recall	F1
252	71	0	363	331	686	0.39	0.37	0.38



SCORES FOR ENTITY TYPE
Exact matching on drug
cor	inc	par	mis	spu	total	prec	recall	F1
124	8	0	219	31	351	0.76	0.35	0.48


Exact matching on brand
cor	inc	par	mis	spu	total	prec	recall	F1
26	0	0	33	1	59	0.96	0.44	0.6


Exact matching on group
cor	inc	par	mis	spu	total	prec	recall	F1
49	16	0	90	48	155	0.43	0.32	0.37


Exact matching on drug_n
cor	inc	par	mis	spu	total	prec	recall	F1
15	15	0	91	34	121	0.23	0.12	0.16


MACRO-AVERAGE MEASURES:
P	R	F1
0.6	0.31	0.4
```

### Conclusions

## Goal 2 : Rule-based, using external knowledge

### Introduction
Once accomplished the first goal we use the rule-based system defined but also using information from external knowledge sources.

The goal is to obtain a F1 score of at least 0.6 on the Devel dataset.

### Details
The most part of the code is reused from the first approach. The only modifications done are:
- extract_drug_bank(drug_bank_path): function where we read the DrugBank dataset and we store in a dictionary the drug name and the drug type given by the external dataset.
- extract_entities(tokens): for each word we do:
    - First, we check if the word is inside the DrugBank dataset. If it exists there, we classify it according to the type specified by the DrugBank.
    - If it is not present, we check the drug_n rules that we had in our first approach. We do this because there is not any drug_n inside the DrugBank and is the class with worst F1 score.
    - Finally, as a lot of drug names inside the DrugBank have more than one word, we check if the union of the word and its previous words appear in the DrugBank. In case they appear there, we add all the words as a single entity and with the corresponding type.

In [None]:
def extract_drug_bank(drug_bank_path):
    drug_bank = {}
    with open(drug_bank_path, encoding="utf8") as f:
        for line in f:
            data = line.strip().split('|')
            drug_name = data[0]
            drug_type = data[1]
            drug_bank[drug_name.lower()] = drug_type
    return drug_bank

drug_bank = extract_drug_bank(drug_bank_path)

In [3]:
def extract_entities(tokens):
    entities = []
    
    for i in range(len(tokens)):
        token = tokens[i]
        word = token[0]
        
        drug_n_prefixes = ('18-m', 'ibog', 'endo', 'toxi')
        drug_n_sufixes = ('ine', 'ate', '8-mc', 'sin', 'xin', 'pge2', 'mhd')
        
        # Check if single word is in bank
        if word.lower() in drug_bank:
            entities.append({'name':word, 'offset': str(token[1])+'-'+str(token[2]),'type':drug_bank[word.lower()]})
        
        # Check drug_n rules
        elif word.isupper() or word.lower().startswith(drug_n_prefixes) or word.lower().endswith(drug_n_prefixes):
            entities.append({'name':word, 'offset': str(token[1])+'-'+str(token[2]),'type':'drug_n'})
        
        # Check if multiple consecutive words appear in the drug bank as a single drug
        else:
            for j in range(1, 5):
                if i>=j-1:
                    words_joined  = ' '.join([t[0] for t in tokens[i-j:i+1]])   
                    if words_joined.lower() in drug_bank:
                        entities.append({'name':words_joined, 'offset': str(tokens[i-j][1])+'-'+str(tokens[i][2]),'type':drug_bank[words_joined.lower()]})
    return entities


### Results
#### Devel
```
SCORES FOR THE GROUP: develGoal2 RUN=1

Strict matching (boundaries + type)
cor	inc	par	mis	spu	total	prec	recall	F1
1177	350	0	244	524	1771	0.57	0.66	0.62



Exact matching
cor	inc	par	mis	spu	total	prec	recall	F1
1270	257	0	244	524	1771	0.62	0.72	0.66



Partial matching
cor	inc	par	mis	spu	total	prec	recall	F1
1270	0	257	244	524	1771	0.62	0.79	0.69



type matching
cor	inc	par	mis	spu	total	prec	recall	F1
1291	236	0	244	524	1771	0.63	0.73	0.68



SCORES FOR ENTITY TYPE
Exact matching on drug
cor	inc	par	mis	spu	total	prec	recall	F1
866	44	0	135	76	1045	0.88	0.83	0.85


Exact matching on brand
cor	inc	par	mis	spu	total	prec	recall	F1
170	3	0	7	32	180	0.83	0.94	0.88


Exact matching on group
cor	inc	par	mis	spu	total	prec	recall	F1
115	52	0	287	52	454	0.53	0.25	0.34


Exact matching on drug_n
cor	inc	par	mis	spu	total	prec	recall	F1
26	15	0	51	37	92	0.33	0.28	0.31


MACRO-AVERAGE MEASURES:
P	R	F1
0.64	0.58	0.6
```

### Test
```
SCORES FOR THE GROUP: testGoal2 RUN=1

Strict matching (boundaries + type)
cor	inc	par	mis	spu	total	prec	recall	F1
401	160	0	125	334	686	0.45	0.58	0.51



Exact matching
cor	inc	par	mis	spu	total	prec	recall	F1
482	79	0	125	334	686	0.54	0.7	0.61



Partial matching
cor	inc	par	mis	spu	total	prec	recall	F1
482	0	79	125	334	686	0.54	0.76	0.63



type matching
cor	inc	par	mis	spu	total	prec	recall	F1
439	122	0	125	334	686	0.49	0.64	0.56



SCORES FOR ENTITY TYPE
Exact matching on drug
cor	inc	par	mis	spu	total	prec	recall	F1
299	18	0	34	62	351	0.79	0.85	0.82


Exact matching on brand
cor	inc	par	mis	spu	total	prec	recall	F1
50	4	0	5	8	59	0.81	0.85	0.83


Exact matching on group
cor	inc	par	mis	spu	total	prec	recall	F1
41	10	0	104	21	155	0.57	0.26	0.36


Exact matching on drug_n
cor	inc	par	mis	spu	total	prec	recall	F1
11	6	0	104	31	121	0.23	0.09	0.13


MACRO-AVERAGE MEASURES:
P	R	F1
0.6	0.51	0.53
```

## Goal 3 : ML, no external knowledge

After trying a rule-based system and viewing its limitations, in this goal we try to solve the same problem (NERC) using machine learning techniques.

Specifically we use a CRF model to solve the task. We do the following steps:
- Extract and define different features to encode the data.
- Train and tune a model with the obtained feature vectors of the Train dataset.
- Test the model using Devel and Test datasets.

### Data exploration
To do this system, we reused the data exploration done on the first goal and we defined most of the features based on that. In the next section we will detail all the features used (the ones finally used and the ones tested but discarded).

### Details
In the following function *extract_features(tokens, pos_tag)* we receive the tokens of a sentence and their POS tag, and we extract different features to encode the data.

The features defined are:
- Word in lowercase
- Prefixes and sufixes (from 2 characters to 5)
- Word length
- POS tag of the word
- First character of the POS tag
- Booleans indicating if the word is uppercase, if contains uppercase and lowercase characters, if contains digits, if contains dash, if is only composed by letters and if is title.

Other features have been tested but it have been discarded because they were not improving the performance. Some of them are:
- Booleans indicating if the word is a punctuation, if contains special characters and if it starts with a digit.
- Similar features as the ones above but with information about the previous and next word (previous and next word in lowercase, length of both, their POS tag, ...)

In [None]:
def extract_features(tokens, pos_tag):
    features = []
    
    for i in range(len(tokens)):
        token = tokens[i]
        word = token[0]
        
        lower = re.compile(r'.*[a-z]+')
        upper = re.compile(r'.*[A-Z]+')
        
        feature_vector = [
            'word.lower=' + word.lower(),
            'word[-5:]=' + word[-5:],
            'word[-4:]=' + word[-4:],
            'word[-3:]=' + word[-3:],
            'word[-2:]=' + word[-2:],
            'word[:2]=' + word[:2],
            'word[:3]=' + word[:3],
            'word[:4]=' + word[:4],
            'word[:5]=' + word[:5],
            'word.length=%s' % len(word),
            'word.isupper=%s' % word.isupper(),
            'word.isupperandlower=%s' % bool(lower.match(word) and upper.match(word)),
            'word.containdigit=%s' % bool(re.search(r'\d', word)),
            'word.containdash=%s' % ('-' in word),
            'word.postag=' + pos_tag[i],
            'word.postag_1=' + pos_tag[i][0],
            'word.isalpha=%s' % word.isalpha(),
            'word.istitle=%s' % word.istitle()
        ]
            
        features.append(feature_vector)
      
    return features

We need a function that is able to read the feature vectors generated by the *extract_features* function and the ground truth of each vector. For that reason we define *read_features_and_classes(inputfile)* function.

Basically it receives the path of the input file containing all the feature vectors. For a given sentence id it reads all its features and it appends it in a feature vector. Finally it returns all the features, a vector containing all feature_vectors (one for sentence), and the classes of each feature_vector.

In [None]:
def read_features_and_classes(inputfile):
    features = []
    classes = []
    prev_sent_id = ''
    with open(inputfile) as f:
        for i, line in enumerate(f):
            saved_features = line.split()
            sent_id = saved_features[0]
            
            if i == 0:
                feature_vector = []
                classes_vector = []
                feature_vector.append(saved_features[5:])
                classes_vector.append(saved_features[4])
            
            elif sent_id == prev_sent_id:
                feature_vector.append(saved_features[5:])
                classes_vector.append(saved_features[4])
            
            else:
                features.append(feature_vector)
                classes.append(classes_vector)
                feature_vector = []
                classes_vector = []
            
            prev_sent_id = sent_id
    
    return features, classes    

The function defined above

In [None]:
def train(features_file, model_name):
    # Get features of train data
    features_train, gs_train = read_features_and_classes(features_file)

    crf = pycrfsuite.Trainer(algorithm='pa', verbose=False)
    
    params = {
        'c': 0.21600273890535607,
        'epsilon': 0.004802939229551229,
        'type': 2,
        'feature.possible_transitions': True,
        'feature.possible_states': True,
        'max_iterations': 100
    }
    
    crf.set_params(params)
    
    for xseq, yseq in zip(features_train, gs_train):
        crf.append(xseq, yseq)

    crf.train(model_name)
    
    return

In [None]:
def read_features(sent_id, inputfile):
    features = []
    with open(inputfile) as f:
        features = [line.split()[5:] for line in f if line.split()[0] == sent_id]
        
    return features    

In [None]:
def predict_classes(tagger, features):
    classes = []
    for ch in tagger.tag(features):
        classes.append(ch)
    return classes

In [None]:
def output_entities(sent_id, tokens, classes, outf):
    B_indices = [i for i in range(len(classes)) if classes[i].startswith('B')]
    for b in B_indices:
        I_indices = []
        i = b + 1
        while i < len(classes) and classes[i].startswith('I'):
            I_indices.append(i)
            i+=1
        
        if len(I_indices) == 0:
            outf.write(sent_id+'|'+str(tokens[b][1])+'-'+str(tokens[b][2])+'|'+tokens[b][0]+'|'+classes[b][2:])
        else:
            joined_tokens = ' '.join([tokens[j][0] for j in [b] + I_indices])
            outf.write(sent_id+'|'+str(tokens[b][1])+'-'+str(tokens[I_indices[-1]][2])+'|'+joined_tokens+'|'+classes[b][2:])
        
        outf.write("\n")   
    return

### Results
#### Devel
```
SCORES FOR THE GROUP: develGoal3 RUN=1

Strict matching (boundaries + type)
cor	inc	par	mis	spu	total	prec	recall	F1
1458	101	0	212	65	1771	0.9	0.82	0.86



Exact matching
cor	inc	par	mis	spu	total	prec	recall	F1
1512	47	0	212	65	1771	0.93	0.85	0.89



Partial matching
cor	inc	par	mis	spu	total	prec	recall	F1
1512	0	47	212	65	1771	0.93	0.87	0.9



type matching
cor	inc	par	mis	spu	total	prec	recall	F1
1500	59	0	212	65	1771	0.92	0.85	0.88



SCORES FOR ENTITY TYPE
Exact matching on drug
cor	inc	par	mis	spu	total	prec	recall	F1
925	3	0	117	39	1045	0.96	0.89	0.92


Exact matching on brand
cor	inc	par	mis	spu	total	prec	recall	F1
152	0	0	28	4	180	0.97	0.84	0.9


Exact matching on group
cor	inc	par	mis	spu	total	prec	recall	F1
354	39	0	61	22	454	0.85	0.78	0.81


Exact matching on drug_n
cor	inc	par	mis	spu	total	prec	recall	F1
27	0	0	65	1	92	0.96	0.29	0.45


MACRO-AVERAGE MEASURES:
P	R	F1
0.94	0.7	0.77
```

#### Test
```
SCORES FOR THE GROUP: testGoal3 RUN=1

Strict matching (boundaries + type)
cor	inc	par	mis	spu	total	prec	recall	F1
463	81	0	142	48	686	0.78	0.67	0.72



Exact matching
cor	inc	par	mis	spu	total	prec	recall	F1
510	34	0	142	48	686	0.86	0.74	0.8



Partial matching
cor	inc	par	mis	spu	total	prec	recall	F1
510	0	34	142	48	686	0.86	0.77	0.81



type matching
cor	inc	par	mis	spu	total	prec	recall	F1
477	67	0	142	48	686	0.81	0.7	0.75



SCORES FOR ENTITY TYPE
Exact matching on drug
cor	inc	par	mis	spu	total	prec	recall	F1
288	6	0	57	25	351	0.9	0.82	0.86


Exact matching on brand
cor	inc	par	mis	spu	total	prec	recall	F1
40	0	0	19	0	59	1	0.68	0.81


Exact matching on group
cor	inc	par	mis	spu	total	prec	recall	F1
126	9	0	20	11	155	0.86	0.81	0.84


Exact matching on drug_n
cor	inc	par	mis	spu	total	prec	recall	F1
9	1	0	111	1	121	0.82	0.07	0.14


MACRO-AVERAGE MEASURES:
P	R	F1
0.9	0.6	0.66
```


## Goal 4 : ML, using external knowledge

### Introduction


### Details

In [None]:
def extract_drug_bank(drug_bank_path):
    drug_bank = {}
    positions_drug_bank = {}
    with open(drug_bank_path, encoding="utf8") as f:
        for line in f:
            data = line.strip().split('|')
            drug_name = data[0].lower()
            drug_type = data[1]
            drug_bank[drug_name] = drug_type
            
            for i in range(len(drug_name.split())):
                positions_drug_bank[drug_name[i]] = i
                
    return drug_bank, positions_drug_bank

drug_bank, positions_drug_bank = extract_drug_bank(drug_bank_path)

In [None]:
def extract_HSDB(HSDB_path):
    HSDB = []
    with open(HSDB_path, encoding="utf8") as f:
        for line in f:
            data = line.strip()
            HSDB.append(data.lower())
    return HSDB

HSDB = extract_HSDB(HSDB_path)

In [None]:
def extract_features(tokens, pos_tag):
    features = []
    
    for i in range(len(tokens)):
        token = tokens[i]
        word = token[0]
        
        lower = re.compile(r'.*[a-z]+')
        upper = re.compile(r'.*[A-Z]+')
        
        feature_vector = [
            'word.lower=' + word.lower(),
            'word[-5:]=' + word[-5:],
            'word[-4:]=' + word[-4:],
            'word[-3:]=' + word[-3:],
            'word[-2:]=' + word[-2:],
            'word[:2]=' + word[:2],
            'word[:3]=' + word[:3],
            'word[:4]=' + word[:4],
            'word[:5]=' + word[:5],
            'word.length=%s' % len(word),
            'word.isupper=%s' % word.isupper(),
            'word.isupperandlower=%s' % bool(lower.match(word) and upper.match(word)),
            'word.containdigit=%s' % bool(re.search(r'\d', word)),
            'word.containdash=%s' % ('-' in word),
            'word.postag=' + pos_tag[i],
            'word.postag_1=' + pos_tag[i][0],
            'word.specialchar=%s' % bool(re.search('^[a-zA-Z0-9]*$',word)),
            'word.isalpha=%s' % word.isalpha(),
            'word.istitle=%s' % word.istitle(),
            'word.startswithdigit=%s' % word[0].isdigit(),
            'word.inbank=%s' % (word.lower() in drug_bank.keys()),
            'word.inHSDB=%s' % (word.lower() in HSDB),
            'word.stopword=%s' % (word.lower() in stopwords),
        ]
        
        if word.lower() in positions_drug_bank.keys():
                feature_vector.append('word.position_inbank=%s' %(positions_drug_bank[word.lower()]))
        
        if word.lower() in drug_bank.keys():
                feature_vector.append('word.type_inbank=' + drug_bank[word.lower()])
        
            
        features.append(feature_vector)
        
    return features

In [None]:
def train(features_file, model_name):
    # Get features of train data
    features_train, gs_train = read_features_and_classes(features_file)
    
    crf = pycrfsuite.Trainer(algorithm='pa', verbose=False)

    for xseq, yseq in zip(features_train, gs_train):
        crf.append(xseq, yseq)

    crf.train(model_name)
    
    return

### Results
#### Devel
```
SCORES FOR THE GROUP: develGoal4 RUN=1

Strict matching (boundaries + type)
cor	inc	par	mis	spu	total	prec	recall	F1
1511	74	0	186	56	1771	0.92	0.85	0.89



Exact matching
cor	inc	par	mis	spu	total	prec	recall	F1
1538	47	0	186	56	1771	0.94	0.87	0.9



Partial matching
cor	inc	par	mis	spu	total	prec	recall	F1
1538	0	47	186	56	1771	0.94	0.88	0.91



type matching
cor	inc	par	mis	spu	total	prec	recall	F1
1552	33	0	186	56	1771	0.95	0.88	0.91



SCORES FOR ENTITY TYPE
Exact matching on drug
cor	inc	par	mis	spu	total	prec	recall	F1
963	4	0	78	20	1045	0.98	0.92	0.95


Exact matching on brand
cor	inc	par	mis	spu	total	prec	recall	F1
171	0	0	9	1	180	0.99	0.95	0.97


Exact matching on group
cor	inc	par	mis	spu	total	prec	recall	F1
351	37	0	66	21	454	0.86	0.77	0.81


Exact matching on drug_n
cor	inc	par	mis	spu	total	prec	recall	F1
26	0	0	66	1	92	0.96	0.28	0.44


MACRO-AVERAGE MEASURES:
P	R	F1
0.95	0.73	0.79
```

#### Test
```
SCORES FOR THE GROUP: testGoal4 RUN=1

Strict matching (boundaries + type)
cor	inc	par	mis	spu	total	prec	recall	F1
485	84	0	117	38	686	0.8	0.71	0.75



Exact matching
cor	inc	par	mis	spu	total	prec	recall	F1
543	26	0	117	38	686	0.89	0.79	0.84



Partial matching
cor	inc	par	mis	spu	total	prec	recall	F1
543	0	26	117	38	686	0.89	0.81	0.85



type matching
cor	inc	par	mis	spu	total	prec	recall	F1
500	69	0	117	38	686	0.82	0.73	0.77



SCORES FOR ENTITY TYPE
Exact matching on drug
cor	inc	par	mis	spu	total	prec	recall	F1
308	6	0	37	27	351	0.9	0.88	0.89


Exact matching on brand
cor	inc	par	mis	spu	total	prec	recall	F1
48	0	0	11	0	59	1	0.81	0.9


Exact matching on group
cor	inc	par	mis	spu	total	prec	recall	F1
118	10	0	27	11	155	0.85	0.76	0.8


Exact matching on drug_n
cor	inc	par	mis	spu	total	prec	recall	F1
11	1	0	109	1	121	0.85	0.09	0.16


MACRO-AVERAGE MEASURES:
P	R	F1
0.9	0.64	0.69
```
