# Final Assignment

In this assignment, you will build a relation extraction model for scientific articles based on the [ScienceIE dataset](https://scienceie.github.io/) in a group of up to 3 students. This is the same dataset that was used for Assignment 2, where you had to train a model to extract keyphrases. You are welcome to build on code any team member already wrote for Assignment 2.

You will build and train relation extraction models on the ScienceIE dataset. For this, you will also need to do data preprocessing to convert the ScienceIE data into a format suitable for training a relation extraction models. 

Your mark will depend on:

* your **reasoning behind modelling choices** made
* the correct **implementations** of your relation extraction models, and
* the **performance** of your models on a held-out test set.

To develop your model you have access to:

* The data in `data/scienceie/`. Remember to un-tar the data.tar.gz file.
* Libraries on the [docker image](https://cloud.docker.com/repository/docker/bjerva/stat-nlp-book) which contains everything in [this image](https://github.com/jupyter/docker-stacks/tree/master/scipy-notebook), including scikit-learn, torch 1.2.0 and tensorflow 1.14.0. 


As with the previous assignment, since we have to run the notebooks of all students, and because writing efficient code is important, your notebook should run in 10 minutes at most, including package loading time, on your machine.
Furthermore, you are welcome to provide a saved version of your model with loading code. In this case loading, testing, and evaluation has to be done in 10 minutes. You can use the dev set to check if this is the case, and assume that it will be fine for the held-out test set if so.

## Setup Instructions
It is important that this file is placed in the **correct directory**. It will not run otherwise. The correct directory is

    DIRECTORY_OF_YOUR_BOOK/assignments/2019/final_assignment/problem/
    
where `DIRECTORY_OF_YOUR_BOOK` is a placeholder for the directory you downloaded the book to. After you placed it there, **rename the file** to your UCPH ID (of the form `xxxxxx`). 

## General Instructions
This notebook will be used by you to provide your solution, and by us to both assess your solution and enter your marks. It contains three types of sections:

1. **Setup** Sections: these sections set up code and resources for assessment. **Do not edit these**. 
2. **Assessment** Sections: these sections are used for both evaluating the output of your code, and for markers to enter their marks. **Do not edit these**. 
3. **Task** Sections: these sections require your solutions. They may contain stub code, and you are expected to edit this code. For free text answers simply edit the markdown field.  

Note that you are free to **create additional notebook cells** within a task section. 

**Do not share** this assignment publicly, by uploading it online, emailing it to friends etc. 

**Do not** copy code from the Web or from other students, this will count as plagiarism.

## Submission Instructions

To submit your solution:

* Make sure that your solution is fully contained in this notebook. 
* **Rename this notebook to your UCPH ID** (of the form "xxxxxx"), if you have not already done so.
* Download the notebook in Jupyter via *File -> Download as -> Notebook (.ipynb)*.
* Upload the notebook to Absalon.


## <font color='green'>Setup 1</font>: Load Libraries
This cell loads libraries important for evaluation and assessment of your model. **Do not change it.**

In [1]:
#! SETUP 1
import sys, os
_snlp_book_dir = "../../../../"
sys.path.append(_snlp_book_dir) 
import math
from glob import glob
from os.path import isfile, join
from statnlpbook.vocab import Vocab
from statnlpbook.scienceie import calculateMeasures
import shutil
import string

## <font color='blue'>Task 1</font>: Convert dataset between standoff and IOB format

We want to work with [the ScienceIE dataset](https://scienceie.github.io) that can be found in the `data/scienceie/` directory of the repository.  This dataset comes with **standoff annotation** for keyphrases and relations between them.  This means that for each document in the dataset, there are two files: a `.txt` file with the raw sentences, and a `.ann` file with the annotated keyphrases.  

For example, this is one of the `.txt` files from the training set:

```
Failure of structural components is a major concern in the nuclear power industry and represents not only a safety issue, but also a hazard to economic performance. Stress corrosion cracking (SCC), and especially intergranular stress corrosion cracking (IGSCC), have proved to be a significant potential cause of failures in the nuclear industry in materials such as Alloy 600 (74% Ni, 16% Cr and 8% Fe) and stainless steels, especially in Pressurised Water Reactors (PWR) [1–5]. Stress corrosion cracking in pressurized water reactors (PWSCC) occurs in Alloy 600 in safety critical components, such as steam generator tubes, heater sleeves, pressurized instrument penetrations and control rod drive mechanisms [2,6,7]. Understanding the mechanisms that control SCC in this alloy will allow for continued extensions of life in current plant as well as safer designs of future nuclear reactors.
```

And this is the corresponding `.ann` file:

```
T1	Material 11 32	structural components
T2	Process 0 32	Failure of structural components
T3	Process 254 259	IGSCC
T4	Process 213 252	intergranular stress corrosion cracking
*	Synonym-of T4 T3
T5	Process 165 190	Stress corrosion cracking
T6	Process 192 195	SCC
*	Synonym-of T5 T6
T7	Material 367 376	Alloy 600
T8	Material 378 402	74% Ni, 16% Cr and 8% Fe
*	Synonym-of T7 T8
T9	Material 408 424	stainless steels
T10	Material 440 466	Pressurised Water Reactors
T11	Material 468 471	PWR
T12	Process 480 505	Stress corrosion cracking
T13	Material 509 535	pressurized water reactors
T14	Material 537 542	PWSCC
*	Synonym-of T13 T14
T15	Material 554 563	Alloy 600
T16	Material 603 624	steam generator tubes
T17	Material 626 640	heater sleeves
T18	Material 642 677	pressurized instrument penetrations
T19	Material 682 710	control rod drive mechanisms
T20	Material 762 765	SCC
T21	Material 774 779	alloy
T22	Material 835 840	plant
T23	Task 852 892	safer designs of future nuclear reactors
T24	Material 876 892	nuclear reactors
T25	Material 567 593	safety critical components
R1	Hyponym-of Arg1:T16 Arg2:T25	
R2	Hyponym-of Arg1:T17 Arg2:T25	
R3	Hyponym-of Arg1:T18 Arg2:T25	
R4	Hyponym-of Arg1:T19 Arg2:T25
```

Note: Besides keyphrases, which you are already familiar with from Assignment 2, the `.ann` files also contain relation annotations labeled `Hyponym-of` and `Synonym-of`. These are relations between keyphrases. 

`Synonym-of` is an undirected relation, meaning that if you see a line like this:

```*	Synonym-of T13 T14```

The order of keyphrases could be swapped, i.e. the following would also hold:

```*	Synonym-of T14 T13```

The evaluation script will thus be agnostic to the order in which the keyphrases between which `Synonym-of` relations hold are ordered.

`Hyponym-of`, on the other hand, is a directed relation, meaning that it is order-sensitive, and that the evaluation script will take the order of keyphrases between which `Hyponym-of` relations hold into account.

The `.ann` standoff format is **documented in [the brat documentation](http://brat.nlplab.org/standoff.html).**  
You may want to convert the format into some internal representation for training models; however, how you do that is up to you, i.e. you do not have to use IOB format like in Assignment 2. 

**Further Notes**:
- At training time, you you will be provided with plain text documents and `.ann` files with keyphrases and relations
- At test time, you will be provided with plain text documents and `.ann` files **with keyphrases only**. This is because your task is to predict relations.
- The evaluation script is agnostic to the order of relation triples and relation ids, but should preserve the ids of the keyphrases that will be used in the predicted relations. The evaluation scripts requres the entity annotations to be present as well in the prediction file.

In [2]:
### THIS CELL IS MERELY HELPER FUNCTIONS

import re
import numpy as np

def load_txt_str(filename, datadir):
    with open(join(datadir,filename), 'r') as f: #open the file
        contents = f.readlines()
        assert len(contents) == 1
    return contents[0]

def split_txt_str(text_string):
        newline_striped = text_string.rstrip() #Remove newline
        split_str = re.split('(\W)', newline_striped) #Split on everything but words
            
        #Remove empty string and function application:
        for i in range(len(split_str)-1,-1,-1):
            if split_str[i] in ['','\u2061']:
                del split_str[i]

        #Save the location, so it can be put back.   
        assert(len(split_str) >=1)
        
        str_lengths = np.array([len(token) for token in split_str])
        #assert(len(str_lengths) >= 1)

        begin_char = np.append(np.array([0]),np.cumsum(str_lengths[:-1]))
        end_char = np.cumsum(str_lengths)
        locations = [(begin_char[i],end_char[i]) for i in range(len(begin_char))]
 
        for i in range(len(split_str)-1,-1,-1):
        #Remove space, non-breaking space, zero-width-space, thin space:
            if split_str[i] in [' ','\xa0','\u200b','\u2009']:
                del split_str[i]
                del locations[i]

        return split_str, locations

def load_ann_file(filename, datadir):
    #open the file
    with open(join(datadir,filename), 'r',newline = '\n') as f:
        contents = f.readlines()
    return contents

def split_and_sort_ann(contents):
    
    #Only keep text-bound annotation
    only_entity = [line for line in contents if line[0] == 'T'] 

    splittet = [line.rstrip().split('\t') for line in only_entity]
    for line in splittet:
        line[1] = line[1].split(' ')
        line[1][1] = int(line[1][1])
        line[1][2] = int(line[1][2])
            
    ### Order according to start of label, with highest end comming first
    sort_indices = np.argsort(np.array([line[1][1] + 1/(2+line[1][2]) for line in splittet]))
    sorted_annotation = [splittet[i] for i in sort_indices]
        
    ### Remove double occurences or overlapping labels from annotation files
    label_number = 1
    while True:
        if label_number >= len(sorted_annotation):
            break

        #If next label starts before the last ends
        if sorted_annotation[label_number-1][1][2] > sorted_annotation[label_number][1][1]:
            del sorted_annotation[label_number]
        else:
            label_number +=1

    
    return sorted_annotation

def ann_to_entities(annotations):
    
    for line in annotations:
        line[2],_ = split_txt_str(line[2])
    
    entity_types = [line[1][0] for line in annotations]
    entity_words = [line[2] for line in annotations]
    ann_names = [line[0] for line in annotations]
    entity_locations = [(int(line[1][1]),int(line[1][2])) for line in annotations]
    
    return entity_types, entity_words, ann_names, entity_locations


In [3]:
## This cell has the relevant functions 'load_scienceie' and 'save_to_ann'

from os import listdir
import re

def load_scienceie(datadir=join(_snlp_book_dir, "data", "scienceie", "train")):
    """
    Load the ScienceIE dataset from a given directory and return it in IOB format.
    Args:
        datadir: The directory to read from, e.g. data/scienceie/train or data/scienceie/dev
    Returns:
        A dictonary with the example format
        
        data['file_name'] = {'data': [tokens,IOBtags], #where tokens and IOBtags are lists with same length
                             'locations':locations, #A list of tuples with start-position and end-position of every token
                             'annotation_names': ann_names #A list of names ['T2','T1',...]
                             }
    """

    txt_files  = [f for f in listdir(datadir) if f[-3:] == 'txt']
    
    
    try:
        ann_files = [f[:-3] + 'ann' for f in txt_files]
        ann_file_exists = True
    except:
        print('Cannot find annotation files. The returned data set cannot be used for training.')
        ann_file_exists = False
    
    #ann_file_exists = False
    
    data = {}
    
    for i in range(len(txt_files)):
        ann_line = 0
        ann_word = 0

        org_text = load_txt_str(txt_files[i],datadir)
        tokens, locations = split_txt_str(org_text)
        
        ### This creates the IOBtags (labels) for the training
        IOBtags = []
        relations = []
        
        if ann_file_exists:
            ann_content = load_ann_file(ann_files[i],datadir)
            sorted_annotation = split_and_sort_ann(ann_content)
            entity_types, entity_words, ann_names, _ = ann_to_entities(sorted_annotation)
            for word in tokens:


                #If we have been through all annotation-lines
                if ann_line >= len(entity_words):
                    IOBtags.append('O')
                    continue #Don't change ann_line /word 

                #There is an error in one of the .ann files
                tmp = entity_words[ann_line][ann_word]
                if tmp == 'echniques':
                    tmp = 'techniques'

                #If this word is not the next we have in annotation
                if word != tmp:
                    IOBtags.append('O')
                    continue #Don't change ann_line /word

                #If this word is the next in annotation
                prefix = 'B' if ann_word == 0 else 'I'
                suffix = entity_types[ann_line]
                IOBtags.append(prefix + '-' + suffix)

                #Increase annotation word and line
                if ann_word < len(entity_words[ann_line])-1:
                    ann_word += 1
                else:
                    ann_line +=1
                    ann_word = 0
                #end-if
            #end-for tokens
            
            #extract synonym/hyponyms and add them to data
            # get */R
            non_entity = [line for line in ann_content if line[0] != 'T']
            non_entity_split = [re.split(r'[\- :\t]',line.rstrip()) for line in non_entity]
            
            remove = ['','of','Arg1','Arg2']
            relations = [tuple(token for token in line if token not in remove) for line in non_entity_split]
            
        data[txt_files[i][:-4]] = {'tokens': tokens,
                                   'IOBtags': IOBtags,
                                   'locations':locations,
                                   'annotation_names': ann_names,
                                   'relations': relations}
        
    return data


def save_to_ann(data, datadfir):
    """
    Save annotations in IOB format back to .ann files.
    Args:
        data: The annotations in IOB format
        datadir: The directory to save to, e.g. data/scienceie/predictions
    """
    file_names = list(data.keys())
    
    for file_name in file_names:
        file = data[file_name]
        
        #If two 'B-...' arrives, begin then new, but if 'I-...' after 'O', let it be 'B-...'
        
        tokens = np.array(file['tokens'])
        IOBtags = np.array(file['IOBtags'])
        locations = np.array(file['locations'])

        #Put into one
        ann_tags = []
        ann_names = file['annotation_names'] #Maybe None
        ann_locations = []
        ann_entities = []

        #IOBtag (str), locations (tuple), entity (list with start stop at every)
        
        tmp = []
        last_tag = 'O'
        
        for index in range(len(tokens)):
            
            tag = IOBtags[index]
            word = tokens[index]
            location = tuple(locations[index])
            
            if tag == 'O':
                pass
            elif tag[0] == 'B':
                ann_tags.append(tag[2:])
                tmp.append([[word,location]])
            elif tag[0] == 'I' and last_tag == 'O': #Begin new_line
                ann_tags.append(tag[2:])
                tmp.append([[word,location]])
            elif tag[0] == 'I' and last_tag[2:] != tag[2:]: #Begin new_line
                ann_tags.append(tag[2:])
                tmp.append([[word,location]])
            elif tag[0] == 'I' and last_tag[2:] == tag[2:]: #Continue this line
                tmp[-1].append([word,location])
            else:
                print('Tag: ',tag)
                print('Last_tag: ',last_tag)
                raise Exception("This should not happen")
                
            last_tag = tag


        sentences = []
        for sentence in tmp:
            ann_locations.append((sentence[0][1][0],sentence[-1][1][1]))

            entity_str = ''

            n=0

            while True:
                entity_str += sentence[n][0]
                n +=1

                if len(sentence) <= n:
                    break

                n_spaces = sentence[n][1][0] - sentence[n-1][1][1]
                
                try:
                    assert (n_spaces in [0,1])
                except:
                    n_spaces = 1
                    
                for i in range(n_spaces):
                    entity_str += ' '

            sentences.append(entity_str)

        if ann_names is None:
            ann_names = ['T' + str(i) for i in range(1,len(sentenses)+1)]
        
        ann_sorted = [[ann_names[i],
                       [ann_tags[i],ann_locations[i][0],ann_locations[i][1]],
                       sentences[i]] for i in range(len(ann_names))]
        
        #Re-sort it back. (Only makes difference if we have ann_names)
        sort_indices = np.argsort(np.array([int(line[0][1:]) for line in ann_sorted]))
        org_sorting = [ann_sorted[i] for i in sort_indices]
        
        #Concatenate 
        for line in org_sorting:
            line[1] = line[1][0] + ' ' + str(line[1][1]) + ' ' + str(line[1][2])

        #Save the bloody thing
        prediction_dir = datadfir
        if not os.path.exists(prediction_dir):
            os.makedirs(prediction_dir)

        fname = join(datadfir,file_name + '.ann')
        with open(fname,'w') as file:
            for line in org_sorting:
                tmp = '\t'.join(line)
                tmp += '\n'
                file.write(tmp)

    return

In [4]:
### VISER LIGE HVORDAN OUTPUT SER UD
import pprint
pp = pprint.PrettyPrinter(indent=1,depth = None, compact = True).pprint

dev_data = load_scienceie(datadir=join(_snlp_book_dir, "data", "scienceie", "dev"))
#the full file
pp(dev_data['S0021999113005846'])


{'IOBtags': ['O', 'O', 'B-Task', 'I-Task', 'I-Task', 'I-Task', 'O', 'O', 'O',
             'O', 'O', 'O', 'B-Task', 'I-Task', 'I-Task', 'I-Task', 'I-Task',
             'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Process', 'O', 'O',
             'O', 'O', 'O', 'B-Process', 'I-Process', 'I-Process', 'I-Process',
             'I-Process', 'I-Process', 'O', 'O', 'B-Process', 'I-Process',
             'I-Process', 'O', 'O', 'O', 'B-Process', 'O', 'O', 'O', 'O', 'O',
             'B-Process', 'I-Process', 'I-Process', 'I-Process', 'I-Process',
             'I-Process', 'I-Process', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
             'B-Process', 'I-Process', 'O', 'O', 'O', 'B-Material',
             'I-Material', 'I-Material', 'I-Material', 'O', 'O', 'O',
             'B-Material', 'I-Material', 'O', 'O', 'O', 'O', 'O', 'O',
             'B-Material', 'O', 'O', 'B-Process', 'I-Process', 'B-Process', 'O',
             'O', 'O', 'O', 'O', 'B-Process', 'I-Process', 'B-Process', 'O',
          

In [5]:
# We are interested in WORD, RELDIST1, RELDIST2, IOB, POS
# Combine the name of T* and the IOBs
indata = dev_data['S0010938X13003818']

def entityLocator(indata):
    # line for each T: T, start, end, entity
    entities = []
    name_counter = 0
    i = 0
    flag = 0
    for tag in indata['IOBtags']:
        if tag[0] == 'B':
            name = indata['annotation_names'][name_counter]
            tg = tag
            start = i
            name_counter = name_counter + 1
            flag = 1 # flag if we are currently inside iob
        if tag[0] == 'O' and flag == 1:
            # if entity ended - submit to entities
            end = i-1
            entities.append((name, tg, start, end))
            flag = 0
        #end-if
        i = i + 1
    #end-for
    if flag == 1:
        #submit
        entities.append((name, tg, start, i-1))
        flag = 0
    #end-tricks
    return entities
#end-def
if False:
    indata = dev_data['S0010938X13003818']
    print(entityLocator(indata))
#end-if

In [6]:
# Get Part-of-speech (spacy) - Stanford coreNLP is a java program

## Downloads English SpaCy models and performs part-of-speech tagging on our dataset.
## You do not need to modify anything here.
if False:
    !python -m spacy download en
    import spacy
    nlp = spacy.load("en")
    nlp.tokenizer = nlp.tokenizer.tokens_from_list

In [7]:
if False:
    #https://github.com/allenai/scispacy
    import numpy
    #!pip install pybind11
    #!pip install scispacy --no-deps
    !pip install scispacy
    import scispacy
    
    #!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_lg-0.2.4.tar.gz
    !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_sm-0.2.4.tar.gz
    #!python -m spacy download en_core_sci_sm
    import en_core_sci_sm
    import spacy
    #nlp = spacy.load("en_core_sci_lg")
    nlp = spacy.load("en_core_sci_sm")
    nlp.tokenizer = nlp.tokenizer.tokens_from_list
    
    


[33mYou are using pip version 9.0.3, however version 19.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_sm-0.2.4.tar.gz
  Using cached https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_sm-0.2.4.tar.gz
Building wheels for collected packages: en-core-sci-sm
  Running setup.py bdist_wheel for en-core-sci-sm ... [?25ldone
[?25h  Stored in directory: /home/jovyan/.cache/pip/wheels/34/60/b9/fabd9c3eeba17ed66df745479f2fc502a6702755cb4a9632f2
Successfully built en-core-sci-sm
[33mYou are using pip version 9.0.3, however version 19.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [None]:
# Stanford coreNLP is a java program but can be accessed via python:
# https://www.khalidalnajjar.com/setup-use-stanford-corenlp-server-python/

In [None]:
# Get the POS from either Spacy/Stanford-coreNLP

#train_pos = [[token.pos_ for token in sent] for sent in nlp.pipe(train_tokens)]
#dev_pos = [[token.pos_ for token in sent] for sent in nlp.pipe(dev_tokens)]
def getPOS(data_dic, mode = 'spacy'):
    if mode == 'spacy':
        return [[token.pos_ for token in sent] for sent in nlp.pipe([data_dic['tokens']])][0]
    else:
        raise ValueError('The mode: \'%s\' is not implemented' % (mode))
    #end-if
#end-def

# append POS to data if they do not exist yet
#if not 'POS' in indata.keys():

def addPOStoDic(data_dic):
    pos = getPOS(data_dic) # do not override default!
    # assert lenght are equal - otherwise shit has hit the fan!
    assert len(pos) == len(data_dic['tokens']), 'POS length does not match token lengths!'
    data_dic['POS'] = pos
    pass

if False:
    indata = dev_data['S0010938X13003818']
    addPOStoDic(indata)
    print(indata['POS'])


In [None]:
# we create input pair given entities from entitylocator and data

def inputPair(entityA, entityB, indata):
    out = []
    # get distances
    _, _, A_start, A_end = entityA
    _, _, B_start, B_end = entityB
    
    # check if POS
    if not 'POS' in indata.keys():
        # warn user
        print('No POS in data dictionary - adding them...')
        # add them
        addPOStoDic(indata)
        print('POS added to data dictionary.')
    
    # Get tokens inbetween
    i = 0
    for token in indata['tokens']:
        # A < B
        if i in range(A_start, B_end+1):
            relA = max(i-A_end, 0)
            relB = min(i-B_start,0)
            # get POS
            out.append([token, relA, relB, indata['IOBtags'][i], indata['POS'][i]])
        # B < A
        if i in range(B_start, A_end+1):
            relA = max(A_start-i, 0)
            relB = min(B_end-i,0)
            # get POS
            out.append([token, relA, relB, indata['IOBtags'][i], indata['POS'][i]])
        #end-if
        i = i+1
    #end-for
    return out
if False:
    indata = dev_data['S0010938X13003818']
    entities = entityLocator(indata)
    A = entities[0]
    B = entities[1]
    print(A); print(B)
    # works both ways
    print(inputPair(A, B, indata))
    print(inputPair(B, A, indata))
#end-if

In [None]:
# SETUP PRE-TRAINED-WORD-EMBEDDINGS
from gensim.models import fasttext
from gensim.models import KeyedVectors

# Create model
if False:
    engmodel = KeyedVectors.load_word2vec_format('wiki-news-300d-1M.vec', limit=3000)
    engmodel.save("engmodel.model")
    del engmodel

class WordEmbedder:
    def __init__(self, vmodel):
        self.vmodel = vmodel
        self.length = len(vmodel.vectors[0])
        
    def getEmbedding(self, word):
        if word in self.vmodel.vocab:
            return self.vmodel[word]
        else:
            return np.zeros(self.length)
#end-class

# Get model
engmodel = KeyedVectors.load("engmodel.model")
engbedder = WordEmbedder(engmodel)

In [None]:
import copy
def GenerateVocabs(fulldata, maxdistance):
    fulldata = dev_data
    maxlen = maxdistance

    fulldata = copy.deepcopy(fulldata)
    # Create POS for all before we create vocab
    for entry in list(fulldata.values()):
        addPOStoDic(entry)
    # we introduce four vocabs: word, dist, entity, pos
    vocab_w = Vocab.from_iterable([fulldata[i]['tokens'] for i in fulldata])
    vocab_dist = Vocab.from_iterable(range(-maxlen, maxlen))
    vocab_ent = Vocab.from_iterable([fulldata[i]['IOBtags'] for i in fulldata])
    vocab_pos = Vocab.from_iterable([fulldata[i]['POS'] for i in fulldata])
    return vocab_w, vocab_dist, vocab_ent, vocab_pos
if False:
    vocab_w, vocab_dist, vocab_ent, vocab_pos = GenerateVocabs(dev_data, maxdistance = 350)

In [None]:
# Convert input from strings to ints for embedding layer (or use pre-trained embeddings):
def createX(xdata, vocab_words, vocab_distances, vocab_entities, vocab_pos):
    out = [[vocab_words.map_to_index([w[0]])[0],
            vocab_distances.map_to_index([w[1]])[0],
            vocab_distances.map_to_index([w[2]])[0],
             vocab_entities.map_to_index([w[3]])[0],
             vocab_pos.map_to_index([w[4]])[0]] for w in xdata]
    return out

def createXEmbeddings(xdata, embedder, vocab_distances, vocab_entities, vocab_pos):
    out = [[embedder.getEmbedding(w[0]),
            vocab_distances.map_to_index([w[1]])[0],
            vocab_distances.map_to_index([w[2]])[0],
             vocab_entities.map_to_index([w[3]])[0],
             vocab_pos.map_to_index([w[4]])[0]] for w in xdata]
    return out

In [None]:
### Den endelige funktion til at generere data_X og data_Y
from collections import Counter

def dataX_Y_format(data, indices = False):
    """
    This takes the raw data from 'load_scienceie' and converts data_X and data_Y for input to tensorflow.
    
    Input:
        data: The raw data output from 'load_scienceie' (a dictionary)
        
        indices: If False, it returns the names in the features (word, type-of-entity, POS). 
                 If True, it converts word, type-of-entity and POS to indices, which can be used for embedding.
                 
    Output:
        data_X, data_Y
        
        data_X is a list of 2d np-array with shape (sentence_length, feature_length)
        data_Y is a 1d np-array
    """
    
    
    if indices:
        vocab_w, vocab_dist, vocab_ent, vocab_pos = GenerateVocabs(dev_data, 350)
    
    files = dev_data.values()
    
    data_X = []
    data_Y = []
        
    for file in files:
        addPOStoDic(file)
        entities = entityLocator(file)
        
        #Labels
        labels = {}
        labels['synonyms']= [(rel[2],rel[3]) for rel in file['relations'] if rel[1] == 'Synonym']
        labels['hyponyms'] = [(rel[2],rel[3]) for rel in file['relations'] if rel[1] == 'Hyponym']
        
        # create a 'stair' of combinations
        
        for i in range(len(entities)):
            for j in range(i+1, len(entities)):
                
                #Extrac X and annotation names
                ann_names = (entities[i][0],entities[j][0])
                ann_names_reverted = (entities[j][0],entities[i][0])
                
                xdata = inputPair(entities[i], entities[j], file)
                
                #Reformat X
                if indices:
                    xout = createX(xdata, vocab_w, vocab_dist, vocab_ent, vocab_pos)
                    #xout createXEmbeddings(xdata, engbedder, vocab_dist, vocab_ent, vocab_pos)
                    data_X.append(np.array(xout))
                else:
                    data_X.append(np.array(xdata))
                
                ### Extract label
                if ann_names in labels['hyponyms']:
                    data_Y.append('hyponym')
                elif ann_names_reverted in labels['hyponyms']:
                    data_Y.append('hyponym_reverted')
                elif ann_names in labels['synonyms'] or ann_names_reverted in labels['synonyms']:
                    data_Y.append('synonym')
                else:
                    data_Y.append('NONE')
                    
    data_Y_np = np.array(data_Y)
    
    print('X is a list of datapoints where datapoint as an np.array with shape (sentence_length, feature_length). These can\'t be turned into 3d np.array because sentence length vary')
    print('Length of Y: {}'.format(len(data_Y_np)))
    print('The 4 possible labels with count:')
    pp(Counter(data_Y_np).most_common())
    
    return data_X, data_Y_np

In [None]:
#Generer data_X og data_Y (hvor data_X er i ord)
data_X, data_Y = dataX_Y_format(dev_data,indices = False)

if True:
    pp(data_X[0])
    print(data_Y[0])

In [None]:
#Generer data_X og data_Y (hvor data_X er med indices)
data_X, data_Y = dataX_Y_format(dev_data,indices = True)

if True:
    pp(data_X[0])
    print(data_Y[0])

In [None]:
# You can test if conversion and back-conversion works well 
dev_data = load_scienceie(datadir=join(_snlp_book_dir, "data", "scienceie", "dev"))
save_to_ann(dev_data, join(_snlp_book_dir, "data", "scienceie", "dev_copy"))

## <font color='blue'>Task 1.1</font>: Develop and Train a Relation Extraction Model with Gold Keyphrases

In this task, you develop a relation extraction model and apply it to the ScienceIE dataset.
As input to it, at test time, you will have the plain input texts as well as `.ann` files containing gold (i.e. correct) keyphrase annotations. The output should be `.ann` files containing relations between those keyphrases (you should include the keyphrase annotations in the output file as well).

A test input/output example is given in folders `data/scienceie/test/`,`data/scienceie/test_pred/`.

There are no strict requirements for how to design this model. You are expected to use the knowledge you have gathered throughout this course to design and implement this model. 

You are welcome to re-use existing code you might have written for other assignments as you see fit.

You are free to implement your solution in either PyTorch or Tensorflow, but if you are not sure where to start, we recommend looking at the [Keras API](https://keras.io) which is [integrated into Tensorflow 1.14.0](https://www.tensorflow.org/beta/guide/keras/overview?hl=en).

In [None]:
# You should improve this cell

def create_model(train_data, dev_data):
    """
    Return an instance of a relation extraction model defined over the dataset.
    Args:
        train_data: the training data the relation extraction detection model should be defined over.
        dev_data: the development data the relation extraction detection model can be tuned on.
    Returns:
        a relation extraction model
    """
    pass

def train_model(model, train_data, dev_data):
    """Train a relation extraction model on the given dataset.
    Args:
        model: The model to train
        data_train: The dataset to train on
        dev_data: the development data the relation extraction detection model can be tuned on
    """
    pass

def make_predictions(model, data):
    """Makes predictions on a list of instances
    Args:
        model: The trained model
        data: The dataset to evaluate on
    Returns:
        The model's predictions for the data.
    """
    pass


In [None]:
# Use this cell to test on the dev set
data_train = load_scienceie(datadir=join(_snlp_book_dir, "data", "scienceie", "train"))
dir_dev = join(_snlp_book_dir, "data", "scienceie", "dev")
data_dev = load_scienceie(datadir=dir_dev)

model = create_model(data_train, [])
train_model(model, data_train, [])

data_pred = make_predictions(model, data_dev)
dir_pred = join(_snlp_book_dir, "data", "scienceie", "dev_pred")
save_to_ann(data_pred, dir_pred)

calculateMeasures(dir_dev, dir_pred, "keys") # this will only evaluate the correctness of relations

In [None]:
# DO NOT MODIFY THIS CELL! It will evaluate your model on an unseen dataset!
shutil.rmtree(join(_snlp_book_dir, "data", "scienceie", "test_pred")) # clean after previous

data_train = load_scienceie(datadir=join(_snlp_book_dir, "data", "scienceie", "train"))
data_dev = load_scienceie(datadir=join(_snlp_book_dir, "data", "scienceie", "dev"))

model = create_model(data_train, data_dev)
train_model(model, data_train, data_dev)

data_test = load_scienceie(datadir=join(_snlp_book_dir, "data", "scienceie", "test"))
data_pred = make_predictions(model, data_test)
dir_pred = join(_snlp_book_dir, "data", "scienceie", "test_pred")
save_to_ann(data_pred, dir_pred)

dir_gold = join(_snlp_book_dir, "data", "scienceie", "test_gold")
calculateMeasures(dir_gold, dir_pred, "keys") # this will only evaluate the correctness of relations

## <font color='red'>Assessment 1.1</font>: Correctness of the implementation (20 pts)

We assess if your code implements a correct relation extraction model (10 points):

* 0-5 pts: the model does not run correctly or does not constitute a relation extraction model
* 5-10 pts: the model correctly implements the requirements

Additionally, we will assess how well your model performs on an unseen test set (10 points):

* 0-5 pts: performance worse than a simple baseline model
* 5-10 pts: performance better than a simple baseline model

## <font color='blue'>Task 1.2</font>: Describe your Approach

Enter a maximum 500 words description of your model developed in Task 1.1, its architecture, and the way you trained and tuned it. Motivate your choices, describing potential benefits and downsides.

## <font color='red'>Assessment 1.2</font>: Modelling Choices and Motivation (10 pts)


Finally, we assess your modelling design choices and how you motivated them, which you summarised in the above cell (10 points):

* 0-5 pts: the model design choices do not show high levels of creativity, e.g. re-using code from the lecture out of the box; and they are not moviated well
* 5-10 pts: the model design choices show high levels of creativity, e.g. combining different things learned throughout the course, models inspired by further reading, etc.; and they are motivated well

## <font color='blue'>Task 2</font>: Relation Extraction with Weak Supervision

In this task, the goal is to improve the performance of your model developed in Task 2 by obtaining more automatically labelled training data using a weak supervision approach. You are not required to change the relation extraction model architure, i.e. it is fine to re-use the one from Task 1, but instead, the requirements are to implement one or more weak supervision strategies.

Some possible weak supervision methods for relation extraction will be introduced in the lecture Week 43 (https://github.com/copenlu/stat-nlp-book/blob/master/chapters/relation_extraction_slides.ipynb); the following blog post also serves as a good introduction to this topic: https://hazyresearch.github.io/snorkel/blog/ws_blog_post.html

For this task, you are not confined to the training data provided to you, but you are welcome to obtain additional unlabelled datasets and automatically label them using weak supervision methods. 

The general setup will otherwise be the same as for Task 1:
As input to it, you will have the plain input texts as well as `.ann` files containing gold (i.e. correct) keyphrase annotations. The output should be `.ann` files containing relations between those keyphrases.

**Important notes**:
- You must provide code for the functions below. 
- If running them on the full dataset exceeds the 10 minute limit, you are welcome to additionally provide a line of code that (down)loads the already weakly annotated data.
- The maximum file size for weakly annotated data may not exceed 1GB.

A test input/output example is given in folders `data/scienceie/test/`,`data/scienceie/test_pred/`.

In [None]:
# You should improve this cell

def create_weak_model(train_data, dev_data, **args):
    """
    Return an instance of a relation extraction model defined over the dataset.
    Args:
        train_data: the training data the relation extraction detection model should be defined over.
        dev_data: the development data the relation extraction detection model can be tuned on.
        **args: any additional arguments needed, e.g. additional automatically labelled training data
    Returns:
        a relation extraction model
    """
    pass

def train_weak_model(model, train_data, dev_data, **args):
    """Train a relation extraction model on the given dataset.
    Args:
        model: The model to train
        data_train: The dataset to train on
        dev_data: the development data the relation extraction detection model can be tuned on
        **args: any additional arguments needed, e.g. additional automatically labelled training data
    """
    pass

def make_predictions_weak(model, data):
    """Makes predictions on a list of instances. Can be the same as function developed in Task 1.
    Args:
        model: The trained model
        data: The dataset to evaluate on
    Returns:
        The model's predictions for the data.
    """
    pass


In [None]:
# Training a model and evaluating it on the development set. 
# Use this to monitor the performance of your model prior to submitting your assignment.
data_train = load_scienceie(datadir=join(_snlp_book_dir, "data", "scienceie", "train"))
dir_dev = join(_snlp_book_dir, "data", "scienceie", "dev")
data_dev = load_scienceie(datadir=dir_dev)

model = create_weak_model(data_train, [data_dev])
train_weak_model(model, data_train, data_dev)

data_pred = make_predictions_weak(model, data_dev)
dir_pred = join(_snlp_book_dir, "data", "scienceie", "dev_pred")
save_to_ann(data_pred, dir_pred)

calculateMeasures(dir_dev, dir_pred, "keys") 

In [None]:
# DO NOT MODIFY THIS CELL! It will evaluate your model on an unseen dataset!
shutil.rmtree(join(_snlp_book_dir, "data", "scienceie", "test_pred")) # clean after previous

data_train = load_scienceie(datadir=join(_snlp_book_dir, "data", "scienceie", "train"))
data_dev = load_scienceie(datadir=join(_snlp_book_dir, "data", "scienceie", "dev"))

model = create_weak_model(data_train, data_dev)
train_weak_model(model, data_train, data_dev)

data_test = load_scienceie(datadir=join(_snlp_book_dir, "data", "scienceie", "test"))
data_pred = make_predictions_weak(model, data_test)
dir_pred = join(_snlp_book_dir, "data", "scienceie", "test_pred")
save_to_ann(data_pred, dir_pred)

dir_gold = join(_snlp_book_dir, "data", "scienceie", "test_gold")
calculateMeasures(dir_gold, dir_pred, "keys") # this will only evaluate the correctness of relations

In [None]:
dir_pred = join(_snlp_book_dir, "data", "scienceie", "test_pred")
dir_gold = join(_snlp_book_dir, "data", "scienceie", "test_gold")

dir_gold = join(_snlp_book_dir, "data", "scienceie", "test_gold")
calculateMeasures(dir_gold, dir_pred, "keys") # this will only evaluate the correctness of relations

## <font color='red'>Assessment 2.1</font>: Correctness of the implementation (20 pts)

We assess if your code implements correct weak supervision methods (10 points):

* 0-5 pts: the model does not run correctly or the methods implemented do not constitute weak supervision strategies
* 5-10 pts: the model correctly implements the requirements

Additionally, we will assess how well your model performs on an unseen test set (10 points):

* 0-5 pts: performance worse than a simple baseline model
* 5-10 pts: performance better than a simple baseline model

## <font color='blue'>Task 2.2</font>: Describe your Approach

Enter a maximum 500 words description of your weak supervision stategies developed in Task 2.1 and the way you trained and tuned them. Motivate your choices, describing potential benefits and downsides.

## <font color='red'>Assessment 2.2</font>: Modelling Choices and Motivation (10 pts)


Finally, we assess your modelling design choices and how you motivated them, which you summarised in the above cell (10 points):

* 0-5 pts: the model design choices do not show high levels of creativity, e.g. re-using code from the lecture out of the box; and they are not moviated well
* 5-10 pts: the model design choices show high levels of creativity, e.g. combining different things learned throughout the course, models inspired by further reading, etc.; and they are motivated well

## <font color='blue'>Task 3</font>: Comparison of relation extraction models

Reflect on the models implemented in Tasks 1 and 2. What worked and didn't work well, and how would you explain this? How and when does the performance differ between the models and why might that be? You are expected to perform a small error analysis on the development set in order to answer these questions.

## <font color='red'>Assessment 3</font>: Assess your explanation (20 pts)

We will mark the explanation along the following dimension: 

* Substance (20pts: well-designed error analysis, correctly explained reasons for performance differences between models)