## Load data processed till now.
* `dictionary` makes it easier for us to annotate. We can simply press one of three keys `i`, `b`, or `c` and the corresponing label gets attached with the sentence
* `done_file` is a set of already annotated titles. 
* The set `done` is to have the titles that we will process now. 
* `ignored_titles` is the set containing all the titles that we REJECTED.
* `write_to_file` contains the new JSON objects having the structural information as well.

In [None]:
import json
dictionary = {"i": "INTRODUCTION", "b": "BODY", "c": "CONCLUSION"}
done_file = set(list(map(str.strip,open("done.txt").readlines())))
done = set()
ignored_titles = set(list(map(str.strip,open("ignored.txt").readlines())))
write_to_file = []
c = 1

# The Annotation Job
* Iterate over abstracts from a file called `train.dat`.
* Ignore the title/abstract in case it is already done or if it has been rehected before.
* We also reject titles containing `i.e.` or `e.g.`. NLTK tokenization breaks this up as `i . e .` and then we can't split on a `.`. So, we ignore these from manual annotation. 

In [None]:
with open("train.dat") as f:
    for line in f:
        j = json.loads(line)
        title = j["title"]

        if title in done or title in done_file or title in ignored_titles:
            continue

        if any(ignore in j["abstract"] for ignore in ['i.e .', 'e.g .', 'etc .']):
            continue

        abstract = j["abstract"].split(" . ")
        print("\n\n***************************** Abstract is: **************************")
        print("\n".join(abstract))

        consider_this = input("\n is this ok ? \n")
        
        # Pressing anything other than these two keys would cause the 
        # title to be ignored. 
        if consider_this in ["yes", "y"]:
            new_abstract = []
            for a in abstract:
                # You can only enter one of i, b or c
                # i = Introduction
                # b = Body
                # c = Conclusion
                labelling = input(a)
                new_abstract.append((dictionary[labelling], a))

            j["abstract"] = new_abstract
            write_to_file.append(j)
            done.add(title)
            
            # Break after processing 20 abstracts so that we can save the 
            # progress in the output file. 
            if c % 20 == 0:
                break
            c += 1    
            
        else:
            ignored_titles.add(title)



## Sanity

In [224]:
len(ignored_titles), len(done), len(write_to_file), len(done_file)

(359, 0, 0, 1500)

## Write the new data to file

In [None]:
with open("ignored.txt", "w") as f2:
    with open("labelled.txt", "a") as f:
        with open("done.txt", "a") as f1:
            for i in ignored_titles:
                f2.write(i+"\n")
            for t in done:
                f1.write(t+"\n")
            for a in write_to_file:
                f.write(json.dumps(a)+"\n")