# Tagger and pub-annotation generator for COVID-19
This is a project in the course EDAN70, Projects in Computer Science, at LTH Sweden. The goal of this project is using NLP for tagging words in articles and generate a collection of pub-annotations in order to aid the scientific community in finding relevant research for their field.

## Authors
Annie Tallind, Lund University, Faculty of Engineering <br/>
Kaggle ID: [atllnd](https://www.kaggle.com/atllnd/) <br/>
Github ID: [annietllnd](https://github.com/annietllnd/) <br/>

Sofi Flink, Lund University, Faculty of Engineering <br/>
Kaggle ID: [sofiflinck](https://www.kaggle.com/sofiflinck/) <br/>
Github ID: [obakanue](https://github.com/obakanue) <br/>

## Credit
Dictionaries were generated using golden- and silver-standard implemented by Aitslab.

## About project
The scientific field is overwhelmed by the constant influx of new research being published. During the crisis of the COVID-19 pandemic it is even more crucuial to developing tools for aiding researchers in finding relevant data. For this purpose a tagger was implemented for generating pub-annotations in order to give reserachers a tool to summerize articles. Hopefully this tool will also be good groundwork for filtering articles in other fields as well.

## The source code
This is constructed in a way to guide the reader through the code and the different elements of the tagger. It goes through each step, from loading data in to the implementation to tagging, pub-annotation generation and precision and recall evaluation.

### TODO list

### Load and process datafiles
**Load JSON-file paths** <br/>
We import 'os' in order to use the method 'listdir()' which returns a list containing the names of the entries in the directory given by path. In this case tdatafilesdatafileshe path to the JSON-file articles in directory 'comm_use_subset_100' which is saved in constant 'DIRECTORY_NAME'. They will be loaded in to the array 'article_paths'.

In [36]:
import os
DIRECTORY_NAME = 'comm_use_subset_100'
article_paths = os.listdir(DIRECTORY_NAME)
article_dicts_list = []

**Load JSON-files** <br/>
The import 'json' is used in order to be able to process the contents of each file to a dictionary. These dictionaries are added to the list 'article_dicts_list.

In [37]:
import json
for article_name in article_paths:
    if article_name == ".DS_Store":  # For MacOS users skip .DS_Store-file
        continue                     # generated.
    full_path = DIRECTORY_NAME + '/' + article_name
    with open(full_path) as article:
        article_dicts_list.append(json.load(article))

**Load CSV-file with metadata** <br/>
We also need the metadata which contains complementing information for the JSON-articles. First we need ti import pandas, this is a library for data analysis with built in methods to handle CSV files. The method 'load_metadata()' reads a CSV file and creates a dataframe for the content. Next a list witch each articles metadata as a dictionary is created, 'metadata_list'.   
Since we want to be able to easily access the position for a particular element in the list, the indices where mapped as values with 'sha' id as key. This way we can easily find and access any element in the list, even if it has two different id's for the same articles which sometimes occurs. The list with metadata and the dictionary mapping a 'sha' id to an index is returned.

In [38]:
import pandas as pd
def load_metadata():
    """
    Returns list with metadata and dictionary with sha as keys and indices of metadata list as values.
    """
    metadata_csv_path = 'metadata_comm_use_subset_100.csv'
    metadata_frame = pd.read_csv(metadata_csv_path,
                                 na_filter=False,
                                 engine='python')
    metadata_list = metadata_frame.to_dict('records')
    index = 0
    metadata_indices_dict = dict()
    for data in metadata_list:
        shas = data['sha'].split('; ', 1)
        for sha in shas:
            metadata_indices_dict.update({sha: index})
        index += 1
    return metadata_list, metadata_indices_dict

metadata_list, metadata_indices_dict = load_metadata()

### Load vocabularies and create patterns used for tagging
**Load vocabularies** <br/>
Vocabularies are text files with words we wish to tagg in the articles. The ones used here where generated using Aitslabs golden and silver standard called 'Virys_SARS-CoV-2' and 'Disease_COVID-19'. We want to make an easy and streamlined process where we simply can add as many vocabularies as necessary. We create a dictionary 'VOCABS_COL_DICT' with the class name as key and the list of words as value.

In [39]:
VOCABS_COL_DICT = {'Virus_SARS-CoV-2':
                   [row.strip() for row in
                   open('Supplemental_file1.txt')],
                   'Disease_COVID-19':
                   [row.strip() for row in
                   open('Supplemental_file2.txt')]}

**Create dictionary with patterns** <br/>
In addition to vocabularies we wish to handle certain word patterns, these get their own class. For now we only have one pattern: for all words ending in 'vir' and it's class name is 'chemical_antiviral'. Here, similiar to 'VOCABS_COL_DICT', 'PATTERNS_DICT' will use the class name as a key, and this time the pattern is the value.

In [40]:
"""
Patterns:
1. All words ending in 'vir' case insensitive in class 'chemical_antiviral'.
"""
PATTERNS_DICT = {'chemical_antiviral':
            r'(?i)\b\S*vir\b'
            }

### Tag articles and generate pub-annotations
**Define method for processing an article** <br/>
So before constructing a loop for iterating through all articles and processing them we define a function for how an article should be processed. For now, an article and corresponding metadata are sent in as input arguments for the function as dictionaires. We only need some of the information from the metadata and as such a helper function 'obtain_metadata_args extracts 'cord_uid', 'source_x' and 'pmcid' from the dictionary to a list which is returned and saved in 'metadata_info'. The sections we want to filter through are 'metadata' (which contains 'title'), 'abstract' and 'body_text'.   
Now we will find each section of the article by looping through our 'sections' list. If the section is 'metadata' we wish to retrieve the 'title' from the articled dictionary. In order to be able to process all sections equally in the code all of the sections creates a new list 'section_paragraphs' with the help of list comprehension. This means that this is done for the article section 'title' as well even if articles never have more than one title. Since we are only interested in the 'title' in 'metadata' section, we wish to update the 'section' string to 'title' instead. For the other sections there is a possibility of several text paragraphs for one section. With list comprehension all the paragraphs are saved in 'section_paragraphs' list. If the paragraphs in a section is empty we handle that by assigning the list as an list with an empty string. The counter 'paragraph_index' keeps track of the amout of paragraphs in a section and resets for every new section.   
The tagging will be done for each paragraph, this way we can process more demands for the filtering and tagging of articles before exporting the pub-annotations to a file. The tagging of paragraphs are handled with 'tag_paragraph' which we will walk through a bit later. For constructing the complete denotation method get_paragraph_denotation() is used with the 'url' section in corresponding metadata dictionary is used as input argument. If the denotation is empty, that is '\[ \]', we will not create a pub-annotations since no matches where found. The function 'construct_pubannotation()', returns the full pub-annotations string while 'export_pubannotation()' exports the string to a json-file. For every section, 'file_index' will increment with every section, so for every article, at most 3 files with pub-annotations will be generated.

In [41]:
def tag_article(article_dict, metadata_dict):
    """
    Process article for each section and paragraph and generate pub-annotations for export to file.
    """
    file_index = 0
    metadata_info = obtain_metadata_args(metadata_dict)
    sections = ['metadata', 'abstract', 'body_text']
    for section in sections:
        if section == 'metadata':
            section_paragraphs = [article_dict[section]['title']]
            section = 'title'
        else:
            section_paragraphs = [section['text'] for section in
                                  art() icle_dict[section]]
        if not bool(section_paragraphs):
            section_paragraphs = ['']
        paragraph_index = 0
        for paragraph in section_paragraphs:
            tag_paragraph(paragraph)
            denotation = get_paragraph_denotation(metadata_dict['url'])
            if not re.fullmatch(r'\[\]', denotation):
                annotation = construct_pubannotation(metadata_info,
                                                      paragraph_index,
                                                      paragraph,
                                                      denotation)
                export_pubannotation(metadata_info[0],
                                      file_index,
                                      section,
                                      annotation)
            paragraph_index += 1
        file_index += 1  # Increment with each file

**Helper function for tag_article(): obtain_metadata_args()** <br/>
As mentioned above we will now walk through the helper functions called in 'tag_article()'. The first function used is 'obtain_metadata_args' which uses the metadata dictionary as argument input. The function simply extracts as mentioned earlier the sections of interest in the dictionary and returns them as elements in a list.


In [42]:
def obtain_metadata_args(metadata_dict):
    """
    Returns necessary columns from metadata dictionary. Index 0 gives cord_uid, index 1 gives source_x, index 2 gives
    pmcid.
    """
    cord_uid = metadata_dict['cord_uid']
    source_x = metadata_dict['source_x']
    pmcid = metadata_dict['pmcid']
    metadata_info = [cord_uid, source_x, pmcid]
    return metadata_info

**Helper function for tag_article(): tag_paragraph()** <br/>
The function takes the paragraph string as input argument, for every new paragraph we wish to clear the dictionary 'paragraph_matches'. The dictionary contains match objects as key and word class as value, this dictionary is nesessaru in order to keep track on what words are to be tagged with whith which word class. We will walk through the priority rules further down.   
The function will iterate through every dictionary and pattern in 'VOCABS_COL_DICT' and 'PATTERNS_DICT' and tag for every occurance in the paragraph string. This is done in helper function 'tag_pattern()'. We create regular expression patterns for every word in a dictionary, since both pattern matching and word matching uses similar code we can avoid repeating ourselves by using the function 'tag_pattern()'.

In [43]:
paragraph_matches = dict()
def tag_paragraph(paragraph):
    """
    For a paragraph, iterate through all vocabularies and patterns and tag using corresponding regex-pattern.
    """
    paragraph_matches.clear()
    for vocabulary in VOCABS_COL_DICT:
        for word in VOCABS_COL_DICT[vocabulary]:
            pattern = fr'(?i)\b{word}(es|s)?\b'
            tag_pattern(pattern, paragraph, vocabulary)

    for word_class in PATTERNS_DICT:
        tag_pattern(PATTERNS_DICT[word_class], paragraph, word_class)

**Helper function for tag_paragraph(): tag_pattern()** <br/>
We need to import 're' in order to use the methods for searching with regular expressions. We will use 'finditer()' which will return match objects of all found matches for a certian pattern when iterating through a string. In our case 'text' will be our paragraphs. The input arguments for our helper functions are the regex pattern, string of text we wish to tag, and what word class the pattern belonged to. For all matches we can find we will send them to helper function 'is_match_priority()' which will return true if the match is prioritized and is to be added to our dictionary with match objects.

In [44]:
import re
def tag_pattern(pattern, text, word_class):
    """
    For a particular pattern, find matches in paragraph and add to 'paragraph_matches' dictionary, if match is
    prioritized.
    """
    for match in re.finditer(pattern, text):
        is_priority = is_match_priority(pattern, match.group(0), word_class)
        if is_priority:
            paragraph_matches.update({match: word_class})

**Helper function for tag_pattern(): is_match_priority()** <br/>
For now we don't have a lot of rules since we only use three different word classes. For 'Virus_Sars-CoV-2' and 'Disease_COVID-19' only the longest match should be tagged, if there exists an older shorter match it will be deleted from the dictionary 'paragraph_matches' and true is returned.

In [45]:
def is_match_priority(pattern, new_word_match, word_class):
    """
    Checks priorites of tagging for vocabularies. For 'Virus_SARS-CoV-2' and 'Disease_COVID-19' if already pattern
    matches with existing match in 'paragraph_matches' then only the longest match will be kept in the dictionary.
    Returns 'True' if new match is to be added (prioritized).
    """
    for match in paragraph_matches:
        word_match = match.group(0)
        if word_class == 'Virus_SARS-CoV-2' or word_class == 'Disease_COVID-19':
            prev_tagged = re.match(pattern, word_match)
            if prev_tagged:
                longest_match = max(new_word_match, word_match, key=len)
                if longest_match == new_word_match:
                    del paragraph_matches[match]
                    return True
                return False
            return True
    return True

**Helper function for tag_article(): get_paragraph_denotation()** <br/>
The helper function uses the input argument a string representing the url for a particular article. For every match we have in 'paragraph_matches' we wish to create a denotation which is appended to 'denotations' list. The denotation for every match is generated with helper function 'construct_denotations()' which returns a string. Then all denotations are concatenated with helper function 'concat_denotations()' which concatenates the strings properly accordingly to pub-annotation format. The complete string is then returned.

In [46]:
def get_paragraph_denotation(url):
    """
    Constructs complete string denotation for a paragraph.
    """
    denotations = []
    for match in paragraph_matches:
        denotations.append(construct_denotation(paragraph_matches[match],
                                                str(match.start()),
                                                str(match.end()), url))
    return concat_denotations(denotations)

**Helper function for get_paragraph_denotation(): construct_denotation()** <br/>
The function takes input arguments 'idd', which is the word class string, string format of the begining of a match in a paragraph, string format of the end of the same match in a paragraph, and the url for a particular article retrieved from the metadata file. Here strings are properly concatenated to create a single match for a paragraph and then returned.

In [52]:
def construct_denotation(idd, begin, end, url):
    """
    Returns a string denotation for a single match.
    """
    idd = "\"id\":\"" + idd + "\", "

    span = "\"span\":{\"begin\":" + begin + "," + "\"end\":" + end + "}, "

    obj = "\"obj\":\"" + url + "\""
    denotation = "{" + idd + span + obj + "}"
    return denotation

**Helper function for get_paragraph_denotation(): concat_denotation()** <br/>
The argument for the helper function is the list with string elements where every element is a denotation in pub-annotation format for a single match. We now wish to merge the strings if there are several matches in a single paragraph. This string is then returned.

In [53]:
def concat_denotations(denotations):
    """
    Returns a complete denotation string of all separate denotations in
    list parameter, or an empty string if there where no elements in the
    list.
    """
    if not bool(denotations):
        return "[]"

    full_denotation = ''

    for denotation in denotations:
        if denotation == denotations[-1]:
            full_denotation += denotation
        else:
            full_denotation += denotation + ", "
    return "[" + full_denotation + "]"

**Helper function for tag_article(): construct_pubannotation()** <br/>
This helper function creates the complete string for the pub-annotation to be exported.


In [54]:
def construct_pubannotation(metadata_info, paragraph_index, text, denotation):
    """
    Returns a string in pub-annotation format.
    """
    cord_uid = "\"cord_uid\":\"" + metadata_info[0] + "\", "

    source_x = "\"sourcedb\":\"" + metadata_info[1] + "\", "

    pmcid = "\"sourceid\":\"" + metadata_info[2] + "\", "

    divid = "\"divid\":" + str(paragraph_index) + ", "

    text = "\"text\":\"" + text + "\", "

    project = "\"project\":\"cdlai_CORD-19\", "

    denotations_str = "\"denotations\":" + denotation

    return "{" + cord_uid + source_x + pmcid + divid + text + project + \
           denotations_str + "}"

In [50]:
def export_pubannotation(idd, file_index, section, annotation):
    """
    Export pub-annotation string to corresponding section file.
    """
    file_name = idd + "-" + str(file_index) + "-" + section
    text_file = open("out/" + file_name + ".json", "wt")
    text_file.write(annotation)
    text_file.close()

In [51]:
for article_dict in article_dicts_list:
    metadata_index = metadata_indices_dict[
    article_name.replace('.json', '')]
    metadata_dict = metadata_list[metadata_index]
    process_article(article_dict, metadata_dict)

<_sre.SRE_Match object; span=(62, 77), match='new coronavirus'>
<_sre.SRE_Match object; span=(777, 788), match='alisporivir'>
<_sre.SRE_Match object; span=(928, 947), match='ritonavir/lopinavir'>


KeyboardInterrupt: 