# Word Representation in Biomedical Domain

Before you start, please make sure you have read this notebook. You are encouraged to follow the recommendations but you are also free to develop your own solution from scratch. 

## Marking Scheme

- Biomedical imaging project: 40%
    - 20%: accuracy of the final model on the test set
    - 20%: rationale of model design and final report
- Natural language processing project: 40%
    - 30%: completeness of the project
    - 10%: final report
- Presentation skills and team work: 20%


This project forms 40\% of the total score for summer/winter school. The marking scheme of each part of this project is provided below with a cap of 100\%.

You are allowed to use open source libraries as long as the libraries are properly cited in the code and final report. The usage of third-party code without proper reference will be treated as plagiarism, which will not be tolerated.

You are encouraged to develop the algorithms by yourselves (without using third-party code as much as possible). We will factor such effort into the marking process.

## Setup and Prerequisites 

Recommended environment

- Python 3.7 or newer
- Free disk space: 100GB

Download the data

```sh
# navigate to the data folder
cd data

# download the data file
# which is also available at https://www.semanticscholar.org/cord19/download
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2021-07-26/document_parses.tar.gz

# decompress the file which may take several minutes
tar -xf document_parses.tar.gz

# which creates a folder named document_parses
```

## Part 1 (20%): Parse the Data

The JSON files are located in two sub-folders in `document_parses`. You will need to scan all JSON files and extract text (i.e. `string`) from relevant fields (e.g. body text, abstract, titles).

You are encouraged to extract full article text from body text if possible. If the hardware resource is limited, you can extract from abstract or titles as alternatives. 

Note: The number of JSON files is around 425k so it may take more than 10 minutes to parse all documents.

For more information about the dataset: https://www.semanticscholar.org/cord19/download

Recommended output:

- A list of text (`string`) extracted from JSON files.

In [1]:
###################
# TODO: add your solution
import os
import json
from tqdm import tqdm

def parse_data() -> list[str]:
    """ Scans all JSON files in folder `data` and extracts full text from them.

    Returns:
       list[str]: A list containing full article text of all papers.
    """    

    res = []

    doc_dir = os.path.join("data", "document_parses")
    pdf_dir = os.path.join(doc_dir, "pdf_json")
    pmc_dir = os.path.join(doc_dir, "pmc_json")
    file_paths = [os.path.join(pdf_dir, filename) for filename in os.listdir(pdf_dir)]
    file_paths.extend([os.path.join(pmc_dir, filename) for filename in os.listdir(pmc_dir)])
    
    with tqdm(file_paths, desc="Parsing papers", unit="paper") as pbar:
        for file_path in pbar:
            with open(file_path, encoding='utf-8') as f:
                paper = json.load(f)
                body_text = paper['body_text']
                res.append('\n'.join(para['text'] for para in body_text))

    return res

full_texts = parse_data()
###################

Parsing papers: 100%|██████████| 425257/425257 [04:14<00:00, 1668.98paper/s]


In [2]:
print(len(full_texts))
print(full_texts[0])

425257
Digital technologies have provided support in diverse policy, business, and societal application areas in the COVID-19 outbreak, such as pandemic management (Radanliev et al., 2020b) , corporate communications (Camilleri, 2020) , analysis of research data (Radanliev et al., 2020a) , and education (Crawford et al., 2020) . COVID-19 started as a global infectious disease in the spring of 2020, but the necessary measures to control the virus went beyond treatment and were also directed against its spread. Thus, for months, all interpersonal relationships were characterized by social distancing, and the pandemic raised not only medical but also social, economic and technological issues, among others. Higher education was one domain that the pandemic affected radically (Nuere and de Miguel, 2020; Watermeyer et al., 2020) . During the worldwide lockdown, higher educational institutions had to immediately switch their activities from the classroom and the campus to a virtual space, whi

## Part 2 (30%): Tokenization

Traverse the extracted text and segment the text into words (or tokens).

The following tracks can be developed in independentely. You are encouraged to divide the workload to each team member.

Recommended output:

- Tokenizer(s) that is able to tokenize any input text.

Note: Because of the computation complexity of tokenizers, it may take hours/days to process all documents. Which tokenizer is more efficient? Any idea to speedup?

### Track 2.1 (10%): Use split()

Use the standard `split()` by Python.

### Track 2.2 (10%): Use NLTK or SciSpaCy

NLTK tokenizer: https://www.nltk.org/api/nltk.tokenize.html

SciSpaCy: https://github.com/allenai/scispacy

Note: You may need to install NLTK and SpaCy so please refer to their websites for installation instructions.

### Track 2.3 (10%): Use Byte-Pair Encoding (BPE)

Byte-Pair Encoding (BPE): https://huggingface.co/transformers/tokenizer_summary.html

Note: You may need to install Huggingface's transformers so please refer to its website for installation instructions.

### Track 2.4 (Bonus +5%): Build new Byte-Pair Encoding (BPE)

This track may be dependent on track 2.3.

The above pre-built tokenization methods may not be suitable for biomedical domain as the words/tokens (e.g. diseases, sympotoms, chemicals, medications, phenotypes, genotypes etc.) can be very different from the words/tokens commonly used in daily life. Can you build and train a new BPE model for biomedical domain in particular?

### Open Question (Optional):

- What are the pros and cons of the above tokenizers?

In [2]:
import time
from timeit import Timer
###################
# TODO: add your solution

###################
# Track 2.1

import re
from gensim import corpora
from typing import Optional, List
import string
from pprint import pprint
from multiprocessing.dummy import Pool as ThreadPool

class SplitTokenizer(object):
    """ Tokenizer using the standard `split()` by Python.
    Note:
        Also remove the punctuation and stopwords.
    """

    def __init__(self, read_dict:bool=False, dict_path:Optional[str]=None):
        """ Initialize tokenizer or load dictionary from disk.
        Args:
            read_dict (bool): Whether to load an existing dictionary from disk. Defaults `False`.
            dict_path (Optional[str]): The path of the dictionary. Defaults `None`.
        """
        self.dictionary = corpora.Dictionary()
        # stopwords
        self.stopList = set('for a of the and to in'.split())
        if read_dict:
            self.dictionary.load(dict_path)


    def tokenize(self, in_texts:List[str]):
        """ Tokenize text using split() and clean stopwords. Generate dictionary using tokens.
        Args:
            in_texts (List[str]): Input text list to tokenize.
        """
        res = []
        for text in tqdm(in_texts, desc="tokenize", unit="text"):
            res.append([word for word in re.split(r'\W+', text.lower()) if word not in self.stopList])
        self.dictionary = corpora.Dictionary(res)

    def save_dict(self, dict_path:str):
        """ Save dictionary to disk.
        Args:
            dict_path (str): Dictionary(.dict file) path to save.
        """
        self.dictionary.save(dict_path)

    def print_dict(self):
        """ Print current dictionary with token, token_id and frequency of tokens.
        """
        token2id = self.dictionary.token2id
        dfs = self.dictionary.dfs
        token_info = {}
        for word in token2id:
            token_info[word] = dict(
                word=word,
                id=token2id[word],
                freq=dfs[token2id[word]]
            )
        token_items = token_info.values()
        token_items = sorted(token_items, key=lambda x: x['id'])
        print('The info of dictionary: ')
        pprint(token_items)
        print('--------------------------')

    def encode(self, text:str) -> List[int]:
        """ Encode input string.
        Args:
            text (str): Input text to encode.
        Returns:
            List[int]: Corresponding corpus.
        """
        return self.dictionary.doc2idx([word for word in re.split(r'\W+', text.lower()) if word not in self.stopList])

    def decode(self, corpus:List[int]) -> str:
        """ Decode input corpus.
        Args:
            corpus (List[int]): Input corpus.
        Returns:
            str: String after decoding.
        """
        return " ".join([self.id_to_token(t_id) for t_id in corpus])

    def token_to_id(self, word:str) -> int:
        """ Token -> Token_id
        """
        return self.dictionary.token2id[word]

    def id_to_token(self, token_id:int) -> str:
        """ Token_id -> Token
        """
        return self.dictionary[token_id]

    def dict_filter(self, below:int = 5, above:float = 0.5, n=10000):
        """ Filter dictionary.
        Args:
            below (int): delete tokens whose frequency < below
            above (float): delete tokens whose frequency(percentage) > above
        """
        self.dictionary.filter_extremes(no_below=below, no_above=above, keep_n=n)

    # Speed-up Strategy: Parallel
    # The performance bottleneck is the CPU rather than IO, so acceleration is based on process parallel.=


In [4]:
tokenizer = SplitTokenizer()
tokenizer.tokenize(full_texts)
tokenizer.save_dict("splitDict.dict")

tokenize:  69%|██████▉   | 294122/425257 [15:50<02:26, 893.84text/s]IOStream.flush timed out
tokenize: 100%|██████████| 425257/425257 [18:34<00:00, 381.41text/s]  


In [91]:
# Simple usage example of SplitTokenizer

texts = full_texts[0:100]

tokenizer = SplitTokenizer()
tokenizer.tokenize(texts)

tokenizer.print_dict()
print(tokenizer.encode("GV developed the idea for this empirical research and was involved in all steps of the study and the manuscript preparation. GV and AU prepared the instrument applied in the empirical study."))
print(tokenizer.decode([1011, 648, 99, 360, 255]))
print(tokenizer.id_to_token(77))
print(tokenizer.token_to_id("covid"))

tokenize: 100%|██████████| 100/100 [00:00<00:00, 1114.07text/s]


The info of dictionary: 
[{'freq': 91, 'id': 0, 'word': ''},
 {'freq': 72, 'id': 1, 'word': '0'},
 {'freq': 15, 'id': 2, 'word': '001'},
 {'freq': 23, 'id': 3, 'word': '01'},
 {'freq': 95, 'id': 4, 'word': '1'},
 {'freq': 79, 'id': 5, 'word': '12'},
 {'freq': 12, 'id': 6, 'word': '128'},
 {'freq': 4, 'id': 7, 'word': '147'},
 {'freq': 1, 'id': 8, 'word': '16dii127'},
 {'freq': 88, 'id': 9, 'word': '19'},
 {'freq': 1, 'id': 10, 'word': '1958'},
 {'freq': 1, 'id': 11, 'word': '1966'},
 {'freq': 2, 'id': 12, 'word': '1973'},
 {'freq': 5, 'id': 13, 'word': '1980'},
 {'freq': 3, 'id': 14, 'word': '1989'},
 {'freq': 5, 'id': 15, 'word': '1995'},
 {'freq': 8, 'id': 16, 'word': '1999'},
 {'freq': 94, 'id': 17, 'word': '2'},
 {'freq': 73, 'id': 18, 'word': '20'},
 {'freq': 10, 'id': 19, 'word': '2000'},
 {'freq': 11, 'id': 20, 'word': '2001'},
 {'freq': 12, 'id': 21, 'word': '2002'},
 {'freq': 11, 'id': 22, 'word': '2003'},
 {'freq': 10, 'id': 23, 'word': '2005'},
 {'freq': 11, 'id': 24, 'word'

In [92]:
tokenizer.dict_filter()
tokenizer.print_dict()

The info of dictionary: 
[{'freq': 15, 'id': 0, 'word': '001'},
 {'freq': 23, 'id': 1, 'word': '01'},
 {'freq': 12, 'id': 2, 'word': '128'},
 {'freq': 5, 'id': 3, 'word': '1980'},
 {'freq': 5, 'id': 4, 'word': '1995'},
 {'freq': 8, 'id': 5, 'word': '1999'},
 {'freq': 10, 'id': 6, 'word': '2000'},
 {'freq': 11, 'id': 7, 'word': '2001'},
 {'freq': 12, 'id': 8, 'word': '2002'},
 {'freq': 11, 'id': 9, 'word': '2003'},
 {'freq': 10, 'id': 10, 'word': '2005'},
 {'freq': 11, 'id': 11, 'word': '2006'},
 {'freq': 12, 'id': 12, 'word': '2007'},
 {'freq': 11, 'id': 13, 'word': '2008'},
 {'freq': 13, 'id': 14, 'word': '2009'},
 {'freq': 12, 'id': 15, 'word': '2010'},
 {'freq': 15, 'id': 16, 'word': '2011'},
 {'freq': 19, 'id': 17, 'word': '2012'},
 {'freq': 15, 'id': 18, 'word': '2013'},
 {'freq': 13, 'id': 19, 'word': '2014'},
 {'freq': 16, 'id': 20, 'word': '2015'},
 {'freq': 17, 'id': 21, 'word': '2017'},
 {'freq': 41, 'id': 22, 'word': '34'},
 {'freq': 49, 'id': 23, 'word': '35'},
 {'freq': 30

In [101]:
tokenizer = SplitTokenizer()
tokenizer.tokenize(full_texts[1:1000])
tokenizer.parallel_tokenize(full_texts[1:1000])

tokenize: 100%|██████████| 99/99 [00:00<00:00, 905.92text/s]


TypeError: __init__() got an unexpected keyword argument 'print_tmpl'

In [5]:
###################
# Track 2.2

""""
Note:
    install nltk, only need to run for one time. If failed, download the NLTK manually on https://www.nltk.org/.
"""

import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\KWZhu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\KWZhu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\KWZhu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\KWZhu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     C:\Users\KWZhu\AppData\Roaming\nltk_data...
[nltk_data]    | 

True

In [94]:
from nltk.corpus import stopwords
from nltk import word_tokenize, sent_tokenize
from nltk.stem.porter import PorterStemmer

class NLTKTokenizer(object):
    """ Tokenizer using NLTK.
    Note:
        In this tokenizer, except remove the punctuation and stopwords, also using stemming when text cleaning to shorten the size of dictionary and optimize follow-up training
    """

    def __init__(self, read_file:bool=False, dict_path:Optional[str]=None):
        self.remove = str.maketrans('', '', string.punctuation)
        self.dictionary = None
        self.porter_stem = PorterStemmer()
        if read_file:
            self.dictionary.load(dict_path)


    def tokenize(self, in_texts:List[str]):
        tokens = []
        for text in tqdm(in_texts, desc="tokenize", unit="text"):
            sens = sent_tokenize(text)
            res = []
            for sen in sens:
                # remove punctuations
                sen = sen.translate(self.remove)
                # stemming
                res.extend([self.porter_stem.stem(word) for word in word_tokenize(sen.lower()) if not word in stopwords.words('english')])
            tokens.append(res)
        self.dictionary = corpora.Dictionary(tokens)

    def save_dict(self, dict_path:str):
        self.dictionary.save(dict_path)

    def print_dict(self):
        token2id = self.dictionary.token2id
        dfs = self.dictionary.dfs
        token_info = {}
        for word in token2id:
            token_info[word] = dict(
                word=word,
                id=token2id[word],
                freq=dfs[token2id[word]]
            )
        token_items = token_info.values()
        token_items = sorted(token_items, key=lambda x: x['id'])
        print('The info of dictionary: ')
        pprint(token_items)
        print('--------------------------')

    def encode(self, text:str) -> List[int]:
        sens = sent_tokenize(text)
        res = []
        for sen in sens:
            sen = sen.translate(self.remove)
            res.extend(self.dictionary.doc2idx([self.porter_stem.stem(word) for word in word_tokenize(sen.lower()) if not word in stopwords.words('english')]))
        return res

    def decode(self, corpus:List[int]) -> str:
        return " ".join([self.id_to_token(t_id) for t_id in corpus])

    def token_to_id(self, word:str) -> int:
        return self.dictionary.token2id[self.porter_stem.stem(word)]

    def id_to_token(self, token_id:int) -> str:
        return self.dictionary[token_id]

    def dict_filter(self, below = 5, above = 0.5, n=10000):
        self.dictionary.filter_extremes(no_below=below, no_above=above, keep_n=n)

In [95]:
# Simple usage example of NLTKTokenizer

texts = full_texts[0:100]

tokenizer = NLTKTokenizer()
tokenizer.tokenize(texts)


tokenizer.print_dict()
print(tokenizer.encode("GV developed the idea for this empirical research and was involved in all steps of the study and the manuscript preparation. GV and AU prepared the instrument applied in the empirical study."))
print(tokenizer.decode([1011, 648, 99, 360, 255]))
print(tokenizer.id_to_token(77))
print(tokenizer.token_to_id("covid"))

tokenize: 100%|██████████| 100/100 [00:22<00:00,  4.51text/s]


The info of dictionary: 
[{'freq': 1, 'id': 0, 'word': '0001level'},
 {'freq': 1, 'id': 1, 'word': '001level'},
 {'freq': 1, 'id': 2, 'word': '0351'},
 {'freq': 1, 'id': 3, 'word': '0573'},
 {'freq': 1, 'id': 4, 'word': '0652'},
 {'freq': 1, 'id': 5, 'word': '0678'},
 {'freq': 1, 'id': 6, 'word': '0758'},
 {'freq': 1, 'id': 7, 'word': '0929'},
 {'freq': 1, 'id': 8, 'word': '0997'},
 {'freq': 93, 'id': 9, 'word': '1'},
 {'freq': 11, 'id': 10, 'word': '119'},
 {'freq': 76, 'id': 11, 'word': '12'},
 {'freq': 12, 'id': 12, 'word': '128'},
 {'freq': 3, 'id': 13, 'word': '147'},
 {'freq': 5, 'id': 14, 'word': '157'},
 {'freq': 1, 'id': 15, 'word': '16dii127'},
 {'freq': 1, 'id': 16, 'word': '1958'},
 {'freq': 1, 'id': 17, 'word': '1966'},
 {'freq': 2, 'id': 18, 'word': '1973'},
 {'freq': 6, 'id': 19, 'word': '1980'},
 {'freq': 3, 'id': 20, 'word': '1989'},
 {'freq': 5, 'id': 21, 'word': '1995'},
 {'freq': 7, 'id': 22, 'word': '1999'},
 {'freq': 92, 'id': 23, 'word': '2'},
 {'freq': 10, 'id':

In [96]:
tokenizer.dict_filter()
tokenizer.print_dict()

The info of dictionary: 
[{'freq': 11, 'id': 0, 'word': '119'},
 {'freq': 12, 'id': 1, 'word': '128'},
 {'freq': 5, 'id': 2, 'word': '157'},
 {'freq': 6, 'id': 3, 'word': '1980'},
 {'freq': 5, 'id': 4, 'word': '1995'},
 {'freq': 7, 'id': 5, 'word': '1999'},
 {'freq': 10, 'id': 6, 'word': '2000'},
 {'freq': 11, 'id': 7, 'word': '2001'},
 {'freq': 11, 'id': 8, 'word': '2002'},
 {'freq': 10, 'id': 9, 'word': '2003'},
 {'freq': 10, 'id': 10, 'word': '2005'},
 {'freq': 11, 'id': 11, 'word': '2006'},
 {'freq': 11, 'id': 12, 'word': '2007'},
 {'freq': 11, 'id': 13, 'word': '2008'},
 {'freq': 13, 'id': 14, 'word': '2009'},
 {'freq': 13, 'id': 15, 'word': '2010'},
 {'freq': 14, 'id': 16, 'word': '2011'},
 {'freq': 19, 'id': 17, 'word': '2012'},
 {'freq': 14, 'id': 18, 'word': '2013'},
 {'freq': 13, 'id': 19, 'word': '2014'},
 {'freq': 16, 'id': 20, 'word': '2015'},
 {'freq': 16, 'id': 21, 'word': '2017'},
 {'freq': 8, 'id': 22, 'word': '2021'},
 {'freq': 50, 'id': 23, 'word': '26'},
 {'freq': 7

In [22]:
###################
# Track 2.3
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
from itertools import chain
import re
from nltk import sent_tokenize
from typing import List, Dict, Iterable, Union, Optional


class BPETokenizer:

    def __init__(self, from_file:bool=False, tokenizer_path:Optional[str]=None):
        """ Initializes a BPE tokenizer or loads an existing tokenizer from file.

        Args:
            from_file (bool): Whether to load an existing tokenizer from file. Defaults to `False`.
            tokenizer_path (Optional[str]): The path of the tokenizer (if `from_file` is `True`).
        """
        if from_file:  # load a trained tokenizer
            assert tokenizer_path
            self.tokenizer = Tokenizer.from_file(tokenizer_path)
        else:  # initialize a tokenizer
            self.tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
            self.tokenizer.pre_tokenizer = Whitespace()


    def normalize(self, text: Union[str, List[str]]) -> Iterable[str]:
        """ Normalizes the text, that is, converts uppercase to lowercase and removes punctuations.

        Args:
            text (Union[str, List[str]]): Text to normalize.

        Returns:
            Iterable[str]: Iterator of normalized sentences.
        """
        if isinstance(text, str):
            text = [text]
        # separate sentences
        text = map(lambda doc: sent_tokenize(doc.lower()), text)  # iterator of list of sentences
        text = chain.from_iterable(text)  # iterator of sentences
        # remove punctuations
        punc_pat = re.compile(R"[^\w\s'-]+")  # punctuations that should be removed
        return map(lambda doc: punc_pat.sub("", doc), text)


    def train(self, text:List[str], vocab_size:int=10000, save_path:Optional[str]=None):
        """ Trains on corpus.

        Args:
            text (List[str]): A list of documents.
            vocab_size (int): The desired vocabulary size. Defaults to `10000`.
            tokenizer_path (Optional[str]): The path to store the trained tokenizer in.
        """
        trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size=vocab_size)
        text = self.normalize(text)
        self.tokenizer.train_from_iterator(text, trainer)
        if save_path:
            self.tokenizer.save(save_path)


    def save(self, save_path: str):
        """ Saves the tokenizer to file.

        Args:
            tokenizer_path (str): The path to store the trained tokenizer in.
        """
        self.tokenizer.save(save_path)


    def encode(self, text: str) -> Dict[str, List[Union[str, int]]]:
        """ Encodes a string.

        Args:
            text (str): A string to encode.

        Returns:
            Dict[str, List[Union[str, int]]]: Output of encoding, including "tokens" and "ids".
        """
        text = self.normalize(text)
        tokens = ['[CLS]']
        ids = [self.token_to_id('[CLS]')]
        for stc in text:
            enc = self.tokenizer.encode(stc)
            tokens.extend(enc.tokens)
            tokens.append("[SEP]")
            ids.extend(enc.ids)
            ids.append(self.token_to_id('[SEP]'))
        return {"tokens": tokens, "ids": ids}


    def decode(self, ids: List[int]) -> str:
        """ Decodes a sequence of ids.

        Args:
            ids (List[int]): A list of ids of tokens.

        Returns:
            str: The decoded string.
        """
        return " ".join([self.id_to_token(id) for id in ids])


    def add_tokens(self, tokens: List[str]):
        """ Adds given tokens to vocabulary.

        Args:
            tokens (List[str]): Tokens to add.
        """
        self.tokenizer.add_tokens([tkn.lower() for tkn in tokens])


    def token_to_id(self, token: str) -> int:
        return self.tokenizer.token_to_id(token)

    def id_to_token(self, id: int) -> str:
        return self.tokenizer.id_to_token(id)

    @property
    def vocab(self):
        return self.tokenizer.get_vocab()

    @property
    def vocab_size(self):
        return self.tokenizer.get_vocab_size()

###################

In [28]:
# Train a BPE tokenizer

tokenizer = BPETokenizer()
tokenizer.train(full_texts, vocab_size=10000, save_path="BPETokenizer10000.json")


In [27]:
# Usage of BPE tokenizer

tokenizer = BPETokenizer(from_file=True, tokenizer_path="BPETokenizer15000.json")
# tokenizer = Tokenizer.from_file("BPETokenizerPro10000.json")

print(tokenizer.vocab_size)

print(tokenizer.id_to_token(9999))
print(tokenizer.token_to_id("college"))

output = tokenizer.encode("How are you? I'm fine, thank you, and you? I'm fine, too.")
print(output)
print(tokenizer.decode(output["ids"]))

print(tokenizer.encode("diseases")["tokens"])
print(tokenizer.encode("symptoms")["tokens"])
print(tokenizer.encode("chemicals")["tokens"])
print(tokenizer.encode("medications")["tokens"])
print(tokenizer.encode("phenotypes")["tokens"])
print(tokenizer.encode("genotypes")["tokens"])
print(tokenizer.encode("COVID-19")["tokens"])
print(tokenizer.encode("I have 30,000 apples.")["tokens"])

tokenizer.add_tokens(["phenotypes", "genotypes", "COVID-19", "30000"])
print(tokenizer.vocab_size)

print(tokenizer.encode("phenotypes")["tokens"])
print(tokenizer.encode("genotypes")["tokens"])
print(tokenizer.encode("COVID-19")["tokens"])
print(tokenizer.encode("I have 30,000 apples.")["tokens"])
print(tokenizer.encode("The main symptoms of COVID-19 are fever and cough.")["tokens"])

print(tokenizer.encode(full_texts[0])["tokens"])


15000
college
9999
{'tokens': ['[CLS]', 'how', 'are', 'you', '[SEP]', 'i', "'", 'm', 'fine', 'than', 'k', 'you', 'and', 'you', '[SEP]', 'i', "'", 'm', 'fine', 'too', '[SEP]'], 'ids': [1, 5805, 4877, 7095, 2, 26, 5, 30, 11336, 5161, 28, 7095, 4830, 7095, 2, 26, 5, 30, 11336, 8174, 2]}
[CLS] how are you [SEP] i ' m fine than k you and you [SEP] i ' m fine too [SEP]
['[CLS]', 'diseases', '[SEP]']
['[CLS]', 'symptoms', '[SEP]']
['[CLS]', 'chemicals', '[SEP]']
['[CLS]', 'medications', '[SEP]']
['[CLS]', 'phenotypes', '[SEP]']
['[CLS]', 'genotypes', '[SEP]']
['[CLS]', 'covid', '-', '19', '[SEP]']
['[CLS]', 'i', 'have', '300', '00', 'ap', 'ples', '[SEP]']
15002
['[CLS]', 'phenotypes', '[SEP]']
['[CLS]', 'genotypes', '[SEP]']
['[CLS]', 'covid-19', '[SEP]']
['[CLS]', 'i', 'have', '30000', 'ap', 'ples', '[SEP]']
['[CLS]', 'the', 'main', 'symptoms', 'of', 'covid-19', 'are', 'fever', 'and', 'cough', '[SEP]']
['[CLS]', 'according', 'to', 'current', 'live', 'statistics', 'at', 'the', 'time', 'of', '


## Part 3 (30%): Build Word Representations

Build word representations for each extracted word. If the hardware resource is limited, you may limit the vocabulary size up to 10k words/tokens (or even smaller) and the dimension of representations up to 256.

The following tracks can be developed independently. You are encouraged to divide the workload to each team member.

### Track 3.1 (15%): Use N-gram Language Modeling

N-gram Language Modeling is to predict a target word by using `n` words from previous context. Specifically,

$P(w_i | w_{i-1}, w_{i-2}, ..., w_{i-n+1})$

For example, given a sentence, `"the main symptoms of COVID-19 are fever and cough"`, if $n=7$, we use previous context `["the", "main", "symptoms", "of", "COVID-19", "are"]` to predict the next word `"fever"`.

More to read: https://web.stanford.edu/~jurafsky/slp3/3.pdf

Recommended outputs:

- A fixed vector for each word/token.

### Track 3.2 (15%): Use Skip-gram with Negative Sampling

In skip-gram, we use a central word to predict its context. Specifically,

$P(w_{c-m}, ... w_{c-1}, w_{c+1}, ..., w_{c+m} | w_c)$

As the learning objective of skip-gram is computational inefficient (summation of entire vocabulary $|V|$), negative sampling is commonly applied to accelerate the training.

In negative sampling, we randomly select one word from the context as a positive sample, and randomly select $K$ words from the vocabulary as negative samples. As a result, the learning objective is updated to

$L = -\log\sigma(u^T_{t} v_c) - \sum_{k=1}^K\log\sigma(-u^T_k v_c)$, where $u_t$ is the vector embedding of positive sample from context, $u_k$ are the vector embeddings of negative samples, $v_c$ is the vector embedding of the central word, $\sigma$ refers to the sigmoid function.

More to read http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes01-wordvecs1.pdf section 4.3 and 4.4

Recommended outputs:

- A fixed vector for each word/token.

### Track 3.3 (Bonus +5%): Use Contextualised Word Representation by Masked Language Model (MLM)

BERT introduces a new language model for pre-training named Masked Language Model (MLM). The advantage of MLM is that the word representations by MLM will be contextualised.

For example, "stick" may have different meanings in different context. By N-gram language modeling and word2vec (skip-gram, CBOW), the word representation of "stick" is fixed regardless of its context. However, MLM will learn the representation of "stick" dynamatically based on context. In other words, "stick" will have different representations in different context by MLM.

More to read: http://jalammar.github.io/illustrated-bert/ and https://arxiv.org/pdf/1810.04805.pdf

Recommended outputs:

- An algorithm that is able to generate contextualised representation in real time.

In [3]:
###################
# TODO: add your solution

###################
# Track 3.2


###################

## Part 4 (20%): Explore the Word Representations

The following tracks can be finished independently. You are encouraged to divide workload to each team member.

### Track 4.1 (5%): Visualise the word representations by t-SNE

t-SNE is an algorithm to reduce dimentionality and commonly used to visualise high-dimension vectors. Use t-SNE to visualise the word representations. You may visualise up to 1000 words as t-SNE is highly computationally complex.

More about t-SNE: https://lvdmaaten.github.io/tsne/

Recommended output:

- A diagram by t-SNE based on representations of up to 1000 words.

### Track 4.2 (5%): Visualise the Word Representations of Biomedical Entities by t-SNE

Instead of visualising the word representations of the entire vocabulary (or 1000 words that are selected at random), visualise the word representations of words which are biomedical entities. For example, fever, cough, diabetes etc. Based on the category of those biomedical entities, can you assign different colours to the entities and see if the entities from the same category can be clustered by t-SNE? For example, sinusitis and cough are both respirtory diseases so they should be assigned with the same colour and ideally their representations should be close to each other by t-SNE. Another example, Alzheimer and headache are neuralogical diseases which should be assigned by another colour.

Examples of biomedial ontology: https://www.ebi.ac.uk/ols/ontologies/hp and https://en.wikipedia.org/wiki/International_Classification_of_Diseases

Recommended output:

- A diagram with colours by t-SNE based on representations of biomedical entities.

### Track 4.3 (5%): Co-occurrence

- What are the biomedical entities which frequently co-occur with COVID-19 (or coronavirus)?

Recommended outputs:

- A sorted list of biomedical entities and description on how the entities are selected and sorted.

### Track 4.4 (5%): Semantic Similarity

- What are the biomedical entities which have closest semantic similarity COVID-19 (or coronavirus) based on word representations?

Recommended outputs:

- A sorted list of biomedical entities and description on how the entities are selected and sorted.

### Open Question (Optional): What else can you discover?


In [4]:
###################
# TODO: add your solution

###################

## Part 5 (Bonus +10%): Open Challenge: Mining Biomedical Knowledge

A fundamental task in clinical/biomedical natural language processing is to extract intelligence from biomedical text corpus automatically and efficiently. More specifically, the intelligence may include biomedical entities mentioned in text, relations between biomedical entities, clinical features of patients, progression of diseases, all of which can be used to predict, understand and improve patients' outcomes. 

This open challenge is to build a biomedical knowledge graph based on the CORD-19 dataset and mine useful information from it. We recommend the following steps but you are also encouraged to develop your solution from scratch.

### Extract Biomedical Entities from Text

Extract biomedical entities (such as fever, cough, headache, lung cancer, heart attack) from text. Note that:

- The biomedical entities may consist of multiple words. For example, heart attack, multiple myeloma etc.
- The biomedical entities may be written in synoynms. For example, low blood pressure for hypotension.
- The biomedical entities may be written in different forms. For example, smoking, smokes, smoked.

### Extract Relations between Biomedical Entities

Extract relations between biomedical entities based on their appearance in text. You may define a relation between biomedical entities by one or more of the following criteria:

- The biomedical entities frequentely co-occuer together.
- The biomedical entities have similar word representations.
- The biomedical entities have clear relations based on textual narratives. For example, "The most common symptoms for COVID-19 are fever and cough" so we know there are relations between "COVID-19", "fever" and "cough".

### Build a Biomedical Knowledge Graph of COVID-19

Build a knoweledge graph based on the results from track 5.1 and 5.2 and visualise it.

In [5]:
###################
# TODO: add your solution

###################