# Word Representation in Biomedical Domain

Before you start, please make sure you have read this notebook. You are encouraged to follow the recommendations but you are also free to develop your own solution from scratch.

## Marking Scheme

- Biomedical imaging project: 40%
  - 20%: accuracy of the final model on the test set
  - 20%: rationale of model design and final report
- Natural language processing project: 40%
  - 30%: completeness of the project
  - 10%: final report
- Presentation skills and team work: 20%

This project forms 40\% of the total score for summer/winter school. The marking scheme of each part of this project is provided below with a cap of 100\%.

You are allowed to use open source libraries as long as the libraries are properly cited in the code and final report. The usage of third-party code without proper reference will be treated as plagiarism, which will not be tolerated.

You are encouraged to develop the algorithms by yourselves (without using third-party code as much as possible). We will factor such effort into the marking process.


## Setup and Prerequisites

Recommended environment

- Python 3.7 or newer
- Free disk space: 100GB

Download the data

```sh
# navigate to the data folder
cd data

# download the data file
# which is also available at https://www.semanticscholar.org/cord19/download
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2021-07-26/document_parses.tar.gz

# decompress the file which may take several minutes
tar -xf document_parses.tar.gz

# which creates a folder named document_parses
```


## Part 1 (20%): Parse the Data

The JSON files are located in two sub-folders in `document_parses`. You will need to scan all JSON files and extract content (i.e. `string`) from relevant fields (e.g. body text, abstract, titles).

You are encouraged to extract full article text from body text if possible. If the hardware resource is limited, you can extract from abstract or titles as alternatives.

Note: The number of JSON files is around 425k so it may take more than 10 minutes to parse all documents.

For more information about the dataset: https://www.semanticscholar.org/cord19/download

Recommended output:

- A list of text (`string`) extracted from JSON files.


In [None]:
import csv

# 打开并读取 metadata.csv 文件
with open(r"all_sources_metadata_2020-03-13.csv", encoding="utf-8") as f_in:
    reader = csv.DictReader(f_in)

    # 打开 txt 文件以写入
    with open("titles_abstracts.txt", "w", encoding="utf-8") as f_out:
        for row in reader:
            title = row["title"].strip() if row["title"] else " "
            abstract = row["abstract"].strip() if row["abstract"] else " "

            # 合并标题和摘要
            text = f"{title}\n{abstract}\n\n"

            # 写入文件
            f_out.write(text)

print("标题和摘要已成功存入 titles_abstracts.txt 文件！")

## Part 2 (30%): Tokenization

Traverse the extracted text and segment the text into words (or tokens).

The following tracks can be developed in independentely. You are encouraged to divide the workload to each team member.

Recommended output:

- Tokenizer(s) that is able to tokenize any input text.

Note: Because of the computation complexity of tokenizers, it may take hours/days to process all documents. Which tokenizer is more efficient? Any idea to speedup?

### Track 2.1 (10%): Use split()

Use the standard `split()` by Python.

### Track 2.2 (10%): Use NLTK or SciSpaCy

NLTK tokenizer: https://www.nltk.org/api/nltk.tokenize.html

SciSpaCy: https://github.com/allenai/scispacy

Note: You may need to install NLTK and SpaCy so please refer to their websites for installation instructions.

### Track 2.3 (10%): Use Byte-Pair Encoding (BPE)

Byte-Pair Encoding (BPE): https://huggingface.co/transformers/tokenizer_summary.html

Note: You may need to install Huggingface's transformers so please refer to its website for installation instructions.

### Track 2.4 (Bonus +5%): Build new Byte-Pair Encoding (BPE)

This track may be dependent on track 2.3.

The above pre-built tokenization methods may not be suitable for biomedical domain as the words/tokens (e.g. diseases, sympotoms, chemicals, medications, phenotypes, genotypes etc.) can be very different from the words/tokens commonly used in daily life. Can you build and train a new BPE model for biomedical domain in particular?

### Open Question (Optional):

- What are the pros and cons of the above tokenizers?


### Preparation


#### Import the library


In [None]:
import re
import random
from collections import Counter
import numpy as np
import nltk
from nltk import word_tokenize, sent_tokenize
from transformers import AutoTokenizer
from tokenizers import Tokenizer, models, trainers, pre_tokenizers

#### Read txt file


In [None]:
filename = "./titles_abstracts.txt"
with open(filename, "r", encoding="utf-8") as file:
    content = file.read()

### Track 2.1: Use split()


In [None]:
def Tokenization_1(content):
    words = []
    lines = content.split("\n")
    for line in lines:
        words.extend(re.split(r"\W+", line))  # 使用正则表达式分割
    # 过滤掉空字符串，确保抽取的都是有效单词
    words = [word for word in words if word]

    # 使用 Counter 类统计单词出现次数，并根据出现次数进行加权抽样
    word_counts = Counter(words)
    weighted_sample = random.choices(
        list(word_counts.keys()), weights=word_counts.values(), k=20
    )
    print("Weighted sample:", " ".join(weighted_sample))

    result = " ".join(words)
    return result

### Track 2.2: Use NLTK


In [None]:
def Tokenization_2(content):
    # 分句
    sents = sent_tokenize(content)

    interpunctuations = [
        ",",
        ".",
        ":",
        ";",
        "?",
        "(",
        ")",
        "[",
        "]",
        "&",
        "!",
        "*",
        "@",
        "#",
        "$",
        "%",
        "/",
        "\\",
        "—",
    ]  # 标点符号

    words = []
    for sent in sents:
        sent_t = word_tokenize(sent)  # 分词
        sent_t = [
            word.lower() for word in sent_t if word not in interpunctuations
        ]  # 移除标点符号
        words.append(sent_t)

    # 保存分词结果
    result = []
    for sent in words:
        result.append(" ".join(sent))
    with open("./nlkt_tokenized.txt", "w", encoding="utf-8") as f:
        f.write("\n".join(result))

    return result

### Track 2.3: Use Byte-Pair Encoding (BPE)


In [None]:
def Tokenization_3(content):
    bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

    # 按段落分割文本（避免超长输入）
    max_length = 512
    result = []

    for i in range(0, len(content), max_length):
        chunk = content[i : i + max_length]
        tokens = bert_tokenizer.tokenize(chunk)
        result.append(" ".join(tokens))  # 以空格连接 token

    # 保存分词结果
    with open("bert_tokenized.txt", "w", encoding="utf-8") as f:
        f.write("\n".join(result))
    print("BERT 分词结果已保存至 `bert_tokenized.txt`")

    return result

### Track 2.4: Build new Byte-Pair Encoding (BPE)


In [None]:
def Tokenization_4(content):
    # 训练 BPE 模型
    print("Training BPE tokenizer...")
    bpe_tokenizer = Tokenizer(models.BPE())
    bpe_tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

    trainer = trainers.BpeTrainer(special_tokens=["<unk>", "<pad>", "<s>", "</s>"])
    bpe_tokenizer.train(files=["titles_abstracts.txt"], trainer=trainer)

    # 保存 BPE 训练模型
    bpe_tokenizer.save("biomedical_bpe.json")

    # 重新加载 BPE tokenizer 并进行分词
    bpe_tokenizer = Tokenizer.from_file("biomedical_bpe.json")
    bpe_tokens = bpe_tokenizer.encode(content)

    # 保存 BPE 分词结果
    with open("bpe_tokenized.txt", "w", encoding="utf-8") as f:
        f.write(" ".join(bpe_tokens.tokens))  # 以空格连接 token

    print("BPE 分词结果已保存至 `bpe_tokenized.txt`")

    return bpe_tokens

## Part 3 (30%): Build Word Representations

Build word representations for each extracted word. If the hardware resource is limited, you may limit the vocabulary size up to 10k words/tokens (or even smaller) and the dimension of representations up to 256.

The following tracks can be developed independently. You are encouraged to divide the workload to each team member.

### Track 3.1 (15%): Use N-gram Language Modeling

N-gram Language Modeling is to predict a target word by using `n` words from previous context. Specifically,

$P(w_i | w_{i-1}, w_{i-2}, ..., w_{i-n+1})$

For example, given a sentence, `"the main symptoms of COVID-19 are fever and cough"`, if $n=7$, we use previous context `["the", "main", "symptoms", "of", "COVID-19", "are"]` to predict the next word `"fever"`.

More to read: https://web.stanford.edu/~jurafsky/slp3/3.pdf

Recommended outputs:

- A fixed vector for each word/token.

### Track 3.2 (15%): Use Skip-gram with Negative Sampling

In skip-gram, we use a central word to predict its context. Specifically,

$P(w_{c-m}, ... w_{c-1}, w_{c+1}, ..., w_{c+m} | w_c)$

As the learning objective of skip-gram is computational inefficient (summation of entire vocabulary $|V|$), negative sampling is commonly applied to accelerate the training.

In negative sampling, we randomly select one word from the context as a positive sample, and randomly select $K$ words from the vocabulary as negative samples. As a result, the learning objective is updated to

$L = -\log\sigma(u^T_{t} v_c) - \sum_{k=1}^K\log\sigma(-u^T_k v_c)$, where $u_t$ is the vector embedding of positive sample from context, $u_k$ are the vector embeddings of negative samples, $v_c$ is the vector embedding of the central word, $\sigma$ refers to the sigmoid function.

More to read http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes01-wordvecs1.pdf section 4.3 and 4.4

Recommended outputs:

- A fixed vector for each word/token.

### Track 3.3 (Bonus +5%): Use Contextualised Word Representation by Masked Language Model (MLM)

BERT introduces a new language model for pre-training named Masked Language Model (MLM). The advantage of MLM is that the word representations by MLM will be contextualised.

For example, "stick" may have different meanings in different context. By N-gram language modeling and word2vec (skip-gram, CBOW), the word representation of "stick" is fixed regardless of its context. However, MLM will learn the representation of "stick" dynamatically based on context. In other words, "stick" will have different representations in different context by MLM.

More to read: http://jalammar.github.io/illustrated-bert/ and https://arxiv.org/pdf/1810.04805.pdf

Recommended outputs:

- An algorithm that is able to generate contextualised representation in real time.


### Preparation


#### Import the library


In [None]:
import collections
from collections import defaultdict, Counter
from scipy.sparse import csr_matrix
from scipy.sparse.linalg import svds
import numpy as np
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from transformers import BertTokenizer, BertForMaskedLM, Trainer, TrainingArguments
from datasets import Dataset
import torch
from transformers import DataCollatorForLanguageModeling
from peft import get_peft_model, LoraConfig, TaskType

### Track 3.1: Use N-gram Language Modeling


In [None]:
class NGramLanguageModel:
    def __init__(self, n, vocab_size=10000, embedding_dim=100):
        self.n = n
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim

        # 词表相关
        self.vocab = {"<UNK>", "<START>", "<END>"}
        self.word_to_id = {}
        self.id_to_word = {}

        # N-gram统计
        self.ngram_counts = defaultdict(lambda: defaultdict(float))
        self.context_counts = defaultdict(float)

        # 词向量相关
        self.word_embeddings = None
        self.context_embeddings = None

        # 平滑参数
        self.alpha = 0.1

    def build_vocab(self, text):
        """构建词表"""
        words = text.split()
        word_counts = Counter(words)
        top_words = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)
        top_words = top_words[: self.vocab_size - 3]  # 留出特殊标记的空间

        self.vocab.update(word for word, _ in top_words)
        self.word_to_id = {word: idx for idx, word in enumerate(self.vocab)}
        self.id_to_word = {idx: word for word, idx in self.word_to_id.items()}

    def build_cooccurrence_matrix(self, text):
        """构建共现矩阵"""
        words = ["<START>"] + text.split() + ["<END>"]
        vocab_size = len(self.vocab)

        # 初始化稀疏矩阵的数据结构
        rows, cols, data = [], [], []

        # 统计共现
        for i in range(len(words) - self.n + 1):
            context_words = words[i : i + self.n - 1]
            target_word = words[i + self.n - 1]

            # 获取单词ID
            context_ids = [
                self.word_to_id.get(w, self.word_to_id["<UNK>"]) for w in context_words
            ]
            target_id = self.word_to_id.get(target_word, self.word_to_id["<UNK>"])

            # 更新N-gram统计
            context = tuple(context_ids)
            self.ngram_counts[context][target_id] += 1
            self.context_counts[context] += 1

            # 更新共现统计
            for context_id in context_ids:
                rows.append(context_id)
                cols.append(target_id)
                data.append(1.0)

        # 构建稀疏矩阵
        M = csr_matrix((data, (rows, cols)), shape=(vocab_size, vocab_size))

        # 计算条件概率 P(wj|wi)
        row_sums = M.sum(axis=1).A.flatten()
        row_sums[row_sums == 0] = 1  # 避免除零
        M = csr_matrix(M / row_sums[:, np.newaxis])

        return M

    def factorize_matrix(self, M):
        """使用SVD分解共现矩阵"""
        # Perform Singular Value Decomposition
        U, S, Vt = svds(M, k=self.embedding_dim)

        # Scale singular values
        S_sqrt = np.sqrt(S)
        self.word_embeddings = U * S_sqrt  # Scale each column of U
        self.context_embeddings = Vt.T * S_sqrt  # Scale each row of Vt.T

    def train(self, text):
        """训练模型"""
        # 构建词表
        self.build_vocab(text)

        # 构建并分解共现矩阵
        M = self.build_cooccurrence_matrix(text)
        self.factorize_matrix(M)

    def get_word_vector(self, word):
        """获取词向量"""
        word_id = self.word_to_id.get(word, self.word_to_id["<UNK>"])
        return self.word_embeddings[word_id]

    def get_context_vector(self, context):
        """获取上下文向量"""
        context_words = context.split()[-self.n + 1 :]
        context_ids = [
            self.word_to_id.get(w, self.word_to_id["<UNK>"]) for w in context_words
        ]
        return np.mean([self.context_embeddings[i] for i in context_ids], axis=0)

    def predict_next(self, context):
        """预测下一个词"""
        # 获取上下文的向量表示
        context_vec = self.get_context_vector(context)

        # 计算与所有词的相似度
        similarities = np.dot(self.word_embeddings, context_vec)

        # 获取最相似的词
        best_id = np.argmax(similarities)
        best_word = self.id_to_word[best_id]

        # 计算概率（使用softmax）
        exp_similarities = np.exp(similarities - np.max(similarities))
        probability = exp_similarities[best_id] / exp_similarities.sum()

        return best_word, float(probability)

    def generate(self, seed, length=10):
        """生成文本"""
        current = ["<START>"] + seed.split()
        result = current.copy()

        for _ in range(length):
            context = " ".join(current[-(self.n - 1) :])
            next_word, _ = self.predict_next(context)

            if next_word == "<END>":
                break

            result.append(next_word)
            current = result[-self.n + 1 :]

        return " ".join(result[1:])  # 去掉<START>标记

    def get_similar_words(self, word, k=5):
        """找到最相似的k个词"""
        word_vec = self.get_word_vector(word)
        similarities = np.dot(self.word_embeddings, word_vec)
        top_k = np.argsort(similarities)[-k - 1 : -1][::-1]  # 不包括词本身
        return [(self.id_to_word[i], similarities[i]) for i in top_k]

In [None]:
def Representations_1(result):
    # 使用分词后的文本结果
    tokenized_text = result

    model = NGramLanguageModel(3)
    model.train(tokenized_text)

    # 输出词向量
    for word, vector in model.get_word_vector().items():
        print(f"Word: {word}, Vector: {vector[:5]}...")  # 仅显示前5个维度

    return model

### Track 3.2: Use Skip-gram with Negative Sampling


In [None]:
class SkipGramWithNegativeSampling:
    def __init__(self, vocab_size, vector_dim, window_size, negative_samples):
        self.vocab_size = vocab_size
        self.vector_dim = vector_dim
        self.window_size = window_size
        self.negative_samples = negative_samples

        # 词向量随机初始化
        self.word_vectors = np.random.uniform(-0.5, 0.5, (vocab_size, vector_dim))
        self.context_vectors = np.random.uniform(-0.5, 0.5, (vocab_size, vector_dim))

    def _sigmoid(self, x):
        # 激活函数
        return 1 / (1 + np.exp(-x))

    def _loss(self, central_vec, context_vec, negative_vecs):
        # 损失函数
        positive_score = np.dot(context_vec, central_vec)
        positive_loss = -np.log(self._sigmoid(positive_score))

        negative_scores = np.dot(negative_vecs, central_vec)
        negative_loss = -np.sum(np.log(self._sigmoid(-negative_scores)))

        return positive_loss + negative_loss

    def _gradient_update(self, central_vec, context_vec, negative_vecs, learning_rate):
        # 正样本梯度
        positive_score = self._sigmoid(np.dot(context_vec, central_vec))
        positive_grad = (positive_score - 1) * context_vec

        # 负样本梯度
        negative_scores = self._sigmoid(np.dot(negative_vecs, central_vec))
        negative_grad = np.sum(negative_scores[:, np.newaxis] * negative_vecs, axis=0)

        # 更新向量
        central_vec -= learning_rate * (positive_grad + negative_grad)
        context_vec -= learning_rate * (positive_score - 1) * central_vec

        for neg_vec in negative_vecs:
            neg_vec -= learning_rate * negative_scores * central_vec

    def train(self, dataset, epochs, learning_rate):
        # 梯度下降训练模型
        for epoch in range(epochs):
            total_loss = 0

            for sentence in dataset:
                for i, central_word in enumerate(sentence):
                    # 设置窗
                    start = max(0, i - self.window_size)
                    end = min(len(sentence), i + self.window_size + 1)

                    # 获取句子
                    context_words = [sentence[j] for j in range(start, end) if j != i]

                    for context_word in context_words:
                        # 正样本
                        central_vec = self.word_vectors[central_word]
                        context_vec = self.context_vectors[context_word]

                        # 负样本
                        negative_indices = random.sample(
                            [
                                idx
                                for idx in range(self.vocab_size)
                                if idx not in sentence
                            ],
                            self.negative_samples,
                        )
                        negative_vecs = self.context_vectors[negative_indices]

                        # 计算损失函数
                        loss = self._loss(central_vec, context_vec, negative_vecs)
                        total_loss += loss

                        # 更新梯度
                        self._gradient_update(
                            central_vec, context_vec, negative_vecs, learning_rate
                        )

            print(f"Epoch {epoch + 1}/{epochs}, Loss: {total_loss}")

    def get_word_vectors(self):
        # 获取词向量
        return self.word_vectors

In [None]:
def Representations_2(result):
    # 预处理句子
    tokens = [sent.split(" ") for sent in result]

    # 设置模型参数
    model = Word2Vec(
        sentences=tokens[:20_000],  # 输入预处理后的句子
        vector_size=256,  # 嵌入向量的维度
        window=5,  # 窗口大小（前后最多考虑 5 个词）
        sg=1,  # Skip-gram 模式（sg=1 表示 Skip-gram；sg=0 表示 CBOW）
        negative=10,  # 负采样数量（每个正样本对应的负样本数量）
        min_count=2,  # 忽略出现次数少于 2 次的词
        workers=4,  # 使用的 CPU 核心数
        epochs=10,  # 训练的迭代次数
    )

    # 输出词向量
    vocab = model.wv.index_to_key
    for word in vocab:
        vector = model.wv[word]
        print(f"Word: {word}, Vector: {vector[:5]}...")  # 仅显示前5个维度

    return model

### Track 3.3: Use Contextualised Word Representation by Masked Language Model (MLM)


In [None]:
def Representations_3():
    # 加载 BERT（Masked Language Model 版本）
    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    model = BertForMaskedLM.from_pretrained("bert-base-uncased")

    # 配置 LoRA
    lora_config = LoraConfig(
        r=4,  # 低秩适配维度
        lora_alpha=32,  # LoRA alpha 参数
        lora_dropout=0.1,  # Dropout 概率
        target_modules=["query", "value"],  # 适配 transformer 的 Q/V 矩阵
    )
    model = get_peft_model(model, lora_config)

    total_params = sum(p.numel() for p in model.parameters())  # 所有参数
    trainable_params = sum(
        p.numel() for p in model.parameters() if p.requires_grad
    )  # 仅可训练参数

    print(f"🔹 模型总参数量: {total_params:,}")
    print(f"🔸 可训练参数量: {trainable_params:,}")
    print(f"🔹 冻结参数量: {total_params - trainable_params:,}")

    # 读取文本数据
    with open("titles_abstracts.txt", "r", encoding="utf-8") as f:
        lines = [line.strip() for line in f.readlines() if line.strip()]

    # 构造 Hugging Face 数据集
    dataset = Dataset.from_dict({"text": lines})

    # 拆分数据集
    dataset = dataset.train_test_split(test_size=0.1)
    train_dataset = dataset["train"]
    eval_dataset = dataset["test"]

    # 预处理：Tokenization + MLM 掩码
    def tokenize_function(examples):
        return tokenizer(
            examples["text"], truncation=True, padding="max_length", max_length=256
        )

    train_dataset = train_dataset.map(tokenize_function, batched=True)
    eval_dataset = eval_dataset.map(tokenize_function, batched=True)

    # 数据 collator
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=True,  # 进行 MLM 训练
        mlm_probability=0.15,  # 15% 的 token 被 mask
    )

    # 训练参数
    training_args = TrainingArguments(
        output_dir="./bert_lora_finetuned",
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=3,
        save_strategy="epoch",
        evaluation_strategy="epoch",
        logging_dir="./logs",
        logging_steps=500,
    )

    # 训练器
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=data_collator,
    )

    # 开始训练
    trainer.train()

    # 设为评估模式
    model.eval()

    # 读取文本并进行 tokenization
    with open("titles_abstracts.txt", "r", encoding="utf-8") as f:
        lines = f.readlines()[:500]

    # 逐行处理
    all_embeddings = []
    for line in lines:
        inputs = tokenizer(
            line.strip(),
            return_tensors="pt",
            truncation=True,
            padding="max_length",
            max_length=256,
        )

        with torch.no_grad():
            outputs = model(**inputs)  # ✅ 直接调用 BERT

        embeddings = outputs.logits  # ✅ 提取 token 预测结果
        all_embeddings.append(embeddings.squeeze(0))

    # 打印信息
    print(f"嵌入总数: {len(all_embeddings)}")
    print(f"单个嵌入的形状: {all_embeddings[0].shape}")

    # 保存 token 嵌入
    torch.save(all_embeddings, "bert_lora_finetuned_embeddings.pt")
    print("LoRA 微调后的 token 嵌入已保存！")

    return all_embeddings

## Part 4 (20%): Explore the Word Representations

The following tracks can be finished independently. You are encouraged to divide workload to each team member.

### Track 4.1 (5%): Visualise the word representations by t-SNE

t-SNE is an algorithm to reduce dimentionality and commonly used to visualise high-dimension vectors. Use t-SNE to visualise the word representations. You may visualise up to 1000 words as t-SNE is highly computationally complex.

More about t-SNE: https://lvdmaaten.github.io/tsne/

Recommended output:

- A diagram by t-SNE based on representations of up to 1000 words.

### Track 4.2 (5%): Visualise the Word Representations of Biomedical Entities by t-SNE

Instead of visualising the word representations of the entire vocabulary (or 1000 words that are selected at random), visualise the word representations of words which are biomedical entities. For example, fever, cough, diabetes etc. Based on the category of those biomedical entities, can you assign different colours to the entities and see if the entities from the same category can be clustered by t-SNE? For example, sinusitis and cough are both respirtory diseases so they should be assigned with the same colour and ideally their representations should be close to each other by t-SNE. Another example, Alzheimer and headache are neuralogical diseases which should be assigned by another colour.

Examples of biomedial ontology: https://www.ebi.ac.uk/ols/ontologies/hp and https://en.wikipedia.org/wiki/International_Classification_of_Diseases

Recommended output:

- A diagram with colours by t-SNE based on representations of biomedical entities.

### Track 4.3 (5%): Co-occurrence

- What are the biomedical entities which frequently co-occur with COVID-19 (or coronavirus)?

Recommended outputs:

- A sorted list of biomedical entities and description on how the entities are selected and sorted.

### Track 4.4 (5%): Semantic Similarity

- What are the biomedical entities which have closest semantic similarity COVID-19 (or coronavirus) based on word representations?

Recommended outputs:

- A sorted list of biomedical entities and description on how the entities are selected and sorted.

### Open Question (Optional): What else can you discover?


### Preparation


#### Import the library


In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from random import sample
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

#### Define BME dictionary


In [None]:
gpt_BME = {
    "disease": [
        "asthma",
        "bronchitis",
        "cancer",
        "cirrhosis",
        "disease",
        "fibrosis",
        "hypertension",
        "syndrome",
        "tuberculosis",
    ],
    "symptom": ["chills", "cough", "dyspnea", "fatigue", "fever", "headache", "nausea"],
    "virus": [
        "coronavirus",
        "hepatitis",
        "herpesvirus",
        "hiv",
        "influenza",
        "norovirus",
        "rabies",
    ],
    "bacteria": ["listeria", "mycobacterium", "salmonella", "streptococcus"],
    "protein": [
        "antibody",
        "cytokine",
        "enzyme",
        "hemoglobin",
        "interleukin",
        "protein",
        "receptor",
    ],
    "drug": [
        "antibiotic",
        "antiviral",
        "chloroquine",
        "oseltamivir",
        "remdesivir",
        "vaccine",
    ],
}


kimi_BME = {
    "viruses": [
        "virus",
        "coronavirus",
        "covid-19",
        "influenza",
        "2019-ncov",
        "sars-cov-2",
        "h5n1",
        "mers-cov",
        "norovirus",
        "hiv-1",
        "h1n1",
    ],
    "diseases": [
        "infection",
        "disease",
        "pneumonia",
        "syndrome",
        "mortality",
        "cancer",
        "pathogen",
        "illness",
    ],
    "proteins": [
        "protein",
        "proteins",
        "antibodies",
        "spike",
        "glycoprotein",
        "capsid",
        "membrane",
        "antigen",
        "antigens",
        "epitopes",
    ],
    "genes": [
        "gene",
        "genes",
        "genome",
        "rna",
        "dna",
        "receptor",
        "genetic",
        "mutation",
        "mutations",
    ],
    "cells": [
        "cells",
        "cell",
        "host",
        "cellular",
        "blood",
        "serum",
        "tissue",
        "antibody",
        "antibodies",
    ],
    "treatments": [
        "treatment",
        "therapy",
        "vaccine",
        "vaccines",
        "antiviral",
        "care",
        "diagnosis",
        "diagnostic",
        "drug",
        "drugs",
    ],
    "epidemiology": [
        "outbreak",
        "epidemic",
        "pandemic",
        "transmission",
        "quarantine",
        "prevention",
        "public",
        "surveillance",
        "population",
        "spread",
    ],
    "research": [
        "study",
        "analysis",
        "data",
        "research",
        "model",
        "models",
        "sequencing",
        "clinical",
        "review",
        "findings",
    ],
    "symptoms": [
        "symptoms",
        "fever",
        "cough",
        "dyspnea",
        "fatigue",
        "headache",
        "sore throat",
        "shortness of breath",
    ],
    "methods": [
        "methods",
        "assay",
        "pcr",
        "rt-pcr",
        "sequencing",
        "imaging",
        "tools",
        "test",
        "tests",
        "diagnostic",
    ],
    "public_health": [
        "health",
        "public",
        "hospital",
        "hospitals",
        "healthcare",
        "prevention",
        "quarantine",
        "epidemiology",
        "surveillance",
    ],
    "miscellaneous": [
        "cases",
        "patients",
        "human",
        "novel",
        "background",
        "severity",
        "risk",
        "control",
        "management",
        "response",
    ],
}

### Track 4.1: Visualise the word representations by t-SNE


In [None]:
def Exploration_1(model, BME=kimi_BME, max_words=1000, seed=1022):
    random.seed(seed)
    # 提取词汇表和对应的词向量
    BMEs = [word for words in list(BME.values()) for word in words]
    vocab = model.wv.index_to_key
    vocab_1 = list(filter(lambda x: x in BMEs, vocab))
    vocab_2 = sample(vocab, max_words - len(vocab_1))

    vocab = vocab_1 + vocab_2  # 提取词汇表
    word_vectors = model.wv[vocab]  # 提取词向量

    # 限制最多可视化的词数量
    if len(vocab) > max_words:
        np.random.seed(seed)
        indices = np.random.choice(len(vocab), max_words, replace=False)
        vocab = [vocab[i] for i in indices]
        word_vectors = word_vectors[indices]

    # 使用 t-SNE 降维到 2D
    tsne = TSNE(
        n_components=2, random_state=seed, perplexity=75, init="pca", n_iter=1000
    )
    reduced_embeddings = tsne.fit_transform(word_vectors)

    # 绘制 t-SNE 可视化图
    plt.figure(figsize=(14, 10))
    for i, word in enumerate(vocab):
        x, y = reduced_embeddings[i]
        plt.scatter(x, y, color="blue", s=10)
        plt.text(x + 0.2, y + 0.2, word, fontsize=8, color="gray")
    plt.title("t-SNE Visualization of Word Embeddings")
    plt.show()

    return None

### Track 4.2: Visualise the Word Representations of Biomedical Entities by t-SNE


In [None]:
# 可视化 Word2Vec 模型的词向量
def Exploration_2(model, BME=kimi_BME, max_words=1000, seed=1022):
    random.seed(seed)
    # 提取词汇表和对应的词向量
    BMEs = [word for words in list(BME.values()) for word in words]
    BME_classes = [c for c in list(BME.keys())]
    vocab = model.wv.index_to_key
    vocab_1 = list(filter(lambda x: x in BMEs, vocab))
    vocab_2 = sample(vocab, max_words - len(vocab_1))

    vocab = vocab_1 + vocab_2  # 提取词汇表
    word_vectors = model.wv[vocab]  # 提取词向量

    # 限制最多可视化的词数量
    if len(vocab) > max_words:
        np.random.seed(seed)
        indices = np.random.choice(len(vocab), max_words, replace=False)
        vocab = [vocab[i] for i in indices]
        word_vectors = word_vectors[indices]

    # 使用 t-SNE 降维到 2D
    tsne = TSNE(
        n_components=2, random_state=seed, perplexity=75, init="pca", n_iter=1000
    )
    reduced_embeddings = tsne.fit_transform(word_vectors)

    # 绘制 t-SNE 可视化图
    colors = [
        "pink",
        "blue",
        "green",
        "yellow",
        "purple",
        "magenta",
        "red",
        "orange",
        "cyan",
        "lawngreen",
        "olive",
        "lightblue",
    ]
    colors = dict(zip(BME_classes, colors[: len(BME_classes)]))

    plt.figure(figsize=(14, 10))
    l = {}
    for i, word in enumerate(vocab):
        x, y = reduced_embeddings[i]
        for c in BME_classes:
            if word in BME[c]:
                color = colors[c]
                size = 30
                l[c] = (x, y, color)
                break
            else:
                color = "gray"
                size = 5
        plt.scatter(x, y, color=color, s=size)
        plt.text(x + 0.2, y + 0.2, word, fontsize=0.3 * size, color=color)
    for key, value in l.items():
        plt.scatter(value[0], value[1], color=value[2], s=size, label=key)
    plt.title("t-SNE Visualization of Word Embeddings")
    plt.legend()
    plt.show()

    return None

### Track 4.3: Co-occurrence


In [None]:
def Exploration_3(
    model, target_word="coronavirus", k=10, output_file=".\cooccurrence_words.xlsx"
):
    if target_word in model.vocab:
        # Get similar words
        cooccurrence_words = model.get_similar_words(target_word, k)
        print("Top 10 similar words:")
        for word, similarity in cooccurrence_words:
            print("{0}: {1:.5f}".format(word, similarity))

        # Save to Excel
        df = pd.DataFrame(cooccurrence_words, columns=["Word", "Similarity"])
        df.to_excel(output_file, index=False)
        print(f"Co-occurrence words saved to {output_file}")
        return True
    else:
        print(f"{target_word} not in vocabulary.")
        return False

### Track 4.4: Semantic Similarity


In [None]:
def Exploration_4(
    model, target_word="coronavirus", output_file=".\semantic_similar_words.xlsx"
):
    if target_word.lower() in model.wv:
        target_vector = model.wv[target_word.lower()]
        all_words = model.wv.index_to_key
        similarities = []

        for word in all_words:
            if word != target_word.lower():
                similarity = cosine_similarity([target_vector], [model.wv[word]])[0][0]
                similarities.append((word, similarity))

        sorted_similarities = sorted(similarities, key=lambda x: x[1], reverse=True)
        top_similar_words = sorted_similarities[:10]

        # Save to Excel
        df = pd.DataFrame(top_similar_words, columns=["Word", "Similarity"])
        df.to_excel(output_file, index=False)
        print(f"Top similar words saved to {output_file}")

        print("Top 10 similar words:")
        for word, similarity in top_similar_words:
            print("{0}:{1:.5f}".format(word, similarity))
        return True

    else:
        print(f"{target_word.lower()} not in vocabulary.")

        return False

## Part 5 (Bonus +10%): Open Challenge: Mining Biomedical Knowledge

A fundamental task in clinical/biomedical natural language processing is to extract intelligence from biomedical text corpus automatically and efficiently. More specifically, the intelligence may include biomedical entities mentioned in text, relations between biomedical entities, clinical features of patients, progression of diseases, all of which can be used to predict, understand and improve patients' outcomes.

This open challenge is to build a biomedical knowledge graph based on the CORD-19 dataset and mine useful information from it. We recommend the following steps but you are also encouraged to develop your solution from scratch.

### Extract Biomedical Entities from Text

Extract biomedical entities (such as fever, cough, headache, lung cancer, heart attack) from text. Note that:

- The biomedical entities may consist of multiple words. For example, heart attack, multiple myeloma etc.
- The biomedical entities may be written in synoynms. For example, low blood pressure for hypotension.
- The biomedical entities may be written in different forms. For example, smoking, smokes, smoked.

### Extract Relations between Biomedical Entities

Extract relations between biomedical entities based on their appearance in text. You may define a relation between biomedical entities by one or more of the following criteria:

- The biomedical entities frequentely co-occuer together.
- The biomedical entities have similar word representations.
- The biomedical entities have clear relations based on textual narratives. For example, "The most common symptoms for COVID-19 are fever and cough" so we know there are relations between "COVID-19", "fever" and "cough".

### Build a Biomedical Knowledge Graph of COVID-19

Build a knoweledge graph based on the results from track 5.1 and 5.2 and visualise it.


In [None]:
###################
# TODO: add your solution

###################