# Data Processing

The aim of this notebook is to outline the steps taken to produce the data - a series of relevant **subject, relationship, object** triplets - that will be visualised in a network in future steps. The process has been broken down as follows:

1. Downloading the Data with E-Utilities and Biopython.
2. Relationship Extraction using REBEL.
3. Semantic Matching using BioBERT.
4. Key-Word Extraction using KeyBERT and BioBERT.

## 1. Downloading the Data with E-Utilities and Biopython.
To download the data, we will make use of E-Utilities (NCBI Entrez Programming Utilities), a set of tools designed to facilitate the process of downloading large sets of bioinformatic data.

A general introduction to the E-Utilities:
- https://www.ncbi.nlm.nih.gov/books/NBK25497/
- *'A set of nine server-side programs that provide a stable interface into the Entrez query and database system at the NCBI'*.
- Uses a fixed URL syntax that translates a standard set of input parameters into the values necessary for various NCBI software components to search for and retrieve the requested data.
- The E-utilities are therefore the structured interface to the Entrez system, which currently includes 38 databases covering a variety of biomedical data, including nucleotide and protein sequences, gene records, three-dimensional molecular structures, and the biomedical literature.
- To access data, a piece of software posts an E-utility URL to NCBI, then retrieves the results of this and processes the data.
- It can use any computer languages that can send a URL to the E-utilities server and interpret the XML response (i.e. Python, Perl, Java, C++).
- NCBI requests that users limit requests to no more than 3 per second.

A combination of **ESearch** and **EFetch** can be used to find and retrieve the relevant data.

In order to make use of the tools, a program must post an 'E-Utility URL' to NCBI. **BioPython** is a library that provides a tool called Entrez to send these URLs using Python. 

Below, it is used to **fetch** all article **abstracts** related to the term **'biology'**. It then saves the data to one text file, 'bio_corpus.txt'.

In [None]:
from Bio import Entrez
import xml.etree.ElementTree as ET

# Setting email.
Entrez.email = 'aidanlowrie@example.com'

# Setting search word.
search_word = 'biology'

# Search for PubMed articles related to the search word 'biology', returning up to 100,000 results. 
search_handle = Entrez.esearch(db='pubmed', term=search_word, retmax=100000)
record = Entrez.read(search_handle)
search_handle.close()

# A list of uids is necessary for fetching the actual abstracts.
uids = record['IdList']

# Fetch the abstracts in XML form, so that the actual abstract may be extracted.
fetch_handle = Entrez.efetch(db="pubmed", id=','.join(uids), 
                             rettype="abstract", retmode="xml") # Return data in XML form.
data = fetch_handle.read()
fetch_handle.close()

# Parsing the XML.
root = ET.fromstring(data)
# Extracting the abstracts themselves from the returned data.
abstracts = root.findall(".//AbstractText")

# Write the abstracts to a file.
with open("data/bio_corpus.txt", "w") as file:
    for abstract in abstracts:
        file.write(str(abstract.text) + "\n\n") # Two newlines are added between abstracts, for clarity.


## 2. Relationship Extraction with REBEL 
Relationship extraction involves finding **triplets** - **subject (head)**, **relationship (type)** and **object (tail)** - in a corpus. While this can be achieved through a variety of machine learning techniques including pattern matching and supervised machine learning, we have chosen to use REBEL (Relationship Extraction By End-to-end Language generation).

[REBEL](https://aclanthology.org/2021.findings-emnlp.204.pdf) is an open source relationship extraction seq2seq model released in 2021. We have chosen to use it due to its **state-of-the-art performance** and **ease-of-use**.

Before being passed into the model, we used **NLTK** to **break the corpus into sentences**.

In [1]:
# Importing nltk resources.
import nltk
from nltk.corpus.reader import PlaintextCorpusReader
from nltk.tokenize import sent_tokenize, word_tokenize

# Opening corpus.
with open("data/bio_corpus.txt", "r") as file:
    corpus_text = file.read()

corpus_sentences = sent_tokenize(corpus_text)

#### The ```extract_triplets``` Function.
This function was lifted directly from the [huggingface REBEL docs](https://huggingface.co/Babelscape/rebel-large). It is designed to **parse the text** generated by REBEL into a **list of triplets**. 

In [2]:
# Parse REBEL output into a list of triplets. 
def extract_triplets(text):
    triplets = []
    relation, subject, object_ = '', '', ''
    text = text.strip()
    current = 'x'
    for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
    return triplets


The code below **loads the REBEL model** and **writes** the **extracted triplets to a file**. 

*The process had to be carried out in batches over several nights, which is why the code was adapted to include a 'start line'.*

In [None]:
import csv
from transformers import pipeline

# Loading the model.
triplet_extractor = pipeline('text2text-generation', model='Babelscape/rebel-large', tokenizer='Babelscape/rebel-large')

# Establishing a start line.
start_line = 15000

# The model can only handle tokens of max length 1024 tokens. Those exceeding this capacity aren't considered in the dataset. (This is very rare, but the check is necessary.)
max_token_length = 1024

# Opening a csv file for triplet storage.
with open("data/triplets_batch4.csv", "w") as file:
    # Field names.
    field_names = ['head', 'type', 'tail']
    writer = csv.DictWriter(file, fieldnames=field_names)
    writer.writeheader()
    for i, sentence in enumerate(corpus_sentences):
        if i > start_line and len(triplet_extractor.tokenizer.encode(sentence)) <= max_token_length:
            extracted_text = triplet_extractor.tokenizer.batch_decode([triplet_extractor(sentence, return_tensors=True, return_text=False)[0]["generated_token_ids"]])
            triplets = extract_triplets(extracted_text[0])
            for triplet in triplets:
                writer.writerow(triplet)    

*The triplet batches are then merged into a single file and duplicates are removed.*

In [77]:
import pandas as pd

def csv_union(csv_path_list, output_csv_path):
    dfs = []
    for csv in csv_path_list:
        dfs.append(pd.read_csv(csv))
    df_union = pd.concat(dfs).drop_duplicates()
    df_union.to_csv(output_csv_path, index=False)

csv_union(csv_path_list=['data/triplets_batch1.csv', 'data/triplets_batch2.csv', 'data/triplets_batch3.csv', 'data/triplets_batch4.csv'], output_csv_path='data/triplets.csv')

## 3. Keyword Extraction with KeyBERT and BioBERT

https://towardsdatascience.com/how-to-extract-relevant-keywords-with-keybert-6e7b3cf889ae (```KeyBert``` useful info.)

In [56]:
# Loading corpus.
with open('data/bio_corpus.txt', 'r') as file:
    abstracts = file.readlines()

In [78]:
import csv
from keybert import KeyBERT
from nltk.corpus import stopwords
import torch
import random
from transformers import AutoModel, AutoTokenizer
from sklearn.metrics.pairwise import cosine_similarity

# Loading the BioBERT model.
model_name = "dmis-lab/biobert-v1.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
biobert_model = AutoModel.from_pretrained(model_name)

# Loading the KeyBERT model running on BioBERT 
kw_model = KeyBERT(model=biobert_model)

# Loading triplets dataframe.
df = pd.read_csv('data/triplets.csv', names=['head', 'type', 'tail'])
# Remove null values.
df = df[df['head'].notna() & df['tail'].notna()]

# Creating a list of stopwords from NLTK.
stop_words = list(set(stopwords.words('english')))

keywords = set()
with open("data/keywords.csv", "w") as file:
    writer = csv.DictWriter(file, fieldnames=['Keyword'])
    writer.writeheader()   
    for abstract in abstracts:
        # Extracting unigrams.
        new_keyword_list = set([keyword for keyword, _ in kw_model.extract_keywords(abstract, keyphrase_ngram_range=(1, 1), stop_words=stop_words)])
        novel_keywords = new_keyword_list - keywords
        keywords.update(novel_keywords)
        for novel_keyword in novel_keywords:
            writer.writerow({'Keyword': novel_keyword})
        # Extracting bigrams.
        new_keyword_list = set([keyword for keyword, _ in kw_model.extract_keywords(abstract, keyphrase_ngram_range=(2, 2), stop_words=stop_words)])
        novel_keywords = new_keyword_list - keywords
        keywords.update(novel_keywords)
        for novel_keyword in novel_keywords:
            writer.writerow({'Keyword': novel_keyword})
        # Extracting trigrams.
        new_keyword_list = set([keyword for keyword, _ in kw_model.extract_keywords(abstract, keyphrase_ngram_range=(3, 3), stop_words=stop_words)])
        novel_keywords = new_keyword_list - keywords
        keywords.update(novel_keywords)
        for novel_keyword in novel_keywords:
            writer.writerow({'Keyword': novel_keyword})

In [79]:
import pandas as pd

def filter_dataframe(df, relevant_column_names, filter_set):
    for column_name in relevant_column_names:
        # Split hyphenated words.
        df[column_name] = df[column_name].str.replace('-', ' ')
        # Remove words that are in the filter set.
        df = df[~df[column_name].isin(filter_set)]
    return df

keyword_data = pd.read_csv('data/keywords.csv')
keywords = keyword_data['Keyword'].tolist()
filtered_df = filter_dataframe(df=df, relevant_column_names=['head', 'tail'], filter_set=keywords)
filtered_df.to_csv('data/filtered_triplets.csv', index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_name] = df[column_name].str.replace('-', ' ')


## 4. Semantic Matching using BioBERT

In the previous steps, triplets were extracted from the corpus using REBEL and filtered depending on whether they were flagged as key terms by **KeyBERT**. But many extracted terms are semantically near-identical. In such cases, they should be **merged** so that only one word is considered. This is called **Semantic Matching**. In order to carry out this process, we will use **BioBERT**.

BioBERT is a BERT model **pre-trained on PubMed articles** as well as other biological content. It is used as follows in the code:
1. **Semantic data** for each 'head' and 'tail' entry are obtained from BioBERT embeddings. 
2. **Cosine similarity** scores are then used to compare terms, merging them if they meet a certain similarity threshold by **replacing every instance of one term with the other**.

In [80]:
# Get a list of each unique term in the df.
unique_words = list(set(df["head"].unique().tolist() 
                        + df["tail"].unique().tolist()))

# Tokenise.
unique_words_tokenised = [tokenizer(word, return_tensors="pt") for word in unique_words]

# Feeding the tokenised words into the model to get a list of unique_word_embeddings.
with torch.no_grad():
    unique_word_outputs = [biobert_model(**tokens) for tokens in unique_words_tokenised]
unique_word_embeddings = [output.last_hidden_state.mean(dim=1).numpy() for output in unique_word_outputs]

# Create a DataFrame with the words and their embeddings
df_embeddings = pd.DataFrame({
    'word': unique_words,
    'embedding': unique_word_embeddings
})

In [81]:
import Levenshtein
def get_cosine_similarity(df_embeddings, word_1, word_2):
    # Get the embeddings for the two words
    embedding1_df = df_embeddings[df_embeddings['word'] == word_1]['embedding']
    embedding2_df = df_embeddings[df_embeddings['word'] == word_2]['embedding']
    if embedding1_df.empty or embedding2_df.empty:
        return 0
    else:
        embedding1 = embedding1_df.values[0]
        embedding2 = embedding2_df.values[0]
        # Compute and return the cosine similarity
        return cosine_similarity(embedding1, embedding2)[0][0] # type: ignore


def get_levenshtein_similarity(word_1, word_2):
    return 1 - Levenshtein.distance(word_1, word_2) / max(len(word_1), len(word_2))

# Merge similar words by rippling through the keyword list and comparing against others, then removing from list.
def merge_similar_words(df, df_embeddings, cosine_threshold, levenshtein_threshold):
    word_replacements = {}
    unique_words = df_embeddings['word'].to_list()
    for word_1 in unique_words:
        unique_words.remove(word_1)
        for word_2 in unique_words:
            levenshtein_score = get_levenshtein_similarity(word_1=word_1, word_2=word_2)
            cosine_score = get_cosine_similarity(df_embeddings=df_embeddings, word_1=word_1, word_2=word_2)
            # print('FAILURE', '1', word_1, '2', word_2, 'cos', cosine_score, 'lev', levenshtein_score)
            if  (cosine_score > cosine_threshold and levenshtein_score > levenshtein_threshold) or levenshtein_score > 0.85:
                print('SUCCESS', '1', word_1, '2', word_2, 'cos', cosine_score, 'lev', levenshtein_score)
                if word_1.lower() == word_1 or word_2.lower() == word_2:
                    word_1 = word_1.lower()
                    word_2 = word_2.lower()
                if len(word_1) < len(word_2):
                    winner = word_1
                elif len(word_2) < len(word_1):
                    winner = word_2
                else:
                    winner = random.choice((word_1, word_2))
                if winner == word_1:
                    loser = word_2
                else:
                    loser = word_1
                word_replacements[loser] = winner
    df_replaced = df.replace(word_replacements)
    return df_replaced

df_replaced = merge_similar_words(df=df, df_embeddings=df_embeddings, cosine_threshold=0.97, levenshtein_threshold=0.6)
df_replaced.to_csv()

SUCCESS 1 Coding variants 2 coding variants cos 0.91758 lev 0.9333333333333333
SUCCESS 1 socioeconomic factors 2 socioeconomic indicators cos 0.9773717 lev 0.7916666666666666
SUCCESS 1 postmenopausal 2 post-menopausal cos 0.98054916 lev 0.9333333333333333
SUCCESS 1 Parkinsonism 2 parkinsonism cos 0.8776088 lev 0.9166666666666666
SUCCESS 1 plaque 2 plaques cos 0.8970699 lev 0.8571428571428572
SUCCESS 1 microglial 2 microglia cos 0.9648881 lev 0.9
SUCCESS 1 diplonemid 2 diplonemids cos 0.99051666 lev 0.9090909090909091
SUCCESS 1 university hospital 2 University Hospital cos 0.8423207 lev 0.8947368421052632
SUCCESS 1 quality-of-life 2 quality of life cos 0.9663539 lev 0.8666666666666667
SUCCESS 1 COL8A1 2 COL11A1 cos 0.97448915 lev 0.7142857142857143
SUCCESS 1 COL8A1 2 COL6A3 cos 0.9819741 lev 0.6666666666666667
SUCCESS 1 COL8A1 2 COL10A1 cos 0.98273283 lev 0.7142857142857143
SUCCESS 1 intermediate state 2 intermediate stage cos 0.8887357 lev 0.9444444444444444
