# Construct a Biomedical Knowledge Graph with NLP

Natural Language Processing (NLP) is a crucial tool for analyzing human language and transforming it into a form that computers can understand. In the biomedical field, NLP techniques can extract information from research papers, journals, and databases, and represent it as a structured knowledge graph. This representation offers a comprehensive view of the relationships and interactions between biomedical concepts, such as genes, diseases, drugs, and proteins.

In this tutorial, a single research paper is used to demonstrate the steps involved in constructing a biomedical knowledge graph, including (a) reading a PDF document with OCR, (b) text preprocessing, (c) biomedical concept recognition and linking, (d) relation extraction, and (e) external database enrichment. The resulting graph is stored in Neo4j, a graph database that uses the labeled property graph model and the Cypher query language to provide a comprehensive and easily searchable view of biomedical concepts and their relationships. This knowledge graph can be used for further analysis and insights.

### Installing Required Tools for NLP-based Knowledge Graph

In [None]:
!sudo apt install tesseract-ocr # on mac: brew install tesseract
!sudo apt-get install poppler-utils # on mac: brew install poppler
!pip install pytesseract pdf2image zero-shot-re neo4j

### 1. Reading a PDF document with OCR
This code downloads a PDF document from a URL using the 'requests' library, converts the PDF into a list of images (pages) using the 'pdf2image' library, extracts the text from the images using the 'pytesseract' library, stores the text from the first 5 pages in a list (excluding the references), and finally joins the text into a single string.

In [1]:
# Import the necessary libraries
import requests
import pdf2image
import pytesseract

# Retrieve the PDF document from the URL
pdf = requests.get('https://arxiv.org/pdf/2110.03526.pdf')

# Convert the PDF to a list of images (pages)
doc = pdf2image.convert_from_bytes(pdf.content)

# Initialize an empty list to store the article text
article = []

# Iterate over each page in the document
for page_number, page_data in enumerate(doc):
    # Extract the text from the page image
    txt = pytesseract.image_to_string(page_data).encode("utf-8")
    
    # Only add the text from pages 1 to 5 to the article text
    if page_number < 6:
      article.append(txt.decode("utf-8"))

# Join the extracted text from each page into a single string
article_txt = " ".join(article)

### 2. Text Preprocessing
We're now going to process the article content we previously obtained. To make the text more focused and relevant, we're going to eliminate any section titles or figure descriptions from the article. This will help us extract the core information from the text and ignore the less significant details.

In [2]:
# Import the Natural Language Toolkit library
import nltk

# Download the 'punkt' package for sentence tokenization
nltk.download('punkt')

# Define a function to remove section titles and figure descriptions from text
def clean_text(text):
    """Remove section titles and figure descriptions from text"""
    clean = "\n".join([row for row in text.split("\n") if (len(row.split(" "))) > 3 and not (row.startswith("(a)"))
                      and not row.startswith("Figure")])
    return clean

# Split the text on the word "INTRODUCTION" and only keep the second part
text = article_txt.split("INTRODUCTION")[1]

# Clean the text
ctext = clean_text(text)

# Tokenize the cleaned text into sentences
sentences = nltk.tokenize.sent_tokenize(ctext)

[nltk_data] Downloading package punkt to /Users/Farzad/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### 3. Biomedical Named Entity Linking (NEL)
Named entity recognition (NER) is a key technique in NLP that identifies relevant entities or concepts in text. In the biomedical domain, for example, NER can detect genes, drugs, diseases, and other relevant concepts. An upgraded version of NER is named entity linking (NEL), which not only detects relevant concepts in text but also maps them to a target knowledge base. The primary reason for using NEL is to help deal with entity disambiguation and to enrich our graph model by fetching information from the knowledge base.

One of the NEL models available is BERN, a fine-tuned BioBert model with integrated NEL models that map concepts to biomedical knowledge bases such as MESH, CHEBI, OMIM, ENSEMBL, and others. BERN provides a free REST endpoint, making it easy to use without having to deal with installation and dependencies.

In this example, we use BERN for biomedical NER visualization, but it does not always assign target knowledge base IDs to concepts. To overcome this limitation, we have prepared a script that first checks if a distinct ID is given for a concept and, if not, uses the entity name as the ID. Additionally, we compute the sha256 of the text of sentences to easily identify specific sentences during relation extraction.

In conclusion, the use of NEL techniques and models such as BERN helps to accurately identify and link relevant concepts in text to a target knowledge base, leading to a more comprehensive and enriched analysis.

*** This part could be time-consuming! ***

In [3]:
# Importing the hashlib library which provides secure hash algorithms like sha256
import hashlib 

# A function to query the biomedical entity linking API, using the "requests" library
def query_raw(text, url="https://bern.korea.ac.kr/plain"):
    """
    Query the biomedical entity linking API to get named entities
    text: str, the text to be processed by the API
    url: str, the API endpoint
    returns: JSON, the response from the API
    """
    return requests.post(url, data={'sample_text': text}, verify=False).json()

entity_list = []
# Loop through each sentence except the last one
for s in sentences[:-1]:
    # Get the named entities for each sentence
    entity_list.append(query_raw(s))

# Make a list that stores information about entities extracted from text, 
# including entity type, entity name, entity id, and other entity ids, 
# as well as the text itself and its SHA-256 hash value.
parsed_entities = []
# Loop through the list of named entities for each sentence
for entities in entity_list:
    e = []
    # If there are not entities in the text
    if not entities.get('denotations'):
        parsed_entities.append({'text':entities['text'], 'text_sha256': hashlib.sha256(entities['text'].encode('utf-8')).hexdigest()})
        continue
    # Loop through each entity in the text
    for entity in entities['denotations']:
        # Get a list of ids that are not from BERN
        other_ids = [id for id in entity['id'] if not id.startswith("BERN")]
        entity_type = entity['obj'] # Get the entity type
        entity_name = entities['text'][entity['span']['begin']:entity['span']['end']] # Get the entity name
        try:
            # Get the BERN id if it exists
            entity_id = [id for id in entity['id'] if id.startswith("BERN")][0]
        except IndexError:
            # If the BERN id does not exist, use the entity name as the id
            entity_id = entity_name
        # Append the entity information to the e list
        e.append({'entity_id': entity_id, 'other_ids': other_ids, 'entity_type': entity_type, 'entity': entity_name})
    # Append the entities, text, and sha256 hash of the text to the parsed_entities list
    parsed_entities.append({'entities':e, 'text':entities['text'], 'text_sha256': hashlib.sha256(entities['text'].encode('utf-8')).hexdigest()})



### 4. Construct a Knowledge Graph
Here, we're going to create a biomedical knowledge graph using only entities. To store the graph, we will be using Neo4j, and you can use a free Neo4j Sandbox instance for this purpose.

To get started, go to the Neo4j Sandbox website (https://sandbox.neo4j.com/?usecase=blank-sandbox) and start a new Blank project. Once the project is started, copy the connection details to the Colab notebook. Now, we're ready to connect to our Neo4j instance and start building our knowledge graph!

In [4]:
from neo4j import GraphDatabase
import pandas as pd

# Connection information for the Neo4j database
host = 'bolt://44.201.239.175:7687'
user = 'neo4j'
password = 'nameplate-certificate-helmsmen'
# Creating a driver to connect to the Neo4j database
driver = GraphDatabase.driver(host,auth=(user, password))

# Function to run a query on the Neo4j database
def neo4j_query(query, params=None):
    with driver.session() as session:
        # Running the query and returning the result
        result = session.run(query, params)
        # Converting the result to a Pandas dataframe for easier manipulation
        return pd.DataFrame([r.values() for r in result], columns=result.keys())

Import the author and the article into the graph. The article node will contain only the title.

In [5]:
author = article_txt.split("\n")[0]
title = " ".join(article_txt.split("\n")[2:4])

neo4j_query("""
MERGE (a:Author{name:$author})
MERGE (b:Article{title:$title})
MERGE (a)-[:WROTE]->(b)
""", {'title':title, 'author':author})

To visualize the author-title relationship in Neo4j, you can run the following Cypher query. This query will return all the nodes labeled as "Author" and "Article" and their relationships of type "WROTE". The result can be visualized as a graph in the Neo4j browser interface.

![alt text](https://github.com/fvfarahani/BioNLP/blob/main/author_title.png?raw=true)

We could define a Cypher query that retrieves all the relationships between "Author" and "Article" nodes. The query is executed using the neo4j_query function, which returns the results in a pandas DataFrame. The results are then printed.

In [6]:
# Define the Cypher query that you want to execute
query = """
MATCH (a:Author)-[:WROTE]->(b:Article)
RETURN a.name AS Author, b.title AS Article
"""

# Execute the query and store the results in a pandas DataFrame
result = neo4j_query(query)

# Print the results
print(result)

                Author                                            Article
0  Mohammadreza Ahmadi  Tissue Engineering and Regeneration of Skin an...


We can import the sentences and mentioned entities by executing the following Cypher query. This query is used to populate the graph database with data about articles and the entities mentioned in their sentences.

In [7]:
neo4j_query("""
MATCH (a:Article)
UNWIND $data as row
MERGE (s:Sentence{id:row.text_sha256})
SET s.text = row.text
MERGE (a)-[:HAS_SENTENCE]->(s)
WITH s, row.entities as entities
UNWIND entities as entity
MERGE (e:Entity{id:entity.entity_id})
ON CREATE SET e.other_ids = entity.other_ids,
              e.name = entity.entity,
              e.type = entity.entity_type
MERGE (s)-[m:MENTIONS]->(e)
ON CREATE SET m.count = 1
ON MATCH SET m.count = m.count + 1
""", {'data': parsed_entities})

We can execute the following Cypher query to inspect the constructed graph:

![alt text](https://github.com/fvfarahani/BioNLP/blob/main/main_graph.png?raw=true)

### 5. Knowledge Graph Applications

#### 5.1. Search Engine

We could use our graph as a search engine. For example, you could use the following Cypher query to find sentences or articles that mention a specific medical entity.

#### 5.2. Co-occurrence Analysis

The second option is the co-occurrence analysis. You could define co-occurrence between medical entities if they appear in the same sentence or article. You could use the following Cypher query to find entities that often co-occur in the same sentence (obviously, the results would be better if we analyzed thousands or more articles).

#### 5.3. Inspect Author Expertise

We could also use this graph to find the author’s expertise by examining the medical entities they most frequently write about. With this information, we could also suggest future collaborations. Execute the following Cypher query to inspect which medical entities our single author mentioned in the research paper.

### 6. Relation extraction
Now, we will be exploring the concept of relation extraction in the field of biomedical research. While named entity extraction is relatively straightforward, relation extraction can prove to be a bit challenging and often requires fine-tuning of the models.

However, we will not be going down that route in this tutorial. Instead, we will be using a "zero-shot relation extractor" based on the Exploring the zero-shot limit of FewRel paper. This model is available on HuggingFace and does not require any training or set up on our part, making it perfect for a simple demonstration.

In [13]:
!pip install transformers
!pip install huggingface-hub

from transformers import AutoTokenizer
from zero_shot_re import RelTaggerModel, RelationExtractor

# Loading the pre-trained relation tagging model from HuggingFace
model = RelTaggerModel.from_pretrained("fractalego/fewrel-zero-shot")

# Loading the pre-trained tokenizer for the BERT-large-uncased-whole-word-masking-finetuned-squad model
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

# Defining the list of relations of interest
relations = ['associated', 'interacts']

# Initializing the relation extractor with the model, tokenizer, and list of relations
extractor = RelationExtractor(model, tokenizer, relations)



AttributeError: module 'tensorflow.python.saved_model.registration' has no attribute 'register_tf_serializable'

The relations variable is a list that defines the type of relationships that you want the model to detect. In this example, the two relationships specified are 'associated' and 'interacts'. There are also more specific relationship types such as treats, causes, and others, but their results were not great.

The zero-shot relation extractor uses this information to search for connections between entities mentioned in the text. The result of the named entity linking process is used as input to the relation extraction process. The code finds all sentences where two or more entities are mentioned and then runs them through the model to extract any connections. A threshold value of 0.85 is also defined, meaning that if the model predicts a relationship between entities with a probability less than 0.85, it will be ignored.

In [12]:
import itertools
# Candidate sentence where there is more than a single entity present
candidates = [s for s in parsed_entities if (s.get('entities')) and (len(s['entities']) > 1)]
# Initialize an empty list to store the predicted relationships
predicted_rels = []

# Loop through the candidate sentences
for c in candidates:
    # Get all combinations of two entities from the sentence
    combinations = itertools.combinations([{'name':x['entity'], 'id':x['entity_id']} for x in c['entities']], 2)
    for combination in list(combinations):
        try:
            # Use the extractor to rank the relationships between the entity pair
            ranked_rels = extractor.rank(text=c['text'].replace(",", " "), head=combination[0]['name'], tail=combination[1]['name'])
            # Define threshold for the most probable relation
            if ranked_rels[0][1] > 0.85:
                # Add the relationship to the list of predicted relationships
                predicted_rels.append({'head': combination[0]['id'], 'tail': combination[1]['id'], 'type':ranked_rels[0][0], 'source': c['text_sha256']})
        except:
            pass

# Store relations to Neo4j
neo4j_query("""
UNWIND $data as row
MATCH (source:Entity {id: row.head})
MATCH (target:Entity {id: row.tail})
MATCH (text:Sentence {id: row.source})
MERGE (source)-[:REL]->(r:Relation {type: row.type})-[:REL]->(target)
MERGE (text)-[:MENTIONS]->(r)
""", {'data': predicted_rels})

We store the relationships as well as the source text used to extract that relationship in the graph.

You can examine the extracted relationships between entities and the source text with the following Cypher query: