# Implementing Biomedical NER with TinyLlama 

This notebook implements the core workflows presented in the paper **"LLMs in Biomedical: A Study on Named Entity Recognition"**. I will adapt the paper's methods for an open-source model, replacing the proprietary GPT-4 with **TinyLlama**.

The goal is to perform **Named Entity Recognition (NER)** on biomedical text, identifying entities like diseases, treatments, and tests.


--- 
## 1. Setup and Dependencies


In [3]:
import torch
import numpy as np
import pandas as pd
from dask.multiprocessing import exceptions
from sklearn.externals.array_api_compat.dask.array import astype
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModel
from datasets import load_dataset
from sklearn.neighbors import NearestNeighbors
from transformers import pipeline
import os
import requests
import  ast

# Check for GPU availability for faster processing
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Using device: cuda


---
## 2. Loading Models and Data

Here, we load the TinyLlama chat model, the BioClinicalBERT model for embeddings.

In [4]:
os.getenv(".env")

In [5]:
# loading the tiny llama model 
pipe = pipeline('text-generation', model="meta-llama/Llama-3.2-3B-Instruct", dtype=torch.bfloat16, device_map="auto")

# loading the BioClinicalBert model for encodings
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

print("Models Loaded Successfully!!")



Loading checkpoint shards: 100%|██████████| 2/2 [00:08<00:00,  4.16s/it]
Device set to use cuda:0


Models Loaded Successfully!!


In [6]:
torch.cuda.is_available()

True

In [7]:
# Convert data to a dataframe
db = pd.read_csv("data-words/train.tsv", sep='\t')
db = db.dropna()
db

Unnamed: 0,Identification,O
0,of,O
1,APC2,O
2,",",O
3,a,O
4,homologue,O
...,...,...
135994,and,O
135995,increased,O
135996,survival,O
135997,.,O


# TANL

## Creating a function which inputs the search term and the database. It then looks up for the term for documents and returns the list of the documents. These documents will be used by the LLm as a context for identifying entities.

In [8]:

def get_context(search_term,search_db):

    BASE_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"

    esearch_url = BASE_URL + "esearch.fcgi"

    esearch_params = {
        "db": search_db,
        "term": search_term,
        "retmax": 5,
        "retmode": "json",
        "usehistory": "y",
        "tool": "MyPubMedAPIScript",
        "email": os.getenv("email"),
        "api_key": os.getenv("ncbi_token")
    }

    response = requests.get(esearch_url,params=esearch_params)
    response.raise_for_status

    result_json = response.json()

    result = result_json.get("esearchresult",{})
    ids = result.get("idlist",[])
    count = result.get("count",0)
    webenv = result.get("webenv")

    final_result = {"search term": search_term,
                    "total results:": count,
                    "id list": ids,
                    "web environment": webenv}


    return final_result


## Creating a test for testing the first chunk with 1500 words
### Results:- Best performance with 25 batch size

In [9]:
# Writing the prompt usning TANL technique

current_words = db["Identification"].iloc[0:1500]
current_text = " ".join(current_words)

prompt_first_classification = [
  {
    "role": "system",
    "content": "You are an expert in medical domain. Given the following document, your task is to identify that it could potentially be a medical entity, do not add any other text just return the output format specified. The output should be a list of strings where the strings will be the potential medical entities nothing else, for example: the output format should be: ['entity 1', 'entity 2' ... and so on]"
     },
    {
    "role": "user",
    "content": f"data: {current_text}"
    },
    ]

prompt = pipe.tokenizer.apply_chat_template(prompt_first_classification, tokenize=False, add_generation_prompt=True)
test_entities_from_doc = pipe(prompt,
        max_new_tokens=1000,
        temperature=0.1,
        batch_size=16,
        return_full_text=False)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [10]:
test_result_preclassification = ast.literal_eval(test_entities_from_doc[0]["generated_text"])
print(test_result_preclassification)

['adenomatous polyposis coli', 'Wnt signalling pathway', 'glycogen synthase kinase 3beta', 'axin/conductin', 'betacatenin', 'Tcf-4 transcription factor', 'MSH2 mutation', 'hereditary non-polyposis cancer syndrome', 'HNPCC', 'colorectal cancer', 'Huntington disease', 'apolipoprotein E', 'CAG repeat length', 'complement component 7', 'Neisseria', 'cartilage-hair hypoplasia', 'dihydropyrimidine dehydrogenase deficiency']


## Final LLM calling for the entire document

## Creating chinks of the column of size 1500 and adding them to the list of prompts

In [12]:
# # Writing the prompt usning TANL technique
#
# chunks = 1500
# overlap = 200
# all_potential_entities = set()
# all_prompts = []
#
# for i in range(0,len(db["Identification"]),chunks-overlap):
#
#         current_words = db["Identification"][i:i+chunks]
#         current_text = " ".join(current_words)
#
#         prompt_tanl = [
#             {
#               "role": "system",
#               "content": "You are an expert in medical domain. Given the following document, your task is to identify that it could potentially be a medical entity, do not add any other text just return the output format specified. The output should be a list of strings where the strings will be the potential medical entities nothing else, for example: the output format should be: ['entity 1', 'entity 2' ... and so on]"
#                         },
#                 {
#                  "role": "user",
#                  "content": f"data: {current_text}"
#                 },
#                 ]
#
#         prompt = pipe.tokenizer.apply_chat_template(prompt_tanl, tokenize=False, add_generation_prompt=True)
#         all_prompts.append(prompt)
#
# print("All prompts added")
# pipe.tokenizer.pad_token = pipe.tokenizer.eos_token


## Calling the pipeline

In [13]:
# entities_from_doc = pipe(all_prompts,
#         max_new_tokens=1024,
#         temperature=0.1,
#         return_full_text=False,
#         batch_size=8)

## Implimenting the RAG

In [62]:
from Bio import Entrez
from urllib.error import HTTPError
Entrez.email = "atripathi2024@fau.edu"

In [78]:
def ragpredict(test_result_preclassification):

    predictions = []
    for i in test_result_preclassification:
       search_term = i
       ids = get_context(i,"pubmed")
       valid_ids = [str(j) for j in ids if j]
       context_xml = ""
       if valid_ids:
           try:
               list_of_ids = ",".join(valid_ids)
               handle = Entrez.efetch(db="pubmed", id=list_of_ids, retmode="xml")
               context_xml = handle.read()
               handle.close()
           except HTTPError as e:
               print(f"HTTP Error fetching IDs for term '{search_term}': {e}")
               pass
           except Exception as e:
               print(f"An error occurred for term '{search_term}': {e}")
               pass



       prompt_final_classification = [
        {
          "role": "system",
          "content": "You are an expert in medical domain. Given the following word, and a context which has been taken from Pubmed clinical database, your task is to identify that it could potentially be a medical entity, do not add any other text just return the output format specified. The output should be either 'o'- which represents outside clinical term (for non clinical terms), 'B-CLINICAL'- for the words which are beginning of a clinical term and 'I-CLINICAL' - for the words which can be a subset of a clinical term "
        },
        {
          "role": "user",
          "content": f"word: {search_term}, context: {context_xml}"
        },
        ]
       result = pipe(prompt_final_classification,
                     max_new_tokens=1000,
                     temperature=0.1,
                     return_full_text=False
                     )
       result = result[0]["generated_text"]
       result_dict = {
           search_term: result
       }
       predictions.append(result_dict)

    return  predictions




In [79]:
predictions = ragpredict(test_result_preclassification)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


HTTP Error fetching IDs for term 'adenomatous polyposis coli': HTTP Error 400: Bad Request


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


HTTP Error fetching IDs for term 'Wnt signalling pathway': HTTP Error 400: Bad Request


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


HTTP Error fetching IDs for term 'glycogen synthase kinase 3beta': HTTP Error 400: Bad Request


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


HTTP Error fetching IDs for term 'axin/conductin': HTTP Error 400: Bad Request


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


HTTP Error fetching IDs for term 'betacatenin': HTTP Error 400: Bad Request


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


HTTP Error fetching IDs for term 'Tcf-4 transcription factor': HTTP Error 400: Bad Request


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


HTTP Error fetching IDs for term 'MSH2 mutation': HTTP Error 400: Bad Request


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


HTTP Error fetching IDs for term 'hereditary non-polyposis cancer syndrome': HTTP Error 400: Bad Request


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


HTTP Error fetching IDs for term 'HNPCC': HTTP Error 400: Bad Request


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


HTTP Error fetching IDs for term 'colorectal cancer': HTTP Error 400: Bad Request


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


HTTP Error fetching IDs for term 'Huntington disease': HTTP Error 400: Bad Request


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


HTTP Error fetching IDs for term 'apolipoprotein E': HTTP Error 400: Bad Request


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


HTTP Error fetching IDs for term 'CAG repeat length': HTTP Error 400: Bad Request


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


HTTP Error fetching IDs for term 'complement component 7': HTTP Error 400: Bad Request


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


HTTP Error fetching IDs for term 'Neisseria': HTTP Error 400: Bad Request


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


HTTP Error fetching IDs for term 'cartilage-hair hypoplasia': HTTP Error 400: Bad Request


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


HTTP Error fetching IDs for term 'dihydropyrimidine dehydrogenase deficiency': HTTP Error 400: Bad Request


In [80]:
print(predictions)

[{'adenomatous polyposis coli': 'B-CLINICAL \nI-CLINICAL'}, {'Wnt signalling pathway': 'B-I-CLINICAL'}, {'glycogen synthase kinase 3beta': 'B'}, {'axin/conductin': 'B-CLINICAL'}, {'betacatenin': 'B-I-CLINICAL'}, {'Tcf-4 transcription factor': 'B-CLINICAL'}, {'MSH2 mutation': 'B-CLINICAL'}, {'hereditary non-polyposis cancer syndrome': 'B-CLINICAL \nI-CLINICAL \nI-CLINICAL'}, {'HNPCC': 'B'}, {'colorectal cancer': 'B-CLINICAL \nI-CLINICAL \nI-CLINICAL'}, {'Huntington disease': 'B-Hunting'}, {'apolipoprotein E': 'B-CLINICAL'}, {'CAG repeat length': 'B-CLINICAL'}, {'complement component 7': 'B-CLINICAL'}, {'Neisseria': 'B-CLINICAL'}, {'cartilage-hair hypoplasia': 'B-CLINICAL'}, {'dihydropyrimidine dehydrogenase deficiency': 'B-CLINICAL'}]


In [83]:
reformatted_data = [{'Term': list(item.keys())[0], 'Classification': list(item.values())[0]} for item in predictions]
predictions_data= pd.DataFrame(reformatted_data)
predictions_data

Unnamed: 0,Term,Classification
0,adenomatous polyposis coli,B-CLINICAL \nI-CLINICAL
1,Wnt signalling pathway,B-I-CLINICAL
2,glycogen synthase kinase 3beta,B
3,axin/conductin,B-CLINICAL
4,betacatenin,B-I-CLINICAL
5,Tcf-4 transcription factor,B-CLINICAL
6,MSH2 mutation,B-CLINICAL
7,hereditary non-polyposis cancer syndrome,B-CLINICAL \nI-CLINICAL \nI-CLINICAL
8,HNPCC,B
9,colorectal cancer,B-CLINICAL \nI-CLINICAL \nI-CLINICAL


# Dice