# Implementing Biomedical NER with TinyLlama 

This notebook implements the core workflows presented in the paper **"LLMs in Biomedical: A Study on Named Entity Recognition"**. I will adapt the paper's methods for an open-source model, replacing the proprietary GPT-4 with **TinyLlama**.

The goal is to perform **Named Entity Recognition (NER)** on biomedical text, identifying entities like diseases, treatments, and tests.


--- 
## 1. Setup and Dependencies


In [1]:
import torch
import numpy as np
import pandas as pd
from sklearn.externals.array_api_compat.dask.array import astype
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModel
from datasets import load_dataset
from sklearn.neighbors import NearestNeighbors
from transformers import pipeline
import os
import requests
import  ast

# Check for GPU availability for faster processing
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

  from .autonotebook import tqdm as notebook_tqdm


Using device: cuda


---
## 2. Loading Models and Data

Here, we load the TinyLlama chat model, the BioClinicalBERT model for embeddings.

In [2]:
os.getenv(".env")

In [3]:
# loading the tiny llama model 
pipe = pipeline('text-generation', model="meta-llama/Llama-3.2-3B-Instruct", dtype=torch.bfloat16, device_map="auto")

# loading the BioClinicalBert model for encodings
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

print("Models Loaded Successfully!!")



Loading checkpoint shards: 100%|██████████| 2/2 [00:16<00:00,  8.02s/it]
Device set to use cuda:0


Models Loaded Successfully!!


In [4]:
torch.cuda.is_available()

True

In [5]:
# Convert data to a dataframe
db = pd.read_csv("data-words/train.tsv", sep='\t')
db = db.dropna()
db

Unnamed: 0,Identification,O
0,of,O
1,APC2,O
2,",",O
3,a,O
4,homologue,O
...,...,...
135994,and,O
135995,increased,O
135996,survival,O
135997,.,O


# TANL

## Creating a function which inputs the search term and the database. It then looks up for the term for documents and returns the list of the documents. These documents will be used by the LLm as a context for identifying entities.

In [6]:

def get_context(search_term,search_db):

    BASE_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"

    esearch_url = BASE_URL + "esearch.fcgi"

    esearch_params = {
        "db": search_db,
        "term": search_term,
        "retmax": 5,
        "retmode": "json",
        "usehistory": "y",
        "tool": "MyPubMedAPIScript",
        "email": os.getenv("email"),
        "api_key": os.getenv("ncbi_token")
    }

    response = requests.get(esearch_url,params=esearch_params)
    response.raise_for_status

    result_json = response.json()

    result = result_json.get("esearchresult",{})
    ids = result.get("idlist",[])
    count = result.get("count",0)
    webenv = result.get("webenv")

    final_result = {"search term": search_term,
                    "total results:": count,
                    "id list": ids,
                    "web environment": webenv}


    return final_result


## Creating a test for testing the first chunk with 1500 words
### Results:- Best performance with 25 batch size

In [20]:
# Writing the prompt usning TANL technique

current_words = db["Identification"].iloc[0:1500]
current_text = " ".join(current_words)

prompt_tanl = [
  {
    "role": "system",
    "content": "You are an expert in medical domain. Given the following document, your task is to identify that it could potentially be a medical entity, do not add any other text just return the output format specified. The output should be a list of strings where the strings will be the potential medical entities nothing else, for example: the output format should be: ['entity 1', 'entity 2' ... and so on]"
     },
    {
    "role": "user",
    "content": f"data: {current_text}"
    },
    ]

prompt = pipe.tokenizer.apply_chat_template(prompt_tanl, tokenize=False, add_generation_prompt=True)
test_entities_from_doc = pipe(prompt,
        max_new_tokens=1000,
        temperature=0.1,
        batch_size=16,
        return_full_text=False)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [23]:
test_result_preclassification = ast.literal_eval(test_entities_from_doc[0]["generated_text"])
print(test_result_preclassification)

['adenomatous polyposis coli', 'Wnt signalling pathway', 'glycogen synthase kinase 3beta', 'axin/conductin', 'betacatenin', 'Tcf-4 transcription factor', 'MSH2 mutation', 'hereditary non-polyposis cancer syndrome', 'HNPCC', 'colorectal cancer', 'Huntington disease', 'apolipoprotein E', 'CAG repeat length', 'complement component 7', 'cartilage-hair hypoplasia', 'dihydropyrimidine dehydrogenase deficiency']


In [24]:
list_of_context_file = []
for i in test_result_preclassification:
    result_from_umls = get_context(i,"pubmed")
    list_of_context_file.append(result_from_umls)

In [25]:
list_of_context_file

[{'search term': 'adenomatous polyposis coli',
  'total results:': '10671',
  'id list': ['41118559', '41112420', '41104667', '41067867', '41045501'],
  'web environment': 'MCID_68f8519ac0fb9ce1550a7395'},
 {'search term': 'Wnt signalling pathway',
  'total results:': '33267',
  'id list': ['41110101', '41103766', '41103435', '41102854', '41102431'],
  'web environment': 'MCID_68f8519b785fe9b67b06592a'},
 {'search term': 'glycogen synthase kinase 3beta',
  'total results:': '10072',
  'id list': ['41113442', '41105605', '41091591', '41086612', '41084480'],
  'web environment': 'MCID_68f8519cca8d189b560fc01b'},
 {'search term': 'axin/conductin',
  'total results:': '12',
  'id list': ['21498506', '19966865', '18854359', '12183362', '12023307'],
  'web environment': 'MCID_68f8519d0e7384fd3a0640f3'},
 {'search term': 'betacatenin',
  'total results:': '33292',
  'id list': ['41115563', '41113613', '41111081', '41110481', '41110101'],
  'web environment': 'MCID_68f8519de46b8848fc032201'},


## Final LLM calling for the entire document

## Creating chinks of the column of size 1500 and adding them to the list of prompts

In [15]:
# Writing the prompt usning TANL technique

chunks = 1500
overlap = 200
all_potential_entities = set()
all_prompts = []

for i in range(0,len(db["Identification"]),chunks-overlap):

        current_words = db["Identification"][i:i+chunks]
        current_text = " ".join(current_words)

        prompt_tanl = [
            {
              "role": "system",
              "content": "You are an expert in medical domain. Given the following document, your task is to identify that it could potentially be a medical entity, do not add any other text just return the output format specified. The output should be a list of strings where the strings will be the potential medical entities nothing else, for example: the output format should be: ['entity 1', 'entity 2' ... and so on]"
                        },
                {
                 "role": "user",
                 "content": f"data: {current_text}"
                },
                ]

        prompt = pipe.tokenizer.apply_chat_template(prompt_tanl, tokenize=False, add_generation_prompt=True)
        all_prompts.append(prompt)

print("All prompts added")
pipe.tokenizer.pad_token = pipe.tokenizer.eos_token


All prompts added


## Calling the pipeline

In [18]:
entities_from_doc = pipe(all_prompts,
        max_new_tokens=1024,
        temperature=0.1,
        return_full_text=False,
        batch_size=8)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


KeyboardInterrupt: 

## Converting the result generated by the pipeline into lists and adding those lists to a set to prevent duplicate values

In [16]:
for i, result in  enumerate(entities_from_doc):
    res = result[0]['generated_text']
    res_list = ast.literal_eval(res)
    if isinstance(res_list, list):
        all_potential_entities.update(res_list)

NameError: name 'entities_from_doc' is not defined

In [13]:
all_potential_entities

set()

In [7]:
result_from_umls = get_context("Wilson disease","pubmed")

# Dice