## Document/text processing and embedding creation

* PDF document of choice (or other file types
* Embedding model of choice

1. Import PDF document
2. Process text for embedding (split into chunks of sentences)
3. Embed text chunks with embedding model
4. Save embeddings to files for later 

In [1]:
import os
import requests

pdf_path = "A MANUAL OF ADVERSE DRUG INTERACTIONS.pdf"

if not os.path.exists(pdf_path):
    print (f"[INFO] File doesn't exist, downloading...")

    url = "http://repo.upertis.ac.id/1645/1/A%20MANUAL%20OF%20ADVERSE%20DRUG%20INTERACTIONS.pdf"

    filename = pdf_path

    response = requests.get(url)

    # if successful
    if response.status_code == 200:
        with open(filename, "wb") as file:
            file.write(response.content)
        print(f"[INFO] The file has been download and saved as {filename}")
    else:
        print(f"[INFO] Failed to download the file. Status code: {response.status_code}")

else:
    print(f"File {pdf_path} exists.")

File A MANUAL OF ADVERSE DRUG INTERACTIONS.pdf exists.


In [2]:
import pymupdf 
from tqdm.auto import tqdm


def text_formatter(text: str) -> str:
    """Performs pre-processing text formatting"""
    cleaned_text = text.replace("\n", " ").strip()
    cleaned_text = " ".join(cleaned_text.split())
    cleaned_text = cleaned_text.replace("/", " | ")
    return cleaned_text 

def open_and_read_pdf(pdf_path: str) -> list[dict]:
    doc = pymupdf.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):
        if page_number < 15:
            continue
        text = page.get_text()
        text = text_formatter(text=text)
        pages_and_texts.append({"page_number": page_number - 14 - 1,
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split(". ")),
                                "page_token_count": len(text) / 4,
                                "text": text})
    return pages_and_texts
pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)

pages_and_texts[238]

0it [00:00, ?it/s]

{'page_number': 238,
 'page_char_count': 1842,
 'page_word_count': 266,
 'page_sentence_count_raw': 11,
 'page_token_count': 460.5,
 'text': 'INTERACTIONS WITH SYMPATHOMIMETIC AMINES 239 Combination Interaction administration of dopamine should receive substantially reduced dosage of dopamine. The starting dose in such patients should be reduced to one tenth (1110) of the usual dose. MAOI | dopexamine (6) Dopexamine is contraindicated in patients receiving MA01 MAOIhoradrenaline (7, 8) This interaction may produce a hypertensive crisis. However, noradrenaline is very rapidly taken up from circulation by adrenergic nerves and is inactivated by COMT enzyme, Therefore since monamine oxidase is probably little involved in this metabolism, the interaction is less likely than with other sympathomimetic amines which rely upon monoamine oxidase for their inactivation. Although noradrenaline is less likely to induce a hypertensive episode in patients receiving MA01 antidepressants than other sy

In [3]:
'''
import openparse
basic_doc_path = pdf_path
parser = openparse.DocumentParser()
parsed_basic_doc = parser.parse(basic_doc_path)

for node in parsed_basic_doc.nodes:
    op_pages_and_texts = []
    op_pages_and_texts.append(node)
'''


'\nimport openparse\nbasic_doc_path = pdf_path\nparser = openparse.DocumentParser()\nparsed_basic_doc = parser.parse(basic_doc_path)\n\nfor node in parsed_basic_doc.nodes:\n    op_pages_and_texts = []\n    op_pages_and_texts.append(node)\n'

In [4]:
import random
random.sample(pages_and_texts, k=3)

[{'page_number': 395,
  'page_char_count': 1217,
  'page_word_count': 148,
  'page_sentence_count_raw': 13,
  'page_token_count': 304.25,
  'text': '396 A MANUAL OF ADVERSE DRUG INTERACTIONS Combination Antacids and Absorbants |  isoniazid Antibioticslisoniazid (cycloserine) (222) Anti-epileptics | isoniazid e.g. carbamazepine ethosuximide phenytoin (1 , 217-220) Benzodiazepines | isoniazid e.g. diazepam Xanthineshsoniazid (theophylline) (221) Interaction Reduced absorption of isoniazid. Increased CNS toxicity of cycloserine due to isoniazid inhibition of its metabolism. Isoniazid inhibits the hepatic metabolism of these anti- epileptics and increases risk of toxicity. Metabolism of diazepam is inhibited. Isoniazid increases plasma levels of theophylline. 6. Cycloserine Cycloserine is contraindicated in patients with epilepsy, depression, renal insuf- ficiency or alcohol abuse. Combination Interaction Alcohokycloserine (223) Cycloserine enhances the CNS effects of alcohol and concurren

In [5]:
import pandas as pd 

df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,650.0,650.0,650.0,650.0,650.0
mean,324.5,1960.83,293.93,21.74,490.21
std,187.78,794.46,132.54,24.95,198.61
min,0.0,34.0,4.0,1.0,8.5
25%,162.25,1718.0,240.0,11.0,429.5
50%,324.5,2025.0,291.5,14.0,506.25
75%,486.75,2432.75,365.75,18.75,608.19
max,649.0,3925.0,598.0,106.0,981.25


# Splitting pages into sentences

In [15]:
# !pip install spacy

Collecting spacy
  Using cached spacy-3.8.4-cp311-cp311-win_amd64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Using cached spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Using cached spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Using cached murmurhash-1.0.12-cp311-cp311-win_amd64.whl.metadata (2.2 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Using cached cymem-2.0.11-cp311-cp311-win_amd64.whl.metadata (8.8 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Using cached preshed-3.0.9-cp311-cp311-win_amd64.whl.metadata (2.2 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Using cached thinc-8.3.4-cp311-cp311-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Using cached wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Using cached srsly-2.5.1-cp311-cp311-

In [6]:
from spacy.lang.en import English

nlp = English()

# Sentencizer pipeline using Spacy- turning texts into sentences 
nlp.add_pipe("sentencizer")

# Create a document instance 
doc = nlp("This is a sent. This is another sent. This is a 3rd.")
assert len(list(doc.sents)) == 3 

list(doc.sents)


[This is a sent., This is another sent., This is a 3rd.]

In [7]:
for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]
    item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/650 [00:00<?, ?it/s]

In [8]:
random.sample(pages_and_texts, k=1)

[{'page_number': 0,
  'page_char_count': 59,
  'page_word_count': 9,
  'page_sentence_count_raw': 1,
  'page_token_count': 14.75,
  'text': 'PART 1 Commentary on Drug Interactions and Their Mechanisms',
  'sentences': ['PART 1 Commentary on Drug Interactions and Their Mechanisms'],
  'page_sentence_count_spacy': 1}]

In [9]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,650.0,650.0,650.0,650.0,650.0,650.0
mean,324.5,1960.83,293.93,21.74,490.21,20.33
std,187.78,794.46,132.54,24.95,198.61,22.66
min,0.0,34.0,4.0,1.0,8.5,1.0
25%,162.25,1718.0,240.0,11.0,429.5,10.0
50%,324.5,2025.0,291.5,14.0,506.25,14.0
75%,486.75,2432.75,365.75,18.75,608.19,18.0
max,649.0,3925.0,598.0,106.0,981.25,103.0


# Chunking sentences into groups of n sentences, arbitrary number choice
e.g. framworks such as langchain
1. texts are easier to filter
2. text chunks can fit into embedding model context window
3. contexts passed to LLM can be more specific and focused 

In [10]:
num_sentence_chunk_size = 5

def split_list(input_list: list[str], 
               slice_size: int = num_sentence_chunk_size) -> list[list[str]]:
    return [input_list[i:i+slice_size] for i in range(0, len(input_list), slice_size)]

test_list = list(range(25))
split_list(test_list)

[[0, 1, 2, 3, 4],
 [5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14],
 [15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [11]:
# split sentences nito chunk

for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list = item["sentences"], slice_size = num_sentence_chunk_size)

    item["num_chunks"] = len(item["sentence_chunks"])

random.sample(pages_and_texts, k=1)

  0%|          | 0/650 [00:00<?, ?it/s]

[{'page_number': 573,
  'page_char_count': 2551,
  'page_word_count': 394,
  'page_sentence_count_raw': 11,
  'page_token_count': 637.75,
  'text': '574 A MANUAL OF ADVERSE DRUG INTERACTIONS does not always imply that a lesser amount of drug is absorbed, but rather that the time for a drug to reach peak levels after a single dose in lengthened ((11). Vitamin and mineral supplements are a common combination and are readily obtained as non-prescription medicines. The possibility of their interaction with prescribed medicines may not be appreciated. It is the mineral component of the combination product, as well as the mineral content of antacids and some laxatives that is responsible for the majority of these interactions. It should not be forgotten that many foods are rich in minerals and these may also enter into interactions (e.g. with tetracyclines and dairy products), in addition branded breakfast cereals have added minerals and vitamins. A list showing the mineral content of common

In [12]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,650.0,650.0,650.0,650.0,650.0,650.0,650.0
mean,324.5,1960.83,293.93,21.74,490.21,20.33,4.5
std,187.78,794.46,132.54,24.95,198.61,22.66,4.49
min,0.0,34.0,4.0,1.0,8.5,1.0,1.0
25%,162.25,1718.0,240.0,11.0,429.5,10.0,2.0
50%,324.5,2025.0,291.5,14.0,506.25,14.0,3.0
75%,486.75,2432.75,365.75,18.75,608.19,18.0,4.0
max,649.0,3925.0,598.0,106.0,981.25,103.0,21.0


## Split each chunk into its own item 
for granularity to dive itno text sample 

In [13]:
import re 
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # join the sentences into paragraphs
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk)

        chunk_dict["sentence_chunk"] = joined_sentence_chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 

        pages_and_chunks.append(chunk_dict)

len(pages_and_chunks)

  0%|          | 0/650 [00:00<?, ?it/s]

2928

In [14]:
random.sample(pages_and_chunks, k = 1)


[{'page_number': 342,
  'sentence_chunk': 'Changes in plasma concentration of phenytoin when administered concurrently with ciprofloxacin have been anticipated by the manufacturer (Miles Inc, package insert) but this is the first published report of such an interaction. The mechanism involved was thought to be ciprofloxacin’s induction of the P450 oxidative enzymes responsible for phenytoin’s metabolism. Nightly administration of dichloralphenazone significantly increased the clearance of phenytoin in an epileptic patient due to induced microsomal enzyme activity with loss of epileptic control. The interaction has been confirmed in healthy subjects. Phenobarbitone induces liver microsomal enzymes and thus increases the metabolism of phenytoin.',
  'chunk_char_count': 710,
  'chunk_word_count': 95,
  'chunk_token_count': 177.5}]

In [15]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,2928.0,2928.0,2928.0,2928.0
mean,319.92,433.27,64.0,108.32
std,176.5,413.84,58.07,103.46
min,0.0,2.0,1.0,0.5
25%,172.0,140.0,26.0,35.0
50%,323.5,269.0,41.0,67.25
75%,468.0,668.0,94.0,167.0
max,649.0,3387.0,518.0,846.75


In [16]:
# Filter for insiginficant short chunks
min_token_length = 42
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f"Chunk token count: {row[1]['chunk_token_count']} | Text: {row[1]['sentence_chunk']}")


Chunk token count: 32.25 | Text: 1969) Br Med J 1, 845.35 Humberstone PM. (1969) Br Med J 1, 846.36 Committee on Safety of Medicines. (1988) Curr Problems, NO 22.
Chunk token count: 31.25 | Text: 84 Silver BA, Bell WSR. (1979) Ann Intern Med 90, 348.85 Wallin BA et al. (1979) Ann Intern Med 90, 993.86 Breckenridge AM. (
Chunk token count: 25.5 | Text: 58 Kater RMH et al. (1969) Am J Med Sci 258, 35.59 Kater RMH et al. (1969) JAMA 207, 363.60 Iber FL. (
Chunk token count: 28.75 | Text: 1977) Br Med J 2, 773.24 Serlin MJ et al. (1980) Br J Clin Pharmacol9, 287.25 Orme M et al. (1976) Br Med J 1, 200.
Chunk token count: 40.5 | Text: Verapamil should not be used in the treatment of Wolff-Parkinson-White syndrome. Verapamil should not be injected into patients recently treated with p-adrenergic


In [17]:
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'page_number': 2,
  'sentence_chunk': 'INTRODUCTION: A WIDENING PROBLEM Over 20 years ago, an editorial on drug interactions in the Lancet (19 April 1975) said that the “publication of huge lists and tables will induce in doctors a drug- interaction-anxiety syndrome and lead to therapeutic paralysis”. This prediction has not come about, although the problem of drug interactions is still with us and the spectrum is widening as new drugs are introduced. Indeed it could be said that the nature of the problem has also widened for in the intervening years drug interactions have come to embrace interactions with food and with herbal medicines as well as the more numerous and better recognized drug-drug interactions. There is little doubt that drug-drug interactions can often be serious, even life- threatening. They can also be very expensive, and evidence from the Medical Defence Unions’ Reports of over 10 years ago reveals that one case which settled for &44 000 was due to phenylbutazone-

In [18]:
random.sample(pages_and_chunks_over_min_token_len, k=1)

[{'page_number': 568,
  'sentence_chunk': 'A disulfiram-like reaction also occurs with the parenteral cephalosporin, cephamandole (68), but not with other cephaloporins available in the UK. Patients should abstain from alcohol while taking the antibiotic',
  'chunk_char_count': 211,
  'chunk_word_count': 29,
  'chunk_token_count': 52.75}]

### Embedding text chunks to numerical representations

In [20]:
from sentence_transformers import SentenceTransformer
embedding_model = model = SentenceTransformer('BAAI/bge-large-en-v1.5',
                                             device="cuda")
print(f"Model is on: {embedding_model.device}")

sentences = [
    "The Sentences Transformers library provides an easy and open-source way to create embeddings.",
    "Sentences can be embedded one by one or as a list of strings.",
    "Embeddings are one of the most powerful concepts in machine learning!",
    "Learn to use embeddings well and you'll be well on your way to being an AI engineer."
]
# capture the meanings behind sentences through embedding

embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))

for sentence, embedding in embeddings_dict.items():
    print(f"Sentence: {sentence}")
    print(f"Embedding: {embedding}")
    print("")

Model is on: cuda:0
Sentence: The Sentences Transformers library provides an easy and open-source way to create embeddings.
Embedding: [-0.00298849  0.02823196 -0.03635975 ...  0.00782592 -0.00111846
  0.01117564]

Sentence: Sentences can be embedded one by one or as a list of strings.
Embedding: [-0.00834338 -0.01538954 -0.01188452 ... -0.00581899 -0.00116553
  0.0249307 ]

Sentence: Embeddings are one of the most powerful concepts in machine learning!
Embedding: [ 0.0153289   0.01436272 -0.00261183 ...  0.01016328 -0.01206179
  0.00810741]

Sentence: Learn to use embeddings well and you'll be well on your way to being an AI engineer.
Embedding: [ 0.00196116  0.05934048 -0.01474938 ... -0.02595543  0.01763771
  0.00390331]



In [21]:
import torch
print(torch.cuda.is_available())  # Should return True if GPU is available

True
