## AI Doctor Assistant
***
* Here we will demonstrate how to leverage Gemini's long context window to read information from many EMA (European Medicines Agency) drugs to provide assistance to doctors and patients.

* The data came from the [EMA](https://www.ema.europa.eu/en/medicines/download-medicine-data) website. The list of medicines was filtered to obtain only the authorised human medicines. Then web scraping was done to get the English overview pdf file of each medicine. [Here is](https://www.ema.europa.eu/en/documents/overview/qdenga-epar-medicine-overview_en.pdf) an example of a pdf file.

* Warning: The AI's responses should not be considered a substitute for consultation with a qualified healthcare professional.

## Import Python Packages

In [1]:
import os
import time
import shutil
import requests
import datetime
import pandas as pd
import urllib.request
from tqdm import tqdm
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import google.generativeai as genai
from IPython.display import Markdown
from google.generativeai import caching
from kaggle_secrets import UserSecretsClient

## Get the Data

In [2]:
medicines_df = pd.read_excel('/kaggle/input/medicines/ema_medicines.xlsx')
medicines_df['file_downloaded'] = ''
medicines_df.head()

Unnamed: 0,Name of medicine,Pharmacotherapeutic group,Medicine URL,file_downloaded
0,Pylclari,Diagnostic radiopharmaceuticals,https://www.ema.europa.eu/en/medicines/human/E...,
1,Ganfort,Ophthalmologicals,https://www.ema.europa.eu/en/medicines/human/E...,
2,Lumigan,Prostaglandin analogues,https://www.ema.europa.eu/en/medicines/human/E...,
3,MenQuadfi,Vaccines,https://www.ema.europa.eu/en/medicines/human/E...,
4,Celsunax,Diagnostic radiopharmaceuticals,https://www.ema.europa.eu/en/medicines/human/E...,


In [3]:
print('There are', len(medicines_df), 'medicines.')

There are 1437 medicines.


In [4]:
print('There are', medicines_df['Pharmacotherapeutic group'].nunique(), 'pharmacotherapeutic groups.')

There are 157 pharmacotherapeutic groups.


* Since there are so many medications (1437), it would not be feasible to load them all, as it would be thousands of pages in PDF files. So our strategy will be to filter by the pharmacological group we want to ask questions about. Suppose we want to get information about vaccines, then we will load only the medications in the "Vaccines" group.

* Note: You can choose any other group by changing the line below.

In [5]:
group = 'vaccine'

In [6]:
# Filter only the group chosen above
medicines_df = medicines_df[medicines_df["Pharmacotherapeutic group"].str.lower().str.contains(group, na=False)]
medicines_df = medicines_df.reset_index(drop=True)

In [7]:
medicines_df.head(10)

Unnamed: 0,Name of medicine,Pharmacotherapeutic group,Medicine URL,file_downloaded
0,MenQuadfi,Vaccines,https://www.ema.europa.eu/en/medicines/human/E...,
1,M-M-RVaxPro,Vaccines,https://www.ema.europa.eu/en/medicines/human/E...,
2,Qdenga,Vaccines,https://www.ema.europa.eu/en/medicines/human/E...,
3,Aflunov,Vaccines,https://www.ema.europa.eu/en/medicines/human/E...,
4,Gardasil,Vaccines,https://www.ema.europa.eu/en/medicines/human/E...,
5,Nimenrix,Vaccines,https://www.ema.europa.eu/en/medicines/human/E...,
6,Nuvaxovid,Covid-19 vaccines,https://www.ema.europa.eu/en/medicines/human/E...,
7,Imvanex,Other viral vaccines,https://www.ema.europa.eu/en/medicines/human/E...,
8,Comirnaty,Vaccines,https://www.ema.europa.eu/en/medicines/human/E...,
9,Gardasil 9,Papillomavirus vaccines,https://www.ema.europa.eu/en/medicines/human/E...,


In [8]:
print('There are', len(medicines_df), 'vaccines.')

There are 56 vaccines.


## Web Scraping

Now we need to scrape the EMA (European Medicines Agency) website to download the medicine overview pdf file of each medicine from the selected group (Vaccines).

In [9]:
# Create a folder to download the files
folder_location = r'ema_medicines'
if not os.path.exists(folder_location):os.mkdir(folder_location)

In [10]:
# Create the function to download the english files
downloads = 0

def download_files():
    """
    Function to download the pdf file of each medicine.
    """
    global medicines_df, downloads
    
    for i in tqdm(range(medicines_df.shape[0]), desc="Downloading files"):
        if medicines_df.loc[i,'file_downloaded'] != '':
            continue
        url = medicines_df.loc[i,'Medicine URL']
        response = requests.get(url)
        soup= BeautifulSoup(response.text, "html.parser")     
        for link in soup.select("a[href$='en.pdf']"):
            filename = os.path.join(folder_location,link['href'].split('/')[-1])
            if 'epar-medicine-overview' not in filename and 'epar-medicines-overview' not in filename and 'epar-overview' not in filename and 'epar-summary-public' not in filename:
                continue
            urllib.request.urlretrieve(urljoin(url,link['href']), filename)
            medicines_df.loc[i,'file_downloaded'] = filename.split('/')[1]
            downloads += 1
            time.sleep(7)
            break
    print('downloads:', downloads)

In [11]:
# Set a number of attempts and start downloading
retry = 5
for k in range(retry):
    if downloads == medicines_df.shape[0]:
        break
    download_files()

Downloading files: 100%|██████████| 56/56 [06:53<00:00,  7.39s/it]

downloads: 56





In [12]:
# Get a list of the downloaded files
pdf_files = os.listdir('/kaggle/working/ema_medicines')
print('The first two files:', pdf_files[:2])

The first two files: ['abrysvo-epar-medicine-overview_en.pdf', 'twinrix-adult-epar-summary-public_en.pdf']


In [13]:
print('There are', len(pdf_files), 'pdf files.')

There are 56 pdf files.


## Create a single file

We will use pymupdf to merge all the pdf files into a single file for easier manipulation.

In [14]:
!pip install pymupdf

Collecting pymupdf
  Downloading PyMuPDF-1.24.14-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading PyMuPDF-1.24.14-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (19.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m63.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.24.14


In [15]:
# Merge pdf files
import pymupdf

doc = pymupdf.open()
for filename in pdf_files:
    doc.insert_file(f'/kaggle/working/ema_medicines/{filename}')

doc.save('medicines.pdf')
doc.close()

In [16]:
pdf_file = '/kaggle/working/medicines.pdf'

## Authenticate with Google Generative AI

In [17]:
user_secrets = UserSecretsClient()
ai_studio_token = user_secrets.get_secret("ai_studio_token")
genai.configure(api_key=ai_studio_token)

## Define Helper Functions

In [18]:
def upload_to_gemini(file, mime_type=None):
    """Uploads the given files to Gemini.

    See https://ai.google.dev/gemini-api/docs/prompting_with_media
    """
    file = genai.upload_file(file, mime_type=mime_type)
    print(f"Uploaded file '{file.display_name}' as: {file.uri}")
    return file

def wait_for_files_active(files):
    """Waits for the given files to be active.
    
    Some files uploaded to the Gemini API need to be processed before they can be
    used as prompt inputs. The status can be seen by querying the file's "state"
    field.
    
    This implementation uses a simple blocking polling loop. Production code
    should probably employ a more sophisticated approach.
    """
    for name in (file.name for file in files):
        file = genai.get_file(name)
        while file.state.name == "PROCESSING":
          print(".", end="", flush=True)
          time.sleep(10)
          file = genai.get_file(name)
        if file.state.name != "ACTIVE":
          raise Exception(f"File {file.name} failed to process")
    print("All files ready")

## Upload the Medicine Files to Gemini

In [19]:
files = [upload_to_gemini(pdf_file, mime_type="application/pdf")]

wait_for_files_active(files)

Uploaded file 'medicines.pdf' as: https://generativelanguage.googleapis.com/v1beta/files/2v8l2dsswlq4
All files ready


## Load the Gemini 1.5 model

In [20]:
# Create a cache with a 15 minute TTL
cache = caching.CachedContent.create(
    model='models/gemini-1.5-flash-002',
    display_name='medicines',
    system_instruction=(
        'You are a doctor, and your job is to answer '
        'the user\'s query based on the file you have access to.'
    ),
    contents=files,
    ttl=datetime.timedelta(minutes=15),
)

In [21]:
# Construct a GenerativeModel which uses the created cache

generation_config = {
  "temperature": 0,
  "top_p": 0.7,
  "top_k": 32,
  "max_output_tokens": 8192,
  "response_mime_type": "text/plain",
}

model = genai.GenerativeModel.from_cached_content(cached_content=cache, generation_config=generation_config)
chat_session = model.start_chat()

In [22]:
response = model.count_tokens(files)
print(f"Token Count: {response.total_tokens}")

Token Count: 296420


***

## Ask Gemini 1.5 Questions About Medicines and Diseases

In [23]:
response = chat_session.send_message("Tell me about what you read in one sentence.")
Markdown(response.text)

The provided text contains summaries of European Public Assessment Reports (EPARs) for various vaccines, detailing their uses, administration, mechanisms of action, benefits, risks, and authorization statuses within the EU.


In [24]:
response = chat_session.send_message("Please write a summary of Supemtek.")
Markdown(response.text)

Supemtek is a quadrivalent influenza vaccine (containing four different influenza strains) given as a single intramuscular injection to adults.  Studies showed it to be at least as effective as a comparator vaccine in preventing influenza, with mostly mild side effects.  It received EU marketing authorization in November 2020.


In [25]:
response = chat_session.send_message("Where on the body should I get the Supemtek vaccine?")
Markdown(response.text)

The Supemtek vaccine should be injected into a muscle, preferably in the upper arm.


In [26]:
response = chat_session.send_message("What is the recommended dose for Fendrix?")
Markdown(response.text)

The recommended dose for Fendrix is four injections given one month apart, with the fourth injection given four months after the third.


In [27]:
response = chat_session.send_message("What are the risks associated with Aflunov?")
Markdown(response.text)

The most common side effects of Aflunov in adults are headache, myalgia, injection site reactions, tiredness, malaise, and chills.  Children may experience additional side effects such as nausea, diarrhea, and vomiting.  Aflunov should not be given to those with severe allergic reactions to its components or to those with a severe sudden fever.  The full list of side effects and restrictions can be found in the package leaflet.


In [28]:
response = chat_session.send_message("Please list the vaccines for Covid-19.")
Markdown(response.text)

Based on the provided text, the COVID-19 vaccines mentioned are:

* **Nuvaxovid:**  This vaccine and its adapted versions (XBB.1.5 and JN.1) target specific variants of the SARS-CoV-2 virus.
* **Comirnaty:** This vaccine and its adapted versions (Original/Omicron BA.4-5, Omicron XBB.1.5, JN.1, and KP.2) target specific variants of the SARS-CoV-2 virus.
* **Spikevax:** This vaccine and its adapted versions (bivalent Original/Omicron BA.1, bivalent Original/Omicron BA.4-5, XBB.1.5, and JN.1) target specific variants of the SARS-CoV-2 virus.
* **Bimervax:** This vaccine targets the Alpha and Beta variants of the SARS-CoV-2 spike protein.

Please note that this list may not be exhaustive of all COVID-19 vaccines available.  The document only covers those specifically mentioned.


In [29]:
response = chat_session.send_message("What are the recommended vaccines to prevent dengue? Give me only their names.")
Markdown(response.text)

Qdenga and Dengvaxia


In [30]:
response = chat_session.send_message("Can pregnant women be vaccinated with Qdenga?")
Markdown(response.text)

No, the provided text states that Qdenga should not be used in women who are pregnant or breastfeeding.


In [31]:
response = chat_session.send_message("In which country is a person more likely to be infected with dengue: Brazil or USA? Consider that only Brazil is a tropical country.")
Markdown(response.text)

Based solely on the fact that Brazil is a tropical country and the USA is not, a person would be more likely to be infected with dengue in Brazil.  However, this is a simplification and the actual risk depends on many factors beyond climate, including specific geographic location within each country, time of year, and mosquito control measures.


## Conclusion
***

Here we were able to demonstrate a very interesting use case of Gemini’s long window context, an assistant that can help doctors and patients get information from a trusted data source (European Medicines Agency in this case). The answers were accurate enough even with a very simple prompt. We can imagine that such a use case has the potential to positively impact public health and lives.

We could see how long window context was essential to provide accurate information from files without the need for other techniques like RAG. We also used context caching, which is a great feature for reducing operational costs in scenarios where a substantial initial context is referenced repeatedly by shorter requests.

Many thanks to the Google team for providing an excellent model like Gemini and an API that can be used for free. The use cases that can emerge with Gemini from now on will certainly be amazing.