# MySL (My Science Liaison): a Literature-Informed, LLM-Powered Medical Science Liaison

## Context & Problem Statement
Medical Science Liasons, or MSLs for short, are industry-facing medical scientists that have a wide variety of responsibilities in the pharmaceutical industry. One of these responsibilities is to maintain relationships with physicians, pharmacists, and other key opinion leaders (KOLs) in the medical community. These relationships are usually maintained through frequent interactions initiated by the MSL, in which the MSL answers detailed scientific questions about their company's products. Through these interactions, the MSL ensures that members of the medical community have up-to-date, scientifically accurate information. 

Though MSLs usually hold advanced degrees in the life sciences and are more than capable of fulfilling the role of science communicator, the amount of effort required to be constantly available to their clients is disproportionately high. MSLs spend a significant portion of their time chasing interaction quotas and traveling to meet KOLs in person. Though not all of this travel is Q&A-related, even a small reduction in workload could allow MSLs to better allocate their time and efforts to more pressing problems, especially in a world where relevant medical information is readily available online. 

## Solution
This problem has a convenient and complementary solution in the form of Large Language Models (LLMs). These transformer-based models are trained on a diverse and substantial corpus of data, and are engineered to answer queries in natural language. This feature of LLMs resolves one of the primary sources of friction keeping doctors and pharmacists from simply searching for technical drug information themselves. Though pharmaceutical companies digitally publish all of the scientific information needed to answer frequently asked questions, their customers have trouble accessing it in a reliable and digestible format. For this reason, an MSL must dedicate a significant amount of time to reviewing publically available drug information and relating it to their clients. 

My Science Liaison (MySL) is an LLM-powered chat application that aims to fill an MSL's role as science communicator. By using Retrieval Augmented Generation (RAG) pipelines to reference the most up-to-date info available, MySL is built to be scientifically informed and reliable. Instead of spending hours reviewing new literature, getting MySL up to speed on the latest product information takes as long as uploading/accessing a simple text document. To ensure interpretability, MySL's responses will even reference the source documents it used to create an answer, allowing the user to easily verify the information themselves in the case of model hallucinations. Future versions will also incorporate pretraining on scientific documents to tailor MySL's speech patterns and built-in knowledge to medical information. 

## Methods
MySL was written in Python, making extensive use of the Langchain package to facilitate the model's RAG pipelines. Though there is room to experiment with different LLMs, OpenAI's GPT-4o-mini was used in this particular implementation of the model simply because of its fast inference time and low cost. For commercial applications where cost is less of an issue and external compute resources are available, testing the use of larger models like GPT-4 from OpenAI and Claude-3.5 Sonnet from Anthropic is recommended. 

MySL's knowledge base was compiled from a variety of different publicly available sources, and was curated to have specialized knowledge of some of the most popular ulcerative colitis (UC) medications. UC was chosen simply because I have a friend that currently works as an MSL in the UC indication, so I'll have someone to test the project out on. Pharmaceutical companies make prescribing data for their products available online, so for this implementation, documents were pulled from the healthcare provider (HCP) websites of Eli Lilly (mirikizumab/Omvoh), Janssen (ustekinumab/Stelara), and Takeda (vedolizumab/Entyvio).

## Miscellaneous Notes
This project was made as a minimum viable product, and as such does not contain many of the available features and improvements it should have before being used in production. Future improvements include, but are not limited to:  
* pretraining on relevant medical and pharmaceutical knowledge
* an intuitive user interface that shows chat history and takes users to the specific places in its source documents that references were made to
* testing on various RAG pipeline hyperparameters such as chunk size, similarity functions, query modification/translation methods, and more
* inclusion of conversation history in the model's context

## Model Implementation:
First, the necessary environment variables are imported. The specific LLM powering the application is also initialized:

In [1]:
import getpass
import os
from langchain_openai import ChatOpenAI

os.environ["LANGCHAIN_TRACING_V2"] = "true"
# os.environ["LANGCHAIN_API_KEY"] = getpass.getpass()
# os.environ["OPENAI_API_KEY"] = getpass.getpass()

llm = ChatOpenAI(model="gpt-4o-mini")


Next, the data is loaded in and cleaned:

In [2]:
import bs4
from langchain_community.document_loaders import WebBaseLoader

def clean_text_string(text):
    # Split the text into lines
    lines = text.split('\n')
    remove_whitespace = [line.strip() for line in lines if line.strip()]
    join_lines = '\n'.join(remove_whitespace)
    # Remove excessive spaces
    clean_text = ' '.join(join_lines.split())
    return clean_text

# Load web data
data_loader = WebBaseLoader(
    web_paths=("https://uspl.lilly.com/omvoh/omvoh.html#pi",),
)
docs = data_loader.load()

# Add relevant metadata
docs[0].metadata["title"] = "Omvoh Prescribing Information"
docs[0].metadata["page"] = 0

# Clean web data
for doc in docs:
    doc.page_content = clean_text_string(doc.page_content)

# Display results
print(f"{docs[0].metadata}\n\n")
print(docs[0].page_content[:500])


USER_AGENT environment variable not set, consider setting it to identify your requests.


{'source': 'https://uspl.lilly.com/omvoh/omvoh.html#pi', 'title': 'Omvoh Prescribing Information', 'language': 'en', 'page': 0}


These highlights do not include all the information needed to use OMVOH safely and effectively. See full prescribing information for OMVOH. OMVOH (mirikizumab-mrkz) injection, for intravenous or subcutaneous use Initial U.S. Approval: 2023 OMVOH- mirikizumab-mrkz injection, solution Eli Lilly and Company ---------- HIGHLIGHTS OF PRESCRIBING INFORMATION These highlights do not include all the information needed to use OMVOH safely and effectively. See full prescribing information for OMVOH. OMVOH


In [3]:
from langchain_community.document_loaders import PyPDFLoader

# Loading PDF files

pdf_paths = {"Entyvio Prescribing Information": "data/ENTYVIOPI.pdf", "Stelara Prescribing Information": "data/STELARA-pi.pdf"}
loaded_pdfs = []

for pdf in pdf_paths.keys():
    loader = PyPDFLoader(pdf_paths[pdf])
    async for page in loader.alazy_load():
        page.metadata["title"] = pdf
        page.metadata["language"] = "en"
        loaded_pdfs.append(page)

print(f"{len(pdf_paths.keys())} PDFS LOADED; {len(loaded_pdfs)} TOTAL PAGES: \n\n")
for pdf in loaded_pdfs:
    print(f"{pdf.metadata}\n\n")

# Consolidate all docs in existing docs var
docs.extend(loaded_pdfs)
    


2 PDFS LOADED; 32 TOTAL PAGES: 


{'source': 'data/ENTYVIOPI.pdf', 'page': 0, 'title': 'Entyvio Prescribing Information', 'language': 'en'}


{'source': 'data/ENTYVIOPI.pdf', 'page': 1, 'title': 'Entyvio Prescribing Information', 'language': 'en'}


{'source': 'data/ENTYVIOPI.pdf', 'page': 2, 'title': 'Entyvio Prescribing Information', 'language': 'en'}


{'source': 'data/ENTYVIOPI.pdf', 'page': 3, 'title': 'Entyvio Prescribing Information', 'language': 'en'}


{'source': 'data/ENTYVIOPI.pdf', 'page': 4, 'title': 'Entyvio Prescribing Information', 'language': 'en'}


{'source': 'data/ENTYVIOPI.pdf', 'page': 5, 'title': 'Entyvio Prescribing Information', 'language': 'en'}


{'source': 'data/ENTYVIOPI.pdf', 'page': 6, 'title': 'Entyvio Prescribing Information', 'language': 'en'}


{'source': 'data/ENTYVIOPI.pdf', 'page': 7, 'title': 'Entyvio Prescribing Information', 'language': 'en'}


{'source': 'data/ENTYVIOPI.pdf', 'page': 8, 'title': 'Entyvio Prescribing Information', 'language': 'e

After that, the documents are split using recursive splitting so as to conserve token space in calls to the LLM. They are then indexed in a vector store: 

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200, add_start_index=True)
splits = text_splitter.split_documents(docs)

vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

str(len(splits[0].page_content))+" characters"

'996 characters'

A document retriever is made from the vector store. A system prompt template is then written, and a helper function is created to format the outputs:

In [5]:
from langchain_core.prompts import ChatPromptTemplate

retriever = vectorstore.as_retriever(search_type="mmr")

system_prompt = (
    "You are an expert in medical pharmacology. You work as a "
    "Medical Science Liaison, and your job is to answer technical "
    "questions from other medical professionals while being as"
    " scientifically accurate as possible. You specialize in the Ulcerative"
    " Colitis indication, and you focus on three competing medications:"
    " Omvoh, Stelara, and Entyvio. Use the following pieces "
    "of retrieved context to answer any questions you receive."
    " If you don't know the answer, say that you don't know. "
    "Politely REFUSE TO ANSWER any questions"
    " unrelated to pharmacology and medicine."
    "\n\n"
    "CONTEXT:\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

print(prompt.messages[0].prompt.template)

You are an expert in medical pharmacology. You work as a Medical Science Liaison, and your job is to answer technical questions from other medical professionals while being as scientifically accurate as possible. You specialize in the Ulcerative Colitis indication, and you focus on three competing medications: Omvoh, Stelara, and Entyvio. Use the following pieces of retrieved context to answer any questions you receive. If you don't know the answer, say that you don't know. Politely REFUSE TO ANSWER any questions unrelated to pharmacology and medicine.

CONTEXT:
{context}


The RAG chain is then defined:

In [6]:
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

qa_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, qa_chain)


Finally, the user's prompt is entered and processed: 

"What are some alternatives to Omvoh for treating Ulcerative Colitis, and how do they compare in terms of dosage?"

In [7]:
query = "What are some alternatives to Omvoh for treating Ulcerative Colitis, and how do they compare in terms of dosage?"
output = rag_chain.invoke({"input": query})

In [8]:
print("RESPONSE:\n" + output["answer"])

RESPONSE:
Some alternatives to OMVOH for the treatment of moderately to severely active ulcerative colitis include Stelara (ustekinumab) and Entyvio (vedolizumab).

1. **Stelara (ustekinumab)**:
   - **Dosage**: The initial dose is administered as a weight-based intravenous infusion (typically 260 mg for those weighing up to 100 kg, 390 mg for those weighing 100 kg to 120 kg, and 520 mg for those over 120 kg) followed by a subcutaneous injection at Week 8 and then every 12 weeks thereafter.

2. **Entyvio (vedolizumab)**:
   - **Dosage**: The initial dose is given as an intravenous infusion of 300 mg at Weeks 0, 2, and 6, followed by maintenance infusions every 8 weeks.

These medications have different mechanisms of action and dosing schedules, which may influence their selection based on individual patient needs and response to therapy. Always consult with a healthcare provider for personalized treatment plans.


In [9]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

print("CONTEXT: \n" + format_docs(output["context"]))

CONTEXT: 
trials are conducted under widely varying conditions, adverse reaction rates observed in the clinical trials of a drug cannot be directly compared to rates in the clinical trials of another drug and may not reflect the rates observed in practice. OMVOH was studied up to 12 weeks in subjects with moderately to severely active ulcerative colitis in a randomized, double-blind, placebo-controlled induction study (UC-1). In subjects who responded to induction therapy in UC-1, long term safety up to 52 weeks was evaluated in a randomized, double-blind, placebo-controlled maintenance study (UC-2) and a long-term extension study [see Clinical Studies (14)]. In the induction study (UC-1), 1279 subjects were enrolled of whom 958 received OMVOH 300 mg administered as an intravenous infusion at Weeks 0, 4, and 8. In the maintenance study (UC-2), 581 subjects were enrolled of whom 389 received OMVOH 200 mg administered as a subcutaneous injection every 4 weeks. Table 1 summarizes the adve

## Discussion of Preliminary Results

There is a lot to observe from our first round of results. After fact-checking the response, everything MySL recommended about dosage information seems to be correct for all medications. All of the facts, from dosage amounts, frequencies, and administration methods can be verified in the indexed context documents. Mission success, right?

Sort of.

If we take a closer look at the retrieved chunks from the RAG pipeline, we can see that only Omvoh was mentioned in the context. This presumably means that the model is relying on its training information for any dosage advice for non-Omvoh medications. Although this response ended up being correct, this is probably only because the necessary information made it into the model's knowledge cutoff. In other words, we got lucky. For future queries, we want to ensure that any information necessary to answer the question is present in the context, or has been verifiably included in the model's training. LLMs are prone to "hallucinations" when they are missing information, and by relying on a black box of knowledge to make up for lack of context, we are increasing the probability that MySL gives false information to its users. 

## Possible Solutions to Missing Context

To decrease our chances of model hallucinations, we can take a few different approaches:
1. Make sure our model is trained on comprehensive and up-to-date info that covers any question our users might ask
2. Make sure this info is at least indexed in the RAG vector store
3. Assuming this info is present in our vector store, we can improve our RAG pipeline's retrieval methods to get a more relevant sampling of information from the index
4. Make sure our model does not answer anything it doesn't have the references for
5. Use web search as a fallback

Given that this implementation is an MVP, and that getting "comprehensive and up-to-date info that covers any question our users might ask" would take a lot of user surveying/data monitoring, we'll settle for improving the retrieval methods for the time being.

## Analysis of Retrieval Methods

Our current implementation is not taking advantage of query translation/modification steps to enhance its results. We are using maximal marginal relevance (MMR) to measure document similarity to our user's query, so we should be getting a diverse mix of context chunks from our library. The only issue is that certain questions, like the example asked above, need more than just a good similarity function to get the right context. 

"What are some alternatives to Omvoh, and how do they compare in terms of dosage?" is an interesting question because although it explicitly asks the model to think of other medications besides Omvoh, Omvoh is the only medication mentioned in the query. The LLM itself understands this, but our context retrieval is a bit more naive. When running similarity calculations between a query and the indexed documents, only documents containing words similar to the query will be returned. Given that Omvoh is the only medication mentioned in the query, only context chunks containing "Omvoh" are likely to be returned. We can see from the context returned in the results above that this is exactly what happened. The only reason that Entyvio and Stelara were mentioned in the response was because they were explained to be similar medications in the system prompt. 

Even though Omvoh is the only medication mentioned in the query, we as humans know that to answer the question correctly, we need context on other medications besides Omvoh. We would begin to attack this question by asking "what are the known alternatives to Omvoh?" Then, we would ask "what are the dosage recommendations for each alternative, and how do they compare to that of Omvoh?"  We can mimic this multi-query approach to answering the question by using Query Translation to augment our retrieval.

## Implementation of Recursive Query Decomposition

Query Translation is a general term for taking the user's query and modifying/augmenting it somehow to improve retrieval results. One commonly used Query Translation method is called Query Decomposition. This approach takes the original query given to the model, instructs an LLM to create N sub-questions that would lead to an answer for the user's query, then feeds those to the retrieval system for addtional context. We are going to recursively feed in each sub-question so that each one has the Q&A pairs of the sub-questions before it as context. We will then feed the results of this process, along with the user's original query, to construct an adequate answer (this time, with better context). 

We'll start by creating 3 sub-questions from our original query:

In [10]:
# Creating decomposition template
template = """You are a helpful assistant that generates multiple sub-questions related to an input question. \n
The goal is to break down the input into a set of sub-problems / sub-questions that can be answers in isolation. \n
Generate multiple search queries related to: {question} \n
Output (3 queries):"""
prompt_decomposition = ChatPromptTemplate.from_template(template)

from langchain_core.output_parsers import StrOutputParser

# LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Chain
generate_queries_decomposition = ( prompt_decomposition | llm | StrOutputParser() | (lambda x: x.split("\n")))

# Run
questions = generate_queries_decomposition.invoke({"question":query})

for question in questions: 
    print(question)

1. What are the most common alternatives to Omvoh for treating Ulcerative Colitis?
2. How do the dosages of alternative medications for Ulcerative Colitis compare to Omvoh?
3. What are the side effects of alternative treatments for Ulcerative Colitis compared to Omvoh?


Next, we'll set up some additional infrastructure to handle our chain-of-thought reasoning:

In [11]:
# Master prompt template
template = """
You are an expert in medical pharmacology. 
You work as a Medical Science Liaison, and your job is to answer technical
questions from other medical professionals while being as
scientifically accurate as possible. You specialize in the Ulcerative
Colitis indication, and you focus on three competing medications:
Omvoh, Stelara, and Entyvio. Use the following pieces
of retrieved context to answer any questions you receive.
If you don't know the answer, say that you don't know.
Politely REFUSE TO ANSWER any questions
unrelated to pharmacology and medicine.
\n\n

Here's the question you need to answer:

\n --- \n {question} \n --- \n

Here is any available background question + answer pairs:

\n --- \n {q_a_pairs} \n --- \n

Here is additional context relevant to the question: 

\n --- \n {context} \n --- \n

Use the above context and any background question + answer pairs to answer the question: \n {question}
"""

decomposition_prompt = ChatPromptTemplate.from_template(template)

from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

def format_qa_pair(question, answer):
    """Format Q and A pair"""
    
    formatted_string = ""
    formatted_string += f"Question: {question}\nAnswer: {answer}\n\n"
    return formatted_string.strip()


q_a_pairs = ""
for question in questions:
    
    rag_chain = (
    {"context": itemgetter("question") | retriever, 
     "question": itemgetter("question"),
     "q_a_pairs": itemgetter("q_a_pairs")} 
    | decomposition_prompt
    | llm
    | StrOutputParser())

    answer = rag_chain.invoke({"question":question,"q_a_pairs":q_a_pairs})
    q_a_pair = format_qa_pair(question,answer)
    q_a_pairs = q_a_pairs + "\n---\n"+  q_a_pair

print(q_a_pairs)


---
Question: 1. What are the most common alternatives to Omvoh for treating Ulcerative Colitis?
Answer: The most common alternatives to Omvoh for treating Ulcerative Colitis include Stelara (ustekinumab) and Entyvio (vedolizumab). Both of these medications are indicated for the treatment of moderately to severely active ulcerative colitis and have been studied in clinical trials for their efficacy and safety profiles. If you have specific questions about their mechanisms of action, dosing, or clinical trial data, feel free to ask!
---
Question: 2. How do the dosages of alternative medications for Ulcerative Colitis compare to Omvoh?
Answer: The dosages of alternative medications for Ulcerative Colitis, such as Stelara (ustekinumab) and Entyvio (vedolizumab), differ from those of Omvoh (mirikizumab). 

For Omvoh, the recommended dosing regimen involves an intravenous infusion of 300 mg at Weeks 0, 4, and 8 for induction, followed by a subcutaneous injection of 200 mg every 4 weeks for

Now that we have Q&A pairs made for each of our sub-questions, we will send this to our retriever to get context, then pass the context and the Q&A pairs to our model to answer the original question: 

In [28]:
from langchain_core.runnables import RunnablePassthrough

system_prompt = f"""
You are an expert in medical pharmacology. 
You work as a Medical Science Liaison, and your job is to answer technical
questions from other medical professionals while being as
scientifically accurate as possible. You specialize in the Ulcerative
Colitis indication, and you focus on three competing medications:
Omvoh, Stelara, and Entyvio. Use the following pieces
of retrieved context to answer any questions you receive.
If you don't know the answer, say that you don't know.
Politely REFUSE TO ANSWER any questions
unrelated to pharmacology and medicine.
\n\n

Here is any available background question + answer pairs:

\n --- \n {q_a_pairs} \n --- \n

Here is additional context relevant to the question: 

\n --- \n {{context}} \n --- \n

Use the above context and any background question + answer pairs to answer the question
"""

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

rag_chain_from_docs = (
    {
        "input": lambda x: x["input"],  # input query
        "context": lambda x: format_docs(x["context"]),  # context
    }
    | prompt  # format query and context into prompt
    | llm  # generate response
    | StrOutputParser()  # coerce to string
)

retrieved_docs = (lambda q: f"{q}\n\n{q_a_pairs}") | retriever

chain = RunnablePassthrough.assign(context=retrieved_docs).assign(answer=rag_chain_from_docs)

results = chain.invoke({"input": query})

print(f"ANSWER:\n{results['answer']}\n\n")
print("CONTEXT (aside from the Q&A pairs):\n")
for doc in results['context']:
    print(doc.page_content+"\n\n")    

ANSWER:
The most common alternatives to Omvoh for treating Ulcerative Colitis include Stelara (ustekinumab) and Entyvio (vedolizumab). 

In terms of dosage:

- **Omvoh (mirikizumab)**: The recommended dosing regimen involves an intravenous infusion of 300 mg at Weeks 0, 4, and 8 for induction, followed by a subcutaneous injection of 200 mg every 4 weeks for maintenance.

- **Stelara (ustekinumab)**: It is typically administered as an initial intravenous dose of 260 mg for adults, followed by a subcutaneous injection of 90 mg every 8 weeks after the initial dose.

- **Entyvio (vedolizumab)**: This medication is administered as an intravenous infusion of 300 mg at Weeks 0, 2, and 6, followed by maintenance infusions every 8 weeks.

These differences in dosing schedules and routes of administration reflect the unique pharmacokinetics and mechanisms of action of each medication. If you have further questions about specific dosing or clinical considerations, feel free to ask!


CONTEXT (asi

## Final Observations (for now)
Taking a look at the final results from our recursive decomposition implementation, it seems we have succeeded in making our RAG pipeline a bit less naive. Chunks from all relevant documents appear in the context, though not all relevant info was present in the chunks. Overall, however, this is a substantial improvement over the original implementation. Next steps for improving the model would involve experimenting with chunk size and the number of chunks to retrieve from the vector store. 