# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

In this project, I am creating a custom chatbot that can answer medical questions based on its knowledge from medical journals.

The largest open medical journal dataset available is PubMed dataset, and it offers a dataset containing questions with answers coming from medical journals. Thus, this dataset is appropriate for the custom chatbot task.

## Data Wrangling

In [1]:
from datasets import load_dataset # Import HuggingFace library
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Download PubMedQA Dataset from HuggingFace
dataset = load_dataset("pubmed_qa", "pqa_labeled") # Source: https://huggingface.co/datasets/qiaojin/PubMedQA

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['pubid', 'question', 'context', 'long_answer', 'final_decision'],
        num_rows: 1000
    })
})

In [4]:
# Load the dataset into a pandas dataframe
df = pd.DataFrame(dataset['train'])

In [5]:
df.head()

Unnamed: 0,pubid,question,context,long_answer,final_decision
0,21645374,Do mitochondria play a role in remodelling lac...,{'contexts': ['Programmed cell death (PCD) is ...,Results depicted mitochondrial dynamics in viv...,yes
1,16418930,Landolt C and snellen e acuity: differences in...,{'contexts': ['Assessment of visual acuity dep...,"Using the charts described, there was only a s...",no
2,9488747,"Syncope during bathing in infants, a pediatric...",{'contexts': ['Apparent life-threatening event...,"""Aquagenic maladies"" could be a pediatric form...",yes
3,17208539,Are the long-term results of the transanal pul...,{'contexts': ['The transanal endorectal pull-t...,Our long-term study showed significantly bette...,no
4,10808977,Can tailored interventions increase mammograph...,{'contexts': ['Telephone counseling and tailor...,The effects of the intervention were most pron...,yes


### Combine `question`, `context` and `long_answer` into `text` column

#### Analyzing data format for each columns

In [6]:
df['question'].loc[100]

'Mammographic screening in Sami speaking municipalities and a control group. Are early outcome measures influenced by ethnicity?'

In [7]:
df['context'].loc[100]

{'contexts': ['Female citizens of Sami (the indigenous people of Norway) municipalities in northern Norway have a low risk of breast cancer. The objective of this study was to describe the attendance rate and outcome of the Norwegian Breast Cancer Screening Program (NBCSP) in the Sami-speaking municipalities and a control group.',
  'A retrospective registry-based study.',
  'The 8 municipalities included in the administration area of the Sami language law (Sami) were matched with a control group of 11 municipalities (non-Sami). Population data were accessed from Statistics Norway. Data regarding invitations and outcome in the NBCSP during the period 2001-2010 was derived from the Cancer Registry of Norway (CRN). The NBCSP targets women aged 50-69 years. Rates and percentages were compared using chi-square test with a p-value<0.05 as statistical significant.',
  'The attendance rate in the NBCSP was 78% in the Sami and 75% in the non-Sami population (p<0.01). The recall rates were 2.4 

In [8]:
df['long_answer'].loc[100]

'Despite a lower risk of breast cancer, the Sami attended the NBCSP more frequently than the control group. The recall and cancer detection rate was lower among the Sami compared with the non-Sami group.'

#### Merging all strings from the three columns into a one-column dataframe

In [9]:
# Example merging for row 1
print(
    (df['question'] + '\n\n' + df['context'].apply(lambda x: ' '.join(x['contexts'])) + '\n\n' + df['long_answer']).loc[0]
)

Do mitochondria play a role in remodelling lace plant leaves during programmed cell death?

Programmed cell death (PCD) is the regulated death of cells within an organism. The lace plant (Aponogeton madagascariensis) produces perforations in its leaves through PCD. The leaves of the plant consist of a latticework of longitudinal and transverse veins enclosing areoles. PCD occurs in the cells at the center of these areoles and progresses outwards, stopping approximately five cells from the vasculature. The role of mitochondria during PCD has been recognized in animals; however, it has been less studied during PCD in plants. The following paper elucidates the role of mitochondrial dynamics during developmentally regulated PCD in vivo in A. madagascariensis. A single areole within a window stage leaf (PCD is occurring) was divided into three areas based on the progression of PCD; cells that will not undergo PCD (NPCD), cells in early stages of PCD (EPCD), and cells in late stages of PCD (

In [10]:
text_df = pd.DataFrame(
    (df['question'] + '\n\n' + df['context'].apply(lambda x: ' '.join(x['contexts'])) + '\n\n' + df['long_answer']),
    columns=['text']
)

In [11]:
text_df.head()

Unnamed: 0,text
0,Do mitochondria play a role in remodelling lac...
1,Landolt C and snellen e acuity: differences in...
2,"Syncope during bathing in infants, a pediatric..."
3,Are the long-term results of the transanal pul...
4,Can tailored interventions increase mammograph...


#### Save `text_df` into a CSV file

In [12]:
text_df.to_csv('./text_df.csv', index=False)

## Custom Query Completion

#### Connect to OpenAI

In [13]:
import os
import openai

# Set the OpenAI API key here. Change the assignment to a string instead for a direct pasting.
openai.api_key=os.environ.get("OPENAI_API_KEY")

In [15]:
prompt = "When did Russia invade Ukraine?"
answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt
)

#### Create Embeddings

In [20]:
EMBEDDING_MODEL = 'text-embedding-ada-002'
embed_resp = openai.Embedding.create(
    model=EMBEDDING_MODEL,
    input=text_df['text'].tolist()
)

In [24]:
embeddings = [data.embedding for data in embed_resp.data]

In [27]:
len(embeddings)

1000

In [28]:
text_df['embedding'] = embeddings

In [29]:
text_df.head()

Unnamed: 0,text,embedding
0,Do mitochondria play a role in remodelling lac...,"[-0.015485687181353569, -0.009461959823966026,..."
1,Landolt C and snellen e acuity: differences in...,"[0.0042733196169137955, 0.020611297339200974, ..."
2,"Syncope during bathing in infants, a pediatric...","[0.007913217879831791, 0.02330002933740616, 0...."
3,Are the long-term results of the transanal pul...,"[-0.013925756327807903, 0.020487992092967033, ..."
4,Can tailored interventions increase mammograph...,"[-0.040134165436029434, -0.0025001068133860826..."


Save DF with embeddings

In [30]:
text_df.to_csv('./text_embedding_df.csv', index=False)

#### Semantic Similarity Searching

In [47]:
from scipy.spatial.distance import cosine as cosine_distance

def get_rows_sorted_by_relevance(question, df):
    # Get embedding for the question
    question_resp = openai.Embedding.create(
        model=EMBEDDING_MODEL,
        input=question
    )
    question_embeddings = question_resp.data[0].embedding
    
    # Get cosine distance of all text in the df from the question
    question_distance = [cosine_distance(question_embeddings, data_embeddings) for data_embeddings in df['embedding'].tolist()]
    q_df = df.copy()
    q_df['distance'] = question_distance

    # Sort the DF
    q_df.sort_values('distance', inplace=True, ascending=True)

    return q_df

#### Composing Custom Prompt

In [52]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")

    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context of questions and answers below, and if the question
can't be answered based on the context, say "I don't know".

Context: 

{}

---

Question: {}
Answer:"""

    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))

    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:

        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count

        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break


    return prompt_template.format("\n\n###\n\n".join(context), question)

In [54]:
# Print example prompt
print(create_prompt('Landolt C and snellen e acuity: differences in strabismus amblyopia?', text_df, 1000))


Answer the question based on the context of questions and answers below, and if the question
can't be answered based on the context, say "I don't know".

Context: 

Landolt C and snellen e acuity: differences in strabismus amblyopia?

Assessment of visual acuity depends on the optotypes used for measurement. The ability to recognize different optotypes differs even if their critical details appear under the same visual angle. Since optotypes are evaluated on individuals with good visual acuity and without eye disorders, differences in the lower visual acuity range cannot be excluded. In this study, visual acuity measured with the Snellen E was compared to the Landolt C acuity. 100 patients (age 8 - 90 years, median 60.5 years) with various eye disorders, among them 39 with amblyopia due to strabismus, and 13 healthy volunteers were tested. Charts with the Snellen E and the Landolt C (Precision Vision) which mimic the ETDRS charts were used to assess visual acuity. Three out of 5 optot

#### Querying Based on Custom Prompt

In [55]:
COMPLETION_MODEL_NAME = 'gpt-3.5-turbo-instruct'

def answer_question(question, df, max_prompt_tokens=1800, max_answer_tokens=150):
    prompt = create_prompt(question, df, max_prompt_tokens)

    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response.choices[0].text.strip()
    except Exception as e:
        print(e)
        return ""

In [60]:
# Using while and input to write question directly without running the function manually
question = ''
while question != 'stop':
    question = input('Write your question here: ')
    answer = answer_question(question, text_df)
    print(answer + '\n\n')

Write your question here:  Landolt C and snellen e acuity: differences in strabismus amblyopia?


Yes, there were small differences in visual acuity measured with the Snellen E and Landolt C charts among patients with strabismus amblyopia.




Write your question here:  stop


The context does not provide enough information to accurately answer this question.




## Custom Performance Demonstration

### Question 1

#### Default Response

In [89]:
question = "Landolt C and snellen e acuity: differences in strabismus amblyopia?"
answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=question
)

In [90]:
answer.choices[0]['text']

' | BMC Ophthalmology | Full Text\n\nTable 2 Precision in visual ac'

#### Custom Prompt

In [91]:
question = "Landolt C and snellen e acuity: differences in strabismus amblyopia?"
answer = answer_question(question, text_df)
print(answer)

Yes, there were small differences in the lower visual acuity range when comparing Landolt C and Snellen E acuity in patients with various eye disorders, including those with amblyopia due to strabismus. These differences were confirmed even when evaluating eyes with the lower visual acuity and the right eyes of healthy individuals. Additionally, there was only a slight overestimation of visual acuity by the Snellen E compared to the Landolt C method.


### Question 2

#### Default Response

In [107]:
question = "Double balloon enteroscopy: is it efficacious and safe in a community setting?"
answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=question
)

In [108]:
answer.choices[0]['text']

'\n\nDouble balloon enteroscopy (DBE) is a minimally invasive end'

#### Custom Prompt

In [112]:
question = "Double balloon enteroscopy: is it efficacious and safe in a community setting?"
answer = answer_question(question, text_df)
print(answer)

Yes, according to the data presented in the context, double-balloon enteroscopy appears to be both safe and effective when performed in a community setting.
