# <u>Fabric+ OpenAI + RAG</u>

## Use Case: Developing an innovative solution leveraging Fabric and OpenAI's Retrieval-Augmented Generation (RAG) technique, aimed at assisting individuals with illnesses in effortlessly identifying clinical trials that align with their specific needs and preferences.

> Source data
  - Source: ClinicalTrials.gov; File:/lakehouse/default/Files/ctg-onc_studies.csv
  - source:https://www.unitedstateszipcodes.org/zip-code-database/; File: /lakehouse/default/Files/US Zip Codes from 2013 Government Data.csv



# Step-1 Insallation

In [1]:
#!pip install torchvision
#!pip install sentencepiece
!pip install -U sentence_transformers
!pip install faiss-cpu

StatementMeta(, 64748098-ba40-436c-9f32-780b853b8d62, 3, Finished, Available)

Collecting sentence_transformers
  Downloading sentence_transformers-2.5.1-py3-none-any.whl (156 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m156.5/156.5 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers<5.0.0,>=4.32.0 (from sentence_transformers)
  Downloading transformers-4.38.2-py3-none-any.whl (8.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.5/8.5 MB[0m [31m163.0 MB/s[0m eta [36m0:00:00[0m00:01[0m
Collecting huggingface-hub>=0.15.1 (from sentence_transformers)
  Downloading huggingface_hub-0.21.3-py3-none-any.whl (346 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m346.2/346.2 kB[0m [31m117.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.19,>=0.14 (from transformers<5.0.0,>=4.32.0->sentence_transformers)
  Downloading tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB

In [2]:
import os
import pandas as pd
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
import pandas as pd

StatementMeta(, 64748098-ba40-436c-9f32-780b853b8d62, 4, Finished, Available)

2024-03-04 05:01:03.360039: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-04 05:01:10.165120: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


# Step-2 RAG - define functions to load csv,create+save+index embeddings
## model = SentenceTransformer("all-MiniLM-L6-v2")
## index with faiss A library for efficient similarity search and clustering of dense vectors

In [3]:
def load_and_process_data(csv_file):
    # Load data from a CSV file and process it to create document strings
    df = pd.read_csv(csv_file)
    #df = df.iloc[:10000,:20]
    # Specify fields to include in each document
    fields = df.columns
    # Combine selected fields for each row into a single document string
    documents = df.apply(lambda row: ". ".join([str(row[field]) for field in fields if pd.notnull(row[field])]), axis=1).tolist()
    return documents

def create_embeddings(documents, model_name='all-MiniLM-L6-v2'):
    # Load the sentence transformer model and generate embeddings for the documents
    model = SentenceTransformer(model_name)
    embeddings = model.encode(documents, convert_to_tensor=False)
    return np.array(embeddings)

def save_embeddings_and_index(embeddings, index, embeddings_file='embeddings.npy', index_file='faiss_index.index'):
    # Save the document embeddings and FAISS index to disk for later use
    np.save( embeddings_file, embeddings)
    faiss.write_index(index, index_file)

def load_embeddings_and_index(embeddings_file='embeddings.npy', index_file='faiss_index.index'):
    # Load the document embeddings and FAISS index from disk
    embeddings = np.load(embeddings_file)
    index = faiss.read_index(index_file)
    return embeddings, index

def initialize_or_load_vector_database(csv_file, embeddings_file='embeddings.npy', index_file='faiss_index.index', model_name='all-MiniLM-L6-v2'):
    # Initialize or load the vector database from disk
    # If the embeddings and index files do not exist, create them    
    if not os.path.exists(embeddings_file) or not os.path.exists(index_file):
        documents = load_and_process_data(csv_file)
        embeddings = create_embeddings(documents, model_name=model_name)
        dimension = embeddings.shape[1]
        index = faiss.IndexFlatL2(dimension)  # Use L2 distance for similarity
        index.add(embeddings)  # Add document embeddings to the FAISS index
        save_embeddings_and_index(embeddings, index, embeddings_file, index_file)
        #np.save('/lakehouse/default/Files/embeddings.npy', embeddings)
        #faiss.write_index(index, "/lakehouse/default/Files/faiss.index")
    else:
        # If the embeddings and index files exist, load them
        embeddings, index = load_embeddings_and_index(embeddings_file, index_file)        
    return embeddings, index


StatementMeta(, 64748098-ba40-436c-9f32-780b853b8d62, 5, Finished, Available)

In [4]:
csv_file = '/lakehouse/default/Files/ctg-onc_studies.csv'
embeddings, index = initialize_or_load_vector_database(csv_file)
embeddings, index

StatementMeta(, 64748098-ba40-436c-9f32-780b853b8d62, 6, Finished, Available)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

(array([[-0.0795683 ,  0.00045708, -0.02926963, ..., -0.05581854,
         -0.00356966, -0.05569702],
        [-0.00046371, -0.01605904, -0.04317908, ..., -0.14403085,
         -0.07639643, -0.0027623 ],
        [ 0.011615  , -0.00724567, -0.04716434, ..., -0.0912191 ,
          0.00375993,  0.04593056],
        ...,
        [ 0.02679865, -0.1174694 , -0.03375481, ..., -0.02427036,
         -0.03802202, -0.0260945 ],
        [ 0.04042743,  0.00930498, -0.06377677, ..., -0.08025651,
         -0.02048424, -0.01428712],
        [-0.00496481, -0.05500593, -0.05620575, ..., -0.11521147,
         -0.02170976, -0.00855382]], dtype=float32),
 <faiss.swigfaiss_avx512.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x79bf2f5aea00> >)

In [8]:
%%sparkr
#"_api_call_clinicaltrials_" is a separate notebook with api call to clinicaltrials.gov
%run api_call_clinicaltrials

StatementMeta(, , -1, Finished, Available)

Installing package into ‘/nfs4/R/user-lib/application_1709528053988_0001’
(as ‘lib’ is unspecified)
trying URL 'https://cloud.r-project.org/src/contrib/janitor_2.2.0.tar.gz'
Content type 'application/x-gzip' length 236408 bytes (230 KB)
downloaded 230 KB

Loading required package: usethis
* installing *source* package ‘janitor’ ...
** package ‘janitor’ successfully unpacked and MD5 sums checked
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
Loading required package: usethis
** help
*** installing help indices
*** copying figures
** building package indices
Loading required package: usethis
** installing vignettes
** testing if installed package can be loaded from temporary location
Loading required package: usethis
** testing if installed package can be loaded from final location
Loading required package: usethis
** testing if installed package keeps a record of temporary installation path
* DONE (janitor)

The downloaded source packages 

In [9]:
%%sparkr
condition=" lung cancer"
usa_zip_codes_csv <- "/lakehouse/default/Files/zip_code_database.csv"
studies_near_me <- get_studies(usa_zip_codes_csv,zipcode="08840",radius=100,indication=condition) %>%
rename(`NCT Number`=`nctId`) 
write.csv(studies_near_me,"/lakehouse/default/Files/local_studies.csv")

StatementMeta(, 64748098-ba40-436c-9f32-780b853b8d62, 14, Finished, Available)

Rows: 42735 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): zip, type, primary_city, acceptable_cities, unacceptable_cities, s...
dbl  (4): decommissioned, latitude, longitude, irs_estimated_population

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1] "Geo loc: 40.54,-74.36,100mi"
[1] "Total records fetched: 424"

# Step-3 Index search based on user query
## user query= diagnosis e.g. lung cancer

In [10]:
def handle_query(query,model_name,local_studies_csv):
    model = SentenceTransformer(model_name)
        #create string of local studies
        # Convert the column to string type, then join all values into a single string
        # Use a space or any other delimiter you prefer between values
    
    print(f"nct_string:",query)
    query_vec = model.encode([query])

        # Search the FAISS index for the closest matching document
    D, I = index.search(query_vec, 15)  # Finds the top 10 closest document
    # Retrieve the closest document based on the index search
    all_studies =pd.read_csv('/lakehouse/default/Files/ctg-onc_studies.csv')
    fields = all_studies.columns
    # Combine selected fields for each row into a single document string
    documents = all_studies.apply(lambda row: " . ".join([str(row[field]) for field in fields if pd.notnull(row[field])]), axis=1).tolist()
        
    data = []  # Initialize an empty list to store the data

    for i, idx in enumerate(I[0]):
        # Assuming `documents` is a list where each element is a document string
        document = documents[idx]  # Retrieve the document using the index from FAISS        
        # Create a dictionary for the current document with its index, distance, and content
        row = {
            'Index': idx,
            'Distance': D[0][i],
            'Document': document
        }
        # Add the dictionary to our list
        data.append(row)

    #Convert the list of dictionaries into a DataFrame
    df = pd.DataFrame(data)
    df['NCT Number'] = df['Document'].str.split(' . ', expand=True)[0]
    merged_df = pd.merge(df, all_studies, on='NCT Number', how='inner')

    # Limit studies to local zipcode
    local_studies = pd.read_csv(local_studies_csv)
    #user_df = pd.merge(local_studies, merged_df, on='NCT Number', how='inner')
    user_df=merged_df
    for col in merged_df:
        # Concatenate each value in the column with the column name
        # The "{} {}" format string places the column name first, followed by the value
        # astype(str) ensures that the numeric values are converted to strings for concatenation
        user_df[col] = col + ":" + user_df[col].astype(str)
    # Display the DataFrame to check
    rag_documents = user_df.apply(lambda row: " . ".join([str(row[field]) for field in fields if pd.notnull(row[field])]), axis=1).tolist()
    return rag_documents

StatementMeta(, 64748098-ba40-436c-9f32-780b853b8d62, 15, Finished, Available)

In [11]:
model_name = 'all-MiniLM-L6-v2'
condition=" lung cancer"
local_studies_csv = "/lakehouse/default/Files/local_studies.csv"
RAG_response = handle_query(condition,model_name,local_studies_csv)
print(f"RAG response: {RAG_response}")

StatementMeta(, 64748098-ba40-436c-9f32-780b853b8d62, 16, Finished, Available)

nct_string:  lung cancer
RAG response: ["NCT Number:NCT06251570 . Study Title:Evaluation of Dual Task Performance and Respiratory Muscle Endurance in Patients With Lung Cancer . Study URL:https://clinicaltrials.gov/study/NCT06251570 . Acronym:nan . Study Status:NOT_YET_RECRUITING . Brief Summary:Lung cancer is one of the most diagnosed cancer types worldwide, according to GLOBOCAN data published in 2020. According to these data, lung cancer comes second after breast cancer with 2,206,771 new diagnoses worldwide in 2020.\n\nAccording to Türkiye's data for 2020, 41,264 new lung cancer diagnoses made. Lung cancer tumors are divide into two main histological groups non-small cell lung cancer (NSCLC) and small cell lung cancer (SCLC).\n\nCurrent medical treatment methods for lung cancer are surgical resection, chemotherapy, radiotherapy, and targeted therapies. Cancer treatments can be administered as a combination of these methods appropriately selected for patients.\n\nAdvances in treatme

# Step-4 Azure OpenAI patient interations

In [12]:
import os
import openai

def openai_response(user_query):

    openai.api_type = "azure"
    openai.api_base = "https://polite-ground-030dc3103.4.azurestaticapps.net/api/v1"
    openai.api_key = "35d2c036-4dc6-47f1-baf5-9ba5495acf6d"
    openai.api_version = "2023-05-15"
    

    response = openai.ChatCompletion.create(
        engine="gpt-35-turbo-16k", 
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": f"Here is the information I found:{RAG_response}"},  
            {"role": "user", "content": f"Based on this, {user_query}"}
        ]
    )

    #print(response)
    print(response.choices[0].message['content'])  # Adjust this based on the actual response structure



StatementMeta(, 64748098-ba40-436c-9f32-780b853b8d62, 17, Finished, Available)

In [13]:
# Example call to the function
openai_response("how many studies and thier indications")



StatementMeta(, 64748098-ba40-436c-9f32-780b853b8d62, 18, Finished, Available)

Based on the information provided, there are 10 studies mentioned. I can provide you with the indications for each study. Here they are:

1. Evaluation of Dual Task Performance and Respiratory Muscle Endurance in Patients With Lung Cancer
   Indication: Lung Cancer

2. Taiwan Real-world LDCT Screening Behavior and Outcome Research for High Risk Subjects Based on Health Promotion Administration
   Indication: Lung Cancer, High-Risk Cancer, Smoking, Pollution; Exposure, Family Relations

3. Characteristics, Treatment Patterns and Outcomes for Patients With Surgically Resected Lung Cancers
   Indication: Lung Cancer, Lung Adenocarcinoma, Lung Squamous Cell Carcinoma, Non-Small Cell Lung Cancer, Small-cell Lung Cancer

4. Formatting the Risk Prediction Models for Never-Smoking Lung Cancer
   Indication: Lung Cancer

5. Registry Study on Biological Disease Profile and Clinical Outcome in Patients With Newly Diagnosed or Existent Lung Cancer The LuCa Biology and Outcome (BiO)-Project
   Indi

In [14]:
openai_response("give me the duration of each trial and number of subject with promary outcome")

StatementMeta(, 64748098-ba40-436c-9f32-780b853b8d62, 19, Finished, Available)

Based on the information provided, here are the durations and the number of subjects with primary outcomes for each trial:

1. NCT06251570 (Evaluation of Dual Task Performance and Respiratory Muscle Endurance in Patients With Lung Cancer): Duration: 4 months (Primary Completion Date: 2024-06-15). Number of subjects with primary outcome: 60.0.

2. NCT05557487 (Taiwan Real-world LDCT Screening Behavior and Outcome Research for High Risk Subjects Based on Health Promotion Administration): Duration: Ongoing (Primary Completion Date: 2024-12-31). Number of subjects with primary outcome: 6,618.0.

3. NCT06255197 (Characteristics, Treatment Patterns and Outcomes for Patients With Surgically Resected Lung Cancers): Duration: Ongoing (Primary Completion Date: 2024-12-31). Number of subjects with primary outcome: 60,000.0.

4. NCT05572944 (Formatting the Risk Prediction Models for Never-Smoking Lung Cancer): Duration: Ongoing (Primary Completion Date: 2025-12-31). Number of subjects with primary

In [15]:
openai_response("what are the side effects for each study, if no information then give display the study URL of all stuies")

StatementMeta(, 64748098-ba40-436c-9f32-780b853b8d62, 20, Finished, Available)

I apologize, but the provided information does not include the specific side effects for each study. However, I can provide you with the study URLs for each study:

1. Evaluation of Dual Task Performance and Respiratory Muscle Endurance in Patients With Lung Cancer:
Study URL: https://clinicaltrials.gov/study/NCT06251570

2. Taiwan Real-world LDCT Screening Behavior and Outcome Research for High Risk Subjects Based on Health Promotion Administration:
Study URL: https://clinicaltrials.gov/study/NCT05557487

3. Characteristics, Treatment Patterns and Outcomes for Patients With Surgically Resected Lung Cancers:
Study URL: https://clinicaltrials.gov/study/NCT06255197

4. Formatting the Risk Prediction Models for Never-Smoking Lung Cancer:
Study URL: https://clinicaltrials.gov/study/NCT05572944

5. Lung Cancer Registry:
Study URL: https://clinicaltrials.gov/study/NCT04654364

6. A Prospective Study for the Early Diagnosis of Lung Cancer in Korean Females:
Study URL: https://clinicaltrials.g

In [16]:
openai_response("what is the age group for each of the study")

StatementMeta(, 64748098-ba40-436c-9f32-780b853b8d62, 21, Finished, Available)

The age group for each study is specified in the information provided:

1. Study: "Evaluation of Dual Task Performance and Respiratory Muscle Endurance in Patients With Lung Cancer"
   Age Group: Adult, Older Adult

2. Study: "Taiwan Real-world LDCT Screening Behavior and Outcome Research for High Risk Subjects Based on Health Promotion Administration"
   Age Group: Adult, Older Adult

3. Study: "Characteristics, Treatment Patterns and Outcomes for Patients With Surgically Resected Lung Cancers"
   Age Group: Adult, Older Adult

4. Study: "Formatting the Risk Prediction Models for Never-Smoking Lung Cancer"
   Age Group: Adult, Older Adult

5. Study: "Registry Study on Biological Disease Profile and Clinical Outcome in Patients With Newly Diagnosed or Existent Lung Cancer The LuCa Biology and Outcome (BiO)-Project"
   Age Group: Adult, Older Adult

6. Study: "Biology of Young Lung Cancer Study: The YOUNG LUNG Study"
   Age Group: Child, Adult

7. Study: "Lung Cancer Registry"
   Age Gr

In [17]:
openai_response("what is the end of enrollment for eah study")

StatementMeta(, 64748098-ba40-436c-9f32-780b853b8d62, 22, Finished, Available)

Based on the information provided, the end of enrollment for each study is as follows:

1. Evaluation of Dual Task Performance and Respiratory Muscle Endurance in Patients With Lung Cancer (NCT06251570) - Enrollment: 60.0. Completion Date: 2024-07-30.
2. Taiwan Real-world LDCT Screening Behavior and Outcome Research for High-Risk Subjects Based on Health Promotion Administration (NCT05557487) - Enrollment: 6,618. Completion Date: 2028-12-31.
3. Characteristics, Treatment Patterns and Outcomes for Patients With Surgically Resected Lung Cancers (NCT06255197) - Enrollment: 60,000. Completion Date: 2031-12-31.
4. Formatting the Risk Prediction Models for Never-Smoking Lung Cancer (NCT05572944) - Enrollment: 10,000. Completion Date: 2025-12-31.
5. Registry Study on Biological Disease Profile and Clinical Outcome in Patients With Newly Diagnosed or Existent Lung Cancer (NCT02613637) - Enrollment: 2,000. Completion Date: 2023-09.
