<a href="https://colab.research.google.com/github/andresmmujica/products-with-openai/blob/main/Summarize_Clinical_Trials_for_Heart_Disease_and_Nutrition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summarize Clinical Trials for Heart Disease and Nutrition

This project is an LLM app that summarizes important studies and clinical trials that have happened for any given disease!

# The Problem

There are so many clinical trials for each disease, but it's very hard to decipher and understand them.

# Solution

Show the clinical trials in summarized way so that it's easy to understand for general population.

**Flow**

1.   Get the list of first 50 clinical trials for user-selcted diseases using API from https://clinicaltrials.gov.
2.   Only look at the diseases with results to narrow down the list.
3.  To further narrow down the list to stay within gpt-3.5-turbo-16k limits, just get first 5 studies from the narrowed down list.
4. In the streamlit app, initially show the clinical trials related to "heart disease" and "nutrition".
5. The streamlit app also has option to select another disease and term if the user wants to view any other disease results.



In [None]:
!pip install openai
!pip install tiktoken


In [None]:
import openai
from getpass import getpass

openai.api_key = getpass('Enter the OpenAI API Key in the cell  ')

In [None]:
def get_clinicaltrials(disease, additional_term):
  import requests
  import json

  disease = "stroke"
  additional_term = "nutrition"

  base_url = "https://ClinicalTrials.gov/api/query/full_studies"
  search_term = disease + " " + additional_term
  search_term.replace(" ", "+")
  params = {
      "expr": search_term,
      "min_rnk": "1",
      "max_rnk": "50",
      "fmt": "json"
    }
  response = requests.get(base_url, params=params)

  # Parse the JSON
  data = json.loads(response.text)
  print(data)
  results_list = []
  # Attributes to include
  attributes_to_include = [
      "DescriptionModule",
      "EnrollmentInfo",
      "OutcomesModule",
      "EligibilityModule",
      "IdentificationModule",
      "StatusModule",
      "SponsorCollaboratorsModule",
      "ConditionsModule"
  ]

  index = 0

  if (data["FullStudiesResponse"]["NStudiesFound"] > 0):

    # Extract the list of studies
    studies = data["FullStudiesResponse"]["FullStudies"]

    # Loop through the studies
    for study_data in studies:
    # Extract relevant information
      study = study_data["Study"]

      # Extract status, start date, and completion date if available
      status_info = study["ProtocolSection"].get("StatusModule", {})

      # Extract results submission information if available, i.e. study
      results_submit_date = status_info.get("ResultsFirstSubmitDate")
      if results_submit_date:
        # Include only specified attributes and their child attributes
        included_attributes = {}
        for attr in attributes_to_include:
          if attr in study["ProtocolSection"]:
            included_attributes[attr] = study["ProtocolSection"][attr]
            #print(f"Added attribute: {attr}")

        index += 1
        results_list.append({"Study {}".format(index): included_attributes})
        # Only get upto 5 studies due to the token limit for openAI
        if (index >= 5):
          break
  else:
      print("No Studies found")

  # Create a dictionary for the collected results
  collected_results = {"Results": results_list}

  # Convert the collected results to JSON format
  collected_results_json = json.dumps(collected_results, indent=4)
  with open("/content/clinical_trials/collected_results"+disease+".json", "w") as outfile:
    outfile.write(collected_results_json)

  return collected_results_json

collected_results_json = get_clinicaltrials("heart disease", "nutrition")




**Get the summary for each study by calling OpenAI API**

In [None]:
summary = ""
if collected_results_json:
  instructPrompt = """
  You are a expert at analyzing clinical data for scientific studies for diseases. Summarize the Outcome of each study. Use bullet points. Compare the Outcome parameters to hypotheses and see if the outcome matched the hypotheses or not.
  Focus more on the outcome parameters for each study. Don't use any scientific language. Make it easily understandable by kids. Below is all the information from the studies of various clinical trials. Don't write any of the identification attributes.
  Don't put the title of the study in your response.
  """

  request = instructPrompt + collected_results_json

  chatOutput = openai.ChatCompletion.create(model="gpt-3.5-turbo-16k",
                                              messages=[{"role": "system", "content": "You are a expert at analyzing clinical data for scientific studies for diseases."},
                                                        {"role": "user", "content": request}
                                                        ]
                                              )

  summary = chatOutput.choices[0].message.content


summary

'Study 1: Nutrition and Aerobic Exercise in Chronic Stroke (NEXIS)\n\n- Brief Summary: This study aims to examine the effects of dietary modification and treadmill training on fuel utilization and physical function in chronic stroke patients.\n- Hypotheses: The researchers hypothesized that dietary modification and exercise training can reverse abnormalities in fuel utilization and improve physical function in chronic stroke patients.\n- Primary Outcome: The change in total daily energy expenditure was measured using an accelerometer activity monitor. The researchers found that treadmill training or stretching did not significantly affect total daily energy expenditure.\n- Secondary Outcomes:\n  - The change in substrate oxidation was measured using open circuit spirometry during a treadmill walking task. The researchers found no significant change in substrate oxidation after 6 months of treadmill training or stretching.\n  - The change in circulating nitrotyrosine, a marker of oxidat

**Get all the important attributes for all the studies that matched the critieria**

In [None]:
def get_clinicaltrials(disease, additional_term):
  import requests
  import json

  disease = "stroke"
  additional_term = "nutrition"

  base_url = "https://ClinicalTrials.gov/api/query/full_studies"
  search_term = disease + " " + additional_term
  search_term.replace(" ", "+")
  params = {
      "expr": search_term,
      "min_rnk": "1",
      "max_rnk": "50",
      "fmt": "json"
    }
  response = requests.get(base_url, params=params)

  # Parse the JSON
  data = json.loads(response.text)
  print(data)
  results_list = []
  # Attributes to include
  attributes_to_include = [
      "DescriptionModule",
      "EnrollmentInfo",
      "OutcomesModule",
      "EligibilityModule",
      "IdentificationModule",
      "StatusModule",
      "SponsorCollaboratorsModule",
      "ConditionsModule"
  ]

  index = 0

  if (data["FullStudiesResponse"]["NStudiesFound"] > 0):

    # Extract the list of studies
    studies = data["FullStudiesResponse"]["FullStudies"]

    # Loop through the studies
    for study_data in studies:
    # Extract relevant information
      study = study_data["Study"]

      # Extract status, start date, and completion date if available
      status_info = study["ProtocolSection"].get("StatusModule", {})

      # Extract results submission information if available, i.e. study
      results_submit_date = status_info.get("ResultsFirstSubmitDate")
      if results_submit_date:
        # Include only specified attributes and their child attributes
        included_attributes = {}
        for attr in attributes_to_include:
          if attr in study["ProtocolSection"]:
            included_attributes[attr] = study["ProtocolSection"][attr]
            #print(f"Added attribute: {attr}")

        index += 1
        results_list.append({"Study {}".format(index): included_attributes})
        # Only get upto 5 studies due to the token limit for openAI
        if (index >= 5):
          break
  else:
      print("No Studies found")

  # Create a dictionary for the collected results
  collected_results = {"Results": results_list}

  # Convert the collected results to JSON format
  collected_results_json = json.dumps(collected_results, indent=4)
  with open("/content/clinical_trials/collected_results"+disease+".json", "w") as outfile:
    outfile.write(collected_results_json)

  #getstudyinfo("stroke", "")
  return collected_results_json

# Function to get summaries based on study number
def get_summary_by_study_number(summary_text, study_number):
    study_marker = f"Study {study_number}:"
    next_study_marker = f"Study {study_number + 1}:"

    start_index = summary_text.find(study_marker)
    end_index = summary_text.find(next_study_marker) if next_study_marker in summary_text else len(summary_text)

    if start_index != -1:
      summary_for_study = summary_text[start_index:end_index].strip()
      # Remove the study marker
      summary_for_study = summary_text.replace(study_marker, '').strip()
      # print("SUMMARY: " + summary_for_study)
      return summary_for_study
    else:
      return ""

def get_attributes_by_study_number(json_data, summary, study_number):
  import json
  output = {}

  # Parse the JSON
  data = json.loads(json_data)

  # Define the study number you're interested in
  study_key = f"Study {study_number}"
  index = study_number - 1

  study_data = None

  # Extract the study data using the study number
  if index < len(data["Results"]):
    print(index)
    study_data = data["Results"][index].get(study_key)
  #  print(study_data)

    if study_data is not None:
      output['brief_title'] = study_data["IdentificationModule"]["BriefTitle"]
      output['nct_id'] = study_data["IdentificationModule"]["NCTId"]
      output['overall_status'] = study_data["StatusModule"]["OverallStatus"]
      output['start_date'] = study_data["StatusModule"]["StartDateStruct"]["StartDate"]
      output['completion_date'] = study_data["StatusModule"]["CompletionDateStruct"]["CompletionDate"]
      output['lead_sponsor'] = study_data["SponsorCollaboratorsModule"]["LeadSponsor"]["LeadSponsorName"]
      output['summary'] = get_summary_by_study_number(summary, study_number)
      # print(output)
      output_json = json.dumps(output, indent=4)
      with open("/content/clinical_trials/" + study_key + ".json", "w") as outfile:
        outfile.write(output_json)

    else:
      print(f"Study {study_number} not found.")


  return output

def get_study_info(selected_disease, additional_term):
  output = get_clinicaltrials(selected_disease, additional_term)
  summary = get_summarized_clinicaltrials.local(output)
  #summary = 'Study 1:\n- The study aimed to examine the effects of dietary modification and treadmill training on fuel utilization and physical function in chronic stroke patients.\n- The outcomes of the study were not explicitly compared to hypotheses.'
  #print(output)
  allAttrsForStudies = []
  for i in range(1, 6):
    attributes = get_attributes_by_study_number(output, summary, i)
    allAttrsForStudies.append(attributes)

  # print(allAttrsForStudies)
  return allAttrsForStudies


get_study_info("diabetes", "")



{'FullStudiesResponse': {'APIVrs': '1.01.05', 'DataVrs': '2023:08:16 23:57:56.293', 'Expression': 'stroke nutrition', 'NStudiesAvail': 462815, 'NStudiesFound': 319, 'MinRank': 1, 'MaxRank': 50, 'NStudiesReturned': 50, 'FullStudies': [{'Rank': 1, 'Study': {'ProtocolSection': {'IdentificationModule': {'NCTId': 'NCT03825419', 'OrgStudyIdInfo': {'OrgStudyId': 'MASS'}, 'Organization': {'OrgFullName': 'Turkish Stroke Research and Clinical Trials Network', 'OrgClass': 'NETWORK'}, 'BriefTitle': 'Detection of Muscle Loss in Acute Stroke Patients Who Need Enteral Nutrition (MASS)', 'OfficialTitle': 'Muscle Assessment in Stroke Study (MASS): Detection of Muscle Loss in Acute Stroke Patients Who Need Enteral Nutrition', 'Acronym': 'MASS'}, 'StatusModule': {'StatusVerifiedDate': 'April 2022', 'OverallStatus': 'Completed', 'ExpandedAccessInfo': {'HasExpandedAccess': 'No'}, 'StartDateStruct': {'StartDate': 'January 23, 2019', 'StartDateType': 'Actual'}, 'PrimaryCompletionDateStruct': {'PrimaryComplet

NameError: ignored

In [None]:
!pip install modal

In [None]:
!modal token new --source corise > authenticationURL.txt

In [None]:
import getpass
import subprocess

def set_modal_token():
  token_id = getpass.getpass('Please enter your Modal token ID in the cell: ')
  token_secret = getpass.getpass('Please enter your Modal token secret in the cell:  ')

  # Using subprocess to execute the command
  subprocess.run(f"!modal token set --token-id (token_id) --token-secret (token_secret)", shell=True)

set_modal_token()

Please enter your Modal token ID in the cell: ··········
Please enter your Modal token secret in the cell:  ··········


In [None]:
%%writefile /content/clinical_trials/clinicaltrials_backend.py
import modal

def download_packages():
  # Load the Whisper model
  import os
  print ("Installing needed packages")


stub = modal.Stub("clinicaltrials-project")
corise_image = modal.Image.debian_slim().pip_install(
                                                     "requests",
                                                     "ffmpeg",
                                                     "openai",
                                                     "tiktoken").run_function(download_packages)

def get_clinicaltrials(disease, additional_term):
  import requests
  import json

  disease = "stroke"
  additional_term = "nutrition"

  base_url = "https://ClinicalTrials.gov/api/query/full_studies"
  search_term = disease + " " + additional_term
  search_term.replace(" ", "+")
  params = {
      "expr": search_term,
      "min_rnk": "1",
      "max_rnk": "50",
      "fmt": "json"
    }
  response = requests.get(base_url, params=params)

  # Parse the JSON
  data = json.loads(response.text)
  # print(data)
  results_list = []
  # Attributes to include
  attributes_to_include = [
      "DescriptionModule",
      "EnrollmentInfo",
      "OutcomesModule",
      "EligibilityModule",
      "IdentificationModule",
      "StatusModule",
      "SponsorCollaboratorsModule",
      "ConditionsModule"
  ]

  index = 0

  if (data["FullStudiesResponse"]["NStudiesFound"] > 0):

    # Extract the list of studies
    studies = data["FullStudiesResponse"]["FullStudies"]

    # Loop through the studies
    for study_data in studies:
      # Extract relevant information
      study = study_data["Study"]

      # Extract status, start date, and completion date if available
      status_info = study["ProtocolSection"].get("StatusModule", {})

      # Extract results submission information if available, i.e. study
      results_submit_date = status_info.get("ResultsFirstSubmitDate")
      if results_submit_date:
        # Include only specified attributes and their child attributes
        included_attributes = {}
        for attr in attributes_to_include:
          if attr in study["ProtocolSection"]:
            included_attributes[attr] = study["ProtocolSection"][attr]
            #print(f"Added attribute: {attr}")

        index += 1
        results_list.append({"Study {}".format(index): included_attributes})
        # Only get upto 5 studies due to the token limit for openAI
        if (index >= 5):
          break
  else:
      print("No Studies found")

  # Create a dictionary for the collected results
  collected_results = {"Results": results_list}

  # Convert the collected results to JSON format
  collected_results_json = json.dumps(collected_results, indent=4)
  #with open("/content/clinical_trials/collected_results"+disease+".json", "w") as outfile:
  #  outfile.write(collected_results_json)

  #getstudyinfo("stroke", "")
  return collected_results_json



@stub.function(image=corise_image, gpu="any")
def get_summary_by_study_number(summary_text, study_number):
    study_marker = f"Study {study_number}:"
    next_study_marker = f"Study {study_number + 1}:"

    start_index = summary_text.find(study_marker)
    end_index = summary_text.find(next_study_marker) if next_study_marker in summary_text else len(summary_text)
    #print("start_index: " + start_index)

    if start_index != -1:
      summary_for_study = summary_text[start_index:end_index].strip()
      # Remove the study marker
      summary_for_study = summary_text.replace(study_marker, '').strip()
      return summary_for_study
    else:
      return ""

@stub.function(image=corise_image, gpu="any")
def get_attributes_by_study_number(json_data, summary, study_number):
  import json
  output = {}

  # Parse the JSON
  data = json.loads(json_data)

  # Define the study number you're interested in
  study_key = f"Study {study_number}"
  index = study_number - 1

  study_data = None

  # Extract the study data using the study number
  if index < len(data["Results"]):
    print(index)
    study_data = data["Results"][index].get(study_key)
  #  print(study_data)

    if study_data is not None:
      output['brief_title'] = study_data["IdentificationModule"]["BriefTitle"]
      output['nct_id'] = study_data["IdentificationModule"]["NCTId"]
      output['overall_status'] = study_data["StatusModule"]["OverallStatus"]
      output['start_date'] = study_data["StatusModule"]["StartDateStruct"]["StartDate"]
      output['completion_date'] = study_data["StatusModule"]["CompletionDateStruct"]["CompletionDate"]
      output['lead_sponsor'] = study_data["SponsorCollaboratorsModule"]["LeadSponsor"]["LeadSponsorName"]
      output['summary'] = get_summary_by_study_number.local(summary, study_number)
      # print(output)
      output_json = json.dumps(output, indent=4)
      #with open("/content/clinical_trials/" + study_key + ".json", "w") as outfile:
      #  outfile.write(output_json)

    else:
      print(f"Study {study_number} not found.")


  return output

@stub.function(image=corise_image, gpu="any", secret=modal.Secret.from_name("my-openai-secret-2"))
def get_summarized_clinicaltrials(collected_results_json):
  import openai
  print ("Starting get_clinicaltrials Function")

  instructPrompt = """
    You are a expert at analyzing clinical data for scientific studies for diseases. Summarize the Outcome of each study. Use bullet points. Compare the Outcome parameters to hypotheses and see if the outcome matched the hypotheses or not.
  Focus more on the outcome parameters for each study. Don't use any scientific language. Make it easily understandable by kids. Below is all the information from the studies of various clinical trials. Don't write any of the identification attributes.
  Don't put the title of the study in your response. Keep the response concise.
  """

  request = instructPrompt + collected_results_json

  chatOutput = openai.ChatCompletion.create(model="gpt-3.5-turbo-16k",
                                            messages=[{"role": "system", "content": "You are a expert at analyzing clinical data for scientific studies for diseases."},
                                                      {"role": "user", "content": request}])

  summary = chatOutput.choices[0].message.content
  summary
  return summary

@stub.function(image=corise_image, gpu="any")
def get_study_info(selected_disease, additional_term):
  download_packages()
  output = get_clinicaltrials(selected_disease, additional_term)
  summary = get_summarized_clinicaltrials.local(output)
  #summary = 'Study 1:\n- The study aimed to examine the effects of dietary modification and treadmill training on fuel utilization and physical function in chronic stroke patients.\n- The outcomes of the study were not explicitly compared to hypotheses.'
  #print(output)
  allAttrsForStudies = []
  for i in range(1, 6):
    attributes = get_attributes_by_study_number.local(output, summary, i)
    allAttrsForStudies.append(attributes)

  print(allAttrsForStudies)
  return allAttrsForStudies

@stub.local_entrypoint()
def main():
  get_study_info.local("stroke", "nutrition")


Overwriting /content/clinical_trials/clinicaltrials_backend.py


In [None]:
!modal run /content/clinical_trials/clinicaltrials_backend.py

[?25l[34m⠋[0m Initializing...[2K[32m✓[0m Initialized. [37mView app at [0m[4;37mhttps://modal.com/apps/ap-H7EMzV4V28vlgvyeLro1Rk[0m
[2K[34m⠋[0m Initializing...
[2K[34m⠸[0m Creating objects...
[37m├── [0m[34m⠋[0m Creating get_summary_by_study_number...
[37m└── [0m[34m⠋[0m Creating mount /content/clinical_trials/clinicaltrials_backend.py: 
[2K[1A[2K[1A[2K[1A[2K[34m⠦[0m Creating objects...
[37m├── [0m[34m⠸[0m Creating get_summary_by_study_number...
[37m├── [0m[32m🔨[0m Created mount /content/clinical_trials/clinicaltrials_backend.py
[37m├── [0m[34m⠋[0m Creating download_packages...
[2K[1A[2K[1A[2K[1A[2K[1A[2K[34m⠏[0m Creating objects...
[37m├── [0m[32m🔨[0m Created get_summary_by_study_number.
[37m├── [0m[32m🔨[0m Created mount /content/clinical_trials/clinicaltrials_backend.py
[37m├── [0m[32m🔨[0m Created download_packages.
[37m├── [0m[32m🔨[0m Created mount /content/clinical_trials/clinicaltrials_backend.py
[2K[1A[2K

**Deploy the backend code to Modal**


In [None]:
!modal deploy /content/clinical_trials/clinicaltrials_backend.py

[2K[34m⠸[0m Creating objects...
[37m├── [0m[34m⠋[0m Creating get_summary_by_study_number...
[37m└── [0m[34m⠋[0m Creating mount /content/clinical_trials/clinicaltrials_backend.py: 
[2K[1A[2K[1A[2K[1A[2K[34m⠦[0m Creating objects...
[37m├── [0m[34m⠸[0m Creating get_summary_by_study_number...
[37m├── [0m[32m🔨[0m Created mount /content/clinical_trials/clinicaltrials_backend.py
[37m├── [0m[34m⠋[0m Creating download_packages...
[2K[1A[2K[1A[2K[1A[2K[1A[2K[34m⠏[0m Creating objects...
[37m├── [0m[32m🔨[0m Created get_summary_by_study_number.
[37m├── [0m[32m🔨[0m Created mount /content/clinical_trials/clinicaltrials_backend.py
[37m├── [0m[32m🔨[0m Created download_packages.
[37m├── [0m[32m🔨[0m Created mount /content/clinical_trials/clinicaltrials_backend.py
[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[34m⠹[0m Creating objects...
[37m├── [0m[32m🔨[0m Created get_summary_by_study_number.
[37m├── [0m[32m🔨[0m Created mount /content/cli

# Front End