<a href="https://colab.research.google.com/github/atlasfutures/memex/blob/docs_private/docs/tutorial/tutorials/clinical-trials-matching/Clinical_Trials_Matching.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Match Patients to Clinical Trials

### An implementation of [Zero-Shot Clinical Trial Patient Matching with LLMs](https://arxiv.org/pdf/2402.05125.pdf) from Stanford Medicine's Shah Lab using Memex


# Introduction

This tutorial outlines the process of matching patients to clinical trials following Stanford's SOTA research on the topic using the Memex SDK in a Python environment. It begins with the installation of the Memex SDK and the loading of synthetic patient data and clinical trials data from Google Drive. The document then details the process of uploading this data to the Memex platform.

The below data diagram illustrates the full analytics pipeline this notebook implements.


[![](https://mermaid.ink/img/pako:eNp9VNty2jAQ_ZUd9bGEAYMvuJ3MkAC5kWlufWjrPihGttXIEmPJaUgm_961TAy4M9aTjs7Zs961tG8kVitGQpII9TfOaGHgYRZJwKXLx7Sg6wy4XJcGVtTQmqjW9FdE7jfSZIxG5PfufI7nseCSx1SYglOh-6l6biRMrurNFI6OjuGkcjFFGZuyYLCmhjO5zfQRMUchKheonL-YgsYGrC_EBTcMd43yxFqeonChipwaoI2jLvOcFptGemqlsyq7ZfjrLnvGtVF72oXVnlW2ghrDZJMZqFwhUFrDH8UlGNVKyJlubGbW5rwq45mKkpomowaaUi61-b-kszqoBucWXKDDNE0Llu5ZANWaaZ1Xbk3whdVfVkXGCturTcFkajJQSRP39Ri27cSGxdne98I0DMNHUbItPEGIWZnc4tMWniFWBZXpR8D8MH7R5s9aBudtwUVLcHmIY4FFz1gCdRAkXIjwUzLxWrTgaWYONUnS0ljbLTtIJpFsvYAlS5ubC3BVtfR2Cbff53c_9m7_Es9v7r5d3zzAZ0DBHnXVqgWWrWqtfb3F1_N0bzaCwaD6Z-qJ2T5-abPDTtbpZEed7LiTdTtZr5P1O9nggCU9kjN8yHyF4-mt0kYE503OIhLi9lHQ-CkikXxHIS2NwnEUkxCnCeuRco1DhM04xf-XkzDBOYSnbMXxJlzXA8_OvR5ZU_lTqZ0GMQnfyAsJg0l_OPI833V93_Uno6BHNiR0g74_dB1_PBp745HjOOP3Hnm1DoN-4A0dJxi6g_HAGzl-8P4PPLGuSw?type=png)](https://mermaid-js.github.io/mermaid-live-editor/edit#pako:eNp9VNty2jAQ_ZUd9bGEAYMvuJ3MkAC5kWlufWjrPihGttXIEmPJaUgm_961TAy4M9aTjs7Zs961tG8kVitGQpII9TfOaGHgYRZJwKXLx7Sg6wy4XJcGVtTQmqjW9FdE7jfSZIxG5PfufI7nseCSx1SYglOh-6l6biRMrurNFI6OjuGkcjFFGZuyYLCmhjO5zfQRMUchKheonL-YgsYGrC_EBTcMd43yxFqeonChipwaoI2jLvOcFptGemqlsyq7ZfjrLnvGtVF72oXVnlW2ghrDZJMZqFwhUFrDH8UlGNVKyJlubGbW5rwq45mKkpomowaaUi61-b-kszqoBucWXKDDNE0Llu5ZANWaaZ1Xbk3whdVfVkXGCturTcFkajJQSRP39Ri27cSGxdne98I0DMNHUbItPEGIWZnc4tMWniFWBZXpR8D8MH7R5s9aBudtwUVLcHmIY4FFz1gCdRAkXIjwUzLxWrTgaWYONUnS0ljbLTtIJpFsvYAlS5ubC3BVtfR2Cbff53c_9m7_Es9v7r5d3zzAZ0DBHnXVqgWWrWqtfb3F1_N0bzaCwaD6Z-qJ2T5-abPDTtbpZEed7LiTdTtZr5P1O9nggCU9kjN8yHyF4-mt0kYE503OIhLi9lHQ-CkikXxHIS2NwnEUkxCnCeuRco1DhM04xf-XkzDBOYSnbMXxJlzXA8_OvR5ZU_lTqZ0GMQnfyAsJg0l_OPI833V93_Uno6BHNiR0g74_dB1_PBp745HjOOP3Hnm1DoN-4A0dJxi6g_HAGzl-8P4PPLGuSw)


In [None]:
# @title pip install memex
!pip install -q memex

In [None]:
# @title setup Memex instance and api key

MEMEX_INSTANCE_URL = "https://<YOUR_INSTANCE>.memexdata.com"
MEMEX_API_KEY = "<YOUR_API_KEY>"

In [None]:
# @title connect to MemexSession
from memex import MemexSession

mx = MemexSession(MEMEX_INSTANCE_URL, api_key=MEMEX_API_KEY, verify_ssl=False)

In [None]:
# @title imports
import os
import json
import zipfile
import requests
from pathlib import Path
from IPython.display import Markdown as md

## Download synthetic patient data and clinicaltrials.gov data from Google Drive

You can view the notebook that uses Synthea to generate the synthetic patient population used in this notebook here: [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/atlasfutures/memex-sample-public/blob/main/docs/tutorial/tutorials/synthesize-patient-communications/synthesize-patient-communications.ipynb)


In [None]:
# @title download data and project files

os.makedirs("data", exist_ok=True)


def download(file):
    DOWNLOAD_DIR = "./"
    BASE_URL = "https://sample.memexdata.com/clinical_trials/"
    open(f"{DOWNLOAD_DIR}{file}", "wb").write(
        requests.get(BASE_URL + file, allow_redirects=True).content
    )


# synthetic patient population created with synthea
download("data/synthea501.zip")

with zipfile.ZipFile("data/synthea501.zip", "r") as zip_ref:
    zip_ref.extractall(".")

# psychiatric clinical trials downloaded from clinicaltrials.gov
download("data/ctg.json")

download("code.zip")

with zipfile.ZipFile("code.zip", "r") as zip_ref:
    zip_ref.extractall(".")

SQL_DIR = "./queries/"
FUNCTION_DIR = "./functions/"
csv_files = [
    Path(os.path.join(root, file))
    for root, _, files in os.walk("synthea501")
    for file in files
    if file.endswith(".csv")
]
sql_files = [
    Path(os.path.join(root, file))
    for root, _, files in os.walk(SQL_DIR)
    for file in files
    if file.endswith(".sql")
]
function_files = [
    Path(os.path.join(root, file))
    for root, _, files in os.walk(FUNCTION_DIR)
    for file in files
    if file.endswith(".json")
]

In [None]:
# @title load project files
print("Loading datasets\n-----------------\n")

# Clinical trials data
trialpath = Path("data/ctg.json")
print(f"Clinical trials data:\n- {trialpath.stem}")
with trialpath.open() as file:
    mx.upload_dataset(file, trialpath.name)

print("\nPatient data (OMOP format):")
# Upload synthea files
for fpath in csv_files:
    if fpath.stem not in set(["claims_transactions", "observations", "procedures"]):
        print(f"- {fpath.stem}")
        with fpath.open() as file:
            mx.upload_dataset(file, fpath.name)

print("\n\nLoading queries\n---------------")
# Load queries
for querypath in sql_files:
    print(f"- {querypath.stem}")
    with querypath.open() as file:
        mx.save_query(query=file.read(), name=querypath.stem, overwrite=True)

print("\n\nLoading functions\n---------------")
# Load functions
for functionpath in function_files:
    print(f"- {functionpath.stem}")
    with functionpath.open() as file:
        mx.save_function({"function": json.load(file), "overwrite": True})

- devices


- immunizations


- patients




Loading queries
---------------
- join_pts_criteria


- extract_criteria


- match_pts


- eval_patients


- format_patient_data


- summarize_pts


- aggregate_criteria


- patients_nested




Loading functions
---------------
- summarize


- shah_individual_py


- match_pts


- format_patient_history


- extract_criteria_py


## Inspecting project assets

Let's take a look at some of the project assets we just imported.

All of these are also available in the UI.


In [None]:
print(mx.get_function("shah_individual_py")["source"])

In [None]:
print(mx.get_query("eval_patients")["query"])

# Transform patient data for data synthesis

[![](https://mermaid.ink/img/pako:eNp9VN1vmzAQ_1dO3uPSKJAECJsqpU3Sr1T93MM29uA4BryCHWHTNq36v-9wUkKYFB6Qz78P353h3glTS05CEmfqhaW0MPA4iSTgo8tFUtBVCkKuSgNLaugGqJ7x74g8rKVJOY3In93-FPdZJqRgNDOFoJnuJuq5pnC5bJnDTnuN2plipQYVg0mFbmBWw5kRSlboPoBZgFSGL5R62stmDEdHx3BS5WqKkpmy4LCiRnC5radJPrHkU5tEkVMDtObqMs9psd6jn1r6pPK2qHjbeWPuRjX4ddlTFKFqhqrpqykoM2C7BKwQhuOqljTfM3vUWZVZRo3hsqYDlUsMlNbwVwkJRrVyFlzXlhNrc16d_Uyzkpo6YQ00oUJq838eZxvRJji3wQU6jJOk4EnDAqjWXOu8cqvFF5Z_WfWIKey9NgWXiUmrK_zUfT-GbQ-w5yxt5AvjMAwXWcm34QmGeCqX2_i0FU8wVgWVyadguq-ftfGzlsF5m3DRIlzuxyzDoic8ho0IYpFl4Zd45LXglxSb-onGcQu1hlu0F48i2fpD5jypvx-Aq6qZd3O4-zG9_9n4IOe4f3t_c337CF8BCQ3oqlUFzFt1WvvNEn_dpwezzjj0qttST9x28FsbdQ6i7kG0fxAdHESHB1HvIOofRIM9lHRIznEKiCXOxveKGxEcMzmPSIjLRUYZjppIfiCRlkbhLGQkxCHDO6Rc4WzhE0Hx_nISxjgEcZcvBc6E6820tUO3Q1ZU_lJqx8GYhO_klYTBqOv0Pc8fDn1_6I_6QYesSTgMur4zdP1Bf-AN-q7rDj465M069LqB57hu4HhBz-n5juN__AN0h81n?type=png)](https://mermaid-js.github.io/mermaid-live-editor/edit#pako:eNp9VN1vmzAQ_1dO3uPSKJAECJsqpU3Sr1T93MM29uA4BryCHWHTNq36v-9wUkKYFB6Qz78P353h3glTS05CEmfqhaW0MPA4iSTgo8tFUtBVCkKuSgNLaugGqJ7x74g8rKVJOY3In93-FPdZJqRgNDOFoJnuJuq5pnC5bJnDTnuN2plipQYVg0mFbmBWw5kRSlboPoBZgFSGL5R62stmDEdHx3BS5WqKkpmy4LCiRnC5radJPrHkU5tEkVMDtObqMs9psd6jn1r6pPK2qHjbeWPuRjX4ddlTFKFqhqrpqykoM2C7BKwQhuOqljTfM3vUWZVZRo3hsqYDlUsMlNbwVwkJRrVyFlzXlhNrc16d_Uyzkpo6YQ00oUJq838eZxvRJji3wQU6jJOk4EnDAqjWXOu8cqvFF5Z_WfWIKey9NgWXiUmrK_zUfT-GbQ-w5yxt5AvjMAwXWcm34QmGeCqX2_i0FU8wVgWVyadguq-ftfGzlsF5m3DRIlzuxyzDoic8ho0IYpFl4Zd45LXglxSb-onGcQu1hlu0F48i2fpD5jypvx-Aq6qZd3O4-zG9_9n4IOe4f3t_c337CF8BCQ3oqlUFzFt1WvvNEn_dpwezzjj0qttST9x28FsbdQ6i7kG0fxAdHESHB1HvIOofRIM9lHRIznEKiCXOxveKGxEcMzmPSIjLRUYZjppIfiCRlkbhLGQkxCHDO6Rc4WzhE0Hx_nISxjgEcZcvBc6E6820tUO3Q1ZU_lJqx8GYhO_klYTBqOv0Pc8fDn1_6I_6QYesSTgMur4zdP1Bf-AN-q7rDj465M069LqB57hu4HhBz-n5juN__AN0h81n)


One key aspect of this process is the creation of nested patient data, which consolidates all relevant patient information into a structured format that is conducive to analysis and matching with clinical trials criteria. This step is crucial for the subsequent utilization of machine learning models that can process this structured data to make accurate matching predictions.

The `patients_nested` query is used to create a nested representation of each patient's data, including their medical encounters, conditions, medications, allergies, and other relevant health information. By executing this query and saving the result as a table called `patients_nested`, we create a valuable resource that can be used for further processing and analysis.


In [None]:
patients_nested = mx.get_query("patients_nested")
mx.save_as_table("patients_nested", patients_nested["query"], overwrite=True)

Let's take a look the table we just saved

In [None]:
mx.query('SELECT * FROM patients_nested LIMIT 5')

In [None]:
md(f"Due to nested structure of this table, it would be much better to explore it in the [Memex UI]({MEMEX_INSTANCE_URL})")

## Formatting patient data

To format the nested structure of patient data into a text summary for LLM processing, we can write a Python UDF. Let's take a look at `format_patient_history` function

In [None]:
print(mx.get_function("format_patient_history")['source'])

We can now run this function over all the patient data to produce a new table `pt_histories`

In [None]:
format_patient_data_query = mx.get_query("format_patient_data")['query']
print(format_patient_data_query)

In [None]:
mx.save_as_table("pt_histories", format_patient_data_query, overwrite=True)

Let's take a look at one of these patient history

In [None]:
pt_history = mx.query('SELECT * FROM pt_histories LIMIT 1')['history'].values[0]
md(pt_history)

## Summarize 

Now that we have the patient history as a big paragraph of text, we can call LLM to summarize it. 

Let's take a look at `summarize` propmt function.

In [None]:
print(mx.get_function('summarize')['content'])

Let's try it on some patient history

In [None]:
summary = mx.query('select id, summarize(history) as pt_summary from pt_histories limit 1', 
                   temperature=0.1, 
                   max_tokens=2000)['pt_summary'].values[0]
md(summary)

### 💡
> At this point we can go to Memex UI and check the `llm_logs` for more stats of the LLM calls, e.g. number of input/output tokens

Now we can run it on the full set of patients and save it to `pt_summaries` table

In [None]:
# This query uses an LLM call to summarize patients:
# summarize = mx.get_function("summarize")
summarize_pts = mx.get_query("summarize_pts")
mx.save_as_table(
    "pt_summaries",
    summarize_pts["query"],
    model="gpt-3.5-turbo",
    temperature=0.1,
    max_tokens=4000,
    use_cache=True,
    overwrite=True,
)

# Format clinical trials criteria

[![](https://mermaid.ink/img/pako:eNp9VNtymzAQ_ZUd9bGOx2AbMO1kxont3Oxpbn1oSx9kLEANSB4kkjiZ_HsXYWObzKAHRqtz9mh3xe47CeWKEZ9EqXwJE5preJwEAnCpYhnndJ0AF-tCw4pqWgHlGv8JyMNG6ITRgPzdn0_xPEy54CFNdc5pqrqxfK4pTKwa4nDgCycnJ6cwQ4npq85pqMFIQJhzzXB3dNMMSvIFkmcp1ZqJmgZUrNCQSsE_yQVoCWuqORMar80ymnOmjqQWpYgMCwUyAp1wdRCUCZaFmktRoscApg9CaraU8ulzjtV3bOI8K-ul8yLURc7qcExNd35nhnhuYskzqoE2wt7U1HNDnZSaVUJve00MX0vD3ccwMfzLsrDPNC2orukKaEy5UPpzkS8qp8q4NMYVKozjOGfxgQRQpZhSWalWO18Z_nUZYSgxY6VzJmKdlDXc-X0_he0DY7ZhcvgoY9_3l2nBtuYZmngrE1v7vGFP0JY5FfHOYXrsP2viFw2ByybhqkG4PrbDFJOesAgqJ4h4mvpfopHTgF8SLOoOjaIGagS3aC8aBaLRG3MW1z8TwE1ZzLs53P2c3v86-H_neH57_2Nx-whfAQkH0E0jC5g38jTy1Rab9ulBb1IGvfK15BMzFfzWRK1W1G5F-63ooBUdtqJOK-q2ot4RSjokY9h_fIVT8b3kBgT7PGMB8XG7TGmIvR6IDyTSQkucgiHxsbVZhxRr7Gg24RTfLyN-hOMPT9mKY0cuqjlrxm2HrKn4LeWegzbx38kr8b1R1-o7jjscuu7QHfW9DtkQf-h1XWtou4P-wBn0bdsefHTIm1HodT3Hsm3Pcrye1XMty_34D-TjzB0?type=png)](https://mermaid-js.github.io/mermaid-live-editor/edit#pako:eNp9VNtymzAQ_ZUd9bGOx2AbMO1kxont3Oxpbn1oSx9kLEANSB4kkjiZ_HsXYWObzKAHRqtz9mh3xe47CeWKEZ9EqXwJE5preJwEAnCpYhnndJ0AF-tCw4pqWgHlGv8JyMNG6ITRgPzdn0_xPEy54CFNdc5pqrqxfK4pTKwa4nDgCycnJ6cwQ4npq85pqMFIQJhzzXB3dNMMSvIFkmcp1ZqJmgZUrNCQSsE_yQVoCWuqORMar80ymnOmjqQWpYgMCwUyAp1wdRCUCZaFmktRoscApg9CaraU8ulzjtV3bOI8K-ul8yLURc7qcExNd35nhnhuYskzqoE2wt7U1HNDnZSaVUJve00MX0vD3ccwMfzLsrDPNC2orukKaEy5UPpzkS8qp8q4NMYVKozjOGfxgQRQpZhSWalWO18Z_nUZYSgxY6VzJmKdlDXc-X0_he0DY7ZhcvgoY9_3l2nBtuYZmngrE1v7vGFP0JY5FfHOYXrsP2viFw2ByybhqkG4PrbDFJOesAgqJ4h4mvpfopHTgF8SLOoOjaIGagS3aC8aBaLRG3MW1z8TwE1ZzLs53P2c3v86-H_neH57_2Nx-whfAQkH0E0jC5g38jTy1Rab9ulBb1IGvfK15BMzFfzWRK1W1G5F-63ooBUdtqJOK-q2ot4RSjokY9h_fIVT8b3kBgT7PGMB8XG7TGmIvR6IDyTSQkucgiHxsbVZhxRr7Gg24RTfLyN-hOMPT9mKY0cuqjlrxm2HrKn4LeWegzbx38kr8b1R1-o7jjscuu7QHfW9DtkQf-h1XWtou4P-wBn0bdsefHTIm1HodT3Hsm3Pcrye1XMty_34D-TjzB0)


Let's take a look at a *prompt function* `extract_criteria_py` for extracting criteria. 

In [None]:
extract_criteria_py = mx.get_function("extract_criteria_py")['source']
print(extract_criteria_py)


This function takes in the trial's eligibility criteria, which has a free form text that might look like this:
```
Inclusion Criteria:

* Diagnosis of Type 2 Diabetes
* Low physical activity (\<150 minutes/week of moderate to vigorous physical activity)

Exclusion Criteria:

* Cognitive deficits impeding a participant's ability to provide informed consent or participate
* Medical conditions likely to lead to death within 6 months.
* Pre-existing coronary artery disease
* Moderate-severe depression (Patient Health Questionnaire-9 \[PHQ-9\] score ≥15)
* Use of non-basal insulin
* Inability to participate in physical activity due to another medical condition
* Inability to receive text messages
* Inability to read, write, or speak in English
* Current participation in another intervention or program that has been designed to promote well-being or physical activity
```
and turns it into a structured list of `inclusion` and `exclusion` criteria:
```
{
 "inclusion" : [
     "Diagnosis of Type 2 Diabetes",
     "Low physical activity (\<150 minutes/week of moderate to vigorous physical activity)",
     ...
   ],
  "exclusion":  [
     "Use of non-basal insulin",
     "Cognitive deficits impeding a participant's ability to provide informed consent or participate",
     ...
   ],
}
```

This demonstrates how easy it is to extract structured data from text by just describing what you want and specifying the output structure (via Python Pydantic data model)

Let's take a look how this works in a single example

In [None]:
mx.query("""WITH enroll_details AS (
      SELECT
        __uid__ as id,
        protocolSection.eligibilityModule.eligibilityCriteria as eligibility
      FROM ctg LIMIT 1
  ),
  extracted AS (
      SELECT id, eligibility, extract_criteria_py(eligibility) AS criteria FROM enroll_details
  )
  SELECT 
    * EXCEPT(criteria), 
    UNNEST(criteria.inclusion) as inclusion, UNNEST(criteria.exclusion) as exclusion FROM extracted
""")

In [None]:
md(f"Due to the nested structured of the output, the results is best viewed in [Memex UI]({MEMEX_INSTANCE_URL})")

In [None]:
extract_criteria_query = mx.get_query("extract_criteria")['query']
print(extract_criteria_query)

Now we can run this over all 100 trials criteria

In [None]:
mx.save_as_table(
    "extracted_criteria_100",
    extract_criteria_query,
    model="gpt-3.5-turbo",
    max_tokens=1000,
    temperature=0.1,
    use_cache=True,
    overwrite=True,
)

Next, we're going to cross join the inclusion/exclusion criteria with each patient in order to create  **"patient + individual criteria"** pairs, in preparation for the evaluation in the next stage

In [None]:
join_pts_criteria = mx.get_query("join_pts_criteria")['query']
print(join_pts_criteria)

In [None]:
mx.save_as_table("pt_criteria_and_summary", join_pts_criteria, overwrite=True)

In [None]:
mx.query("SELECT * FROM pt_criteria_and_summary ORDER BY trial_id, patient_id, criteria_type LIMIT 10")

# Evaluate patient medical history against trial criteria

[![](https://mermaid.ink/img/pako:eNp9VNtymzAQ_ZUd9bGOx2AbMO1kxont3Jxpbn1oSx9kLEANSB4kkjiZ_HsXYUNMZuCB0eqcs9pdafeNhHLNiE-iVD6HCc01PMwCAfipYhXndJMAF5tCw5pqWgHlN_0TkPut0AmjAfnb7M9xP0y54CFNdc5pqvqxfKopTKxbziEQjfoa1QsZFgpkBDrhChrMqFiouRQleghgHCCkZispHw_iaVbncHR0DBd4wjSOcxZTzWBDNWdCA1WKKZXhUh2oL4zmssw1lDkDpXMmYp2UAey134_x-DJTyKgOE6Y-J1v9p8bZSelM50Woi7wJwBR3rzsxxFNTixydAq15qsgymm9r6qmhzkqfBuGvjU8sn5aG28QwM_xz5M-faFp8qIECGlMulIYw55phQvUhZ5WoMuZooLUoXbzonIZ6l_0n2cLIzso0Uqo1EzUFqFijIZWCf5IL0LKVIN9VcXcTU9_3V2nBduYJmniDbA-ftuwZ2jKnIt4L5of6RRs_azk4bxMuWoTLQztM8QHNWASVCCKepv6XaOK04OcE89-jUdRCjcMdOogm--zrTlmyuH5RAFflnd8u4fbn_O7Xh0e7xP2bux_XNw_wFZDwAbpqZQHLVp7GfbXEFn6819uUwaB89fKRmQp-a6NWJ2p3osNOdNSJjjtRpxN1O1HvACU9kjFsQr7GGflWcgOCwyZjAfFxuUppiAMnEO9IpIWWOBND4mN_sx4pNtjWbMYp3l9G_AiHIe6yNce2vK6mrhm-PbKh4reUDQdt4r-RF-J7k741dBx3PHbdsTsZej2yJf7Y67vW2HZHw5EzGtq2PXrvkVfjYdD3HMu2PcvxBtbAtSz3_T_LA897?type=png)](https://mermaid-js.github.io/mermaid-live-editor/edit#pako:eNp9VNtymzAQ_ZUd9bGOx2AbMO1kxont3Jxpbn1oSx9kLEANSB4kkjiZ_HsXYUNMZuCB0eqcs9pdafeNhHLNiE-iVD6HCc01PMwCAfipYhXndJMAF5tCw5pqWgHlN_0TkPut0AmjAfnb7M9xP0y54CFNdc5pqvqxfKopTKxbziEQjfoa1QsZFgpkBDrhChrMqFiouRQleghgHCCkZispHw_iaVbncHR0DBd4wjSOcxZTzWBDNWdCA1WKKZXhUh2oL4zmssw1lDkDpXMmYp2UAey134_x-DJTyKgOE6Y-J1v9p8bZSelM50Woi7wJwBR3rzsxxFNTixydAq15qsgymm9r6qmhzkqfBuGvjU8sn5aG28QwM_xz5M-faFp8qIECGlMulIYw55phQvUhZ5WoMuZooLUoXbzonIZ6l_0n2cLIzso0Uqo1EzUFqFijIZWCf5IL0LKVIN9VcXcTU9_3V2nBduYJmniDbA-ftuwZ2jKnIt4L5of6RRs_azk4bxMuWoTLQztM8QHNWASVCCKepv6XaOK04OcE89-jUdRCjcMdOogm--zrTlmyuH5RAFflnd8u4fbn_O7Xh0e7xP2bux_XNw_wFZDwAbpqZQHLVp7GfbXEFn6819uUwaB89fKRmQp-a6NWJ2p3osNOdNSJjjtRpxN1O1HvACU9kjFsQr7GGflWcgOCwyZjAfFxuUppiAMnEO9IpIWWOBND4mN_sx4pNtjWbMYp3l9G_AiHIe6yNce2vK6mrhm-PbKh4reUDQdt4r-RF-J7k741dBx3PHbdsTsZej2yJf7Y67vW2HZHw5EzGtq2PXrvkVfjYdD3HMu2PcvxBtbAtSz3_T_LA897)


Let's take a look at how we can ask LLM to do clinical trials matching using the top performing prompt from 
the Shah Lab paper [Zero-Shot Clinical Trial Patient Matching with LLMs](https://arxiv.org/pdf/2402.05125.pdf)



In [None]:
shah_individual_py = mx.get_function("shah_individual_py")['source']
print(shah_individual_py)

Let's try it out on some of our data

In [None]:
mx.query("""
WITH eval AS (
    SELECT
      trial_id
      , patient_id
      , criteria_type
      , criterion
      , pt_summary
      , shah_individual_py(pt_summary, criterion) as assessment
    FROM pt_criteria_and_summary LIMIT 2
)
SELECT * EXCEPT(assessment), assessment.* FROM eval""")

We can see that the LLM made an decision `is_met`, including the `confidence` level and `rationale`. 
You can also better see the result's structure by looking at it in the MemexUI.

We will now run the evaluation query against a larger set of patients and criteria.

In [None]:
eval_patients_query = mx.get_query("eval_patients")['query']
print(eval_patients_query)

In [None]:
mx.save_as_table(
    "evaled_pts",
    eval_patients_query,
    model="gpt-3.5-turbo",
    max_tokens=2000,
    temperature=0.1,
    use_cache=True,
    overwrite=True,
)

Next, we're going to aggregate all the matching decisions per patients

In [None]:
aggregate_criteria_query = mx.get_query("aggregate_criteria")['query']
print(aggregate_criteria_query)

In [None]:
mx.save_as_table("assessed_pts", aggregate_criteria_query, overwrite=True)

In [None]:
mx.query("SELECT * FROM assessed_pts")

In MemexUI, you can browse the aggregated result and inspect the nested structure for each patient.

Now, we will calculate the match patient score for `inclusion` and `exclusion` criteria as follows:
- `inclusion_score` = `# inclusion met` / `total # of inclusion criteria`
- `exclusion_score` = `# exclusion unmet` / `total # of exclusion criteria`

This is encoded in the `match_pts` Python UDF below

In [None]:
print(mx.get_function("match_pts")['source'])

Let's calculate the match score for our dataset

In [None]:
match_pts_query = mx.get_query("match_pts")['query']
print(match_pts_query)

In [None]:
mx.save_as_table("pt_trial_match_scores", match_pts_query, overwrite=True)

In [None]:
mx.query("SELECT *, matches.inclusion_score, matches.exclusion_score FROM pt_trial_match_scores ORDER BY matches.inclusion_score DESC, matches.exclusion_score DESC")

In [None]:
md(f"You can explore the full match results in details in [MemexUI]({MEMEX_INSTANCE_URL})")