<!-- Original Implementation by Gyubok Lee -->
<!-- Refined by Sunjun Kweon on 2024-01-15. -->
<!-- Refined by Woosog Chay on 2025-02-21. -->
<!-- Note: This Jupyter notebook is tailored to the unique requirements of the EHRSQL project. It includes specific modifications and additional adjustments to cater to the dataset and experiment objectives. -->

# Dummy Model Sample Code for MIMIC-IV: Single-Turn Text-to-SQL with Abstention on Electronic Health Records


<!-- ## Task Introduction
The goal of the task is to **develop a reliable text-to-SQL system on EHR**. Unlike standard text-to-SQL tasks, this system must handle all types of questions, including answerable and unanswerable ones with respect to the EHR database structure. For answerable questions, the system must accurately generate SQL queries. For unanswerable questions, the system must correctly identify them as such, thereby preventing incorrect SQL predictions for infeasible questions. The range of questions includes answerable queries about MIMIC-IV, covering topics such as patient demographics, vital signs, and specific disease survival rates ([EHRSQL](https://github.com/glee4810/EHRSQL)). Additionally, there are specially designed unanswerable questions intended to challenge the system. Successfully completing this task will result in the creation of a reliable question-answering system for EHRs, significantly improving the flexibility and efficiency of clinical knowledge exploration in hospitals. -->

## Steps of Baseline Code

- [x] Step 1: Clone the GitHub Repository and Install Dependencies
- [x] Step 2: Import Global Packages and Define File Paths
- [x] Step 3: Load Data and Prepare Datasets
- [x] Step 4: Building a Dummy Model
- [x] Step 5: Evaluation


## Step 1: Clone the GitHub Repository and Install Dependencies

Before you begin, make sure you're in the correct directory. If you need to reset the repository directory, remove the existing directory by uncommenting and executing the following lines:

In [1]:
%cd /content
!rm -rf ai612_project_1

/content


Now, clone the repository and install the required Python packages:

In [2]:
# Cloning the GitHub repository
!git clone -q https://github.com/benchay1999/ai612_project_1.git
%cd ai612_project_1

# Installing dependencies
!pip install -q func_timeout

/content/ai612_project_1


Use the `%load_ext` magic command to automatically reload modules before executing a new line:

In [3]:
%load_ext autoreload
%autoreload 2

## Step 2: Import Global Packages and Define File Paths

After setting up the repository and dependencies, the next step is to import packages that will be used globally throughout this notebook and to define the file paths to our datasets.

In [4]:
import os
import json
import pandas as pd
from tqdm import tqdm

# Directory paths for database, results and scoring program
DB_ID = 'mimic_iv'
BASE_DATA_DIR = 'data'
RESULT_DIR = 'results'
SCORING_DIR = 'scoring'

# File paths for the dataset and labels
TABLES_PATH = os.path.join('database', 'tables.json')               # JSON containing database schema
VALID_DATA_PATH = os.path.join(BASE_DATA_DIR, 'valid_data.json')    # JSON file for validation data
VALID_LABEL_PATH = os.path.join(BASE_DATA_DIR, 'valid_label.json')  # JSON file for validation labels (for evaluation)
DB_PATH = os.path.join('data', DB_ID, f'{DB_ID}.sqlite')            # Database path

# Load data
with open(os.path.join(VALID_DATA_PATH), 'r') as f:
    valid_data = json.load(f)

with open(os.path.join(VALID_LABEL_PATH), 'r') as f:
    valid_labels = json.load(f)

# Load SQL assumptions for MIMIC-IV
assumptions = open("database/mimic_iv_assumption.txt", "r").read()

In [5]:
print(assumptions)

- Use SQLite for SQL query generation.
- Use DENSE_RANK() when asked for ranking results, but retrieve only the relevant items, excluding their counts or ranks. When the question does not explicitly mention ranking, don't use DENSE_RANK().
- For the top N results, return only the relevant items, excluding their counts.
- Use DISTINCT in queries related to the cost of events, drug routes, or when counting or listing patients or hospital/ICU visits.
- When calculating the total cost, sum the patient’s diagnoses, procedures, lab events, and prescription costs within a single hospital admission only.
- Use DISTINCT to retrieve the cost of a single event (diagnosis, procedure, lab event, or prescription).
- For cost-related questions, use cost.event_type to specify the event type ('procedures_icd', 'labevents', 'prescriptions', 'diagnoses_icd') when retrieving costs for procedures, lab events, prescriptions, or diagnoses, respectively.
- Treat questions that start with "is it possible," "ca

## Step 3: Load Data and Prepare Datasets

Now that we have our environment and paths set up, the next step is to load the data and prepare it for our model. This involves preprocessing the MIMIC-IV database, reading the data from JSON files, splitting it into training and validation sets, and then initializing our dataset object.

### Data Statistics

In [6]:
print("Valid data:", (len(valid_data['data']), len(valid_labels)))

Valid data: (20, 20)


### Data Format

Before proceeding with the model, it is always a good idea to explore the dataset. This includes checking the keys in the dataset, and viewing the first few entries to understand the structure of the data.



In [7]:
# Explore keys and data structure
print(valid_data.keys())
print(valid_labels[list(valid_labels.keys())[0]])

dict_keys(['version', 'data'])
SELECT AVG(labevents.valuenum) FROM labevents WHERE labevents.hadm_id IN ( SELECT admissions.hadm_id FROM admissions WHERE admissions.subject_id = 10021487 ) AND labevents.itemid IN ( SELECT d_labitems.itemid FROM d_labitems WHERE d_labitems.label = 'bilirubin, direct' ) AND strftime('%Y-%m',labevents.charttime) >= '2100-05' GROUP BY strftime('%Y-%m',labevents.charttime)


## Step 4: Building a Dummy Model

In [8]:
import os
import re
import json

def post_process(answer):
    answer = answer.replace('\n', ' ')
    answer = re.sub('[ ]+', ' ', answer)
    answer = answer.replace("```sql", "").replace("```", "").strip()
    return answer

class Model():
    def __init__(self):
        pass

    def generate(self, input_data):
        """
        Arguments:
            input_data: list of python dictionaries containing 'id' and 'input'
        Returns:
            labels: python dictionary containing sql prediction or 'null' values associated with ids
        """
        labels = {}

        for sample in input_data:
            labels[sample["id"]] = "null"

        return labels

In [9]:
myModel = Model()
data = valid_data["data"]

In [10]:
input_data = []
for sample in data:
    sample_dict = {}
    sample_dict['id'] = sample['id']
    sample_dict['input'] = sample['question']
    input_data.append(sample_dict)

In [11]:
# Generate answer(SQL)
label_y = myModel.generate(input_data)

Below is how the predicted labels(SQLs) look like. **This should be your submission.**

In [12]:
label_y

{'b9c136c1e1d19649caabdeb4': 'null',
 'b389e224ed07b11a553f0329': 'null',
 '0845eda9197d9666e0b3a017': 'null',
 '423e62850dbf99ad88d4c834': 'null',
 'ad08e146a6e37e3a138c8c78': 'null',
 '82fed921fe732e9851109fa0': 'null',
 'b1f43697c74666c4701854b3': 'null',
 '9cd37fc842ad70310d54ee58': 'null',
 '0e38c978a69e475449c84fee': 'null',
 '2766c75e65819b7cf9c0fba2': 'null',
 'f7e273153edfeb72b98bd9c7': 'null',
 '49096da9fc4db23df0c9ca94': 'null',
 'f92a9715af7d181a656d4998': 'null',
 '61158e9ccd8015f7898cb6e8': 'null',
 'fc9243a5cde088d80aaae29a': 'null',
 '23dd8572482a3b9ef2437c37': 'null',
 'b3baba0d3d4a30996c8d7040': 'null',
 'd081d7e2db7e69a70b388b51': 'null',
 'a49efc1cdf3ebbe617aa7d26': 'null',
 '60f8d59c27fe673230ac2a83': 'null'}

In [13]:
from utils import write_json as write_label

# Save the filtered predictions to a JSON file
os.makedirs(RESULT_DIR, exist_ok=True)
SCORING_OUTPUT_DIR = os.path.join(RESULT_DIR, '20240000.json') # The file to submit
write_label(SCORING_OUTPUT_DIR, label_y)

# Verify the file creation
print("Listing files in RESULT_DIR:")
!ls {RESULT_DIR}

Listing files in RESULT_DIR:
20240000.json


## Step 5: Evaluation

You can evaluate your own valid set using the following code:

*Note*: The risk for questions that are not answerable is None here since there are no such data in the valid set.

In [16]:
from scoring.scorer import Scorer
with open("data/valid_data.json", "r") as f:
    data = json.load(f)

with open("data/valid_label.json", "r") as f:
    gold_labels = json.load(f)

with open("results/20240000.json", "r") as f:
    predictions = json.load(f)
scorer = Scorer(
    data=data,
    predictions=predictions,
    gold_labels=valid_labels,
    score_dir="results"
)
print()
print(scorer.get_scores())




100%|██████████| 20/20 [00:00<00:00, 261.86it/s]

No data for risk_notans. This happens when there is no `notans` questions in the evaluation dataset. This metric will be ignored when calculating the final score. This will not happen when evaluating on the test set.
Coverage for answerable questions (in %): 0.0 || 0/20
Risk for answerable questions (in %): 0.0 || 0/20
Risk for unanswerable questions (in %): None || 0/0
Final score: 50.0
{'cov_ans*100': 0.0, 'risk_ans*100': 0.0, 'risk_notans*100': None, 'final_score': 50.0}



