# SOAP Note Data Pre-Processing

This notebook demonstrates how to pre-process the raw transcripts data so it can be used for prompt evaluation

In [1]:
import os

from utils.aws import SAGEMAKER_DEFAULT_BUCKET, sagemaker_session
from utils.data.soap_note import process_transcripts, process_transcript_plain


DATA_DIR = "dataset"
RAW_DATA_DIR = os.path.join(DATA_DIR, "raw/")



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


The raw transcripts data is organized as a set of folders, each containing a `transcript.json` file that follows the Deepgram schema. Each folder also contains either a `.m4a` audio file from which the transcript was generated or a `script.txt` file with the LLM-generated transcript.

In [2]:
# download raw data
if not os.path.isdir(RAW_DATA_DIR):
    sagemaker_session.download_data(
        RAW_DATA_DIR,
        SAGEMAKER_DEFAULT_BUCKET,
        "prompt-engineering/soap-notes/dataset/raw"
    )

To use this data for prompt evaluation, we need to create a pandas `DataFrame` where each row contains all the necessary data for a single model invocation. In this case, we only need a transcript to generate the SOAP note.

In [3]:
df = process_transcripts(RAW_DATA_DIR, process_transcript_plain)
df

Unnamed: 0,transcript
0,"Hi, Ava. I'm Doctor. Bennett. It's great to me..."
1,Good morning. I'm Doctor. Chen. You must be Da...
2,"Good morning, Ms. Cooper. I'm Doctor. Bennett...."
3,"Hello, Jasmine. How have you been? Look, doc. ..."
4,"Hello, I'm Doctor. Patterson. You must be Jenn..."
5,Good morning. I'm Doctor. Chin. You must be Ka...
6,"Hello Mia, I'm Doctor. Harrison. Please come i..."
7,"Good morning, Mrs. Parker. I'm Doctor. Roberts..."
8,"Morning Ms. Davis, I'm Doctor. Warren. What br..."
9,Good morning Ms. Wright. How are you today? Gr...


After the data is transformed, we can upload it back to S3

In [4]:
transcripts_plain_path = os.path.join(DATA_DIR, "transcripts-plain.csv")
df.to_csv(transcripts_plain_path, index=False)
transcripts_plain_s3_uri = sagemaker_session.upload_data(
    transcripts_plain_path,
    SAGEMAKER_DEFAULT_BUCKET,
    "prompt-engineering/soap-notes/dataset",
)