# Echo Results Review

This notebook evaluates the performance of a fine-tuned language model on an echocardiogram report analysis task using a test dataset. It loads the model, defines the target labels, runs inference on each test example to generate predictions, and then calculates and reports the accuracy of the model's predictions against the true labels for each feature, as well as the overall exact match accuracy.

The model used was created by echo_note_training_final.ipynb

In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Set up working directory
import os
WORKING_DIR = '/content/drive/MyDrive/echo_training/'  # Change this to your preferred location
os.makedirs(WORKING_DIR, exist_ok=True)
os.chdir(WORKING_DIR)

Mounted at /content/drive


In [2]:
import pandas as pd
import numpy as np
import ast
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

In [3]:
# Loads the split files created when we trained the model.
#train_df = pd.read_csv('echo_train.csv')
#tune_df = pd.read_csv('echo_tune.csv')
test_df = pd.read_csv('echo_test.csv')

In [4]:
# Making a copy of the test dataframe - so I don't accidentally modify it
test_df_copy = test_df.copy()

In [5]:
test_df_copy = test_df_copy.rename(columns={test_df_copy.columns[0]: 'id_num'})

In [6]:

# ==============================================================================
# SETUP: LOAD MODEL AND DEFINE LABELS
# ==============================================================================

# Define label names
LABEL_NAMES = [
    'LA_cavity', 'RA_dilated', 'LV_systolic', 'LV_cavity',
    'LV_wall', 'RV_cavity', 'RV_systolic', 'AV_stenosis',
    'MV_stenosis', 'TV_regurgitation', 'TV_stenosis',
    'TV_pulm_htn', 'AV_regurgitation', 'MV_regurgitation',
    'RA_pressure', 'LV_diastolic', 'RV_volume_overload',
    'RV_wall', 'RV_pressure_overload'
]

# Load the fine-tuned model
model_path = "final_model"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [7]:

# ==============================================================================
# INFERENCE FUNCTION
# ==============================================================================

def generate_prediction(text):
    prompt = f"""<start_of_turn>user
Analyze this echocardiogram report and provide assessment values for each cardiac feature. Output should be in the format "feature: value" for each of the 19 features.

Report:
{text}<end_of_turn>
<start_of_turn>model
"""

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=300,
        temperature=0.1,
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id
    )

    full_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract just the model's response
    model_output = full_output.split("<start_of_turn>model\n")[-1].strip()
    return model_output


In [8]:
# ==============================================================================
# PARSE PREDICTIONS - SIMPLER VERSION
# ==============================================================================

def parse_prediction(pred_text):
    """Extract predicted label values from model output text"""
    predicted = []
    lines = pred_text.split('\n')

    for label_name in LABEL_NAMES:
        found = False
        for line in lines:
            if label_name in line and ':' in line:
                try:
                    # Get text after colon, remove spaces, convert to int
                    value_str = line.split(':')[1].strip()
                    value = int(value_str)
                    predicted.append(value)
                    found = True
                    break
                except:
                    pass

        if not found:
            predicted.append(None)

    return predicted

In [9]:
# ==============================================================================
# BATCH INFERENCE FUNCTION - OPTIMIZED VERSION
# ==============================================================================

def generate_predictions_batch(texts, batch_size=16): #you can decrease the batch size if you have memory constraints
    """
    Generate predictions for a batch of texts at once.

    Args:
        texts: List of echo text strings to process
        batch_size: Number of texts to process together (adjust based on GPU memory)

    Returns:
        List of prediction strings
    """
    # Create prompts for all texts in batch
    prompts = []
    for text in texts:
        prompt = f"""<start_of_turn>user
Analyze this echocardiogram report and provide assessment values for each cardiac feature. Output should be in the format "feature: value" for each of the 19 features.

Report:
{text}<end_of_turn>
<start_of_turn>model
"""
        prompts.append(prompt)

    # Tokenize all prompts at once with padding
    inputs = tokenizer(
        prompts,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=2048  # Adjust if your reports are longer
    ).to(model.device)

    # Generate for entire batch
    with torch.no_grad():  # Saves memory
        outputs = model.generate(
            **inputs,
            max_new_tokens=300,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id
        )

    # Decode all outputs
    predictions = []
    for output in outputs:
        full_output = tokenizer.decode(output, skip_special_tokens=True)
        # Extract just the model's response
        model_output = full_output.split("<start_of_turn>model\n")[-1].strip()
        predictions.append(model_output)

    return predictions



In [12]:

# ==============================================================================
# RUN BATCH INFERENCE ON ALL TEST EXAMPLES
# ==============================================================================

results = []
batch_size = 16  # Start with 8, increase to 16 or 32 if you have GPU memory available

print(f"Running BATCH inference on {len(test_df_copy)} test examples...")
print(f"Batch size: {batch_size}")
print(f"Estimated time: ~{(len(test_df_copy) / batch_size) * 7.27 / 3600:.1f} hours (vs 13+ hours single)")

# Process in batches
for start_idx in tqdm(range(0, len(test_df_copy), batch_size)):
    end_idx = min(start_idx + batch_size, len(test_df_copy))
    batch_df = test_df_copy.iloc[start_idx:end_idx]

    # Get batch data
    batch_texts = batch_df['text'].tolist()
    batch_ids = batch_df['id_num'].tolist()
    batch_true_labels = batch_df['labels'].tolist()

    # Generate predictions for entire batch at once
    batch_predictions = generate_predictions_batch(batch_texts, batch_size)

    # Store results for each item in batch
    for i, (idx, text, true_labels_raw, id_num, pred_text) in enumerate(
        zip(range(start_idx, end_idx), batch_texts, batch_true_labels, batch_ids, batch_predictions)
    ):
        # Parse true labels
        if isinstance(true_labels_raw, str):
            true_labels = ast.literal_eval(true_labels_raw)
        else:
            true_labels = true_labels_raw

        # Store result
        result = {
            'idx': idx,
            'id_num': id_num,
            'echo_text': text,
            'true_labels': true_labels,
            'prediction_text': pred_text
        }

        results.append(result)

# Convert to DataFrame
results_df = pd.DataFrame(results)

# Save results
results_df.to_csv('/content/drive/MyDrive/echo_training/test_inference_results_batch_run.csv', index=False)
print(f"\nSaved results to test_inference_results_batch_run.csv")
print(f"Shape: {results_df.shape}")
print(f"\n🚀 Batch processing complete! This was much faster than single-item processing!")

Running BATCH inference on 6608 test examples...
Batch size: 16
Estimated time: ~0.8 hours (vs 13+ hours single)


100%|██████████| 413/413 [1:12:35<00:00, 10.55s/it]



Saved results to test_inference_results.csv
Shape: (6608, 5)

🚀 Batch processing complete! This was much faster than single-item processing!


In [13]:

# ==============================================================================
# PART 1: LABEL DISTRIBUTION IN TEST SET
# ==============================================================================

print("\n" + "="*70)
print("LABEL DISTRIBUTION IN TEST SET")
print("="*70)

for i, label_name in enumerate(LABEL_NAMES):
    print(f"\n{label_name}:")

    # Extract the i-th value from each label list
    label_values = []
    for idx in range(len(test_df_copy)):
        labels_raw = test_df_copy.iloc[idx]['labels']

        # Parse if string
        if isinstance(labels_raw, str):
            labels = ast.literal_eval(labels_raw)
        else:
            labels = labels_raw

        label_values.append(labels[i])

    # Count values
    value_counts = pd.Series(label_values).value_counts().sort_index()
    null_count = pd.Series(label_values).isna().sum()
    total = len(label_values)

    for value, count in value_counts.items():
        pct = (count/total)*100
        print(f"  {value:>3}: {count:>5} ({pct:>5.1f}%)")
    if null_count > 0:
        pct = (null_count/total)*100
        print(f"  Null: {null_count:>5} ({pct:>5.1f}%)")


LABEL DISTRIBUTION IN TEST SET

LA_cavity:
  -50:    91 (  1.4%)
   -3:    10 (  0.2%)
   -2:   486 (  7.4%)
    0:  4379 ( 66.3%)
    1:  1076 ( 16.3%)
    2:   565 (  8.6%)
    3:     1 (  0.0%)

RA_dilated:
    0:  4563 ( 69.1%)
    1:  2045 ( 30.9%)

LV_systolic:
  -50:    30 (  0.5%)
   -3:    38 (  0.6%)
   -2:   123 (  1.9%)
   -1:   262 (  4.0%)
    0:  4719 ( 71.4%)
    1:   452 (  6.8%)
    2:   385 (  5.8%)
    3:   599 (  9.1%)

LV_cavity:
  -50:     8 (  0.1%)
   -3:    28 (  0.4%)
   -2:    31 (  0.5%)
   -1:   138 (  2.1%)
    0:  5806 ( 87.9%)
    1:   232 (  3.5%)
    2:   292 (  4.4%)
    3:    73 (  1.1%)

LV_wall:
  -50:    11 (  0.2%)
   -3:    26 (  0.4%)
   -2:    30 (  0.5%)
    0:  4418 ( 66.9%)
    1:  1768 ( 26.8%)
    2:   280 (  4.2%)
    3:    75 (  1.1%)

RV_cavity:
  -50:    41 (  0.6%)
   -3:   189 (  2.9%)
   -2:   311 (  4.7%)
   -1:    23 (  0.3%)
    0:  5240 ( 79.3%)
    1:   491 (  7.4%)
    2:   313 (  4.7%)

RV_systolic:
  -50:    48 (  0.7%)
 

In [14]:

# ==============================================================================
# PART 2: CALCULATE ACCURACY
# ==============================================================================

# Parse all predictions
print("\n\nParsing predictions...")
results_df['pred_labels'] = results_df['prediction_text'].apply(parse_prediction)

print("\n" + "="*70)
print("ACCURACY BY LABEL")
print("="*70)

accuracy_results = []

for i, label_name in enumerate(LABEL_NAMES):
    # Extract true values (i-th element from true_labels list)
    true_vals = results_df['true_labels'].apply(lambda x: x[i] if i < len(x) else None).values

    # Extract predicted values
    pred_vals = results_df['pred_labels'].apply(lambda x: x[i] if x and i < len(x) else None).values

    # Remove any None predictions
    valid_mask = ~pd.isna(pred_vals)
    true_vals_valid = true_vals[valid_mask]
    pred_vals_valid = pred_vals[valid_mask]

    # Calculate accuracy
    correct = (true_vals_valid == pred_vals_valid).sum()
    total = len(true_vals_valid)
    accuracy = correct / total if total > 0 else 0

    # Count unparseable predictions
    unparseable = (~valid_mask).sum()

    accuracy_results.append({
        'label': label_name,
        'correct': correct,
        'total': total,
        'accuracy': accuracy,
        'unparseable': unparseable
    })

    print(f"\n{label_name}:")
    print(f"  Correct: {correct}/{total} = {accuracy:.3f}")
    if unparseable > 0:
        print(f"  Unparseable: {unparseable}")




Parsing predictions...

ACCURACY BY LABEL

LA_cavity:
  Correct: 6596/6607 = 0.998
  Unparseable: 1

RA_dilated:
  Correct: 6607/6607 = 1.000
  Unparseable: 1

LV_systolic:
  Correct: 6598/6607 = 0.999
  Unparseable: 1

LV_cavity:
  Correct: 6604/6607 = 1.000
  Unparseable: 1

LV_wall:
  Correct: 6597/6607 = 0.998
  Unparseable: 1

RV_cavity:
  Correct: 6605/6607 = 1.000
  Unparseable: 1

RV_systolic:
  Correct: 6601/6607 = 0.999
  Unparseable: 1

AV_stenosis:
  Correct: 6599/6607 = 0.999
  Unparseable: 1

MV_stenosis:
  Correct: 6602/6607 = 0.999
  Unparseable: 1

TV_regurgitation:
  Correct: 6602/6607 = 0.999
  Unparseable: 1

TV_stenosis:
  Correct: 6606/6607 = 1.000
  Unparseable: 1

TV_pulm_htn:
  Correct: 6603/6607 = 0.999
  Unparseable: 1

AV_regurgitation:
  Correct: 6579/6607 = 0.996
  Unparseable: 1

MV_regurgitation:
  Correct: 6581/6607 = 0.996
  Unparseable: 1

RA_pressure:
  Correct: 6607/6607 = 1.000
  Unparseable: 1

LV_diastolic:
  Correct: 6598/6607 = 0.999
  Unpars

In [15]:
# Create accuracy summary DataFrame
accuracy_df = pd.DataFrame(accuracy_results)

# Overall exact match accuracy
exact_matches = sum(1 for idx in range(len(results_df))
                    if results_df.iloc[idx]['true_labels'] == results_df.iloc[idx]['pred_labels'])
print("\n" + "="*70)
print(f"EXACT MATCH (all 19 labels correct): {exact_matches}/{len(results_df)} = {exact_matches/len(results_df):.3f}")
print("="*70)

# Save accuracy results
accuracy_df.to_csv('label_accuracy.csv', index=False)
print("\nAccuracy results saved to label_accuracy.csv")

# Display summary
print("\nACCURACY SUMMARY:")
print(accuracy_df.to_string(index=False))


EXACT MATCH (all 19 labels correct): 6481/6608 = 0.981

Accuracy results saved to label_accuracy.csv

ACCURACY SUMMARY:
               label  correct  total  accuracy  unparseable
           LA_cavity     6596   6607  0.998335            1
          RA_dilated     6607   6607  1.000000            1
         LV_systolic     6598   6607  0.998638            1
           LV_cavity     6604   6607  0.999546            1
             LV_wall     6597   6607  0.998486            1
           RV_cavity     6605   6607  0.999697            1
         RV_systolic     6601   6607  0.999092            1
         AV_stenosis     6599   6607  0.998789            1
         MV_stenosis     6602   6607  0.999243            1
    TV_regurgitation     6602   6607  0.999243            1
         TV_stenosis     6606   6607  0.999849            1
         TV_pulm_htn     6603   6607  0.999395            1
    AV_regurgitation     6579   6607  0.995762            1
    MV_regurgitation     6581   6607  0