
# SurgLaVi-β Explorer

This notebook reads and explores SurgLaVi-β database.

**What you'll get:**
- Caption count histograms (unfiltered, surgical, descriptive, surgical & descriptive)
- Caption count per level
- Violin plots: caption duration by level; per-video caption count by level
- Video overview bars (total/silent/narrated/filtered)
- Categorical pie charts: `procedure_modality`, `procedure_type`, `specialty`, `subject`
- FPS and video duration overview
- Sample data per level


In [None]:
!pip install pandas
!pip install numpy
!pip install plotly
!pip install matplotlib
!pip install sqlalchemy
!pip install nbformat

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import cc_explore_utils as cc

RNG_SEED = 42
np.random.seed(RNG_SEED)

db_url = "sqlite:///../data/surglavi_beta.db"  


In [None]:

# --- Connect & Load ---
engine = cc.connect(db_url)
data = cc.load_all(engine)

videos = data["videos"]
captions = data["captions"]
levels = data["levels"]
transcriptions = data["transcriptions"]

# Basic summary
summary_df = cc.summary_info(videos, captions)
display(summary_df)


## Captions

### 1) Histogram of caption counts

In [None]:
cc.plot_caption_count_histograms(captions)

### 2) Keep only captions that are **surgical & descriptive**

In [None]:

captions_f = cc.filter_captions_surgical_descriptive(captions)
print(f"Filtered captions (surgical & descriptive): {len(captions_f)} / {len(captions)}")


### 3) Histogram of filtered caption count per level

In [None]:
cc.plot_caption_counts_by_level(captions_f)

### 4) Violin plots: duration per level

In [None]:
cc.plot_duration_violin_by_level(captions_f)

### 5) Violin plots: per-video caption count per level (filtered)

In [None]:
cc.plot_avg_caption_count_per_video_by_level(captions_f)

## Videos metadata

### 1) Histogram bars: total, silent, narrated, filtered

In [None]:
cc.plot_video_overview_bars(videos, captions_f)

### 2) Narrated videos that contain surgical & descriptive captions

In [None]:

vids_narrated = cc.narrated_videos(videos)
vids_filtered = cc.videos_with_filtered_captions(vids_narrated, captions_f)
print(f"Narrated videos: {len(vids_narrated)} / {len(videos)}")
print(f"Narrated videos that contain filtered captions: {len(vids_filtered)}")


### 3) Categorical distributions

Procedure modality distribution

In [None]:
cc.plot_categorical_pie(vids_filtered, 'procedure_modality')

Procedure type distribution (Place your mouse over a category to display its name more clearly)

In [None]:
cc.plot_categorical_pie(vids_filtered, 'procedure_type')

Specialty distribution

In [None]:
cc.plot_categorical_pie(vids_filtered, 'specialty')

Subject distribution

In [None]:
cc.plot_categorical_pie(vids_filtered, 'subject')

### 4) FPS distribution (histogram)

In [None]:
cc.plot_fps_hist(vids_filtered)

### 5) Video duration distribution (histogram)

In [None]:
cc.plot_video_duration_hist(vids_filtered)

## Caption samples

### Filtered captions: 10 random samples by level

In [None]:

samples_level_1 = cc.sample_filtered_by_level(captions_f, level_id=1, n=10, seed=RNG_SEED)
samples_level_2 = cc.sample_filtered_by_level(captions_f, level_id=2, n=10, seed=RNG_SEED)
samples_level_3 = cc.sample_filtered_by_level(captions_f, level_id=3, n=10, seed=RNG_SEED)

print("Level 1 (task) samples"); display(samples_level_1)
print("Level 2 (step) samples"); display(samples_level_2)
print("Level 3 (phase) samples"); display(samples_level_3)


### Unfiltered captions: 10 random samples

In [None]:

samples_unfiltered = cc.sample_captions(captions, n=10, seed=RNG_SEED)
display(samples_unfiltered)
