---

# Saraga Audiovisual: a large multimodal open data collection for the analysis of carnatic music

---

## 1.Accessing the Saraga Audiovisual Dataset  

>Saraga Audiovisual is a dataset that includes diverse renditions of **Carnatic vocal performances**, totaling **42 concerts** and more than **60 hours of music** featuring **Video recordings** for all concerts, enabling a wide range of multimodal analyses and **High-quality human pose estimation data** of musicians.

### 1.1.Download the Dataset

The Saraga dataset is available on Zenodo. You can download it using the following link:  
[Zondo Link](https://zenodo.org/records/15102483)  

The dataset is split into multiple parts, each containing specific components:

- **`saraga_audio.zip`** – Multi-track audio files along with their corresponding mixture files.
- **`saraga_gesture.zip`** – Pose estimation files extracted from videos corresponding to each audio track.
- **`saraga_metadata.zip`** – Metadata for all the audio files.
- **`saraga_video.zip`** – Videos from three sample concerts. Due to size constraints, only these three concerts are included. For access to the full video collection, contact the dataset providers.

Visit [Zenodo](https://zenodo.org/records/15102483) and manually download the required zip files.

Alternatively, you can use `wget` to download the files directly.

In [None]:
!wget -O "saraga_audio.zip" "https://zenodo.org/records/15102483/files/saraga%20audio.zip?download=1"
!wget -O "saraga_gesture.zip" "https://zenodo.org/records/15102483/files/saraga%20gesture.zip?download=1"
!wget -O "saraga_metadata.zip" "https://zenodo.org/records/15102483/files/saraga%20metadata.zip?download=1"
!wget -O "saraga_video.zip" "https://zenodo.org/records/15102483/files/saraga%20visual.zip?download=1"

### 1.2. Extract the Dataset

Once the files are downloaded, extract them into a common folder.  
For that, we use `zipfile`, a Python library for handling zip files.

In [None]:
import zipfile

In [None]:
saraga_folder = "./saraga"
zip_files = ["saraga gesture.zip", "saraga metadata.zip", "saraga visual.zip"]#, "saraga_audio.zip"]

In [None]:
# Extract each zip file to extract path
for zip_file in zip_files:
    with zipfile.ZipFile(zip_file, 'r') as zip_ref:
        zip_ref.extractall(saraga_folder)
    print(f"Extracted {zip_file} to {extract_path}")

## 2.Processing Audiovisual Data

Now we will process the keypoints from the gesture dataset and display the skeleton on the performance video. For this tutorial, we will use *Valappu Thala* by *Brinda Manickavasagan*.

### 2.1. Load Gestures and Video

Let's import the necessary libraries required for processing the video and gestures.

In [None]:
!pip install -r requirements.txt

In [1]:
import cv2
import numpy as np

In [2]:
# Define File Paths for the Relevant Performance
keypoints_path = "saraga/saraga gesture/Aditi Prahalad/Ananda Natana Prakasham/singer/singer_0_753_kpts.npy"
scores_path = "saraga/saraga gesture/Aditi Prahalad/Ananda Natana Prakasham/singer/singer_0_753_scores.npy"
video_path = "saraga/saraga visual/Aditi Prahlad/Ananda Natana Prakasham/Ananda Natana Prakasham.mov"
save_path = "output.mp4"

Load the keypoints and scores file for the performance.

In [3]:
keypoints = np.load(keypoints_path)
scores = np.load(scores_path)

Now, we will define the skeleton—a list of tuples that defines how keypoints should be connected to form a human pose. For example, the left shoulder should be connected to the left elbow, and the elbow to the wrist.

In [4]:
# Skeleton for 135 keypoints (MMPose)
skeleton = [
    (0, 1), (1, 2),     # Eyes (left to right)
    (0, 3), (0, 4),     # Nose to ears (left and right)
    (5, 6),             # Shoulders (left and right)
    (5, 7), (7, 9),     # Left arm (shoulder -> elbow -> wrist)
    (6, 8), (8, 10),
    (11,12),            # Right arm (shoulder -> elbow -> wrist)
    (5, 11), (6, 12),   # Shoulders to hips
    (11, 13), (13, 15), # Left leg (hip -> knee -> ankle)
    (12, 14), (14, 16)  # Right leg (hip -> knee -> ankle)
]

Now, we will open the video file.

In [5]:
cap = cv2.VideoCapture(video_path)
fps = int(cap.get(cv2.CAP_PROP_FPS))  # Frames per second
frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

### 2.2.Process the frames

Create a temporary output video file so that we can save the processed video with the overlayed skeleton and gestures.

In [6]:
out = cv2.VideoWriter(save_path, cv2.VideoWriter_fourcc(*'mp4v'), fps, (frame_width, frame_height))

Now, let's project the skeleton onto the video frames. First, we will select a 20-second segment to process.

In [7]:
start_time = 10  # Start time in seconds (adjust as needed)
end_time = start_time + 20  # End time in seconds
start_frame = int(start_time * fps)
end_frame = int(end_time * fps)

In [8]:
frame_idx = 0
while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break
        
    if start_frame <= frame_idx < end_frame:
        # Get keypoints and scores for the current frame
        if frame_idx < len(keypoints):
            frame_keypoints = keypoints[frame_idx]
            frame_scores = scores[frame_idx]
            
            # Draw keypoints and skeleton
            for i, (x, y) in enumerate(frame_keypoints):
                # Only draw if confidence score is above threshold
                if frame_scores[i] > 0.5:  # Adjust threshold as needed
                    cv2.circle(frame, (int(x), int(y)), 5, (0, 255, 0), -1)
                    
            # Draw skeleton
            for connection in skeleton:
                start, end = connection
                if frame_scores[start] > 0.5 and frame_scores[end] > 0.5:
                    x1, y1 = frame_keypoints[start]
                    x2, y2 = frame_keypoints[end]
                    cv2.line(frame, (int(x1), int(y1)), (int(x2), int(y2)), (255, 0, 0), 2)
                    
        # Write frame to output video
        out.write(frame)
        
    frame_idx += 1
    
    # Stop processing after the end frame
    if frame_idx >= end_frame:
        break

Free the resources after processing.

In [9]:
cap.release()
out.release()
cv2.destroyAllWindows()

## 3. Display the Result

Great! We have the results. Now, we can display the video in the notebook using `IPython.display`.

In [10]:
import IPython.display as ipd

In [11]:
ipd.display(ipd.Video(save_path, width=640, height=360))