# YOHO Datasets

In this notebook we are going to locally download the datasets used in the [the original YOHO paper](https://doi.org/10.48550/arXiv.2109.00962), meaning:

* the _MuSpeak dataset_,
* the _TUT Sound Event Detection dataset_ and
* the _Urban Sound Event Detection dataset_.

We defined useful funcitons in the `utils` package to make easier the process of data retrieval.

In [4]:
import utils as audio_utils

## Download the datasets

### Music-Speech Detection

In [None]:
# MuSpeak Dataset
musp_url = "https://mirg.city.ac.uk/datasets/muspeak/muspeak-mirex2015-detection-examples.zip"
musp_zip_path = "../data/musp.zip"
musp_extract_to = "../data/musp"

audio_utils.download_file(musp_url, musp_zip_path)
audio_utils.uncompress_file(musp_zip_path, musp_extract_to)

### TUT Sound Events Detection

The dataset is downloaded from the [DCASE2017 Challenge official website](https://dcase.community/challenge2017).

In [None]:
import os

# TUT Sound Events 2017 Dataset
tut_urls = [("TUT-sound-events-2017-development", "https://zenodo.org/api/records/814831/files-archive"),
            ("TUT-sound-events-2017-evaluation", "https://zenodo.org/api/records/1040179/files-archive")
            ]
tut_zip_path = "../data/tut.zip"
tut_extract_to = "../data/tut"

for tut_name, tut_url in tut_urls:
    tut_extract_to_subfolder = tut_extract_to + '/' + tut_name
    audio_utils.download_file(tut_url, tut_zip_path)
    audio_utils.uncompress_file(tut_zip_path, tut_extract_to_subfolder)

    for item in tqdm(os.listdir(tut_extract_to_subfolder)):
        if item.endswith('.zip'):
            zipped_file = tut_extract_to_subfolder + '/' + item
            unzipped_file = zipped_file.rsplit(
                ".", 1)[0]  # Remove .zip extension
            audio_utils.uncompress_file(zipped_file, unzipped_file)
            os.remove(zipped_file)

### Urban-SED

In [None]:
# Urban-SED Dataset
urbansed_url = "https://zenodo.org/api/records/1324404/files-archive"
urbansed_zip_path = "../data/urbansed.zip"
urbansed_extract_to = "../data/urbansed"

audio_utils.download_file(urbansed_url, urbansed_zip_path)
audio_utils.uncompress_file(urbansed_zip_path, urbansed_extract_to)
# The folder contains a compressed subfolder
audio_utils.uncompress_file(urbansed_extract_to + "/URBAN-SED_v2.0.0.tar.gz", urbansed_extract_to)

## Extract metadata

1. **Process Annotations**: Read the annotation files for each dataset and convert the event data into a unified format. For example, for the MUSP dataset, annotations are transformed into a list of events where each event is represented as a tuple containing the event type ('m' for music or 's' for speech), start time, and end time.

2. **Save Processed Data**: Save the processed data, including the audio file paths and their corresponding events, into a new CSV file. This structured data will serve as the input for data generator.

### Music-Speech Detection

In [None]:
import csv

MUSP_ANNOTATIONS_PATH = './data/musp/'

# Get all the annotation files (name) in the data path:
files = audio_utils.get_files(MUSP_ANNOTATIONS_PATH, extensions='.csv')

# !! Remove some annotation files that are not linked to audio files:
files.remove('theconcert2_v2.csv')  # no file named 'theconcert2_v2.mp3'
files.remove('UTMA-26_v2.csv')      # no file named 'UTMA-26_v2.mp3'

musp_data = {}  # This dictionary will store for each audio file the list of events
                # i.e. {'./data/musp/audio_file.mp3': [('s', 20, 22), ('s', 25, 27), ...]}

for f in files:
    with open(MUSP_ANNOTATIONS_PATH + f, 'r') as file:

        reader = csv.reader(file)

        MP3_FILE_PATH = './data/musp/' + f.split('.')[0] + '.mp3'
        musp_data[MP3_FILE_PATH] = []       # Initialize the list of events for this file

        print(f'Processing {MP3_FILE_PATH}...')

        for row in reader: # Read the events from the CSV file
            
            if not row: # Skip empty lines
              continue
  
            start = float(row[0])       # Start time of the event (in seconds)
            duration = float(row[1])    # Duration of the event (in seconds)
            end_time = start + duration # End time of the event (in seconds)
            label = str(row[2])         # Label of the event  

            assert (start < end_time), f"Start time ({start}) must be less than end time ({end_time})"
            # Append the event to the list of events for this audio file:
            musp_data[MP3_FILE_PATH].append(
              (label, start, end_time) 
            )                          

audio_utils.write_data_to_csv(musp_data, './data/musp.csv')

### TUT Sound Events Detection

In [None]:
AUDIO_1_PATH = './data/tut/TUT-sound-events-2017-development/TUT-sound-events-2017-development.audio.1/TUT-sound-events-2017-development/audio/street/'
AUDIO_2_PATH = './data/tut/TUT-sound-events-2017-development/TUT-sound-events-2017-development.audio.2/TUT-sound-events-2017-development/audio/street/'
DATA_PATH = './data/tut/'

DEVELOPMENT_ANNOTATIONS_PATH = './data/tut/TUT-sound-events-2017-development/TUT-sound-events-2017-development.meta/TUT-sound-events-2017-development/meta/street/'

files = audio_utils.get_files(DEVELOPMENT_ANNOTATIONS_PATH, extensions='.ann')

tut_data = {}
for f in files:
    with open(DEVELOPMENT_ANNOTATIONS_PATH + f, 'r'):

        f_name = f.split('.')[0] + '.wav'

        if f_name in ['a128.wav', 'a131.wav', 'b007.wav', 'b093.wav']:
            f_path = AUDIO_2_PATH + f_name
        else:
          f_path = AUDIO_1_PATH + f_name
          
        tut_data[f_path] = []

        print(f'Processing {f_path}...')

        with open(DEVELOPMENT_ANNOTATIONS_PATH + f, 'r') as file:

            reader = csv.reader(file)

            for row in reader:
                if row:
                    # split in \t and get the start and end time
                    row = row[0].split('\t')
                    start = float(row[2])
                    end = float(row[3])
                    label = row[4]
                    tut_data[f_path].append(
                        (label, start, end)
                    )

audio_utils.write_data_to_csv(tut_data, './data/tut.csv')

### Urban-SED

In [None]:
import jams
URBAN_SED_ANNOTATIONS_PATH = './data/urbansed/annotations/train/'

files = audio_utils.get_files(URBAN_SED_ANNOTATIONS_PATH, extensions='jams')

def parse_jams_file(jams_file):
    """
    Parse a JAMS file and extract the annotations.

    Parameters:
    - jams_file (str): The path to the JAMS file.

    Returns:
    - events (list): A list of tuples containing the event type, start time, and end time.
    """
    jam = jams.load(jams_file)
    events = []
    for annotation in jam.annotations:

        for obs in annotation.data:

            start_time = obs.value['event_time']
            end_time = obs.value['event_time'] + obs.value['event_duration']
            label = obs.value['label']
            events.append((label, start_time, end_time))

        return events

DATA_PATH = './data/urbansed/annotations/train/'
AUDIO_PATH = './data/urbansed/audio/train/'

urbansed_data = {}
for f in files:
    f_path = DATA_PATH + f
    f = f.split('.')[0] + '.wav'
    urbansed_data[AUDIO_PATH+f] = parse_jams_file(f_path)

audio_utils.write_data_to_csv(urbansed_data, './data/urbansed.csv')