## Data Preprocessing

1. **Process Annotations**: Read the annotation files for each dataset and convert the event data into a unified format. For example, for the MUSP dataset, annotations are transformed into a list of events where each event is represented as a tuple containing the event type ('m' for music or 's' for speech), start time, and end time.

2. **Save Processed Data**: Save the processed data, including the audio file paths and their corresponding events, into a new CSV file. This structured data will serve as the input for data generator.

In [None]:
import os
import csv

In [42]:
DATA_PATH = '../data/musp/'

def get_files(data_path, extensions):
  """
  Get a list of files in the specified data path with the given extensions.

  Parameters:
  - data_path (str): The path to the directory containing the files.
  - extensions (str or tuple): The file extensions to filter by.

  Returns:
  - files (list): A list of file names that match the specified extensions.
  """
  files = [f for f in os.listdir(data_path) if f.endswith(extensions)]
  return files

def write_data_to_csv(data, output_path):
  """
  Write data to a CSV file.

  Args:
    data (dict): A dictionary containing the data to be written to the CSV file.
    output_path (str): The path to the output CSV file.

  Returns:
    None
  """
  with open(output_path, 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['filepath', 'events'])
    for key, value in data.items():
      if value:
        writer.writerow([key, value])

# Get list of files in the data path with the .csv extension:
files = get_files(DATA_PATH, extensions='.csv')

musp_data = {}
for f in files:
    with open(DATA_PATH + f, 'r') as file:
        reader = csv.reader(file)
        file_name = f.split('.')[0]     # file name without the extension
        file_name = file_name + '.mp3'  # add the .mp3 extension to match the audio files
        musp_data[file_name] = []       # Initialize the list of events for this file

        print(f'Processing {file_name}...')
        for row in reader:
            if row: # Skip empty lines
              start = float(row[0])
              duration = float(row[1])
              end_time = start + duration
              label = str(row[2])

              musp_data[file_name].append(
                (label, start, end_time) # Tuple with the event label, start (seconds) and end time (seconds)
              )                          # i.e. ('s', 20, 22) means that the event 's' starts at 20s and ends at 22s

# Write the data to a CSV file:     
write_data_to_csv(musp_data, '../data/processed/musp.csv')

Processing ConscinciasParalelasN11-OEspelhoEOReflexoFantasiasEPerplexidadesParte413-12-1994.mp3...
Processing ConscinciasParalelasN3-OsSentidosOSentirEAsNormasParte318-10-1994.mp3...
Processing ConscinciasParalelasN7-OsSentidosOSentirEAsNormasParte715-1-1994.mp3...
Processing eatmycountry1609.mp3...
Processing theconcert16.mp3...
Processing theconcert2.mp3...
Processing theconcert2_v2.mp3...
Processing UTMA-26.mp3...
Processing UTMA-26_v2.mp3...
