# Create Voice Clips
This program will split a long voice mp3 file into 55-second short clips.
- name the clips in xx-identifiers-[engaging/boring]-num.mp3 format.
- save the progressing records in a CSV file that will be merged late.

## Install the necessary pydub library

In [None]:
!pip install -q pydub

## Import the libraries

In [None]:
from pydub import AudioSegment
import os

In [None]:
from google.colab import drive
drive.mount('/content/drive')

pydub may need ffmpeg for voice processing: https://ffmpeg.org/about.html
- On Mac, open a terminal and type the following command to check the availability of ffmpeg: `ffmpeg -version`

## Access the Input Voice File
### Mount Google Drive to Colab Notebook:
To access a saved voice file from a Google colab notebook, first you need to mount your google drive to the notebook.
- Click the folder icon on the left margin to open the file navigation panel.
- Click the 'Mount Drive' icon to mount your google drive
- Follow the instructions to authorize your notebook to access your drive.

### Link the Shared folder
- Copy the shared folder to "My Drive/Colab Notebooks"
- Navigate to input voice file


### Copy the path of the Input Voice File
- Point to the vertical three dots next to the input voice file.
- Right-click the mouse the bring up the context menu.
- Click "Copy path" to copy the path of the input file.

### Create Variables for the Input File

- `input_voice_path`: the path the the input file in the drive.
- `source`: the source link where you downloaded the voice file.
- `source_id`: an identifier for the source.
- `initials`: your name initials.
- `label`: "engaging" or "boring"

In [None]:
## CHANGE the path to your input file
input_voice_path = A_PATH_TO_YOUR_INPUT_VOICE_FILE

In [None]:
## CHANGE the source to the link where you downloaded the voice file
source = A_STRING

In [None]:
## CHANGE to the identifier you chooose
source_id = A_STRING

In [None]:
## CHANGE to your initials
initials = A_STRING

In [None]:
## CHANGE the label according to your classification
label = A_STRING

## Set up Output Folder
Similar to copy the input file, copy the output folder path to a variable.

In [None]:
## CHANGE to the folder path depending whether you split engaging or boring files
output_engaging_path = A_PATH_TO_YOUR_OUTPUT_FOLDER

## Load the Audio File

In [None]:
# Load the audio file
audio = AudioSegment.from_file(input_voice_path)

## Set up the Length of Clips
Set up the length of clips and compute how many clips you will generate.

In [None]:
# Calculate the number of segments
segment_length_ms = 55 * 1000  # pydub works in milliseconds (55 seconds)
num_segments = len(audio) // segment_length_ms
num_segments

## Test a Segment
Split a segment and test it works.

In [None]:
start_time = 0
end_time = 55000
segment = audio[start_time:end_time]

In [None]:
from IPython.display import Audio

In [None]:
# CHANGE: Set up the output file path
output_path = os.path.join(output_engaging_path, f'{initials}-{source_id}--{label}-1.mp3')

# Export the segment to the path for playback
segment.export(output_path, format="mp3")

In [None]:
## Playback the saved clip
Audio(output_path)

## Set up the Path to Save Processing Records
Like input file and output path, copy the path to the "data" folder under your name. You need to save the processing records in a CSV file there.

In [None]:
## Set up unique file name based on source
csv_file_name = f'{initials}-{source_id}-{label}.csv'

## CHANGE: Set up the processing records path
processing_records_path = A_PATH + csv_file_name

## Set up the Processing CSV Records

In [None]:
## This is a list of dictionaries each of which record the
## source, identifier, file name, created date, process, and label
processing_records = []

## Split and Save the Clips in Batch

In [None]:
from datetime import datetime
for i in range(num_segments + 1):
    arecord = {}
    arecord['source'] = source
    arecord['identifier'] = source_id
    arecord['label'] = label
    arecord['created_date'] = datetime.now().strftime("%Y-%m-%d")
    arecord['process'] = 'split program'

    start_time = i * segment_length_ms
    end_time = start_time + segment_length_ms
    segment = audio[start_time:end_time]

    # Set up the output file path
    file_name = f'{initials}-{source_id}--{label}-{i + 1}.mp3'
    arecord['file_name'] = file_name

    output_path = os.path.join(output_engaging_path, file_name)

    # Export the segment to the path for playback
    segment.export(output_path, format="mp3")

    # Save the processing records
    processing_records.append(arecord)

    print(arecord)

## Playback the Last Segment to Test

In [None]:
Audio(output_path)

## Save the Processing Record
- Create a Pandas DataFrame from the List of Dictionary
- Save the DataFrame to CSV

In [None]:
import pandas as pd
processing_records_df = pd.DataFrame(processing_records)
processing_records_df.to_csv(processing_records_path, index=False)