## Template / example for generic modeling

There are 2 options for obtaining discretized data:
1. **Loading from pre-existing pickle file** (automatically created after running `python compile_annotations.py --di`)**
    * Pro: The pickle file has already been prepared, so just need to load
    * Con: Prepared with default parameters for segment length, hop length, overlap threshold
2. **Using the utility function `discretize_all()`** (found in `common/annotation_utils`)
    * Pro: Allows for custom setting of segment length, hop length, overlap threshold, e.g. for grid search
    
Below shows examples of how to use each:

### Method 1: loading from pre-existing pickle file:

In [1]:
%load_ext autoreload
%autoreload 2
# - IPython extension to reload modules before executing user code
# autoreload reloads modules automatically before entering the execution of code typed at the IPython prompt.

In [2]:
import pickle, os, sys, time
import pywt

repo_base_directory = os.path.dirname(os.path.dirname(os.getcwd()))
sys.path.append(repo_base_directory)
from common.annotation_utils import *


In [3]:
f = open("../../audio_annotation/outputs/discretized_dict.pkl", "rb")
discretized_dict = pickle.load(f)

In [4]:
discretized_dict.keys()

dict_keys(['data', 'segment_length', 'hop_length', 'overlap_thresh'])

### Method 2: using discretize_all:

In [5]:
seg_len = 0.5
hop_len = 0.5
overlap = 0.5
combined_annotations_filepath = '../../audio_annotation/outputs/compiled_annotations_df.pkl'
_, discretized_dict = discretize_all(combined_annotations_filepath, 
                                     audio_filepath = '../../data/audio',
                                     segment_length=seg_len,
                                     hop_length=hop_len,
                                     overlap_thresh=overlap)


Trying ../../data/audio/6hoNS9LR1Lxb4EzMX72kXR.ogg...
Trying ../../data/audio/79aiOALBohH9vEIUHPAoRj.ogg...
Error: Detected audio duration is greater than 10 hours
Trying to use mp3 file
mp3 load successful
Trying ../../data/audio/0pIwpmg5oPcMWJXVSyrx4E.ogg...
Error: Detected audio duration is greater than 10 hours
Trying to use mp3 file
mp3 load successful
Trying ../../data/audio/7LxbgPH3grqf6lCpLKEwo7.ogg...
Error: Detected audio duration is greater than 10 hours
Trying to use mp3 file
mp3 load successful
Trying ../../data/audio/7vxD3WNDRkigLnIDHyy0cu.ogg...
Error: Detected audio duration is greater than 10 hours
Trying to use mp3 file
mp3 load successful
Trying ../../data/audio/7r367wUYs1EvyBbeyOcq39.ogg...
Error: Detected audio duration is greater than 10 hours
Trying to use mp3 file
mp3 load successful
Trying ../../data/audio/0TkGYYIPwRqx8xzP0XGvRG.ogg...
Trying ../../data/audio/1XgTQnRlfJ0zpDdg2DccbR.ogg...
Error: Detected audio duration is greater than 10 hours
Trying to use mp3

### Processing of All Episodes

In [6]:
def process_entire_episode(podcast_id: str, segment_length, hop_length, overlap, y, audio_duration):
    '''Processes the entire podcast episode
    
    Args:
        podcast_id: the podcast that needs to be processed
    
    Returns:
        a pickled file that is saved in data/dwt_df_pickle that can be used later
    
    '''

    # find the appropriate path (use mp3 if it's available)
    prev_directory = '../../data/audio/'
    if os.path.exists(f'{prev_directory}{podcast_id}.mp3'):
        audio_filepath = f'{prev_directory}{podcast_id}.mp3'
    else:
        audio_filepath = f'{prev_directory}{podcast_id}.ogg'

    print(f'Found audio file at location {audio_filepath}')
    start_load = time.time()
    waveform, sr = librosa.load(audio_filepath)
    print(f'\n{podcast_id} took {((time.time() - start_load) / 60):.2f} minutes to load')

    # Convert sample indices to time (in seconds)
    time_stamp = librosa.samples_to_time(np.arange(0, len(waveform)), sr=sr)
    audio_df = pd.DataFrame()
    audio_df['time'] = time_stamp
    audio_df['original'] = waveform

    
    # necessary information gained about the podcast
    chunks = np.arange(0, audio_duration - segment_length, hop_length)
    print(f'Number of chunks in podcast id {podcast_id}: {len(chunks)}')

    # getting segments of the waveform data based on the chunks
    audio_time_seg = []
    # getting segments of the time based on the chunks
    time_seg = []
    for chunk in chunks:

        # audio segments
        start_sr = librosa.time_to_samples(chunk, sr=sr)
        end_sr = librosa.time_to_samples(chunk + segment_length, sr=sr)
        
        sub_audio = audio_df.original[start_sr:end_sr + 1]
        audio_time_seg.append(sub_audio)

        sub_time = audio_df.time[start_sr:end_sr + 1]
        time_seg.append(sub_time)


    print(f'Number of audio time segments: {len(audio_time_seg)}')
    print(f'Length of first audio time segment: {len(audio_time_seg[0])/sr}')

    dwt_time = time.time()
    cA_lst = []
    for sub_episode in audio_time_seg:
        data = sub_episode / max(sub_episode) # normalize data
        cA, _ = pywt.dwt(data, 'db1', 'symmetric')
        cA_lst.append(cA)

    print(f'Took {(time.time() - dwt_time):.4f} seconds to complete dwt over each audio time segment')

    print(f"length of approximation coefficients cA_lst: {len(cA_lst)}")
    print(
        f"length of each chunk's approximation coefficients in cA_lst: {len(cA_lst[0])}")

    dwt_df = pd.DataFrame(cA_lst)
    dwt_df["y"] = y
    print(
        f'\nSize of final dataframe for podcast id {podcast_id}: {dwt_df.shape}')
    dwt_df.to_pickle(f"../../data/dwt_df_pkl/{podcast_id}_seg{segment_length}_hop{hop_length}_ov{overlap}.pkl")

In [7]:
start_entire_podcast_loop = time.time()

segment_length = discretized_dict['segment_length']
hop_length = discretized_dict['hop_length']
overlap_thresh = discretized_dict['overlap_thresh']

# typical way but I've been having some problems
# check = ['0pIwpmg5oPcMWJXVSyrx4E']
for i in discretized_dict['data'].keys():

    start_individual_podcast_time = time.time()

    y = discretized_dict['data'][i]['y']

    audio_duration = discretized_dict['data'][i]['audio_duration']
    print(f'Length of audio file for podcast id {i}: {audio_duration}')

    process_entire_episode(i, segment_length, hop_length,
                           overlap_thresh, y, audio_duration)

    time_entire_individual_podcast = time.time() - start_individual_podcast_time
    print(f'Podcast id {i} took {time_entire_individual_podcast / 60: .3f} minutes to complete\n')

# TO DO: add in some logging so that it's writing to a file or something

total_num_episodes = len(discretized_dict['data'].keys())

time_entire_podcast_loop = time.time() - start_entire_podcast_loop
print(f'Took {time_entire_podcast_loop / 60: .3f} minutes to loop through all {total_num_episodes} podcast episodes\n\n')


Length of audio file for podcast id 6hoNS9LR1Lxb4EzMX72kXR: 2950.974693877551
Found audio file at location ../../data/audio/6hoNS9LR1Lxb4EzMX72kXR.ogg

6hoNS9LR1Lxb4EzMX72kXR took 2.34 minutes to load
Number of chunks in podcast id 6hoNS9LR1Lxb4EzMX72kXR: 5901
Number of audio time segments: 5901
Length of first audio time segment: 0.5000453514739229
Took 7.2922 seconds to complete dwt over each audio time segment
length of approximation coefficients cA_lst: 5901
length of each chunk's approximation coefficients in cA_lst: 5513

Size of final dataframe for podcast id 6hoNS9LR1Lxb4EzMX72kXR: (5901, 5514)
Podcast id 6hoNS9LR1Lxb4EzMX72kXR took  2.934 minutes to complete

Length of audio file for podcast id 79aiOALBohH9vEIUHPAoRj: 2801.9214512471654
Found audio file at location ../../data/audio/79aiOALBohH9vEIUHPAoRj.mp3





79aiOALBohH9vEIUHPAoRj took 2.12 minutes to load
Number of chunks in podcast id 79aiOALBohH9vEIUHPAoRj: 5603
Number of audio time segments: 5603
Length of first audio time segment: 0.5000453514739229
Took 8.3585 seconds to complete dwt over each audio time segment
length of approximation coefficients cA_lst: 5603
length of each chunk's approximation coefficients in cA_lst: 5513

Size of final dataframe for podcast id 79aiOALBohH9vEIUHPAoRj: (5603, 5514)
Podcast id 79aiOALBohH9vEIUHPAoRj took  2.726 minutes to complete

Length of audio file for podcast id 0pIwpmg5oPcMWJXVSyrx4E: 494.88544217687075
Found audio file at location ../../data/audio/0pIwpmg5oPcMWJXVSyrx4E.mp3





0pIwpmg5oPcMWJXVSyrx4E took 0.43 minutes to load
Number of chunks in podcast id 0pIwpmg5oPcMWJXVSyrx4E: 989
Number of audio time segments: 989
Length of first audio time segment: 0.5000453514739229
Took 1.2837 seconds to complete dwt over each audio time segment
length of approximation coefficients cA_lst: 989
length of each chunk's approximation coefficients in cA_lst: 5513

Size of final dataframe for podcast id 0pIwpmg5oPcMWJXVSyrx4E: (989, 5514)
Podcast id 0pIwpmg5oPcMWJXVSyrx4E took  0.503 minutes to complete

Length of audio file for podcast id 7LxbgPH3grqf6lCpLKEwo7: 414.32090702947846
Found audio file at location ../../data/audio/7LxbgPH3grqf6lCpLKEwo7.mp3





7LxbgPH3grqf6lCpLKEwo7 took 0.33 minutes to load
Number of chunks in podcast id 7LxbgPH3grqf6lCpLKEwo7: 828
Number of audio time segments: 828
Length of first audio time segment: 0.5000453514739229
Took 1.1123 seconds to complete dwt over each audio time segment
length of approximation coefficients cA_lst: 828
length of each chunk's approximation coefficients in cA_lst: 5513

Size of final dataframe for podcast id 7LxbgPH3grqf6lCpLKEwo7: (828, 5514)
Podcast id 7LxbgPH3grqf6lCpLKEwo7 took  0.394 minutes to complete

Length of audio file for podcast id 7vxD3WNDRkigLnIDHyy0cu: 465.39319727891154
Found audio file at location ../../data/audio/7vxD3WNDRkigLnIDHyy0cu.mp3





7vxD3WNDRkigLnIDHyy0cu took 0.35 minutes to load
Number of chunks in podcast id 7vxD3WNDRkigLnIDHyy0cu: 930
Number of audio time segments: 930
Length of first audio time segment: 0.5000453514739229
Took 1.1130 seconds to complete dwt over each audio time segment
length of approximation coefficients cA_lst: 930
length of each chunk's approximation coefficients in cA_lst: 5513

Size of final dataframe for podcast id 7vxD3WNDRkigLnIDHyy0cu: (930, 5514)
Podcast id 7vxD3WNDRkigLnIDHyy0cu took  0.418 minutes to complete

Length of audio file for podcast id 7r367wUYs1EvyBbeyOcq39: 460.94367346938776
Found audio file at location ../../data/audio/7r367wUYs1EvyBbeyOcq39.mp3





7r367wUYs1EvyBbeyOcq39 took 0.35 minutes to load
Number of chunks in podcast id 7r367wUYs1EvyBbeyOcq39: 921
Number of audio time segments: 921
Length of first audio time segment: 0.5000453514739229
Took 1.3402 seconds to complete dwt over each audio time segment
length of approximation coefficients cA_lst: 921
length of each chunk's approximation coefficients in cA_lst: 5513

Size of final dataframe for podcast id 7r367wUYs1EvyBbeyOcq39: (921, 5514)
Podcast id 7r367wUYs1EvyBbeyOcq39 took  0.431 minutes to complete

Length of audio file for podcast id 0TkGYYIPwRqx8xzP0XGvRG: 1507.1825850340135
Found audio file at location ../../data/audio/0TkGYYIPwRqx8xzP0XGvRG.ogg

0TkGYYIPwRqx8xzP0XGvRG took 1.17 minutes to load
Number of chunks in podcast id 0TkGYYIPwRqx8xzP0XGvRG: 3014
Number of audio time segments: 3014
Length of first audio time segment: 0.5000453514739229
Took 3.6234 seconds to complete dwt over each audio time segment
length of approximation coefficients cA_lst: 3014
length of 




1XgTQnRlfJ0zpDdg2DccbR took 2.08 minutes to load
Number of chunks in podcast id 1XgTQnRlfJ0zpDdg2DccbR: 5630
Number of audio time segments: 5630
Length of first audio time segment: 0.5000453514739229
Took 7.0135 seconds to complete dwt over each audio time segment
length of approximation coefficients cA_lst: 5630
length of each chunk's approximation coefficients in cA_lst: 5513

Size of final dataframe for podcast id 1XgTQnRlfJ0zpDdg2DccbR: (5630, 5514)
Podcast id 1XgTQnRlfJ0zpDdg2DccbR took  2.554 minutes to complete

Length of audio file for podcast id 2hgBlERSFYDWndqjWNOV6v: 2382.9696145124717
Found audio file at location ../../data/audio/2hgBlERSFYDWndqjWNOV6v.mp3





2hgBlERSFYDWndqjWNOV6v took 1.80 minutes to load
Number of chunks in podcast id 2hgBlERSFYDWndqjWNOV6v: 4765
Number of audio time segments: 4765
Length of first audio time segment: 0.5000453514739229
Took 6.8827 seconds to complete dwt over each audio time segment
length of approximation coefficients cA_lst: 4765
length of each chunk's approximation coefficients in cA_lst: 5513

Size of final dataframe for podcast id 2hgBlERSFYDWndqjWNOV6v: (4765, 5514)
Podcast id 2hgBlERSFYDWndqjWNOV6v took  2.401 minutes to complete

Length of audio file for podcast id 1jfWVd1sTapjjkQwDM4yQX: 357.0721088435374
Found audio file at location ../../data/audio/1jfWVd1sTapjjkQwDM4yQX.mp3





1jfWVd1sTapjjkQwDM4yQX took 0.28 minutes to load
Number of chunks in podcast id 1jfWVd1sTapjjkQwDM4yQX: 714
Number of audio time segments: 714
Length of first audio time segment: 0.5000453514739229
Took 0.9887 seconds to complete dwt over each audio time segment
length of approximation coefficients cA_lst: 714
length of each chunk's approximation coefficients in cA_lst: 5513

Size of final dataframe for podcast id 1jfWVd1sTapjjkQwDM4yQX: (714, 5514)
Podcast id 1jfWVd1sTapjjkQwDM4yQX took  0.335 minutes to complete

Length of audio file for podcast id 2DCEgkjSeRQyMt8KovG8vF: 419.0722902494331
Found audio file at location ../../data/audio/2DCEgkjSeRQyMt8KovG8vF.mp3





2DCEgkjSeRQyMt8KovG8vF took 0.36 minutes to load
Number of chunks in podcast id 2DCEgkjSeRQyMt8KovG8vF: 838
Number of audio time segments: 838
Length of first audio time segment: 0.5000453514739229
Took 1.1095 seconds to complete dwt over each audio time segment
length of approximation coefficients cA_lst: 838
length of each chunk's approximation coefficients in cA_lst: 5513

Size of final dataframe for podcast id 2DCEgkjSeRQyMt8KovG8vF: (838, 5514)
Podcast id 2DCEgkjSeRQyMt8KovG8vF took  0.428 minutes to complete

Took  14.529 minutes to loop through all 11 podcast episodes


