# Lab 5: Sound-types

As we saw in class, sound-types is a multi-layer framework for representing and synthesizing sounds. In this lab, we will apply the theory of sound-types for creating new audio from input audio. The two main types of synthesis that are possible through the sound-types are *probabilistic generation* and *sound hybridization*. Sections 1 and 2 will walk you through using the sound-types to apply probabilistic generation to an input sound. In Section 3, you will combine two input sounds through the process of sound hybridization.

First, let's briefly review the theory of sound-types. The analysis phase of sound-types takes in an input sample as audio, and performs the following steps:
1. **atomize**: the input sound is divided into very small (40 ms) overlapping chunks called *atoms*
2. **make classes**: for each atom, compute low-level descriptors that allow you to represent the atoms in a feature space. Then, we can use this space to see whcih atoms are closer to each other (and therefore more similar), and creater groups (clusters) of similar atoms.
3. **compute probabilities**: finally, determine the sequential relationship between the clusters of atoms computed in the step 2. Using a Markov chain, we can estimate the probabilities that one cluster of atoms is followed by another in the input sound

For further details, see: Cella, Carmine-Emanuele & Burred, Juan José. (2013). *Advanced sound hybridizations by means of the theory of sound-types.* ICMC, 2013.

## Section 1: Probabilistic Generation

So now that we have seen how the analysis phase of the sound-types works, we will use the Markov chain generated in step 3 to create new audio based on our input signal. The Markov chain is a series of states, where each state is a cluster of atoms. We have already computed the probabilities of transitioning between states. Therefore, we can create a new sound is a similar way to how we created Bach chorales in lab 2:

1. Randomly select a starting state
2. Select an atom from that state
3. Using the transition probabilities of that state, select the next state
4. Repeat steps 2 and 3 until a stopping condition is met

The atoms that we select in step 2 become our generated audio.

The following code cells will walk you through running the analysis and synthesis portions of the sound-types. Run each cell in order and listen to the inputs and outputs.

In [1]:
!pip install librosa
import numpy as np
import soundfile as sf
import librosa
import librosa.display
import matplotlib.pyplot as plt
from sklearn.manifold import MDS
from st_tools import *
from IPython.display import Audio, display, clear_output
import ipywidgets as widgets
from pathlib import Path

N_COEFF = 14
ST_RATIO = .9
N_FRAMES = 500
FRAME_SIZE = 1024
HOP_SIZE = 512
MAX_LOOPS = 3 
SR = 44100

SAMPLES_PATH = Path('./samples')





Use the following code to choose a sample.

In [2]:
sample_list = [str(file.name) for file in Path('./samples').iterdir() if file.is_file()]

sample_dropdown = widgets.Dropdown(
    options=sample_list,
    description="Sample:"
)

# Create a button widget
button = widgets.Button(description="Listen")

# Create an Output widget to display the generated music
output_widget = widgets.Output()

# Define a function to be called when the button is clicked
def on_button_click(b):
    with output_widget:
        clear_output(wait=True)  # Clear the output widget without clearing the dropdowns
        path = Path('./samples') / sample_dropdown.value
        y, _ = librosa.load(path, sr=SR)
        display(Audio(y, rate=SR))

# Attach the function to the button's click event
button.on_click(on_button_click)

# Display the widgets and button
widgets.VBox([sample_dropdown, button, output_widget])

Dropdown(description='Select file:', index=31, options=('TimeAgo.wav', 'Vox.wav', 'Bach_preludeu.wav', 'Frank_…

Now run this code to use the soundtypes to generate a new sound based on the chosen sample.

In [4]:
print ('[soundtypes - probabilistic generation]\n')
print ('computing features...')
y_pad = np.zeros(len(y) + FRAME_SIZE)
y_pad[1:len(y)+1] = y
C = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=N_COEFF, n_fft=FRAME_SIZE, 
                         hop_length=HOP_SIZE)

print ('multidimensional scaling...')
mds = MDS(2)
C_scaled = mds.fit_transform (C.T)

print ('computing soundtypes...')
(dictionary, markov, centroids, labels) = \
    make_soundtypes(C_scaled, ST_RATIO)
n_clusters = centroids.shape[0]

# print (markov)
print ('generate new sequence...')
w1 = np.random.randint (n_clusters)
prev_w1 = 0
loops = 0
gen_sequence = []
gen_sound = np.zeros(N_FRAMES * HOP_SIZE + FRAME_SIZE)
for i in range(N_FRAMES):
    l = markov[(w1)]
    if len(l) == 0:
        w1 = np.random.randint(n_clusters)
    else:
        w1 = l[np.random.randint(len(l))]
    if prev_w1 == w1:
        loops += 1

    if loops > MAX_LOOPS:
        w1 = np.random.randint(n_clusters)
        loops = 0

    gen_sequence.append(w1)
    p = dictionary[(w1)]
    atom = p[np.random.randint(len(p))]

    chunk = y_pad[atom*HOP_SIZE:atom*HOP_SIZE+FRAME_SIZE] \
        * np.hanning(FRAME_SIZE)
    gen_sound[i*HOP_SIZE:i*HOP_SIZE+FRAME_SIZE] += chunk

print ('saving audio data...')
sf.write('generated_sound.wav', gen_sound, sr)

print('done.')

print("Generated audio:")
Audio(gen_sound, rate=sr)

[soundtypes - probabilistic generation]

computing features...
multidimensional scaling...
computing soundtypes...
generate new sequence...
saving audio data...
done.


Below you can listen to the generated sound:

## Part 2: Probabilistic generation with onsets

In this section we again perform probabilistic generation, however the way that we create the atoms is different. In part 1, every atom was of equal length, around 40 ms. Now, we will try to compute when a new *onset* occurs in the input sound, and have each onset be a new atom. For example, if the input is a recording of a piano playing a melody, each new note in the melody is a new onset. The onsets are determined by measuring the *spectral flux*, which measures how quickly the spectrum of the sound is changing. When there is a new onset, for example a note played on the piano, the spectral flux has a high value.

In [21]:
import numpy as np
import librosa
import matplotlib.pyplot as plt
from sklearn.manifold import MDS
from st_tools import get_segments, make_soundtypes

N_COEFF = 20
ST_RATIO = .5
N_FRAMES = 100
FRAME_SIZE = 1024
HOP_SIZE = 1024
MAX_LOOPS = 3
WIDTH = 16
FADE_MS = 10
SR = 44100

SAMPLES_PATH = Path('./samples')

Just like the previous part, we first set our input sound:

In [22]:
sample_list = [str(file.name) for file in Path('./samples').iterdir() if file.is_file()]

sample_dropdown = widgets.Dropdown(
    options=sample_list,
    description="Sample:"
)

# Create a button widget
button = widgets.Button(description="Listen")

# Create an Output widget to display the generated music
output_widget = widgets.Output()

# Define a function to be called when the button is clicked
def on_button_click(b):
    with output_widget:
        clear_output(wait=True)  # Clear the output widget without clearing the dropdowns
        path = Path('./samples') / sample_dropdown.value
        y, _ = librosa.load(path, sr=SR)
        display(Audio(y, rate=SR))

# Attach the function to the button's click event
button.on_click(on_button_click)

# Display the widgets and button
widgets.VBox([sample_dropdown, button, output_widget])

Dropdown(description='Select file:', index=29, options=('TimeAgo.wav', 'Vox.wav', 'Bach_preludeu.wav', 'Frank_…

In [24]:
print ('[soundtypes - probabilistic generation on onsets]\n')
print ('computing segments...')


(segments, onsets, flux) = get_segments (y, sr, FRAME_SIZE, HOP_SIZE, \
    FADE_MS, WIDTH)

print ('computing features...')
features = []
for i in range (len(segments)):
    C = librosa.feature.mfcc(y=segments[i], sr=sr, n_mfcc=N_COEFF,
                             n_fft=FRAME_SIZE, hop_length=HOP_SIZE)
    features.append(np.mean (C, axis=1))

C = np.vstack(features)

print ('multidimensional scaling...')
mds = MDS(2)
C_scaled = mds.fit_transform (C)

print ('computing soundtypes...')
(dictionary, markov, centroids, labels) = \
    make_soundtypes(C_scaled, ST_RATIO)
n_clusters = centroids.shape[0]

print ('generate new sequence...')
w1 = np.random.randint (n_clusters)
prev_w1 = 0
loops = 0
gen_sequence = []
gen_sound = []
for i in range(N_FRAMES):
    l = markov[(w1)]
    if len(l) == 0:
        w1 = np.random.randint(n_clusters)
    else:
        w1 = l[np.random.randint(len(l))]
    if prev_w1 == w1:
        loops += 1

    if loops > MAX_LOOPS:
        w1 = np.random.randint(n_clusters)
        loops = 0

    gen_sequence.append(w1)
    p = dictionary[(w1)]
    atom = p[np.random.randint(len(p))]

    gen_sound.append (segments[atom])

gen_sound = np.hstack (gen_sound)

print ('saving audio data...')
sf.write('generated_sound.wav', gen_sound, sr)

print('done.')

print("Generated audio:")

Audio(gen_sound, rate=sr)

[soundtypes - probabilistic generation on onsets]

computing segments...
computing features...
multidimensional scaling...
computing soundtypes...
generate new sequence...
saving audio data...
done.


Below you can listen to the generated sound. Compare this generation to the output from part 1. Are there differences? Does one work "better"?

## Part 3: Sound Hybridization

Another operation possible through the sound-types is sound hybridization. Similar to style transfer, sound hybridization takes the spectral content of one sound and applies it to the temporal aspects of another sound.

Here is a description of the process from the paper on sound-types:

"It is possible to subject two different sounds to separate types and rules inferences, and then impose or merge one sound’s types or rules with the others’. The sound-types inferred from a signal (the source) are replaced by, or merged with, the sound-types inferred from a target signal. Each sound-type from the source is matched with a sound-type from the target, in terms of a similarity measure between the centroids of their corresponding feature clusters."

In [26]:
import numpy as np
import librosa
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from st_tools import make_soundtypes
import soundfile as sf

N_COEFF = 13
FRAME_SIZE = 2048
HOP_SIZE = 1024
ST_RATIO = .7
K = 5
SR = 44100

SAMPLES_PATH = Path('./samples')

Below, set the "source" and "target" files, then running the cell after to visualize and listen to the samples.

In [27]:
# Source File

samples = list(SAMPLES_PATH.iterdir())
samples = [file.name for file in samples]
w_src = widgets.Dropdown(options=samples, value='cage.wav', description='Source File:')
source_file = SAMPLES_PATH / w_src.value
y_src = None
sr = None

y_src, sr = librosa.load(source_file, sr=SR)

def on_change_src(change):
    if change['type'] == 'change' and change['name'] == 'value':
        global source_file
        global y_src
        global sr
        
        source_file = SAMPLES_PATH / change.new
        y_src, sr = librosa.load(source_file, sr=SR)
        print("changed source file to", source_file)
        
w_src.observe(on_change_src)

# Target File

w_dst = widgets.Dropdown(options=samples, value='lachenmann.wav', description='Target File:')
target_file = SAMPLES_PATH / w_dst.value
y_dst = None
sr = None

y_dst, sr = librosa.load(target_file, sr=SR)

def on_change_dst(change):
    if change['type'] == 'change' and change['name'] == 'value':
        global target_file
        global y_dst
        global sr
        
        target_file = SAMPLES_PATH / change.new
        y_dst, sr = librosa.load(target_file, sr=SR)
        print("changed destination file to", target_file)
        
w_dst.observe(on_change_dst)

display(w_src, w_dst)

Dropdown(description='Source File:', index=14, options=('TimeAgo.wav', 'Vox.wav', 'Bach_preludeu.wav', 'Frank_…

Dropdown(description='Target File:', index=24, options=('TimeAgo.wav', 'Vox.wav', 'Bach_preludeu.wav', 'Frank_…

In [28]:
print("Source audio:", source_file)
display(Audio(y_src, rate=sr))

print("Target audio:", target_file)
display(Audio(y_dst, rate=sr))

Source audio: samples/cage.wav


Target audio: samples/lachenmann.wav


After setting your source and target files, run the following code to generate a new sample that is a hybridization of the two sounds.

In [29]:
print ('[soundtypes - timbre matching]\n')
print ('computing features...')

y_pad_src = np.zeros(len(y_src) + FRAME_SIZE)
y_pad_src[1:len(y_src)+1] = y_src

C_src = librosa.feature.mfcc(y=y_src, sr=sr, n_mfcc=N_COEFF,
                                 n_fft=FRAME_SIZE, hop_length=HOP_SIZE)

y_pad_dst = np.zeros(len(y_dst) + FRAME_SIZE)
y_pad_dst[1:len(y_dst)+1] = y_dst

C_dst = librosa.feature.mfcc(y=y_dst, sr=sr, n_mfcc=N_COEFF,
                                 n_fft=FRAME_SIZE, hop_length=HOP_SIZE)

C_scaled_dst = C_dst.T
C_scaled_src = C_src.T

scaler = StandardScaler ()
C_scaled_dst = scaler.fit_transform (C_scaled_dst)
C_scaled_src = scaler.fit_transform (C_scaled_src)

print ('computing soundtypes...')
(dictionary_src, markov_src, centroids_src, labels_src) = \
    make_soundtypes(C_scaled_src, ST_RATIO)
n_clusters_src = centroids_src.shape[0]
(dictionary_dst, markov_dst, centroids_dst, labels_dst) = \
    make_soundtypes(C_scaled_dst, ST_RATIO)
n_clusters_dst = centroids_dst.shape[0]

print ('matching clusters...')
knn = NearestNeighbors(n_neighbors=K).fit(centroids_dst)
dist, idxs = knn.kneighbors(centroids_src)

print ('generate hybridization...')
n_frames = len(labels_src)
gen_sound = np.zeros(n_frames * HOP_SIZE + FRAME_SIZE)
for i in range(n_frames):
    labels_match = idxs[labels_src[i], :]
    x = labels_match[np.random.randint(K)]
    p = dictionary_dst[x]
    if len(p) == 0:
        atom = 0
    else:
        atom = p[np.random.randint(len(p))]

    amp = np.sum (np.abs(y_pad_src[i * HOP_SIZE : i * HOP_SIZE + \
        FRAME_SIZE]))
    chunk = y_pad_dst[atom * HOP_SIZE : atom * HOP_SIZE + FRAME_SIZE] \
        * np.hanning(FRAME_SIZE)

    norm = np.max (np.abs(chunk))
    if norm == 0:
        norm = 1

    chunk /= norm
    gen_sound[i * HOP_SIZE : i * HOP_SIZE + FRAME_SIZE] += (chunk * amp / n_frames)

print ('saving audio data...')
sf.write('generated_sound.wav', gen_sound, sr)
print('done.')

print("Generated audio:")
Audio(gen_sound, rate=sr)

[soundtypes - timbre matching]

computing features...
computing soundtypes...
matching clusters...
generate hybridization...
saving audio data...
done.


What do you think about this sound hybridization process? Do you consider it successful at style transfer? Why or why not?