# Exercise 8: Sound transformations

In this exercise you will use the HPS model to creatively transform sounds. There are two parts in this exercise. In the first one you should perform a natural sounding transformation on the speech sound that you used in the previous exercise (E7). In the second part you should select a sound of your choice and do a "creative" transformation. You will have to write a short description of the sound and of the transformation you did, giving the link to the original sound and uploading several transformed sounds.

For this exercise, you can use the `transformations_GUI.py` (in `software/transformations_interface/`) to try things, once decided you can fill up the code in this file. You can also do everything from here and add any new code you wish.

In order to perform a good/interesting transformation you should make sure that you have performed an analysis that is adequate for the type of transformation you want to do. Not every HPS analysis representation will work for every type of sound transformation. There will be things in the analysis that when modified will result in undesired artifacts. In general, for any transformation, it is best to have the harmonic values as smooth and continuous as possible and an stochastic representation as smooth and with as few values as possible. It might be much better to start with an analysis representation that does not result in the best reconstruction in exchange of having smoother and more compact data.

To help you with the exercise, we give a brief description of the transformation parameters used by the HPS transformation function:

1. `freqScaling`: frequency scaling factors to be applied to the harmonics of the sound, in time-value pairs (where value of 1 is no scaling). The time values can be normalized, from 0 to 1, or can correspond to the times in seconds of the input sound. The scaling factor is a multiplicative factor, thus a value of 1 is no change. Example: to transpose an octave the sound you can specify `[0, 2, 1, 2]`.
2. `freqStretching`: frequency stretching factors to be applied to the harmonics of the sound, in time-value pairs (value of 1 is no stretching). The time values can be normalized, from 0 to 1, or can correspond to the times in seconds of the input sound. The stretching factor is a multiplicative factor whose effect depend on the harmonic number, higher harmonics being more affected that lower ones, thus resulting in an inharmonic effect. A value of 1 results in no transformation. Example: an array like `[0, 1.2, 1, 1.2]` will result in a perceptually large inharmonic effect.
3. `timbrePreservation`: 1 preserves the original timbre, 0 does not. It can only have a value of 0 or of 1. By setting the value to 1 the spectral shape of the original sound is preserved even when the frequencies of the sound are modified. In the case of speech it would correspond to the idea of preserving the identity of the speaker after the transformation.
4. `timeScaling`: time scaling factors to be applied to the whole sound, in time-value pairs (value of 1 is no scaling). The time values can be normalized, from 0 to 1, or can correspond to the times in seconds of the input sound. The time scaling factor is a multiplicative factor, thus 1 is no change. Example: to stretch the original sound to twice the original duration, we can specify `[0, 0, 1, 2]`.

All the transformation values can have as many points as desired, but they have to be in the form of an array with time-value pairs, so of even size. For example a good array for a frequency stretching of a sound that has a duration of 3.146 seconds could be: `[0, 1.2, 2.01, 1.2, 2.679, 0.7, 3.146, 0.7]`.

## Part 1. Perform natural sounding transformations of a speech sound

Use the HPS model with the sound `speech-female.wav`, available in the sounds directory, to first analyze and then obtain a natural sounding transformation of the sound. The synthesized sound should sound as different as possible to the original sound while sounding natural. By natural we mean that it should sound like speech, that it could have been possible to be produced by a human, and by listening we should consider it as a speech sound, even though we might not be able to understand it. You should first make sure that you start from a good analysis, then you can do time and/or frequency scaling transformations. The transformation should be done with a single pass, no mixing of sounds coming from different transformations. Since you used the same sound in A7, use that experience to get a good analysis, but consider that the analysis, given that we now want to use it for applying a very strong transformation, might be done differently than what you did in A7.

Write a short paragraph for every transformation, explaining what you wanted to obtain and explaining the transformations you did, giving both the analysis and transformation parameter values (sufficiently detailed for the evaluator to be able to reproduce the analysis and transformation).

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import get_window
import sys, os
sys.path.append('../software/models/')
sys.path.append('../software/transformations/')
import utilFunctions as UF
import stft as STFT
import hpsModel as HPS
import hpsTransformations as HPST
import harmonicTransformations as HT
import IPython.display as ipd

In [2]:
# 1.1 perform an analysis/synthesis using the HPS model

input_file = '../sounds/speech-female.wav'

### set the parameters
window ='blackman'
M = 1891
N = 2048
t = -90
minSineDur = 0.1
nH = 40 
minf0 = 100
maxf0 = 300
f0et = 3
harmDevSlope = 0.01
stocf = 0.1

# no need to modify anything after this
Ns = 512
H = 128

(fs, x) = UF.wavread(input_file)
w = get_window(window, M, fftbins=True)
hfreq, hmag, hphase, stocEnv = HPS.hpsModelAnal(x, fs, w, N, H, t, nH, minf0, maxf0, f0et, harmDevSlope, minSineDur, Ns, stocf)
y, yh, yst = HPS.hpsModelSynth(hfreq, hmag, hphase, stocEnv, Ns, H, fs)

ipd.display(ipd.Audio(data=x, rate=fs))
ipd.display(ipd.Audio(data=y, rate=fs))

In [3]:
# 1.2 Perform a transformation from the previous analysis

### define the transformations
freqScaling = np.array([0, 2.2, .62, 1.7, 2.3, 1.2, 2.5, 2, 2.7, .5, 3.0, .6, 3.5, 2.4, 3.6, .6, 3.8, .8])
freqStretching = np.array([0,1,1,1])
timbrePreservation = 1
timeScaling = np.array([0,0, .584, .6, .585, .8, 1, 1.2])

# no need to modify the following code 
Ns = 512
H = 128

# frequency scaling of the harmonics 
hfreqt, hmagt = HT.harmonicFreqScaling(hfreq, hmag, freqScaling, freqStretching, timbrePreservation, fs)

# time scaling the sound
yhfreq, yhmag, ystocEnv = HPST.hpsTimeScale(hfreqt, hmagt, stocEnv, timeScaling)

# synthesis from the trasformed hps representation 
y, yh, yst = HPS.hpsModelSynth(yhfreq, yhmag, np.array([]), ystocEnv, Ns, H, fs)

ipd.display(ipd.Audio(data=y, rate=fs))

### Explain Part 1
For the analysis of the sound, I kept most of the parameters from E7, except for the following ones: t is -90 (10 dB lower than before, this value is enough to avoid the presence of peaks that do not correpond to sinusoidal peaks), minSinDur is longer (0.1 now, 0.07 in E7; this value is better for the transformations to track more continuous and smooth harmonics), f0et is smaller now (value of 3, so that the f0 values in the parts of the sound with a not clear harmonic structure can be discarded) and stocf is 0.1 (lower than in E7, in order to have a more compact representation).

The purpose of the above transformations was to change as much as possible the original sound without make it sound artificial. In particular, the aim was to make it sound as a two-person conversation, the first part would be one speaker with a higher pitch, then a more pronounced pause (than the one in the original sound) and finally, the second speaker with a lower pitch saying "la vaca es cega" with a question intonation. The frequency scaling was done also with the idea in mind to change the "accent" of the speakers to sound more expressive.
The specific transformations were:
- Frequency scaling: in order to change the "accent" of the voice I made a lot of variations on the pitch, taking into account the idea of having 2 different speakers, the first one with a higher pitch than the second. With that purpose, I used the time-domain waveform of the whole original audio to determine the time in seconds of the different phrase segments where the pitch changes should be introduced. As it can be seen from the freqScaling array the pitch changes are abundant and with a lot of increasing and decreasing variations, for instance, when pronouncing "vaca", in order to make it sound with a very expressive accent, I introduced pitch scaling factors to rise the intonation in the first "a" and then lower it very suddenly in the second "a" (word "vaca"). 
- Time scaling: I introduced some time scaling factors to make it sound more like a 2-person conversation. In particular, for the first segment (first speaker) I slowed down the speed to make it sound more relaxed, and I introduced a pronounced pause between the two speakers (between time .584 and .585).
- Timbre preservation: I put it to 1 to keep the naturalness of the original voice.
- Frequency stretching: no frequency stretching was introduced, since it results in inharmonic effects, and we want to keep the resultant sound the more natural as possible.

## Part 2. Perform creative transformations with a sound of your choice

Pick any natural and harmonic sound from Freesound and use the HPS model to do the most creative and interesting transformation you can come up with. Sounding as different as possible from the original sound.

It is essential that you start with a natural harmonic sound. Examples include (but not limited to) any acoustic harmonic instrument, speech, harmonic sound from nature, etc. As long as they have a harmonic structure, you can use it. You can even reuse the sound you used in A7-Part2 or upload your own sound to freesound and then use it.

The sound from Freesound to use could be in any format, but to use the sms-tools software you will have to first convert it to be a monophonic file (one channel), sampling rate of 44100, and 16bits samples.

You can do any interesting transformation with a single pass. It is not allowed to mix sounds obtained from different transformations. The transformed sound need not sound natural. So, time to show some creativity!

Write a short paragraph for every transformation, explaining what you wanted to obtain and explaining the transformations you did, giving both the analysis and transformation parameter values (sufficiently detailed for the evaluator to be able to reproduce the analysis and transformation).

In [4]:
# 1.1 perform an analysis/synthesis using the HPS model

### set the parameters
input_file = '../sounds/violin_phrase.wav'
window ='hamming'
M = 401
N = 512
t = -90
minSineDur = 0.3
nH = 30 
minf0 = 400
maxf0 = 800
f0et = 8
harmDevSlope = 0.05
stocf = 0.2

# no need to modify anything after this
Ns = 512
H = 128

(fs, x) = UF.wavread(input_file)
w = get_window(window, M, fftbins=True)
hfreq, hmag, hphase, stocEnv = HPS.hpsModelAnal(x, fs, w, N, H, t, nH, minf0, maxf0, f0et, harmDevSlope, minSineDur, Ns, stocf)
y, yh, yst = HPS.hpsModelSynth(hfreq, hmag, hphase, stocEnv, Ns, H, fs)

ipd.display(ipd.Audio(data=x, rate=fs))
ipd.display(ipd.Audio(data=y, rate=fs))

  mX = 20 * np.log10(abs(X[:hN]))                       # magnitude spectrum of positive frequencies


In [6]:
# 1.2 Perform a transformation from the previous analysis

### define the transformations
freqScaling = np.array([0, 1.1225, .76, 1.1225, .76, 0.7937, 1.29, 0.7937, 1.29, 1.1225, 1.72, 1.1225,
                        1.72, 0.4455, 2.1, 0.4455, 2.1, 0.7492, 2.64, 0.7492, 2.64, 0.8909, 3.28, 0.8909,
                        3.28, 1.1225, 4.12, 1.1225, 4.12, 1, 4.8142, 1])
freqStretching = np.array([0, 1.2, 1, 1.2])
timbrePreservation = 0
timeScaling = np.array([0, 0, 1.72, 3.44, 2.1, 4.96, 2.11, 4.96, 4.814, 10])

# no need to modify anything after this
Ns = 512
H = 128

# frequency scaling of the harmonics 
hfreqt, hmagt = HT.harmonicFreqScaling(hfreq, hmag, freqScaling, freqStretching, timbrePreservation, fs)

# time scaling the sound
yhfreq, yhmag, ystocEnv = HPST.hpsTimeScale(hfreqt, hmagt, stocEnv, timeScaling)

# synthesis from the trasformed hps representation 
y, yh, yst = HPS.hpsModelSynth(yhfreq, yhmag, np.array([]), ystocEnv, Ns, H, fs)

ipd.display(ipd.Audio(data=y, rate=fs))

### Explain Part 2

I downloaded a short violin phrase from Freesound (link: https://freesound.org/people/wilLOCK/sounds/448200/). I did not have to readjust anything, since the audio was already sampled at 44100 Hz, with 16 bits and mono. For the analysis of the sound, I followed the same approach as with other sounds, namely: compute STFT with big window size and FFT size to look for the lowest and highest frequencies, which were 440 Hz and around 740 Hz, respectively; then apply the HPR model and get the most compact representation possible. The parameters for that optimal configuration were: window_type="hamming", M = 401 (computed as 4* 44100 / 440), N = 512, t = -90, minSineDur = 0.3, nH = 30 (still keeps the perceptual characteristics), minf0 = 400, maxf0 = 800, f0et = 8 and hamDevSlope = 0.05. Looking at the harmonics + residual spectrogram, some irregularities in the high harmonics can be seen (and heard in the harmonic component) in the onset of the longest note of the melody. Some inharmonities might be present here, since the model could not do better than this. The residual audio represents the noise of the movement of the bow, specially significant at the note transitions. With the same configuration, the HPS model was computed, with stocf = 0.2, which gives a fair modelling of the stochastic component. 

The purpose of the transformations was to change the melody and the timbre of the original audio. In particular, the aim was to make it sound as the typical church bell melody (Westminster chimes), so the most challenging part was determining the timestamps of each note and computing the frequency scaling factors so as to make it sound as that melody. Also, frequency stretching was performed to make it sound inharmonic and closer to church bells, as well as time scaling to reduce the speed to a more appropriate tempo for this melody. The specific transformations were: 
- Frequency scaling: first, the notes of the original melody were identified: B4 - C#5 - A4 - F#5 - D5 - C#5 - B4 - A4. Then, the notes of the new melody were also identified: C#5 - A4 - B4 - E4 - A4 - B4 - C#5 - A4. The frequency scaling factors were computed dividing the frequency of each new-melody note by the correpondent original note. After this, looking at the spectrogram I could identify the timestamps of each note transition, these and the scaling factors were used to generate the frequency scaling array for this transformation.  
- Time stretching: in order to make it sound more inharmonic, a constant stretching factor of 1.2 was applied to the whole audio sample. 
- Time scaling: I introduced time scaling factors to make the notes longer (as the church bells). In particular, the fourth note was especifically made longer to fit the rhythm of the new melody better. The other notes were made longer as well. 
- Timbre preservation: I put it to 0 to make it sound as different as possible to the original sound. 
