# Exercise 8: Sound transformations

In this exercise you will use the HPS model to creatively transform sounds. There are two parts in this exercise. In the first one you should perform a natural sounding transformation on the speech sound that you used in the previous exercise (E7). In the second part you should select a sound of your choice and do a "creative" transformation. You will have to write a short description of the sound and of the transformation you did, giving the link to the original sound and uploading several transformed sounds.

For this exercise, you can use the `transformations_GUI.py` (in `software/transformations_interface/`) to try things, once decided you can fill up the code in this file. You can also do everything from here and add any new code you wish.

In order to perform a good/interesting transformation you should make sure that you have performed an analysis that is adequate for the type of transformation you want to do. Not every HPS analysis representation will work for every type of sound transformation. There will be things in the analysis that when modified will result in undesired artifacts. In general, for any transformation, it is best to have the harmonic values as smooth and continuous as possible and an stochastic representation as smooth and with as few values as possible. It might be much better to start with an analysis representation that does not result in the best reconstruction in exchange of having smoother and more compact data.

To help you with the exercise, we give a brief description of the transformation parameters used by the HPS transformation function:

1. `freqScaling`: frequency scaling factors to be applied to the harmonics of the sound, in time-value pairs (where value of 1 is no scaling). The time values can be normalized, from 0 to 1, or can correspond to the times in seconds of the input sound. The scaling factor is a multiplicative factor, thus a value of 1 is no change. Example: to transpose an octave the sound you can specify `[0, 2, 1, 2]`.
2. `freqStretching`: frequency stretching factors to be applied to the harmonics of the sound, in time-value pairs (value of 1 is no stretching). The time values can be normalized, from 0 to 1, or can correspond to the times in seconds of the input sound. The stretching factor is a multiplicative factor whose effect depend on the harmonic number, higher harmonics being more affected that lower ones, thus resulting in an inharmonic effect. A value of 1 results in no transformation. Example: an array like `[0, 1.2, 1, 1.2]` will result in a perceptually large inharmonic effect.
3. `timbrePreservation`: 1 preserves the original timbre, 0 does not. It can only have a value of 0 or of 1. By setting the value to 1 the spectral shape of the original sound is preserved even when the frequencies of the sound are modified. In the case of speech it would correspond to the idea of preserving the identity of the speaker after the transformation.
4. `timeScaling`: time scaling factors to be applied to the whole sound, in time-value pairs (value of 1 is no scaling). The time values can be normalized, from 0 to 1, or can correspond to the times in seconds of the input sound. The time scaling factor is a multiplicative factor, thus 1 is no change. Example: to stretch the original sound to twice the original duration, we can specify `[0, 0, 1, 2]`.

All the transformation values can have as many points as desired, but they have to be in the form of an array with time-value pairs, so of even size. For example a good array for a frequency stretching of a sound that has a duration of 3.146 seconds could be: `[0, 1.2, 2.01, 1.2, 2.679, 0.7, 3.146, 0.7]`.

## Part 1. Perform natural sounding transformations of a speech sound

Use the HPS model with the sound `speech-female.wav`, available in the sounds directory, to first analyze and then obtain a natural sounding transformation of the sound. The synthesized sound should sound as different as possible to the original sound while sounding natural. By natural we mean that it should sound like speech, that it could have been possible to be produced by a human, and by listening we should consider it as a speech sound, even though we might not be able to understand it. You should first make sure that you start from a good analysis, then you can do time and/or frequency scaling transformations. The transformation should be done with a single pass, no mixing of sounds coming from different transformations. Since you used the same sound in A7, use that experience to get a good analysis, but consider that the analysis, given that we now want to use it for applying a very strong transformation, might be done differently than what you did in A7.

Write a short paragraph for every transformation, explaining what you wanted to obtain and explaining the transformations you did, giving both the analysis and transformation parameter values (sufficiently detailed for the evaluator to be able to reproduce the analysis and transformation).

In [2]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import get_window
import sys, os
sys.path.append('../software/models/')
sys.path.append('../software/transformations/')
import utilFunctions as UF
import stft as STFT
import hpsModel as HPS
import hpsTransformations as HPST
import harmonicTransformations as HT
import IPython.display as ipd

In [80]:
# 1.1 perform an analysis/synthesis using the HPS model

input_file = '../sounds/speech-female.wav'

### set the parameters
window ='blackman'
M = 2300
N = 4096
t = -90
minSineDur = .1
nH = 100
minf0 = 100
maxf0 = 260
f0et = 7
harmDevSlope = 0.01
stocf = .03

# no need to modify anything after this
Ns = 512
H = 128

(fs, x) = UF.wavread(input_file)
w = get_window(window, M, fftbins=True)
hfreq, hmag, hphase, stocEnv = HPS.hpsModelAnal(x, fs, w, N, H, t, nH, minf0, maxf0, f0et, harmDevSlope, minSineDur, Ns, stocf)
y, yh, yst = HPS.hpsModelSynth(hfreq, hmag, hphase, stocEnv, Ns, H, fs)

ipd.display(ipd.Audio(data=x, rate=fs))
ipd.display(ipd.Audio(data=y, rate=fs))

### explain your parameter choices
"""
Observing the STFT, we can estimate that the fundamental frequency lies between 120 and 150Hz, therefore we 
used those values for the minf0 and maxf0 respectively. We use a blackman window here, and to resolve 
between the harmonics the window size should at least (6*44100/120)=2205, we take a window size of 
2300 and fft size of 4096. Magnitude threshold is set to -90dB for blackman window.
From observing the harmonic model, the minSineDur is set to 0.1 secs, and nH=100 harmonics are modelled.
The harmDevSlope parameter decides how much the higher harmonics will be allowed to deviate compared to
the lower harmonics, we set this to a value greater than 0, to 0.1.
The stocf factor gives how much the residuals will be decimated, that is how many samples per one sample 
of the smoothed magnitude spectrum we take for resconstruction. To have the most compact representation,
we take a stocf value of 0.03, which causes very high decimation of the residuals, and the output sounds
similar to listening to the voice over a telephone.

"""

'\nObserving the STFT, we can estimate that the fundamental frequency lies between 120 and 150Hz, therefore we \nused those values for the minf0 and maxf0 respectively. We use a blackman window here, and to resolve \nbetween the harmonics the window size should at least (6*44100/120)=2205, we take a window size of \n2300 and fft size of 4096. Magnitude threshold is set to -90dB for blackman window.\nFrom observing the harmonic model, the minSineDur is set to 0.1 secs, and nH=100 harmonics are modelled.\nThe harmDevSlope parameter decides how much the higher harmonics will be allowed to deviate compared to\nthe lower harmonics, we set this to a value greater than 0, to 0.1.\nThe stocf factor gives how much the residuals will be decimated, that is how many samples per one sample \nof the smoothed magnitude spectrum we take for resconstruction. To have the most compact representation,\nwe take a stocf value of 0.03, which causes very high decimation of the residuals, and the output sounds

In [82]:
# 1.2 Perform a transformation from the previous analysis

### define the transformations
freqScaling = np.array([0, 1.2,.2,1.7,.25,1.2,1,1.2])
freqStretching = np.array([0, 1,.2,1.01,.25,1,1,1])
timbrePreservation = 1
timeScaling = np.array([0,0,0.1,0.1,.2, .35,.5,.75,  .55, 1.0,1,1.5 ])


# no need to modify the following code 
Ns = 512
H = 128

# frequency scaling of the harmonics 
hfreqt, hmagt = HT.harmonicFreqScaling(hfreq, hmag, freqScaling, freqStretching, timbrePreservation, fs)

# time scaling the sound
yhfreq, yhmag, ystocEnv = HPST.hpsTimeScale(hfreqt, hmagt, stocEnv, timeScaling)

# synthesis from the trasformed hps representation 
y, yh, yst = HPS.hpsModelSynth(yhfreq, yhmag, np.array([]), ystocEnv, Ns, H, fs)

ipd.display(ipd.Audio(data=y, rate=fs))

### explain your transformations
"""
freqScaling - Here we first scale the beginning and ending of the sound by a scaling factor = 1.2,
and we scale the frequencies of the first vowel pronounciation (second word) at normalised
time 0.2, to 1.7 of its values, and set the frequency scaling at 0.25 time as 1.2, to only further 
increase the frequencies of that particular vowel.

freqStretching - In frequency stretching, the stretching is dependent on the number of the harmonic,
higher the harmonic, more it is affected than the lower ones. This produces an inharmonic effect of 
speech sound, and makes it sound less like speech. Here, we apply very little frequency stretching 
around the first vowel, and allow a stretching of 1.01 for the fundamental frequency.

timbrePreservation - Here we set this value to 1, to get as natural sounding speech as possible.

timeScaling - Here I've applied time scaling to the input sound- at time 0, we set the timescaling to 0,
and no time scaling takes place from time 0 to 0.1. At time 0.1, the output sound time coincides 
with the input sound time, then we add (0.2,0.35), which means that the 0.2 time of the input will coincide 
to 0.35 time of the output, so one vowel pronounciation corresponding to that time in the speech
is "slowed down". Then we add the pair (0.5, 0.75) and and (0.55, 1.0), via similar reasoning to slow
down another vowel in the speech. The end time scaling is set to 1.5 times slowing down, to prevent the end
sounding sped up, because of the two slowed down intervals we added. 
"""

'\nfreqScaling - Here we scale the frequencies of the first vowel pronounciation (second word) at normalised\ntime 0.2, to 1.7 of its values, and set the frequency scaling at 0.25 time as 1, to only increase the \nfrequencies of that particular vowel.\n\nfreqStretching - In frequency stretching, the stretching is dependent on the number of the harmonic,\nhigher the harmonic, more it is affected than the lower ones. This produces an inharmonic effect of \nspeech sound, and makes it sound less like speech. Here, we apply very little frequency stretching \naround the first vowel, and allow a stretching of 1.1 for the fundamental frequency.\n\ntimbrePreservation - Here we set this value to 1, to get as natural sounding speech as possible.\n\ntimeScaling - Here I\'ve applied time scaling to the input sound- at time 0, we set the timescaling to 0,\nand no time scaling takes place from time 0 to 0.1. At time 0.1, the output sound time coincides \nwith the input sound time, then we add 0.2,0.3

### Explain Part 1

The input sound is that of a speech, and to preserve the natural sound of the speech, the 
transformations have to be like natural transformations that happen during speech. 
We set the stochastic decimation factor to 0.03 to give a very telephone-like reconstruction.
We've also slowed down near two specific vowels in the pronounciation, 
as well as scaled the frequency of the pronounciation of a vowel to a higher value. 
Both of these transformations are natural in speech, for example, if the content of the
speech is interogative, the intonation may be representated as a frequency scaling transformation. 

## Part 2. Perform creative transformations with a sound of your choice

Pick any natural and harmonic sound from Freesound and use the HPS model to do the most creative and interesting transformation you can come up with. Sounding as different as possible from the original sound.

It is essential that you start with a natural harmonic sound. Examples include (but not limited to) any acoustic harmonic instrument, speech, harmonic sound from nature, etc. As long as they have a harmonic structure, you can use it. You can even reuse the sound you used in A7-Part2 or upload your own sound to freesound and then use it.

The sound from Freesound to use could be in any format, but to use the sms-tools software you will have to first convert it to be a monophonic file (one channel), sampling rate of 44100, and 16bits samples.

You can do any interesting transformation with a single pass. It is not allowed to mix sounds obtained from different transformations. The transformed sound need not sound natural. So, time to show some creativity!

Write a short paragraph for every transformation, explaining what you wanted to obtain and explaining the transformations you did, giving both the analysis and transformation parameter values (sufficiently detailed for the evaluator to be able to reproduce the analysis and transformation).

In [86]:
# 1.1 perform an analysis/synthesis using the HPS model

### set the parameters
input_file = 'guitar_small.wav'
window ='blackman'
M = 1800
N = 4096
t = -90
minSineDur = 0.01
nH = 50
minf0 = 150
maxf0 = 600
f0et = 10
harmDevSlope = 0.01
stocf = 0.03

# no need to modify anything after this
Ns = 512
H = 128

(fs, x) = UF.wavread(input_file)
w = get_window(window, M, fftbins=True)
hfreq, hmag, hphase, stocEnv = HPS.hpsModelAnal(x, fs, w, N, H, t, nH, minf0, maxf0, f0et, harmDevSlope, minSineDur, Ns, stocf)
y, yh, yst = HPS.hpsModelSynth(hfreq, hmag, hphase, stocEnv, Ns, H, fs)

ipd.display(ipd.Audio(data=x, rate=fs))
ipd.display(ipd.Audio(data=y, rate=fs))

### explain your parameter choices
"""
Link of the sound- https://freesound.org/people/polotenchiko/sounds/329336/

From the STFT of this sound, we observe that the fundamental frequency is between 150Hz in the
penultimate note (lowest note), to 600Hz for the first note(highest note), therefore minf0 and maxf0 are set to
those frequency values, respectively. The minimum window required is given  by 6*44100/150=1746 
(for blackman window used here), we use a window size of 1800, FFT size of 4096, and take a magnitude 
threshold of -90dB.
Since this particular phrase has quickly played notes, and the higher partials also do not last very long,
minSineDur is set to 0.01, and nH=50 harmonics are taken into consideration.
harmdevSlope is set to a low value of 0.01, since we do not expect large increases in deviation for higher 
harmonics for a guitar phrase, since it is very close to a perfectly harmonic sound.
From the harmonic+residual model of the sound, we observe that the residual part, given
the previous parameters for the harmonic model, contains some tonal features and isn't purely 
stochastic, and is mostly in the high frequency range, so we can get away with using a small decimation
factor, we pick a stocf value = 0.03, which gives a reasonably close sound to the original audio,
and a very compact representation.


"""

"\n\nFrom the STFT of this sound, we observe that the fundamental frequency is between 150Hz in the\npenultimate note (lowest note), to 600Hz for the first note(highest note), therefore minf0 and maxf0 are set to\nthose frequency values, respectively. The minimum window required is given  by 6*44100/150=1746 \n(for blackman window used here), we use a window size of 1800, FFT size of 4096, and take a magnitude \nthreshold of -90dB.\nSince this particular phrase has quickly played notes, and the higher partials also do not last very long,\nminSineDur is set to 0.01, and nH=50 harmonics are taken into consideration.\nharmdevSlope is set to a low value of 0.01, since we do not expect large increases in deviation for higher \nharmonics for a guitar phrase, since it is very close to a perfectly harmonic sound.\nFrom the harmonic+residual model of the sound, we observe that the residual part, given\nthe previous parameters for the harmonic model, contains some tonal features and isn't purely

In [88]:
# 1.2 Perform a transformation from the previous analysis

### define the transformations
freqScaling = np.array([0,1,.1,4,.3,.5,.5,.2,.6,1,1,1])
freqStretching = np.array([0, 1,.3,1,.4,5,.5,.2,.7,1,.8,1.2,1,1])
timbrePreservation = 0
timeScaling = np.array([0,0,0.2,0.4,0.3,0.45,1,1 ])


# no need to modify anything after this
Ns = 512
H = 128

# frequency scaling of the harmonics 
hfreqt, hmagt = HT.harmonicFreqScaling(hfreq, hmag, freqScaling, freqStretching, timbrePreservation, fs)

# time scaling the sound
yhfreq, yhmag, ystocEnv = HPST.hpsTimeScale(hfreqt, hmagt, stocEnv, timeScaling)

# synthesis from the trasformed hps representation 
y, yh, yst = HPS.hpsModelSynth(yhfreq, yhmag, np.array([]), ystocEnv, Ns, H, fs)

ipd.display(ipd.Audio(data=y, rate=fs))

### explain your transformations
"""
freqScaling - The freqScaling is set to 1 for the beginning and end of the audio, i.e. no scaling is applied to
the ends. A scaling factor of 4 is applied at 0.1 time, and 0.5 at 0.3 time, and .2 at 0.5 time, 1 at 0.6 time,
to get our result.

freqStretching - This will stretch the harmonics unevenly, and make the audio sound more unnatural. We apply 
a stretching of 5 at 0.4, ramping up from no stretching (streching factor = 1) at 0.3, a stretching factor
0.2 at time 0.5, ramping down to no stretching at time 0.7, and ramping up to a stretching factor of 
1.2 at time 0.8, and back to 1 at time 1.

timbrePreservation - Here, since we do not care about preserving the timbre of the original sound,
we can set this factor to 0 instead of 1, to produce a very unnatural sound.

timeScaling - At zero, time scaling is 0, and the 0.2 time of the input sound corresponds to the 
0.4 normalised time of the output sound, so it is slowed down between 0 and 0.2 time of the original sound.
The 0.45 time of the output sound corresponds to 0.3 of the input,so it appears sped up, between the 
duration 0.2 and 0.35, and the last pair 1,1 implies the duration of both the sounds are same.

"""

'\nfreqScaling - The freqScaling is set to 1 for the beginning and end of the audio, i.e. no scaling is applied to\nthe ends. A scaling factor of 4 is applied at 0.1 time, and 0.5 at 0.3 time, and 2 at 0.5 time, 1 at 0.6 time,\nto get our result.\n\nfreqStretching - This will stretch the harmonics unevenly, and make the audio sound more unnatural. We apply \na stretching of 5 at 0.4, ramping up from no stretching (streching factor = 1) at 0.3, a stretching factor\n0.2 at time 0.5, ramping down to no stretching at time 0.7, and ramping up to a stretching factor of \n1.2 at time 0.8, and back to 1 at time 1.\n\ntimbrePreservation - Here, since we do not care about preserving the timbre of the original sound,\nwe can set this factor to 0 instead of 1, to produce a very unnatural sound.\n\ntimeScaling - At zero, time scaling is 0, and the 0.2 time of the input sound corresponds to the \n0.4 normalised time of the output sound, so it is slowed down between 0 and 0.2 time of the original sou