# Progress Report

# Our setup

We've opened a server on AWS (amazon) that supports GPU based on ami-b36981d8. We only needed to update torch and some dependencies on the virtual machine and clone neural-style from github.

We've used a STFT/ISTFT to create style/content spectrugrams and send them to AWS with ssh2/scp (Ganymed for Matlab), run neural-style algorithm implemented by  @jcjohnson [github](https://github.com/jcjohnson/neural-style) following  _A. Gatys "A Neural Algorithm of Artistic Style"_  and return the transfered image back to my PC for reconstruction of the wav.
This allows us to check several algorithm parameters and wav inputs quickly.
Our code is available at [github](https://github.com/boozebrewer/deepmix) run the "full-loop" setup is matlab/full_loop.m.




# Graphical style transfer demo
## Algorithm parameters review
<pre>
  -style_blend_weights [nil]
  -image_size          Maximum height / width of generated image [512]
  -content_weight      [5]
  -style_weight        [100]
  -tv_weight           [0.001]
  -num_iterations      [1000]
  -normalize_gradients [false]
  -init                random|image [random]
  -optimizer           lbfgs|adam [lbfgs]
  -learning_rate       [10]
  -style_scale         [1]
  -original_colors     [0]
  -pooling             max|avg [max]
  -proto_file          [models/VGG_ILSVRC_19_layers_deploy.prototxt]
  -model_file          [models/VGG_ILSVRC_19_layers.caffemodel]
  -content_layers      layers for content [relu4_2]
  -style_layers        layers for style [relu1_1,relu2_1,relu3_1,relu4_1,relu5_1]
</pre>

## Style transfer - random init

<img src="../jpgs/brad-self-portrait/brad_series_1.png">

## Style transfer - content init

<img src="../jpgs/brad-image_init/brad_init_image_series_1.png">

<img src="../jpgs/gab/gab_series_1.png">

# Audio Signal Processing

* For the STFT part we've used 8kHz sampling rate with 512 freq (positive) bins
* For the 2 second siganls a 32mSec frame with 4mSec skips was used. For upto 2 Sec audio segments this gives no more then 512 time frames, so the maximal image is 512x512. We used ISTFT with matched window parameters for perfect reconstruction.

* For the 6 second signals a 32mSec frame with 16mSec skip was used to meet the limit of 512x512 images.

* For forming the spectrograms we have taken the Log of the STFT and added an offset value so that our image has only positive values. The spectrograms are then saved to a 16bit gray-scale PNG. In reconstruction we remove this offset and at the end normalize to the content's gain.

* In the reconstruction step of the audio from the spectrogram we are using the original phase information of the 'content' audio. Although, this information isn't very contributive to the reconstruction as we have noticed that we get fairly good reconstruction with replacing the orginal phase with random noise for the phase.

* We note that that we used at most 512x512 images inorder that the algorithm will run fast enough, and won't go into memeory troubles.

# Audio style transfer experiments
We've tried style transfer from several contents to several styles.
Tried these inputs in various configurations

* voice male (3 words)
* voice female (1 word)
* chirp (from 100Hz to 3000Hz)
* dtmf tones (12 diferent tones)
* noise
* chromatic scale 261.6Hz (C4) to 1046.5Hz (C6) (pure sine and fm modulated)
* random notes 261.6Hz (C4) to 1046.5Hz (C6) (pure sine)

To generate the last two (notes) we've used "MIDI file tools for MATLAB" from this [web-site](http://kenschutte.com/midi). With this Matlab package we were able to control the generation of midi notes in our desired timing and convert them to audio using different sythesis methos (i.e fm-modulation, saw-wave, pure sine).

**We tweaked neural-transfer parameters:**
* weight - controls how much weight is given the "style" picture
* init   - controls the starting image of the optimization (random noise or the "content" image)
* inter  - number of interation of the optimizaion step

## Audio Bank

In [80]:
from IPython.display import Audio

In [81]:
Audio(url="audio/ben_holech.wav") # voice male


In [82]:
Audio(url="audio/matlab.wav") # voice female (1 word)

In [83]:
Audio(url="audio/chirp_sq_na_100_3000.wav") # chirp (from 100Hz to 3000Hz)

In [84]:
Audio(url="audio/dtmf.wav") # dtmf tones (12 diferent tones)

In [85]:
Audio(url="audio/noise.wav") # noise

In [86]:
Audio(url="audio/scale_sin_6sec.wav") # chromatic scale C4 to C6 (pure sine)

In [87]:
Audio(url="audio/scale_fm_6sec.wav") # chromatic scale C4 to C6 (fm modulated)

In [88]:
Audio(url="audio/rand_sine_6sec.wav") # random notes C4 to C6 (pure sine)

In [89]:
Audio(url="audio/bach_6sec.wav") # 6 Seconds from "Bach English Suit #2 by Glen Gould"

## Perfect reconstruction check (identity)
We want to test that without application of the net we have perfect reconstruction
Spectral Domain
<img src="../jpgs/style_Scale_fm_6sec_perfect_reconstruction.fig.jpg">
Time Domain
<img src="../jpgs/style_Scale_fm_6sec_perfect_reconstruction_time.jpg">

In [90]:
### Original content
Audio(url="audio/scale_fm_6sec.wav")

In [91]:
### Perfect reconstruction
Audio(url="../wavs/style_Scale_fm_6sec_perfect_reconstruction.wav")

# Selected results

Generaly, the results we've got might hint an interesting direction

Here is an example of interesting transfers we have seen from fm-scale tyle and sine random notes and from chirp sound to male voice content

Notice that the actual spectrograms are graylevel, here we use a colormap only for better visibility

### Random notes pure sine in style of fm modulation

We have took random notes 261.6Hz (C4) - 1046.5Hz (C6) synthesized with pure sine (single harmony) and tried transfering the acoustic style of a fm-modulated sound from a chromatic scale in the same range. 

It is interesting to note that the style transfered was able to weakly mimic a harmony structure of the fm-modulated signal and to transfer it as a suitable second harmony of the pure sine (2nd harmony is most prominent, but is not at the second multiple of the base).

<img src="../jpgs/style_Scale_fm_6secVsContent_Rand_sine_6secParams__Weight2000_Imsize513_Iters250_InitImage_Original_color0.fig.jpg">




In [92]:
### Style
Audio(url="audio/scale_fm_6sec.wav") # chromatic scale C5 to C7 (fm modulated)

In [93]:
### Content
Audio(url="audio/rand_sine_6sec.wav") # random notes C5 to C7 (pure sine)

In [94]:
### Transfered
Audio(url="audio/res/style_Scale_fm_6secVsContent_Rand_sine_6secParams__Weight2000_Imsize513_Iters250_InitImage_Original_color0_time.fig.wav")

### Piano (Bach) in style of FM-synthesizer


<img src="../jpgs/style_Scale_fm_6secVsContent_Bach_6secParams__Weight500_Imsize513_Iters250_InitImage_Original_color0.fig.jpg">




In [95]:
### Style
Audio(url="audio/scale_fm_6sec.wav") # chromatic scale C4 to C6 (fm modulated)

In [96]:
### content
Audio(url="audio/bach_6sec.wav")

In [97]:
### Transfered
Audio(url="../wavs/style_Scale_fm_6secVsContent_Bach_6secParams__Weight500_Imsize513_Iters250_InitImage_Original_color0_time.fig.wav")


### Voice in style of Chirp

<img src="../jpgs/style_Chirp_sq_na_100_3000VsContent_Ben_holechParams__Weight1000_Imsize513_Iters500_InitImage_Original_color0.fig.jpg">

Below are the original audios and the transfered one

In [98]:
### Style
Audio(url="audio/chirp_sq_na_100_3000.wav")

In [99]:
### content
Audio(url="audio/ben_holech.wav")

In [100]:
### transfered
Audio(url="audio/res/style_Chirp_sq_na_100_3000VsContent_Ben_holechParams__Weight1000_Imsize513_Iters500_InitImage_Original_color0_time.fig.wav")

# Style and content the same image
## Initialization from content

<img src="../jpgs/style_Scale_fm_6secVsContent_Scale_fm_6secParams__Weight2000_Imsize513_Iters250_InitImage_Original_color0.fig.jpg">
The results are almost identical.
Also in the time domain.

<img src="../jpgs/style_Scale_fm_6secVsContent_Scale_fm_6secParams__Weight2000_Imsize513_Iters250_InitImage_Original_color0_time.fig.jpg">

## Initialization from random high style weight
<img src="../jpgs/style_Scale_fm_6secVsContent_Scale_fm_6secParams__Weight2000_Imsize513_Iters250_InitRandom_Original_color0.fig.jpg">
We can see the content is present when we lower the style weight.

In [101]:
### Output of random image high style weigt
Audio(url="../wavs/style_Scale_fm_6secVsContent_Scale_fm_6secParams__Weight2000_Imsize513_Iters250_InitRandom_Original_color0_time.fig.wav")

## Initialization from random low style weight
<img src="../jpgs/style_Scale_fm_6secVsContent_Scale_fm_6secParams__Weight500_Imsize513_Iters250_InitRandom_Original_color0.fig.jpg">
We can see the content is almost not present, maybe due to a high style weight.

In [102]:
### Output of random initialization low style weight
Audio(url="../wavs/style_Scale_fm_6secVsContent_Scale_fm_6secParams__Weight500_Imsize513_Iters250_InitRandom_Original_color0_time.fig.wav")

# Other interesting results gallery
<img src="../jpgs/style_DtmfVsContent_Chirp_sq_na_100_3000Params__Weight2000_Imsize513_Iters250_InitImage_Original_color0.fig.jpg">


In [103]:
### Transfered
Audio(url="../wavs/style_DtmfVsContent_Chirp_sq_na_100_3000Params__Weight2000_Imsize513_Iters250_InitImage_Original_color0_time.fig.wav")

<img src="../jpgs/style_DtmfVsContent_Ben_holechParams__Weight10000_Imsize513_Iters200_InitImage_Original_color0.fig.jpg">


In [104]:
### Transfered
Audio(url="../wavs/style_DtmfVsContent_Ben_holechParams__Weight10000_Imsize513_Iters200_InitImage_Original_color0_time.fig.wav")

<img src="../jpgs/style_DtmfVsContent_Ben_holechParams__Weight1000_Imsize513_Iters100_InitImage_Original_color0.fig.jpg">


In [105]:
### Transfered
Audio(url="../wavs/style_DtmfVsContent_Ben_holechParams__Weight1000_Imsize513_Iters100_InitImage_Original_color0_time.fig.wav")

<img src="../jpgs/style_ChirpVsContent_Ben_holechParams__Weight10000_Imsize513_Iters100_InitImage_Original_color0.fig.jpg">



In [106]:
### Transfered
Audio(url="../wavs/style_ChirpVsContent_Ben_holechParams__Weight10000_Imsize513_Iters100_InitImage_Original_color0_time.fig.wav")

<img src="../jpgs/style_Chirp_sq_na_100_3000VsContent_NoiseParams__Weight1000_Imsize513_Iters100_InitImage_Original_color0.fig.jpg">


In [107]:
### Transfered
Audio(url="../wavs/style_Chirp_sq_na_100_3000VsContent_NoiseParams__Weight1000_Imsize513_Iters100_InitImage_Original_color0_time.fig.wav")

<img src="../jpgs/style_Chirp_sq_na_100_3000VsContent_DtmfParams__Weight2000_Imsize513_Iters250_InitImage_Original_color0.fig.jpg">


In [108]:
### Transfered
Audio(url="../wavs/style_Chirp_sq_na_100_3000VsContent_DtmfParams__Weight2000_Imsize513_Iters250_InitImage_Original_color0_time.fig.wav")

<img src="../jpgs/style_Ben_holechVsContent_DtmfParams__Weight10000_Imsize513_Iters200_InitImage_Original_color0.fig.jpg">


In [109]:
### Transfered
Audio(url="../wavs/style_Ben_holechVsContent_DtmfParams__Weight10000_Imsize513_Iters200_InitImage_Original_color0_time.fig.wav")

<img src="../jpgs/style_Ben_holechVsContent_NoiseParams__Weight10000_Imsize513_Iters200_InitImage_Original_color0.fig.jpg">

In [110]:
### Transfered
Audio(url="../wavs/style_Ben_holechVsContent_NoiseParams__Weight10000_Imsize513_Iters200_InitImage_Original_color0_time.fig.wav")

# Bach Jackson style transfer

<img src='../jpgs/bach-jackson.png'>