# Changing Music Genre Using Neural Style Transfer


## Introduction

Music is an important part of many people's lives. It can lift you out of the darkest pits and also make you cry harderest. Artficial music has been popular for many years now and progress increases constantly. The object of this project was to try to change the genre of a song to another genre.

It was through Neural Style Transfer that I discovered this concept and decided to try it myself.

### What is Neural Style Transfer? 

Neural Style Transfer (NST) is the method of creating a new image with the content of one mage and the style of another. Essentially, the "content" image's style is changed to the style of another image, the "style" image. NST uses a deep CNN (Convolutional Neural Network) model to do this. The CNN extracts the main features of the images and uses them to create the new image.
Normally a white noise image is first used as the template for the new image, and then the loss between the white noise image and the content image and between the white noise image and the style image is minimized through back-end propogation in order to preserve both images as much as possible without one overpowering the other in the final image.

Based off of research into this topic, it seemed to be better to use the content image as the white noise image, so that only style loss needed to be taken into account rather than both style and content loss, which would be 0. 
Applying these concepts to music, the images we would be feeding our model would be the spectrograms of audio files, specifically mel spectrograms. A spectrogram is a 2-dimensional representation of a sound file. Spectrograms are multiple STFTs (short-time Fourier Transforms) over small intervals of the song. The Fourier transform expresses the loudness/aplitude of various frequencies over that interval, or another way to say it is that a spectrogram reveals the presence of certain frequencies within a file.  mel spectrogram is a spectrogram in which each unit of pitch sounds equally differen to the listener as it take into consideration most humans' inability to differentiate sounds with similar frequencies. By performing NST on the spectrograms and then converting those spectrgrams back into wav files, we have generated and can listen tot he new audio.


### Implementation

First, we import all of the necessary libraries and modules to complete the Neural Style  Transfer. The dataset used was the [GTZAN Dataset - Music Genre Classification](https://www.kaggle.com/andradaolteanu/gtzan-dataset-music-genre-classification). It held 100 30-second-long files of ten different genres: blues, classical, country, disco, pop, hiphop, jazz, metal, pop, reggae, rock to make a total of 1000 song files. One of the files, 'jazz.00054.wav' had to be removed from the dataset though as it kept causing errors. Below is a snippet of the .csv that held information about the sound files. The only relevant information for this particular prject though were the filenames.

In [None]:
#@title Import and Install Statements
!pip install opendatasets
!pip install pydub
!pip install torchviz

from torch.autograd import Variable
from torchviz import make_dot
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
import torch.optim as optim


import matplotlib.pyplot as plt
from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
import numpy as np # we always love numpy
import time

import pandas as pd
import seaborn as sns

import os, json, math, librosa
import opendatasets as od

import IPython.display as ipd
import librosa.display

import tensorflow as tf
import tensorflow.keras as keras

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Conv2D

import sklearn.model_selection as sk

from sklearn.model_selection import train_test_split

from keras import layers
from keras.layers import (Input, Add, Dense, Activation, ZeroPadding2D, BatchNormalization, Flatten, 
                          Conv2D, AveragePooling2D, MaxPooling2D, GlobalMaxPooling2D)
from keras.models import Model, load_model
from keras.preprocessing import image
from keras.utils import layer_utils
import pydot
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
from keras.utils.vis_utils import plot_model
from tensorflow.keras.optimizers import Adam
from keras.initializers import glorot_uniform
from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
from tensorflow.keras.layers import Dense, Dropout

from __future__ import print_function
import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.optim as optim
import matplotlib.pyplot as plt
import numpy as np 
from sys import argv
import torchvision.transforms as transforms
import copy
import librosa
import soundfile as sf

In [None]:
#@title Read Pandas Dataframe (removed jazz.00054.wav)
df = pd.read_csv("/content/drive/MyDrive/kaggle/gtzan-dataset-music-genre-classification/Data/features_30_sec.csv")
df = df.drop([554])
#df = df.drop([5532, 5533, 5534, 5535, 5536, 5537, 5538, 5539, 5540, 5541])
#filenames = df["filename"]
filenames = df["filename"]
df.head()

Unnamed: 0,filename,length,chroma_stft_mean,chroma_stft_var,rms_mean,rms_var,spectral_centroid_mean,spectral_centroid_var,spectral_bandwidth_mean,spectral_bandwidth_var,rolloff_mean,rolloff_var,zero_crossing_rate_mean,zero_crossing_rate_var,harmony_mean,harmony_var,perceptr_mean,perceptr_var,tempo,mfcc1_mean,mfcc1_var,mfcc2_mean,mfcc2_var,mfcc3_mean,mfcc3_var,mfcc4_mean,mfcc4_var,mfcc5_mean,mfcc5_var,mfcc6_mean,mfcc6_var,mfcc7_mean,mfcc7_var,mfcc8_mean,mfcc8_var,mfcc9_mean,mfcc9_var,mfcc10_mean,mfcc10_var,mfcc11_mean,mfcc11_var,mfcc12_mean,mfcc12_var,mfcc13_mean,mfcc13_var,mfcc14_mean,mfcc14_var,mfcc15_mean,mfcc15_var,mfcc16_mean,mfcc16_var,mfcc17_mean,mfcc17_var,mfcc18_mean,mfcc18_var,mfcc19_mean,mfcc19_var,mfcc20_mean,mfcc20_var,label
0,blues.00000.wav,661794,0.350088,0.088757,0.130228,0.002827,1784.16585,129774.064525,2002.44906,85882.761315,3805.839606,901505.4,0.083045,0.000767,-4.529724e-05,0.008172,8e-06,0.005698,123.046875,-113.570648,2564.20752,121.571793,295.913818,-19.168142,235.574432,42.366421,151.106873,-6.364664,167.934799,18.623499,89.18084,-13.704891,67.660492,15.34315,68.932579,-12.27411,82.204201,10.976572,63.386311,-8.326573,61.773094,8.803792,51.244125,-3.6723,41.217415,5.747995,40.554478,-5.162882,49.775421,0.75274,52.42091,-1.690215,36.524071,-0.408979,41.597103,-2.303523,55.062923,1.221291,46.936035,blues
1,blues.00001.wav,661794,0.340914,0.09498,0.095948,0.002373,1530.176679,375850.073649,2039.036516,213843.755497,3550.522098,2977893.0,0.05604,0.001448,0.0001395807,0.005099,-0.000178,0.003063,67.999589,-207.501694,7764.555176,123.991264,560.259949,8.955127,572.810913,35.877647,264.506104,2.90732,279.932922,21.510466,156.477097,-8.560436,200.849182,23.370686,142.555954,-10.099661,166.108521,11.900497,104.358612,-5.555639,105.17363,5.376327,96.197212,-2.23176,64.914291,4.22014,73.152534,-6.012148,52.422142,0.927998,55.356403,-0.731125,60.314529,0.295073,48.120598,-0.283518,51.10619,0.531217,45.786282,blues
2,blues.00002.wav,661794,0.363637,0.085275,0.17557,0.002746,1552.811865,156467.643368,1747.702312,76254.192257,3042.260232,784034.5,0.076291,0.001007,2.105576e-06,0.016342,-1.9e-05,0.007458,161.499023,-90.722595,3319.044922,140.446304,508.765045,-29.093889,411.781219,31.684334,144.090317,-13.984504,155.493759,25.764742,74.548401,-13.664875,106.981827,11.639934,106.574875,-11.783643,65.447945,9.71876,67.908859,-13.133803,57.781425,5.791199,64.480209,-8.907628,60.385151,-1.077,57.711136,-9.229274,36.580986,2.45169,40.598766,-7.729093,47.639427,-1.816407,52.382141,-3.43972,46.63966,-2.231258,30.573025,blues
3,blues.00003.wav,661794,0.404785,0.093999,0.141093,0.006346,1070.106615,184355.942417,1596.412872,166441.494769,2184.745799,1493194.0,0.033309,0.000423,4.583644e-07,0.019054,-1.4e-05,0.002712,63.024009,-199.544205,5507.51709,150.090897,456.505402,5.662678,257.161163,26.859079,158.267303,1.771399,268.034393,14.234031,126.794128,-4.832006,155.912079,9.286494,81.273743,-0.759186,92.11409,8.137607,71.314079,-3.200653,110.236687,6.079319,48.251999,-2.480174,56.7994,-1.079305,62.289902,-2.870789,51.651592,0.780874,44.427753,-3.319597,50.206673,0.636965,37.31913,-0.619121,37.259739,-3.407448,31.949339,blues
4,blues.00004.wav,661794,0.308526,0.087841,0.091529,0.002303,1835.004266,343399.939274,1748.172116,88445.209036,3579.757627,1572978.0,0.101461,0.001954,-1.756129e-05,0.004814,-1e-05,0.003094,135.999178,-160.337708,5195.291992,126.219635,853.784729,-35.587811,333.792938,22.148071,193.4561,-32.4786,336.276825,10.852294,134.831573,-23.352329,93.257095,0.498434,124.672127,-11.793437,130.073349,1.207256,99.675575,-13.088418,80.254066,-2.813867,86.430626,-6.933385,89.555443,-7.552725,70.943336,-9.164666,75.793404,-4.520576,86.099236,-5.454034,75.269707,-0.916874,53.613918,-4.404827,62.910812,-11.703234,55.19516,blues


In [None]:
#@title Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')


root_path = 'gdrive/MyDrive/kaggle/'  #change dir to your project folder

## The Model

First, we start witht the CNN model that draws the features form the content and style spectrogram images to generate the new spectrogram image. Its layers are a 1-Dimensional convolutional layer followed by a ReLu activation function for limiting exponential growth of memory, followed by a pooling layer to reduce processing time, a neural network layer, another ReLu function, and a final neural network layer.

Acording to [Intel](https://www.intel.com/content/www/us/en/developer/articles/technical/neural-style-transfer-on-audio-signals.html) Gram Matrix function is "the inner product between the feature maps i and j represented by vectors in layer l and Nl is the number of feature maps." Essentially, it captures the essence of the style image and is used to calculate the style loss of the generated image in the Style Loss function. 



Most of this code came from or was inspired by the example program from [this repository](https://github.com/alishdipani/Neural-Style-Transfer-Audio).



In [4]:
#@title Neural Style Transfer



class CNNModel(nn.Module):
		def __init__(self):
			super(CNNModel, self).__init__()
			self.cnn1 = nn.Conv1d(in_channels=1025, out_channels=4096, kernel_size=3, stride=1, padding=1)
			self.nl1 = nn.ReLU()
			self.pool1 = nn.AvgPool1d(kernel_size=5)
			self.fc1 = nn.Linear(4096*2500,2**5)
			self.nl3 = nn.ReLU()
			self.fc2 = nn.Linear(2**10,2**5)
		
		def forward(self, x):
			out = self.cnn1(x)
			out = self.nl1(out)
			out = self.pool1(out)
			out = out.view(out.size(0),-1)
			out = self.fc1(out)
			out = self.nl3(out)
			out = self.fc2(out)
			return out


class GramMatrix(nn.Module):

	def forward(self, input):
		a, b, c = input.size()  # a=batch size(=1)
        # b=number of feature maps
        # (c,d)=dimensions of a f. map (N=c*d)
		features = input.view(a * b, c)  # resise F_XL into \hat F_XL
		G = torch.mm(features, features.t())  # compute the gram product
        # we 'normalize' the values of the gram matrix
        # by dividing by the number of element in each feature maps.
		return G.div(a * b * c)


class StyleLoss(nn.Module):

	def __init__(self, target, weight):
		super(StyleLoss, self).__init__()
		self.target = target.detach() * weight
		self.weight = weight
		self.gram = GramMatrix()
		self.criterion = nn.MSELoss()

	def forward(self, input):
		self.output = input.clone()
		self.G = self.gram(input)
		self.G.mul_(self.weight)
		self.loss = self.criterion(self.G, self.target)
		return self.output

	def backward(self,retain_graph=True):
		self.loss.backward(retain_graph=retain_graph)
		return self.loss


#print('Enter the names of SCRIPT, Content audio, Style audio')
# script, content_audio_name , style_audio_name = argv

content_audio_name = "/content/drive/MyDrive/kaggle/gtzan-dataset-music-genre-classification/Data/genres_original/blues/blues.00003.wav"
style_audio_name = "/content/drive/MyDrive/kaggle/gtzan-dataset-music-genre-classification/Data/genres_original/country/country.00084.wav"
# USING LIBROSA
N_FFT=2048
def read_audio_spectum(filename):
  x, fs = librosa.load(filename, duration=58.04) # Duration=58.05 so as to make sizes convenient
  S = librosa.stft(x, N_FFT)
  p = np.angle(S)
  S = np.log1p(np.abs(S))  
  return S, fs

style_audio, style_sr = read_audio_spectum(style_audio_name)
content_audio, content_sr = read_audio_spectum(content_audio_name)

if(content_sr == style_sr):
  print('Sampling Rates are same')
else:
  print('Sampling rates are not same')
  exit()

num_samples=style_audio.shape[1]	
  
style_audio = style_audio.reshape([1,1025,num_samples])
content_audio = content_audio.reshape([1,1025,num_samples])



style_float = Variable(torch.from_numpy(style_audio))
content_float = Variable(torch.from_numpy(content_audio))


cnn = CNNModel()
#if torch.cuda.is_available():
  #cnn = cnn.cuda()
style_layers_default = ['conv_1']

style_weight=2500

def get_style_model_and_losses(cnn, style_float,style_weight=style_weight, style_layers=style_layers_default): #STYLE WEIGHT
  
  cnn = copy.deepcopy(cnn)
  style_losses = []
  model = nn.Sequential()  # the new Sequential module network
  gram = GramMatrix()  # we need a gram module in order to compute style targets
  if torch.cuda.is_available():
    model = model.cuda()
    gram = gram.cuda()

  name = 'conv_1'
  model.add_module(name, cnn.cnn1)
  if name in style_layers:
    target_feature = model(style_float).clone()
    target_feature_gram = gram(target_feature)
    style_loss = StyleLoss(target_feature_gram, style_weight)
    model.add_module("style_loss_1", style_loss)
    style_losses.append(style_loss)

 
  return model, style_losses


input_float = content_float.clone()
#input_float = Variable(torch.randn(content_float.size())).type(torch.FloatTensor)

learning_rate_initial = 0.03

def get_input_param_optimizer(input_float):
  input_param = nn.Parameter(input_float.data)
  #optimizer = optim.Adagrad([input_param], lr=learning_rate_initial, lr_decay=0.0001,weight_decay=0)
  optimizer = optim.Adam([input_param], lr=learning_rate_initial, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)
  return input_param, optimizer

num_steps= 1000

def run_style_transfer(cnn, style_float, input_float, num_steps=num_steps, style_weight=style_weight): #STYLE WEIGHT, NUM_STEPS
  print('Building the style transfer model..')
  model, style_losses= get_style_model_and_losses(cnn, style_float, style_weight)
  input_param, optimizer = get_input_param_optimizer(input_float)
  print('Optimizing..')
  run = [0]

  while run[0] <= num_steps:
    def closure():
            # correct the values of updated input image
      input_param.data.clamp_(0, 1)

      optimizer.zero_grad()
      model(input_param)
      style_score = 0

      for sl in style_losses:
        #print('sl is ',sl,' style loss is ',style_score)
        style_score += sl.backward()

      run[0] += 1
      if run[0] % 100 == 0:
        print("run {}:".format(run))
        print('Style Loss : {:8f}'.format(style_score.data)) #CHANGE 4->8 
        print()

      return style_score


    optimizer.step(closure)
  input_param.data.clamp_(0, 1)
  return input_param.data
  
output = run_style_transfer(cnn, style_float, input_float)


output = output.squeeze(0)
output = output.numpy()

N_FFT=2048
a = np.zeros_like(output)
a = np.exp(output) - 1

# This code is supposed to do phase reconstruction
p = 2 * np.pi * np.random.random_sample(a.shape) - np.pi
for i in range(500):
  S = a * np.exp(1j*p)
  x = librosa.istft(S)
  p = np.angle(librosa.stft(x, N_FFT))

OUTPUT_FILENAME = 'test_output2_2500.wav'
sf.write(OUTPUT_FILENAME, x, style_sr, 'PCM_24')
print('DONE...')


Sampling Rates are same
Building the style transfer model..
Optimizing..
run [100]:
Style Loss : 0.002846

run [200]:
Style Loss : 0.002839

run [300]:
Style Loss : 0.002835

run [400]:
Style Loss : 0.002822

run [500]:
Style Loss : 0.002796

run [600]:
Style Loss : 0.002771

run [700]:
Style Loss : 0.002756

run [800]:
Style Loss : 0.002748

run [900]:
Style Loss : 0.002744

run [1000]:
Style Loss : 0.002741

DONE...


The number of generated songs from our Neural Style Transfer were few due to extreme time (the program took hours to run) and storage constraints (free Google Colab is very limiting). The main example I will be focusing on was the "successful" (the code *technically* works) conversion of a 30 second blues song into a 30 second country song.



This is the blues [song](https://drive.google.com/file/d/10ascZb-V5F3zHC8H83mWQp70orgEB420/view?usp=sharing). Linked for your convenience.

In [None]:
audio_path = "/content/drive/MyDrive/kaggle/gtzan-dataset-music-genre-classification/Data/genres_original/blues/blues.00003.wav"
x , sr = librosa.load(audio_path)

librosa.load(audio_path, sr=None)

ipd.Audio(audio_path)

plt.figure(figsize=(16, 5))
librosa.display.waveplot(x, sr=sr)

This is the country [song](https://drive.google.com/file/d/1MhJdSB0g4K3FacuZiWmdbpb4AjrbKX0w/view?usp=sharing). Linked for your convenience.

In [None]:
audio_path = "/content/drive/MyDrive/kaggle/gtzan-dataset-music-genre-classification/Data/genres_original/country/country.00084.wav"
x , sr = librosa.load(audio_path)

librosa.load(audio_path, sr=None)

ipd.Audio(audio_path)

plt.figure(figsize=(16, 5))
librosa.display.waveplot(x, sr=sr)

And this was the resulting [song](https://drive.google.com/file/d/1BbuPACdp8MPgXV9Txl2Q1HzruXOEahVh/view?usp=sharing), with the Neural Transfer Analysis run with 2500 steps. Linked for your convenience. As you can hear, it sounds terrible, to be frank. I was surprised that neither of the voices in both songs could be found in the generated audio and that the beat would be so choppy. The few iterations I was able to do were increase the steps to 3000 and add an additional convolutinal layer to the CNN, however these produced minimal improvements.



## Conclusion


I believe the issue with the poor quality of the generated audio lay in the reconstruction of the modified spectrogram back into a wave file. I knew that the spectrogram did not contain all the information necessary to recontruct a decent song, however I had researched this issue and believe that the code for phase reconstruction in the "Neural Style Transfer" code block would fix the problem. Evidently this is not the case, but one thing I did learn is that

It was extremely difficult to make iterations and try to improve this model. Not because it was too complicated or finicky, but because of time. On both the CPU and GPU, this algorithm took a very, very long time to run (about 2.5 to 3 hrs each time with a parameter of num_steps = 2500. The recommended amount of steps according to the code's source from [Intel](https://www.intel.com/content/www/us/en/developer/articles/technical/neural-style-transfer-on-audio-signals.html). This, coupled with the extremely limited usage of Google Colab (limited GPU and RunTime in particular). Having a limit on max runtime before the the colab forcefully stopped running made it difficult to do large chunks of data processing conveniently overnight for example, as even if the screen was kept on