<a href="https://www.kaggle.com/fred913/voice-style-transfering-based-on-random-cnn?scriptVersionId=87020649" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

This project is based on [randomCNN-voice-transfer](https://github.com/mazzzystar/randomCNN-voice-transfer).

`input1.wav` is a set of voice that was said by *Viator* and `input2.wav` is a set of voice that was said by *Viatrix*.

*Viator* CV: [鹿喑](https://space.bilibili.com/302145)

*Viatrix* CV: [宴宁](https://space.bilibili.com/14897804)

The `mv1.wav` and the `mv2.wav` are vocal sound files of the song *Let The Wind Tell You (my cover version)*. 

**NOTICE**: the files input1 and input2 are NOT authorized to use in this project. I'm also just a fan of these characters. 


- 词曲/制作人 Lyricist / Composer / Producer ：ChiliChill 
- 编曲 Arrangement：ChiliChill 
- 贝斯 Bass：冯子明、山口進也 
- 长笛 Flute：Salit Lahav 
- 弦乐编写 String music：胡静成 
- 小提琴 Violin：庞阔 / 张浩 
- 中提琴 Viola：毕芳 
- 大提琴 Violoncello：郎莹 
- 弦乐录音 String Music Recording：李昕达@九紫天诚
- ChiliChill is a two person independent music group composed of members Yu H. and CuSummer
- ChiliChill是由成员Yu H.和CuSummer组成的二人独立音乐团体
- 人声/混音 Vocal/Mixing：剩饭不吃剩饭/fred913 (me)
**NOTICE**: the official music is mixed by ChiliChill. In this project I'm using my cover version of this music. 

## Preparations
 - Get the voice files of one charactor. 
 - Get the voice file of the voice content you want. 
 - Upload a dataset with the content and style inside. 

## Recommended Voice Sources
 - Games (e.x. recording from the `voices` page in Genshin Impact)

# Risks
 - 请务必重视该项目的法律风险。我不提倡且不支持任何非法的伪造身份的行为。该项目拥有更好的应用场景。而不应被某些人糟蹋，导致该项目被所有人唾弃。
 - Please pay **attention** to the **legal** risks of the project. I **do not** advocate and support any illegal identity forgery. The project has better application scenarios. It **should not be spoiled** by some people, resulting in the project being **despised**.
 - 去你妈的骗子，别来用这个项目瞎折腾骗人。滚一边去，傻逼（指骗子，请不要对号入座。）

# How to run
1. First, you should configure the project HERE

In [None]:
DATASET_NAME="rcnntmp1"  # Put the name of the dataset (or the folder) here.
CONTENT_FILENAME = "mv2.wav"
STYLE_FILENAME = "input2.wav"

## If you're running this project somewhere else, change this/these option(s):
DATASET_ACCESS = "/kaggle/input/%s/%s"  # upload a dataset with the audio files inside (Kaggle)
# DATASET_ACCESS = "./%s/%s"  # just create a folder with the audio files in the same folder as this notebook (local)


2. And then, click `Save Version` on the right top of the screen. Don't forget to turn on the *GPU*! 
3. Wait for 10-20 hours.
4. Well done! Download the `output.wav` in the *results*.

In [None]:
import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
from torch.autograd import Variable
import time
import math
import argparse
import gc
import tqdm
import os
import torch
import torch.nn as nn
import librosa
import numpy as np
import soundfile
from packaging import version

In [None]:
cuda = True if torch.cuda.is_available() else False

N_FFT = 512
N_CHANNELS = round(1 + N_FFT/2)
OUT_CHANNELS = 32


class RandomCNN(nn.Module):
    def __init__(self):
        super(RandomCNN, self).__init__()

        # 2-D CNN
        self.conv1 = nn.Conv2d(1, OUT_CHANNELS, kernel_size=(3, 1), stride=1, padding=0)
        self.LeakyReLU = nn.LeakyReLU(0.2)

        # Set the random parameters to be constant.
        weight = torch.randn(self.conv1.weight.data.shape)
        self.conv1.weight = torch.nn.Parameter(weight, requires_grad=False)
        bias = torch.zeros(self.conv1.bias.data.shape)
        self.conv1.bias = torch.nn.Parameter(bias, requires_grad=False)

    def forward(self, x_delta):
        out = self.LeakyReLU(self.conv1(x_delta))
        return out


In [None]:
def librosa_write(outfile, x, sr):
    if version.parse(librosa.__version__) < version.parse('0.8.0'):
        librosa.output.write_wav(outfile, x, sr)
    else:
        soundfile.write(outfile, x, sr)

def wav2spectrum(filename):
    x, sr = librosa.load(filename)
    S = librosa.stft(x, N_FFT)
    p = np.angle(S)

    S = np.log1p(np.abs(S))
    return S, sr


def spectrum2wav(spectrum, sr, outfile):
    # Return the all-zero vector with the same shape of `a_content`
    a = np.exp(spectrum) - 1
    p = 2 * np.pi * np.random.random_sample(spectrum.shape) - np.pi
    for i in range(50):
        S = a * np.exp(1j * p)
        x = librosa.istft(S)
        p = np.angle(librosa.stft(x, N_FFT))
    librosa_write(outfile, x, sr)


def wav2spectrum_keep_phase(filename):
    x, sr = librosa.load(filename)
    S = librosa.stft(x, N_FFT)
    p = np.angle(S)

    S = np.log1p(np.abs(S))
    return S, p, sr


def spectrum2wav_keep_phase(spectrum, p, sr, outfile):
    # Return the all-zero vector with the same shape of `a_content`
    a = np.exp(spectrum) - 1
    for i in range(50):
        S = a * np.exp(1j * p)
        x = librosa.istft(S)
        p = np.angle(librosa.stft(x, N_FFT))
    librosa_write(outfile, x, sr)


def compute_content_loss(a_C, a_G):
    """
    Compute the content cost

    Arguments:
    a_C -- tensor of dimension (1, n_C, n_H, n_W)
    a_G -- tensor of dimension (1, n_C, n_H, n_W)

    Returns:
    J_content -- scalar that you compute using equation 1 above
    """
    m, n_C, n_H, n_W = a_G.shape

    # Reshape a_C and a_G to the (m * n_C, n_H * n_W)
    a_C_unrolled = a_C.view(m * n_C, n_H * n_W)
    a_G_unrolled = a_G.view(m * n_C, n_H * n_W)

    # Compute the cost
    J_content = 1.0 / (4 * m * n_C * n_H * n_W) * torch.sum((a_C_unrolled - a_G_unrolled) ** 2)

    return J_content


def gram(A):
    """
    Argument:
    A -- matrix of shape (n_C, n_L)

    Returns:
    GA -- Gram matrix of shape (n_C, n_C)
    """
    GA = torch.matmul(A, A.t())

    return GA


def gram_over_time_axis(A):
    """
    Argument:
    A -- matrix of shape (1, n_C, n_H, n_W)

    Returns:
    GA -- Gram matrix of A along time axis, of shape (n_C, n_C)
    """
    m, n_C, n_H, n_W = A.shape

    # Reshape the matrix to the shape of (n_C, n_L)
    # Reshape a_C and a_G to the (m * n_C, n_H * n_W)
    A_unrolled = A.view(m * n_C * n_H, n_W)
    GA = torch.matmul(A_unrolled, A_unrolled.t())

    return GA


def compute_layer_style_loss(a_S, a_G):
    """
    Arguments:
    a_S -- tensor of dimension (1, n_C, n_H, n_W)
    a_G -- tensor of dimension (1, n_C, n_H, n_W)

    Returns:
    J_style_layer -- tensor representing a scalar style cost.
    """
    m, n_C, n_H, n_W = a_G.shape

    # Reshape the matrix to the shape of (n_C, n_L)
    # Reshape a_C and a_G to the (m * n_C, n_H * n_W)

    # Calculate the gram
    # !!!!!! IMPORTANT !!!!! Here we compute the Gram along n_C,
    # not along n_H * n_W. But is the result the same? No.
    GS = gram_over_time_axis(a_S)
    GG = gram_over_time_axis(a_G)

    # Computing the loss
    J_style_layer = 1.0 / (4 * (n_C ** 2) * (n_H * n_W)) * torch.sum((GS - GG) ** 2)

    return J_style_layer



In [None]:


cuda = True if torch.cuda.is_available() else False
if not cuda:
    print(
        "NOTICE: Cuda is NOT available. Training with CPU will DEEPLY decrease the training speed. "
    )

parser = argparse.ArgumentParser()
parser.add_argument('-content', help='Content input', default=DATASET_ACCESS % (DATASET_NAME, CONTENT_FILENAME))
parser.add_argument('-content_weight',
                    help='Content weight. Default is 1e2',
                    default=1e2)
parser.add_argument('-style', help='Style input', default=DATASET_ACCESS % (DATASET_NAME, STYLE_FILENAME))
parser.add_argument('-style_weight',
                    help='Style weight. Default is 1',
                    default=1)
parser.add_argument('-epochs',
                    type=int,
                    help='Number of epoch iterations. Default is 20000',
                    default=20000)
parser.add_argument('-print_interval',
                    type=int,
                    help='Number of epoch iterations between printing losses',
                    default=1000)
parser.add_argument('-plot_interval',
                    type=int,
                    help='Number of epoch iterations between plot points',
                    default=1000)
parser.add_argument('-learning_rate', type=float, default=0.002)
parser.add_argument('-output',
                    help='Output file name. Default is "output"',
                    default='output')
args = parser.parse_args(args=())

CONTENT_FILENAME = args.content
STYLE_FILENAME = args.style

a_content, sr = wav2spectrum(CONTENT_FILENAME)
a_style, sr = wav2spectrum(STYLE_FILENAME)

a_content_torch = torch.from_numpy(a_content)[None, None, :, :]
if cuda:
    a_content_torch = a_content_torch.cuda()
print(a_content_torch.shape)
a_style_torch = torch.from_numpy(a_style)[None, None, :, :]
if cuda:
    a_style_torch = a_style_torch.cuda()
print(a_style_torch.shape)

model = RandomCNN()
model.eval()

a_C_var = Variable(a_content_torch, requires_grad=False).float()
a_S_var = Variable(a_style_torch, requires_grad=False).float()
if cuda:
    model = model.cuda()
    a_C_var = a_C_var.cuda()
    a_S_var = a_S_var.cuda()

a_C = model(a_C_var)
a_S = model(a_S_var)

# Optimizer
learning_rate = args.learning_rate
a_G_var = Variable(torch.randn(a_content_torch.shape) * 1e-3)
if cuda:
    a_G_var = a_G_var.cuda()
a_G_var.requires_grad = True
optimizer = torch.optim.Adam([a_G_var])

# coefficient of content and style
style_param = args.style_weight
content_param = args.content_weight

num_epochs = args.epochs
print_every = args.print_interval
plot_every = args.plot_interval

# Keep track of losses for plotting
current_loss = 0
all_losses = []


def timeSince(since):
    now = time.time()
    s = now - since
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)


start = time.time()
# Train the Model
bar = tqdm.tqdm(range(1, num_epochs + 1))
for epoch in bar:
    gc.collect()
    torch.cuda.empty_cache()
    optimizer.zero_grad()
    a_G = model(a_G_var)

    content_loss = content_param * compute_content_loss(a_C, a_G)
    style_loss = style_param * compute_layer_style_loss(a_S, a_G)
    loss = content_loss + style_loss
    loss.backward()
    optimizer.step()

    bar.set_description_str(
        "content_loss={:4f}, style_loss={:4f}, total_loss={:4f}".format(
            content_loss.item(), style_loss.item(), loss.item()))
    current_loss += loss.item()

    # Add current loss avg to list of losses
    if epoch % plot_every == 0:
        all_losses.append(current_loss / plot_every)
        current_loss = 0

    if os.path.isfile("get_temp_file.flag"):
        print("output one file")
        gen_spectrum = a_G_var.cpu().data.numpy().squeeze()
        gen_audio_C = "temp_opt.wav"
        spectrum2wav(gen_spectrum, sr, gen_audio_C)
        os.remove("get_temp_file.flag")

gen_spectrum = a_G_var.cpu().data.numpy().squeeze()
gen_audio_C = args.output + ".wav"
spectrum2wav(gen_spectrum, sr, gen_audio_C)

plt.figure()
plt.plot(all_losses)
plt.savefig('loss_curve.png')

plt.figure(figsize=(5, 5))
# we then use the 2nd column.
plt.subplot(1, 1, 1)
plt.title("Content Spectrum")
plt.imsave('Content_Spectrum.png', a_content[:400, :])

plt.figure(figsize=(5, 5))
# we then use the 2nd column.
plt.subplot(1, 1, 1)
plt.title("Style Spectrum")
plt.imsave('Style_Spectrum.png', a_style[:400, :])

plt.figure(figsize=(5, 5))
# we then use the 2nd column.
plt.subplot(1, 1, 1)
plt.title("CNN Voice Transfer Result")
plt.imsave('Gen_Spectrum.png', gen_spectrum[:400, :])
