<a href="https://colab.research.google.com/github/arpanpathak/DataScienceNotebooks/blob/main/Music_Melody_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install tensorflow transformers librosa SpeechRecognition pandas


# Abstract
In this paper, we present a novel framework for automating music composition and lyric transcription by leveraging state-of-the-art techniques in deep learning and unsupervised learning. We propose a set of Convolutional Neural Network (CNN) filters for source separation, enabling the extraction of vocal and instrumental components from raw audio. To bridge the gap between audio and text, we employ transfer learning to model the joint probability distribution of mel spectrograms and word embeddings, facilitating accurate lyric transcription. Additionally, we introduce a generative model based on an unsupervised learning auto-encoder-decoder architecture, which transforms textual input into musical compositions, effectively simulating the creative behavior of a human music composer. This model is designed to capture the intricate relationship between lyrics and melody, enabling the generation of coherent and expressive songs. Our research significantly advances the capabilities of natural language processing (NLP) in the context of music, offering a powerful tool for lyricists, poets, and music labels to rapidly prototype and produce creative works. By integrating unsupervised learning and generative modeling, we pave the way for new possibilities in AI-assisted music production, enhancing both efficiency and artistic potential.


In [4]:
!pip install pydub

Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub
Successfully installed pydub-0.25.1


# Data Preparation

We've collected human voice and short cover clips of songs of variable length and hand crafted lyrics into corresponding file pair `< "audio_<no>.m4a", "lyrics_<no>.txt">`

and converted into these two fixed sized matrices
```
Audio (MFCC vector) shape: (100, 13)
Lyrics (tokenized vector) shape: (1, 50)
```

We can crowd source this data

In [48]:
import os
import librosa
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from IPython.display import Audio, display

# Function to load audio
def load_audio(audio_file, sr=None, max_duration=10):
    # Load audio using librosa
    audio, sr = librosa.load(audio_file, sr=sr, duration=max_duration)
    return audio, sr

# Function to read lyrics from the text file
def load_lyrics(lyrics_file):
    with open(lyrics_file, 'r') as f:
        lyrics = f.read().strip()
    return lyrics

# Function to convert audio to MFCCs (Mel-frequency cepstral coefficients)
def audio_to_mfcc(audio, sr, max_length=100):
    # Extract MFCC features
    mfcc = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=13)

    # Transpose to shape (time, features)
    mfcc = mfcc.T  # Time frames as rows, MFCCs as columns

    # Padding or truncating to a fixed length
    if len(mfcc) < max_length:
        pad_width = max_length - len(mfcc)
        mfcc = np.pad(mfcc, ((0, pad_width), (0, 0)), mode='constant')
    else:
        mfcc = mfcc[:max_length, :]

    return mfcc

# Path to the directories containing audio and lyrics files
audio_dir = './'  # Replace with the path to your audio directory
lyrics_dir = './'  # Replace with the path to your lyrics directory

# Get all audio and lyrics files
audio_files = [f for f in os.listdir(audio_dir) if f.startswith('audio_') and f.endswith('.m4a')]
lyrics_files = [f for f in os.listdir(lyrics_dir) if f.startswith('lyrics_') and f.endswith('.txt')]

# Sort files based on the number part of the filename (e.g., audio_<no>.m4a -> <no>)
audio_files.sort(key=lambda x: int(x.split('_')[1].split('.')[0]))
lyrics_files.sort(key=lambda x: int(x.split('_')[1].split('.')[0]))

# Initialize tokenizer for lyrics (you can define max_words and other parameters as per your need)
tokenizer = Tokenizer(num_words=5000)  # Limit to 5000 words for the vocabulary size

# Iterate through each pair of audio and lyrics files
for audio_file, lyrics_file in zip(audio_files, lyrics_files):
    # Load audio and lyrics
    audio_path = os.path.join(audio_dir, audio_file)
    lyrics_path = os.path.join(lyrics_dir, lyrics_file)

    # Load audio
    audio, sr = load_audio(audio_path, sr=None)

    # Convert audio to MFCC
    audio_vector = audio_to_mfcc(audio, sr, max_length=100)  # 100 time frames

    # Load lyrics
    lyrics = load_lyrics(lyrics_path)

    # Tokenize the lyrics and convert them to sequences
    tokenizer.fit_on_texts([lyrics])  # Fit the tokenizer on the lyrics text
    lyrics_seq = tokenizer.texts_to_sequences([lyrics])  # Convert lyrics to sequences
    lyrics_vector = pad_sequences(lyrics_seq, padding='post', maxlen=50)  # Pad to a fixed length of 50 tokens

    # Display the audio and lyrics
    print(f"Lyrics for {audio_file}:")
    print(lyrics)
    print(f"Audio (MFCC vector) shape: {audio_vector.shape}")
    print(f"Lyrics (tokenized vector) shape: {lyrics_vector.shape}")
    print(lyrics_vector)
    print(audio_vector)
    print("-----------------------------------------\n")

    # Play the audio
    display(Audio(audio_path))  # Display an audio player for the audio file


  audio, sr = librosa.load(audio_file, sr=sr, duration=max_duration)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


Lyrics for audio_0.m4a:
My girl, my girl, don't lie to me. Tell me, where did you sleep last night?
Audio (MFCC vector) shape: (100, 13)
Lyrics (tokenized vector) shape: (1, 50)
[[ 1  2  1  2  4  5  6  3  7  3  8  9 10 11 12 13  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0]]
[[-459.66135     42.849876    41.185257  ...   14.314046    12.254946
    10.681534 ]
 [-458.02686     44.237488    40.34401   ...   20.050821    17.534748
    14.266394 ]
 [-442.69406     54.83538     34.011932  ...   14.006481    10.447466
     9.18412  ]
 ...
 [-171.24693    190.78427    -50.57952   ...  -14.74697      6.0627055
   -11.6960945]
 [-170.4166     181.98541    -50.74931   ...  -15.71911      5.7473536
   -12.173477 ]
 [-159.97144    191.07425    -49.995872  ...  -13.114267     3.6838915
   -11.520643 ]]
-----------------------------------------



  audio, sr = librosa.load(audio_file, sr=sr, duration=max_duration)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


Lyrics for audio_1.m4a:
s this the real life? Is this just fantasy?
Caught in a landslide, no escape from reality
Audio (MFCC vector) shape: (100, 13)
Lyrics (tokenized vector) shape: (1, 50)
[[15  4 16 17 18 19  4 20 21 22 23 24 25 26 27 28 29  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0]]
[[-468.89682     76.006195    62.25554   ...    6.0164337    4.9514656
     4.119427 ]
 [-417.33618    111.50548     46.75373   ...   12.577353     1.0448133
    -6.1962533]
 [-355.68085    138.78735     27.84312   ...   12.585001    -4.050495
   -19.559685 ]
 ...
 [-166.03395    187.10977    -66.30272   ...  -10.811323    -6.2263346
   -11.908101 ]
 [-165.18587    192.86693    -66.157745  ...  -13.921744    -5.1794233
   -10.290825 ]
 [-162.33546    196.69101    -67.56645   ...  -13.515344    -7.1259985
   -10.276342 ]]
-----------------------------------------



  audio, sr = librosa.load(audio_file, sr=sr, duration=max_duration)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


Lyrics for audio_2.m4a:
Is this the real life? Is this just fantasy?
Caught in a landslide, no escape from reality
Open your eyes, look up to the skies and see
I'm just a poor boy, I need no sympathy
Because I'm easy come, easy go
Little high, little low
Audio (MFCC vector) shape: (100, 13)
Lyrics (tokenized vector) shape: (1, 50)
[[ 3  1  2 11 12  3  1  4 13 14 15  5 16  6 17 18 19 33 34 35 36 37  9  2
  38 39 40 20  4  5 41 42 43 44  6 45 46 20 21 47 21 48 22 49 22 50  0  0
   0  0]]
[[-3.78730347e+02  8.63325500e+01  1.92247810e+01 ...  5.31439447e+00
  -5.10083103e+00 -3.72599721e-01]
 [-3.25552094e+02  1.02475769e+02 -8.36921406e+00 ...  6.25550604e+00
   3.33269215e+00 -9.18761635e+00]
 [-2.88411469e+02  1.12654449e+02 -2.68255157e+01 ... -3.18093777e+00
  -3.02138329e-01 -2.03366776e+01]
 ...
 [-2.52223007e+02  1.59019150e+02 -1.28763275e+01 ... -2.83601303e+01
  -9.75946331e+00 -1.74074306e+01]
 [-2.57148895e+02  1.64390930e+02 -8.44738960e+00 ... -2.96957474e+01
  -1.30024815e

  audio, sr = librosa.load(audio_file, sr=sr, duration=max_duration)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


Lyrics for audio_3.m4a:
It's amazing how you can speak right to my heart
Without saying a word, you can light up the dark
Try as I may, I can never explain
What I hear when you don't say a thing
Audio (MFCC vector) shape: (100, 13)
Lyrics (tokenized vector) shape: (1, 50)
[[52 53 54  2 11 55 56  7  6 57 58 59  1 60  2 11 61 24  4 62 63 64  5 65
   5 11 66 67 68  5 69 70  2 13 71  1 72  0  0  0  0  0  0  0  0  0  0  0
   0  0]]
[[-386.8978      97.22352     42.475708  ...    9.326134     5.3905587
     2.3178751]
 [-313.28085    140.10307     17.749624  ...   15.507418    -1.2242019
    -6.0810065]
 [-285.16702    157.31752     11.684967  ...   12.777823    -2.5179014
   -14.061092 ]
 ...
 [-190.10793    122.52476    -55.589767  ...  -17.20525    -22.152344
   -13.562916 ]
 [-180.0038     126.68646    -52.743103  ...  -15.774594   -18.739914
   -15.858721 ]
 [-181.74742    120.62075    -57.194633  ...  -14.559265   -17.85063
   -12.553361 ]]
-----------------------------------------



# Train ( Model )

In [44]:
import os
import librosa
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Function to load audio
def load_audio(audio_file, sr=None, max_duration=10):
    audio, sr = librosa.load(audio_file, sr=sr, duration=max_duration)
    return audio, sr

# Function to read lyrics from the text file
def load_lyrics(lyrics_file):
    with open(lyrics_file, 'r') as f:
        lyrics = f.read().strip()
    return lyrics

# Function to convert audio to MFCCs
def audio_to_mfcc(audio, sr, max_length=100):
    mfcc = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=13)
    mfcc = mfcc.T
    if len(mfcc) < max_length:
        pad_width = max_length - len(mfcc)
        mfcc = np.pad(mfcc, ((0, pad_width), (0, 0)), mode='constant')
    else:
        mfcc = mfcc[:max_length, :]
    return mfcc

# Custom Dataset class
class AudioLyricsDataset(Dataset):
    def __init__(self, audio_dir, lyrics_dir, tokenizer, max_length=100):
        self.audio_dir = audio_dir
        self.lyrics_dir = lyrics_dir
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.audio_files = [f for f in os.listdir(audio_dir) if f.startswith('audio_') and f.endswith('.m4a')]
        self.lyrics_files = [f for f in os.listdir(lyrics_dir) if f.startswith('lyrics_') and f.endswith('.txt')]
        self.audio_files.sort(key=lambda x: int(x.split('_')[1].split('.')[0]))
        self.lyrics_files.sort(key=lambda x: int(x.split('_')[1].split('.')[0]))

    def __len__(self):
        return len(self.audio_files)

    def __getitem__(self, idx):
        audio_path = os.path.join(self.audio_dir, self.audio_files[idx])
        lyrics_path = os.path.join(self.lyrics_dir, self.lyrics_files[idx])

        # Load audio and convert to MFCC
        audio, sr = load_audio(audio_path)
        mfcc = audio_to_mfcc(audio, sr, self.max_length)  # Shape: (max_length, 13)

        # Load lyrics and tokenize
        lyrics = load_lyrics(lyrics_path)
        lyrics_seq = self.tokenizer.texts_to_sequences([lyrics])
        lyrics_padded = pad_sequences(lyrics_seq, padding='post', maxlen=self.max_length)  # Shape: (1, max_length)

        return torch.tensor(mfcc, dtype=torch.float32), torch.tensor(lyrics_padded[0], dtype=torch.long)

# Define the model
class TranscriptionModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(TranscriptionModel, self).__init__()
        self.rnn = nn.GRU(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        out, _ = self.rnn(x)
        out = self.fc(out)
        return out

# Hyperparameters
input_size = 13  # MFCC features
hidden_size = 128
output_size = 5000  # Vocabulary size
learning_rate = 0.001
num_epochs = 10
batch_size = 32
max_length = 100  # Same for both MFCC and lyrics

# Initialize tokenizer
tokenizer = Tokenizer(num_words=5000)

# Create dataset and dataloader
dataset = AudioLyricsDataset('./', './', tokenizer, max_length=max_length)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Initialize model, loss function, and optimizer
model = TranscriptionModel(input_size, hidden_size, output_size)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
for epoch in range(num_epochs):
    for i, (mfccs, lyrics) in enumerate(dataloader):
        # Forward pass
        outputs = model(mfccs)  # Shape: (batch_size, max_length, output_size)

        # Reshape outputs and lyrics for CrossEntropyLoss
        outputs = outputs.view(-1, output_size)  # Shape: (batch_size * max_length, output_size)
        lyrics = lyrics.view(-1)  # Shape: (batch_size * max_length)

        # Compute loss
        loss = criterion(outputs, lyrics)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if (i+1) % 10 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{len(dataloader)}], Loss: {loss.item():.4f}')

# Save the model checkpoint
torch.save(model.state_dict(), 'transcription_model.pth')

# Testing the model
def test_model(model, dataloader):
    model.eval()
    with torch.no_grad():
        for mfccs, lyrics in dataloader:
            outputs = model(mfccs)
            _, predicted = torch.max(outputs, 2)
            print("Predicted:", predicted)
            print("Actual:", lyrics)

test_model(model, dataloader)

  audio, sr = librosa.load(audio_file, sr=sr, duration=max_duration)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
  audio, sr = librosa.load(audio_file, sr=sr, duration=max_duration)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
  audio, sr = librosa.load(audio_file, sr=sr, duration=max_duration)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
  audio, sr = librosa.load(audio_file, sr=sr, duration=max_duration)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
  audio, sr = librosa.load(audio_file, sr=sr, duration=max_duration)
	Deprecated as of librosa version 0.10.0.
	It will 

Predicted: tensor([[   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0],
        [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,  

# Test

In [47]:
# Testing with audio_3.m4a
audio_test_path = './audio_3.m4a'  # Replace with path to your audio_3.m4a file
audio, sr = librosa.load(audio_test_path, sr=None, duration=10)

# Convert audio to MFCC using the standalone function
audio_mfcc = audio_to_mfcc(audio, sr, max_length=100)  # Use the same max_length as during training

# Prepare the input for the model
audio_input = torch.tensor(audio_mfcc, dtype=torch.float32).unsqueeze(0)  # Add batch dimension (batch_size=1)

# Run the model
model.eval()
with torch.no_grad():
    output = model(audio_input)  # Shape: (1, max_length, output_size)

# Convert output to tokenized text
predicted = output.argmax(dim=-1)  # Shape: (1, max_length)
predicted_lyrics = tokenizer.sequences_to_texts(predicted.numpy())

print("Predicted Transcription:", predicted_lyrics)

  audio, sr = librosa.load(audio_test_path, sr=None, duration=10)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


MFCC Shape: (100, 13)
MFCC Values (first 5 frames): [[-386.8978      97.22352     42.475708    29.043026    33.694813
    26.038189    12.401115    -4.212136   -15.644012    -5.3663893
     9.326134     5.3905587    2.3178751]
 [-313.28085    140.10307     17.749624    19.974691    37.343727
    19.616566    15.570862    -0.8970549  -26.961031    -8.498768
    15.507418    -1.2242019   -6.0810065]
 [-285.16702    157.31752     11.684967    13.388584    35.08818
    25.53927     15.387703    -7.869223   -29.018291   -13.137954
    12.777823    -2.5179014  -14.061092 ]
 [-263.86633    158.23773      1.9803222   14.58141     30.8696
    24.64959     11.565348    -9.543816   -29.532768   -15.106099
    14.302038    -2.7773087  -15.004125 ]
 [-239.37564    158.96387     -9.361126    15.149033    29.11945
    24.324772     4.92731    -14.390648   -30.101164   -18.745502
    13.18357     -3.845552   -23.053017 ]]
Model Input Shape: torch.Size([1, 100, 13])
Model Output Shape: torch.Size([1, 1