# Handwritten Music Symbol Recognition with Deep Ensemble

In ancient times, there was no system to record or document the music. Later, the musical pieces from the classical and post-classical period of European music were documented as scores using western staff notations. These notations are used by most of the modern genres of music due to its versatility. Hence, it is very important to develop a method that can store such sheets of handwritten music scores digitally. Optical Music Recognition (OMR) is a system which automatically interprets the scanned handwritten music scores. In this work, we have proposed a classifier ensemble of deep transfer learning models with Support Vector Machine (SVM) as the aggregation function for handwritten music symbol recognition. We have applied three pre-trained deep learning models, namely ResNet50, GoogleNet and DenseNet161 (each trained on ImageNet) and fine-tuned on our target datasets i.e., music symbol image datasets. The proposed ensemble is able to capture a more complex association of the base learners, thus improving the overall performance. We have evaluated the proposed model on three publicly available standard datasets, namely Handwritten Online Music Symbols (HOMUS), Capitan_Score_Non-uniform and Rebelo_real,and achieved state-of-the-art results for all three datasets.
<br></br>
Hyperparameter Initialization

In [None]:
#hyper params
lr = 1e-4
bs = 32
val_split = 0.85
num_epoch = 20
num_classes = 32

We use pytorch to implement the project. Here we include relevant modules and check for GPU.

In [None]:
#imports
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.optim import lr_scheduler
from torch.utils import data
import numpy as np
import torchvision
from  numpy import exp,absolute
from torchvision import datasets, transforms
import matplotlib.pyplot as plt
import time
import os
import copy
import math
from sklearn import svm
import sklearn.model_selection as model_selection
from sklearn.metrics import accuracy_score,f1_score,precision_score ,recall_score 

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

This function gives us training, validation and test set and takes the path to folder as input. This folder must be arranged as per `torchvision.datasets.Imagefolder` specification.

In [None]:
def get_TVT(path):
    data_transforms = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ])

    dataset = datasets.ImageFolder(path+'train/',transform=data_transforms)

    train_size = math.floor(len(dataset)*val_split)
    val_size = len(dataset) - train_size
    trainset, valset = data.random_split(dataset,lengths=[train_size,val_size])
    testset = datasets.ImageFolder(path+'test/',transform=data_transforms)
    return trainset,valset,testset

This is the function to train the model

In [None]:
def train_model(trainset, valset, model, criterion, optimizer, scheduler, num_epochs):
    dataloaders = {
        'train': data.DataLoader(trainset,batch_size=bs,shuffle=True),
        'val' : data.DataLoader(valset,batch_size=bs,shuffle=True)
    }
    dataset_sizes = {'train':len(trainset),'val':len(valset)}
    since = time.time()

    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            if phase == 'train':
                model.train()  # Set model to training mode
            else:
                model.eval()   # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data.
            for inputs, labels in dataloaders[phase]:
                inputs = inputs.to(device)
                labels = labels.to(device)
                # print('bruh')

                # zero the parameter gradients
                optimizer.zero_grad()

                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'train'):
                    outputs = model(inputs)
                    _, preds = torch.max(outputs, 1) 
                    loss = criterion(outputs, labels)

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()

                # statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)
            if phase == 'train':
                scheduler.step()

            epoch_loss = running_loss / dataset_sizes[phase]
            epoch_acc = running_corrects.double() / dataset_sizes[phase]
            
            print('{} Loss: {:.4f} Acc: {:.4f}'.format(
                phase, epoch_loss, epoch_acc))

            # deep copy the model
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())

        print()

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model

This function calculates the model accuracy on test set.

In [None]:
def test_acc(model, testset):
    running_corrects = 0
    testloader = data.DataLoader(testset,batch_size=bs,shuffle=True)
    for inputs, labels in testloader:
        inputs = inputs.to(device)
        labels = labels.to(device)

        with torch.set_grad_enabled(False):
            outputs = model(inputs)
            _, preds = torch.max(outputs, 1)

            running_corrects += torch.sum(preds == labels.data)
    return (running_corrects/len(testset))

This function returns a pair of set of data X and label Y. The elements in X represent the concatenated score from the models. If size of dataset is N, number of classes is c and number of trained model is k then the shape of X is (N,ck). The samples are also given weight based on total number of unique classification made on them (Explained later).

In [None]:
def get_weighted_score_ft(models,dataset):
    num_models = len(models)
    X = np.empty((0,num_models*num_classes))
    Y = np.empty((0),dtype=int)
    dataloader = data.DataLoader(dataset,batch_size=1,shuffle=True)
    for inputs,labels in dataloader:
        inputs,labels = inputs.to(device),labels.to(device)
        predictions = set()
        with torch.set_grad_enabled(False):
            x = models[0](inputs)
            _, preds = torch.max(x, 1)
            predictions.add(preds)
            for i in range(1,num_models):
                x1 = models[i](inputs)
                _, preds = torch.max(x1, 1)
                predictions.add(preds)
                x = torch.cat((x,x1),dim=1)
            if len(predictions) > 1:
                X = np.append(X,x.cpu().numpy()*3,axis=0)
            else:
                X = np.append(X,x.cpu().numpy(),axis=0)
            Y = np.append(Y,labels.cpu().numpy(),axis=0)     
    return X,Y
def get_score_ft(models,dataset):
    num_models = len(models)
    X = np.empty((0,num_classes))
    j = np.empty((0,num_models*num_classes))
    Y = np.empty((0),dtype=int)
    dataloader = data.DataLoader(dataset,batch_size=bs,shuffle=True)
    for inputs,labels in dataloader:
        inputs,labels = inputs.to(device),labels.to(device)
        with torch.set_grad_enabled(False):
            x = models[0](inputs)
            for i in range(1,num_models):
                x = torch.cat((x,models[i](inputs)),dim=1)
            
            if(x.shape[0] != bs): break
            j = np.append(X,x.cpu().numpy(),axis=0)
            X = np.append(X,x.cpu().numpy(),axis=0)
            Y = np.append(Y,labels.cpu().numpy(),axis=0)
    return X,Y

We load the models with pretrained weights

In [None]:
def get_models():
    googlenet = torchvision.models.googlenet(pretrained=True)
    resnet = torchvision.models.resnet50(pretrained=True)
    densenet = torchvision.models.densenet161(pretrained=True)

    densenet.classifier = nn.Linear(2208,num_classes)
    resnet.fc = nn.Linear(2048,num_classes)
    googlenet.fc = nn.Linear(1024,num_classes)
    densenet = densenet.to(device)
    resnet = resnet.to(device)
    googlenet = googlenet.to(device)

    return [densenet,googlenet,resnet]

This is the main code cell where all the functions are utilised together. Now let us consider there are $K$ number of base classifiers $\{CF_1, CF_2, \dots, CF_K\}$ to deal with an $n$-class classification problem. Hence, the output of any classifier (say, $CF_i$) is an $n$-dimensional vector $O_i = {s_1^i, s_2^i, \dots, s_n^i}$. Here, $s_j^i$ is confidence score produced by $i_{th}$ classifier for the $j_{th}$ class. We concatenate all the output vectors produced by the classifiers $\{CF_1, CF_2, \dots, CF_K\}$ to get a vector $S$ of length $nK$. $S$ is represented by

\begin{equation}
    \label{equ:final_vector}
    S = \{s_1^1, s_1^2, \dots, s_2^1, s_2^2, \dots, s_n^K\}
\end{equation}

One such vector $S$ is produced for every sample of the dataset. Let us consider that we have $N$ such samples with corresponding labels $y_i$ in the dataset to be used for training. Thus obtained the set $\{(S_1, y_1), (S_2, y_2),\\ \dots, (S_N, y_N)\}$ on which we train the SVM model. To introduce weights on samples, we consider the total number of unique predictions made on a sample by the base classifiers. For example, if there are three base classifiers and for some sample two of the classifiers are predicting the label 'class-x' and the remaining one is predicting the label 'class-y', then the total number of unique predictions of that sample is $2$. If the total number of prediction is greater than $1$, it suggests that there is a conflict among the classifiers on the correct class. So we propose that the SVM must put more emphasis on these samples in order to approximate a better decision boundary or support vectors.

A sample is assigned with $\mathcal{W}$ times more weight if the number of unique predictions regarding the corresponding image is greater than $\lambda$, which is an integer and whose value lies between $[1, K]$. In this work, we choose the values of both $K$ and $\lambda$ to be 3. The value of $\mathcal{W}$ is taken as 3 which is decided experimentally.

While testing and image it is first passed through all of the three DL models and the three output vectors are obtained. Then these output vectors are concatenated, during this concatenation the order of the models are maintained (as same as during training). We pass this vector through the trained SVM classifier, which predicts the final class of our test image.

![pipeline](https://user-images.githubusercontent.com/31564734/121711632-71825880-caf8-11eb-912b-1d6a147e81a1.jpg)


In [None]:
criterion = nn.CrossEntropyLoss()
ensemble_accuracy=[]
for fold in ['Fold_1','Fold_2','Fold_3','Fold_4','Fold_5']:
    for folder in  ['HOMUS']: #['Capitan_Score_Non-uniform','Capitan_Score_Uniform','Fornes_Music_Symbols_labelled']['Rebelo_Syn_labelled']:
        trainset,valset,testset = get_TVT('/content/homus/'+fold+'/',folder)
        models = get_models()
        for model in models:
            optimizer = optim.Adam(model.parameters(),lr=lr)
            exp_lr_scheduler = lr_scheduler.StepLR(optimizer, step_size=9, gamma=0.1)
            model = train_model(trainset, valset, model, criterion, optimizer, exp_lr_scheduler,num_epoch)

            print(test_acc(model,testset))
    train_X, train_Y = get_weighted_score_ft(models,trainset)
    test_X, test_Y = get_score_ft(models,testset)
    clf = svm.SVC(kernel='rbf',break_ties=True).fit(train_X, train_Y)
    pred = clf.predict(test_X)
    acc = accuracy_score(test_Y, pred)
    ensemble_accuracy.append(acc)
    print('Ensemble on '+fold+': '+str(acc))
print("Average Ensemble Accuracy:",sum(ensemble_accuracy)/len(ensemble_accuracy))