---

# InstRecognitionTool

An instrument recognition tool.

---

## Concept

A brief summary of the tool, the datasets used, how features were extracted from each dataset and how they were classified.

### Dataset

The dataset is called IRMAS, and can be downloaded from: https://www.upf.edu/web/mtg/irmas

It is split on training data, as well as three different testing data groups.

All audio files are in 16-bit stereo format, sampled at 44.1kHz.

##### Instruments

Below are the instruments, their corresponding labels and the number of audio files in training data. There is a total of 11 of them.

- cello,                 `cel`,   388
- clarinet,              `cla`,   505
- flute,                 `flu`,   451
- acoustic guitar,       `gac`,   637
- electric guitar,       `gel`,   760
- organ,                 `org`,   682
- piano,                 `pia`,   721
- saxophone,             `sax`,   626
- trumpet,               `tru`,   577
- violin,                `vio`,   580
- human singing voice,   `voi`,   778

##### Training data

- 6705 .wav files
- spread over 11 folders
- folders represent instruments and are labeled as such
- each file name contains an instrument label, a genre label, and in some cases a drums presence label
- each audio file is 3 seconds long
- audio files are excerpts of different recordings from the past century
- audio files vary in quality

##### Testing data

- 3 testing groups
- test group 1 contains 807 .wav files
- test group 2 contains 1301 .wav files
- test group 3 contains 766 .wav files
- file names do not contain instrument labels
- audio file lengths vary from around 5 to 20 seconds
- audio files are excerpts from existing songs

##### Validation data

- with each .wav file in the testing data groups, there is an annotated .txt file, with the same name
- every annotated .txt file contains either one or two instrument labels corresponding to the instruments used in the .wav file
- the annotated files were generated manually
- they are used to validate the prediction

### Machine learning

A brief summary of the feature extraction, cross-validation and classification algorithms used.

##### Feature extraction

20 _MFCCs_ (_Mel-Frequency Cepstrum Coefficients_) were used as features and were obtained using `librosa` library for python.

MFCCs are derived as follows:

1. Take the _Fourier Transform_ of a signal
2. Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows
3. Take the logs of the powers at each of the mel frequencies
4. Take the _Discrete Cosine Transform_ (_DCT_) of the list of mel log powers, as if it were a signal
5. The _MFCCs_ are the amplitudes of the resulting spectrum

##### Cross-validation

In full, three cross-validation methods were used: Leave One Out, K-Fold and Holdout.

Cross-validation is used on training data, and its goal is to test the model's ability to predict new data that was not used in estimating it, in order to flag problems like overfitting or selection bias and to give an insight on how the model will generalize to an independent dataset.

It does so by splitting training data into subsets, some of which are used for training the model, and the others for validation. There are exhaustive CV methods, which learn and test on all possible ways to divide the original sample into a training and a validation set. There are also non-exhaustive methods, which do not compute all ways of splitting the original sample.

Leave One Out CV is an exhaustive method, and it uses exactly one sample as the validation set, and all the others for training. It then repeats the process until every sample has been selected for validation (and others for training).

K-Fold CV is a non-exhaustive method, and it works by randomly partitioning input samples into k equal-sized subsamples. Of the k subsamples, exactly one subsample is selected for validation, and k-1 for training. The process is repeated k times, with each of the k subsamples being used only once as validation data. The results can be averaged to produce a single estimation.

Holdout is also a non-exhaustive method. The input samples are randomly assigned to two sets, d0 and d1, of which d0 is used for training, and d1 for testing. The size of each set is arbitrary, although the training set should be larger than the testing set. It trains the model on d0, and test on d1. There is only a single run, so its results can be misleading.

##### Classification

###### K-Nearest Neighbours

_KNN_ (_K-Nearest Neighbours_) algorithm was used for instrument classification. The `sklearn` library was used, and from it `neighbors` was imported. 

The input for training consisted of _MFCCs_ and the corresponding classes (instruments), represented by numbers ranging from 1 to 11. The classes were obtained by parsing the file path to each audio file, getting the name of its parent folder. The name of the folder was later converted to a number, ranging from 1 to 11. 

The input for testing consisted of _MFCCs_ only, and it was used for prediction. The output was the predicted class, represented by a number, ranging from 1 to 11.

Within testing, validation was also done, using validation data, described above.

Accuracy was calculated and represented by a percentage of accurate guesses within the testing group (and later for all testing groups). An accurate guess meant that the prediction of an element's class was the same as the class (or one of the two) provided in the validation data.

K was set to 97, as it was a square root of the number of elements within the dataset. It is also a prime, uneven number, which is also recommended. It also produced the best results.

That means that for every element within the input test data, 97 nearest neighbours were taken into account when determining the class it belongs to. The output class is determined by the plurality vote; that is, the most common class out of the 97 neighbours is the output class.

The algorithm was tested for K = {5, 11, 31, 97}.

- For K = 5, overall accuracy was around 31% for the entire testing set.
- For K = 11, overall accuracy was around 34% for the entire testing set.
- For K = 31, overall accuracy was around 38% for the entire testing set.
- For K = 97, overall accuracy was around 40% for the entire testing set.

The accuracy improved with the increase of K, however there was only a slight increase of accuracy that resulted from an increase of K. That was especially noticeable when K was increased from 31 to 97. Increasing K did not negatively affect performance.

###### Random Forest Classification

Random Forest uses decision trees. The trees consist of nodes, and each node corresponds to an input variable, and leafs, which correspond to classes. Random Forest works by constructing multiple decision trees at training time and outputting the class that is the mode of the classes of the individual trees.

The same dataset was used in the same way for training and testing the model.

For accuracy, precision, recall and F1 score were used on both classification algorithms.

Cross-validation reported higher accuracy for RFC, in comparison to KNN, by around 10-30%.

However, Random Forest Classifier did not increase the overall precision on testing data of the algorithm in comparison to K-Nearest Neighbours. It is around 46% for both the KNN and RFC. Recall and F1 score, however, decreased.

That means that the feature selection process could be improved.

### Conclusion

While an accuracy of around 46% is very small, a higher accuracy might be achieved.

That can be done in the following ways:

- increasing the number of features
- making the features more unique, perhaps by extracting features differently
- limiting the dataset to only a few instruments
- using different ML algorithms (such as Bagging)
- etc.

---

## Implementation

### Instructions and prerequisites

##### Prerequisites

- `Anaconda` 5.2.0 for Python 3.6 (py36_3)
- `conda` 4.6.3 (py36_0)
- `Jupyter notebook` 5.5.0
- `numpy` 1.14.3
- `scikit-learn` 0.19.1
- `glob2` 0.6
- `librosa` 0.6.3 from `conda-forge`
- `IRMAS dataset` from https://www.upf.edu/web/mtg/irmas

Additionally, it is required that the extracted IRMAS datasets are placed in the following folder:

- D:\College\Soft Computing\Data

The folder structure of Data folder should then look like:

- \ IRMAS-TestingData-Part1 \ Part1 \ {.wav & .txt files}
- \ IRMAS-TestingData-Part2 \ IRTestingData-Part2 \ {.wav & .txt files}
- \ IRMAS-TestingData-Part3 \ Part3 \ {.wav & .txt files}
- \ IRMAS-TrainingData \ {instrument label} \ {.wav files}
- \ DEFENSE \ {.wav & .txt files}

 ##### Instructions

1. Clone the project repository.
2. Open the `IRT.ipynb` file in a `Jupyter Notebook`.
3. From the toolbar, click on `Cell`, then on `Run All`.
4. Enjoy the awesomeness that is this project.
5. Grade generously.

### Code

Necessary libraries:

In [1]:
import time
import importlib
import read
import classi
import randomforest as rf
import crossvalidation as cv



#### TRAINING:

- data reading, feature extraction, time spent and sample size

In [2]:
print("--- TRAINING DATA ---")
time_start = time.perf_counter()
data = read.fext()
elapsed_time = time.perf_counter() - time_start
print("Feature extraction time:", elapsed_time, "seconds")
print("Data size:", len(data), "audio files")

--- TRAINING DATA ---
Feature extraction time: 944.4008223000001 seconds
Data size: 6705 audio files


- fitting the data using KNN

In [3]:
knn = classi.train(data)

- fitting the data using Random Forest Classifier

In [4]:
rfc = rf.train(data)

#### TESTING ON ONE SAMPLE:

- sample reading and feature extraction

In [5]:
dataone = read.fextonetest()
#print(dataone)

- prediction and validation (using an annotated text file that contains the names of either one or two instruments that are present in the audio file of the same name)

In [6]:
predicted = classi.testone(dataone, knn)
if (dataone[3] == predicted[0]):
    print("Correct!", dataone[3], "==", predicted[0])
elif(dataone[2] == predicted[0]):
    print("Correct!", dataone[2], "==", predicted[0])
else:
    print("Incorrect!", dataone[3], "!=", predicted[0])

Correct! 5 == 5


#### TESTING (DATA PART 1):

- data reading, feature extraction, time spent and sample size

In [7]:
print("--- TESTING DATA PART 1 ---")
time_start = time.perf_counter()
data1 = read.fexttest1()
elapsed_time = time.perf_counter() - time_start
print("Feature extraction time:", elapsed_time, "seconds")
print("Data size:", len(data1), "audio files")

--- TESTING DATA PART 1 ---
Feature extraction time: 459.9897995 seconds
Data size: 807 audio files


- prediction on input data, including validation and accuracy calculation

In [8]:
acc1 = classi.testandverify(data1, knn)

- accuracy

In [9]:
print("Accuracy:", acc1, "%")

Accuracy: 38.166047087980175 %


#### TESTING (DATA PART 2):

- data reading, feature extraction, time spent and sample size

In [10]:
print("--- TESTING DATA PART 2 ---")
time_start = time.perf_counter()
data2 = read.fexttest2()
elapsed_time = time.perf_counter() - time_start
print("Feature extraction time:", elapsed_time, "seconds")
print("Data size:", len(data2), "audio files")

--- TESTING DATA PART 2 ---
Feature extraction time: 691.2876760999998 seconds
Data size: 1301 audio files


- prediction on input data, including validation and accuracy calculation

In [11]:
acc2 = classi.testandverify(data2, knn)

- accuracy

In [12]:
print("Accuracy:", acc2, "%")

Accuracy: 45.04227517294389 %


#### TESTING (DATA PART 3):

- data reading, feature extraction, time spent and sample size

In [13]:
print("--- TESTING DATA PART 3 ---")
time_start = time.perf_counter()
data3 = read.fexttest3()
elapsed_time = time.perf_counter() - time_start
print("Feature extraction time:", elapsed_time, "seconds")
print("Data size:", len(data3), "audio files")

--- TESTING DATA PART 3 ---
Feature extraction time: 432.1010712999996 seconds
Data size: 766 audio files


- prediction on input data, including validation and accuracy calculation

In [14]:
acc3 = classi.testandverify(data3, knn)

- accuracy

In [15]:
print("Accuracy:", acc3, "%")

Accuracy: 36.0313315926893 %


##### Average accuracy:

In [16]:
print("Average accuracy:", (acc1 + acc2 + acc3)/3, "%")

Average accuracy: 39.74655128453779 %


#### TESTING (CUSTOM DATASET)

- data reading, feature extraction, time spent and sample size

In [17]:
print("--- TESTING CUSTOM DATASET ---")
time_start = time.perf_counter()
data4 = read.fextcustom()
elapsed_time = time.perf_counter() - time_start
print("Feature extraction time:", elapsed_time, "seconds")
print("Data size:", len(data4), "audio files")

--- TESTING CUSTOM DATASET ---
Feature extraction time: 16.964201700000103 seconds
Data size: 30 audio files


- prediction on input data, including validation and accuracy calculation

In [18]:
acc4 = classi.testandverify(data4, knn)

- accuracy

In [19]:
print("Accuracy:", acc4, "%")

Accuracy: 40.0 %


In [20]:
alldata = []
alldata.extend(data1)
alldata.extend(data2)
alldata.extend(data3)

#### Cross-validation

- K-Nearest Neighbours

In [21]:
cv.leave_one_out([i[1] for i in data], [i[2] for i in data], knn)
cv.k_fold([i[1] for i in data], [i[2] for i in data], knn)
cv.holdout([i[1] for i in data], [i[2] for i in data], knn)

Leave One Out mean accuracy: 37.53914988814317 %
K-Fold mean accuracy: 36.689458816202155 %
Holdout mean accuracy: 33.33333333333333 %


- Random Forest Classification

In [34]:
cv.leave_one_out([i[1] for i in data], [i[2] for i in data], rfc)
cv.k_fold([i[1] for i in data], [i[2] for i in data], rfc)
cv.holdout([i[1] for i in data], [i[2] for i in data], rfc)

Leave One Out mean accuracy: 50.246085011185684 %
K-Fold mean accuracy: 47.72549324910471 %
Holdout mean accuracy: 66.66666666666666 %


#### Predictions, confusion matrices and classification reports

- K-Nearest Neighbours

In [23]:
knnpred = classi.test(alldata, knn)

In [24]:
for i in range(len(alldata)):
    if knnpred[i] == alldata[i][3]:
        a = alldata[i][2]
        alldata[i][2] = knnpred[i]
        alldata[i][3] = a

In [25]:
rf.print_confusion_matrix([i[2] for i in alldata], knnpred)

[[  5   4   1  30   2   9   6   5   0   7  10]
 [ 16   5   0   9   2   1   9  12   3   2   0]
 [  2   5   0   7  12  22  47   3   1   5  30]
 [  7   3   4 135  21  56  58  18   2   8  82]
 [ 10   4   0  56 305  65  24  17  17   6 178]
 [  2   2   0  12  19  57  16   4   2  14  29]
 [ 19  14   3 123  32 115 220  16   3  11 169]
 [  2   0   0   3   0   2   3  18   1   0   1]
 [  0   0   0   0   4   1   0   6   1   2   0]
 [ 23   1   3   6   0   0   3   0   0  34   0]
 [  8  10   3  23  22  28  11  27   1   7 390]]


In [26]:
rf.print_classification_report([i[2] for i in alldata], knnpred)

             precision    recall  f1-score   support

          1       0.05      0.06      0.06        79
          2       0.10      0.08      0.09        59
          3       0.00      0.00      0.00       134
          4       0.33      0.34      0.34       394
          5       0.73      0.45      0.55       682
          6       0.16      0.36      0.22       157
          7       0.55      0.30      0.39       725
          8       0.14      0.60      0.23        30
          9       0.03      0.07      0.04        14
         10       0.35      0.49      0.41        70
         11       0.44      0.74      0.55       530

avg / total       0.46      0.41      0.41      2874



- Random Forest Classification

In [35]:
prediction = rf.test(alldata, rfc)

In [36]:
for i in range(len(alldata)):
    if prediction[i] == alldata[i][3]:
        a = alldata[i][2]
        alldata[i][2] = prediction[i]
        alldata[i][3] = a

In [37]:
rf.print_confusion_matrix([i[2] for i in alldata], prediction)

[[ 22   5   4  10   6   5   2   9   1   9   1]
 [  5   8   4   8   2   1  12   7   5   2   2]
 [  4  15  14  10   8  14  26  17   4   2   9]
 [ 18  16  18 132  39  38  37  45   3   9  37]
 [ 40  18  17  48 285  69  18  30  23  20  91]
 [  9   5  16  18  24  41  12   7   9  14   8]
 [ 41  26  35  76  48  70 246  50  23  17  68]
 [  3   6   4   2   6   7   2  20   5   2   7]
 [  1   1   0   1   1   1   2   1   4   5   2]
 [  9   5   3   3   1   0   7   6   3  38   1]
 [ 20  17  23  34  73  72  38  45   9  17 200]]


In [38]:
rf.print_classification_report([i[2] for i in alldata], prediction)

             precision    recall  f1-score   support

          1       0.13      0.30      0.18        74
          2       0.07      0.14      0.09        56
          3       0.10      0.11      0.11       123
          4       0.39      0.34      0.36       392
          5       0.58      0.43      0.49       659
          6       0.13      0.25      0.17       163
          7       0.61      0.35      0.45       700
          8       0.08      0.31      0.13        64
          9       0.04      0.21      0.07        19
         10       0.28      0.50      0.36        76
         11       0.47      0.36      0.41       548

avg / total       0.45      0.35      0.38      2874



---

###### Reload .py files (avoids the need to restart the kernel every time the files are changed)

In [31]:
importlib.reload(classi)

<module 'classi' from 'D:\\College\\Soft Computing\\instrecognition-tool\\classi.py'>

In [32]:
importlib.reload(rf)

<module 'randomforest' from 'D:\\College\\Soft Computing\\instrecognition-tool\\randomforest.py'>

In [33]:
importlib.reload(cv)

<module 'crossvalidation' from 'D:\\College\\Soft Computing\\instrecognition-tool\\crossvalidation.py'>

---

### THE END

---