---

# InstRecognitionTool

An instrument recognition tool.

---

## Concept

A brief summary of the tool, the datasets used, how features were extracted from each dataset and how they were classified.

### Dataset

The dataset is called IRMAS, and can be downloaded from: https://www.upf.edu/web/mtg/irmas

It is split on training data, as well as three different testing data groups.

All audio files are in 16-bit stereo format, sampled at 44.1kHz.

##### Instruments

Below are the instruments, their corresponding labels and the number of audio files in training data. There is a total of 11 of them.

- cello,                 `cel`,   388
- clarinet,              `cla`,   505
- flute,                 `flu`,   451
- acoustic guitar,       `gac`,   637
- electric guitar,       `gel`,   760
- organ,                 `org`,   682
- piano,                 `pia`,   721
- saxophone,             `sax`,   626
- trumpet,               `tru`,   577
- violin,                `vio`,   580
- human singing voice,   `voi`,   778

##### Training data

- 6705 .wav files
- spread over 11 folders
- folders represent instruments and are labeled as such
- each file name contains an instrument label, a genre label, and in some cases a drums presence label
- each audio file is 3 seconds long
- audio files are excerpts of different recordings from the past century
- audio files vary in quality

##### Testing data

- 3 testing groups
- test group 1 contains 807 .wav files
- test group 2 contains 1301 .wav files
- test group 3 contains 766 .wav files
- file names do not contain instrument labels
- audio file lengths vary from around 5 to 20 seconds
- audio files are excerpts from existing songs

##### Validation data

- with each .wav file in the testing data groups, there is an annotated .txt file, with the same name
- every annotated .txt file contains either one or two instrument labels corresponding to the instruments used in the .wav file
- the annotated files were generated manually
- they are used to validate the prediction

### Machine learning

A brief summary of the feature extraction and classification algorithms used.

##### Feature extraction

20 _MFCCs_ (_Mel-Frequency Cepstrum Coefficients_) were used as features and were obtained using `librosa` library for python.

MFCCs are derived as follows:

1. Take the _Fourier Transform_ of a signal
2. Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows
3. Take the logs of the powers at each of the mel frequencies
4. Take the _Discrete Cosine Transform_ (_DCT_) of the list of mel log powers, as if it were a signal
5. The _MFCCs_ are the amplitudes of the resulting spectrum

##### Classification

###### K-Nearest Neighbours

_KNN_ (_K-Nearest Neighbours_) algorithm was used for instrument classification. The `sklearn` library was used, and from it `neighbors` was imported. 

The input for training consisted of _MFCCs_ and the corresponding classes (instruments), represented by numbers ranging from 1 to 11. The classes were obtained by parsing the file path to each audio file, getting the name of its parent folder. The name of the folder was later converted to a number, ranging from 1 to 11. 

The input for testing consisted of _MFCCs_ only, and it was used for prediction. The output was the predicted class, represented by a number, ranging from 1 to 11.

Within testing, validation was also done, using validation data, described above.

Accuracy was calculated and represented by a percentage of accurate guesses within the testing group (and later for all testing groups). An accurate guess meant that the prediction of an element's class was the same as the class (or one of the two) provided in the validation data.

K was set to 97, as it was a square root of the number of elements within the dataset. It is also a prime, uneven number, which is also recommended. It also produced the best results.

That means that for every element within the input test data, 97 nearest neighbours were taken into account when determining the class it belongs to. The output class is determined by the plurality vote; that is, the most common class out of the 97 neighbours is the output class.

The algorithm was tested for K = {5, 11, 31, 97}.

- For K = 5, overall accuracy was around 31% for the entire testing set.
- For K = 11, overall accuracy was around 34% for the entire testing set.
- For K = 31, overall accuracy was around 38% for the entire testing set.
- For K = 97, overall accuracy was around 40% for the entire testing set.

The accuracy improved with the increase of K, however there was only a slight increase of accuracy that resulted from an increase of K. That was especially noticeable when K was increased from 31 to 97. Increasing K did not negatively affect performance.

###### Random Forest Classification

### Conclusion

While an accuracy of 40% is very small, a higher accuracy might be achieved.

That can be done in the following ways:

- using some cross-validation method, such as K-Fold
- increasing the number of features
- making the features more unique, perhaps by extracting features differently
- limiting the dataset to only a few instruments
- using different ML algorithms (such as Random Forest)
- etc.

#### Update

Random Forest Classifier increased the overall accuracy from 40% to 46% (as reported by precision).

---

## Implementation

### Instructions and prerequisites

##### Prerequisites

- `Anaconda` 5.2.0 for Python 3.6 (py36_3)
- `conda` 4.6.3 (py36_0)
- `Jupyter notebook` 5.5.0
- `numpy` 1.14.3
- `scikit-learn` 0.19.1
- `glob2` 0.6
- `librosa` 0.6.3 from `conda-forge`
- `IRMAS dataset` from https://www.upf.edu/web/mtg/irmas

Additionally, it is required that the extracted IRMAS datasets are placed in the following folder:

- D:\College\Soft Computing\Data

The folder structure of Data folder should then look like:

- \ IRMAS-TestingData-Part1 \ Part1 \ {.wav & .txt files}
- \ IRMAS-TestingData-Part2 \ IRTestingData-Part2 \ {.wav & .txt files}
- \ IRMAS-TestingData-Part3 \ Part3 \ {.wav & .txt files}
- \ IRMAS-TrainingData \ {instrument label} \ {.wav files}
- \ DEFENSE \ {.wav & .txt files}

 ##### Instructions

1. Clone the project repository.
2. Open the `IRT.ipynb` file in a `Jupyter Notebook`.
3. From the toolbar, click on `Cell`, then on `Run All`.
4. Enjoy the awesomeness that is this project.
5. Grade generously.

### Code

Necessary libraries:

In [1]:
import time
import importlib
import read
import classi
import randomforest as rf
import crossvalidation as cv



#### TRAINING:

- data reading, feature extraction, time spent and sample size

In [2]:
print("--- TRAINING DATA ---")
time_start = time.perf_counter()
data = read.fext()
elapsed_time = time.perf_counter() - time_start
print("Feature extraction time:", elapsed_time, "seconds")
print("Data size:", len(data), "audio files")

--- TRAINING DATA ---
Feature extraction time: 1050.2539766 seconds
Data size: 6705 audio files


- fitting the data using KNN

In [3]:
knn = classi.train(data)

- fitting the data using Random Forest Classifier

In [4]:
rfc = rf.train(data)

#### TESTING ON ONE SAMPLE:

- sample reading and feature extraction

In [5]:
dataone = read.fextonetest()
#print(dataone)

- prediction and validation (using an annotated text file that contains the names of either one or two instruments that are present in the audio file of the same name)

In [6]:
predicted = classi.testone(dataone, knn)
if (dataone[3] == predicted[0]):
    print("Correct!", dataone[3], "==", predicted[0])
elif(dataone[2] == predicted[0]):
    print("Correct!", dataone[2], "==", predicted[0])
else:
    print("Incorrect!", dataone[3], "!=", predicted[0])

Correct! 5 == 5


#### TESTING (DATA PART 1):

- data reading, feature extraction, time spent and sample size

In [7]:
print("--- TESTING DATA PART 1 ---")
time_start = time.perf_counter()
data1 = read.fexttest1()
elapsed_time = time.perf_counter() - time_start
print("Feature extraction time:", elapsed_time, "seconds")
print("Data size:", len(data1), "audio files")

--- TESTING DATA PART 1 ---
Feature extraction time: 545.0384208999999 seconds
Data size: 807 audio files


- prediction on input data, including validation and accuracy calculation

In [8]:
acc1 = classi.testandverify(data1, knn)

- accuracy

In [9]:
print("Accuracy:", acc1, "%")

Accuracy: 38.166047087980175 %


#### TESTING (DATA PART 2):

- data reading, feature extraction, time spent and sample size

In [10]:
print("--- TESTING DATA PART 2 ---")
time_start = time.perf_counter()
data2 = read.fexttest2()
elapsed_time = time.perf_counter() - time_start
print("Feature extraction time:", elapsed_time, "seconds")
print("Data size:", len(data2), "audio files")

--- TESTING DATA PART 2 ---
Feature extraction time: 737.5692674000002 seconds
Data size: 1301 audio files


- prediction on input data, including validation and accuracy calculation

In [11]:
acc2 = classi.testandverify(data2, knn)

- accuracy

In [12]:
print("Accuracy:", acc2, "%")

Accuracy: 45.04227517294389 %


#### TESTING (DATA PART 3):

- data reading, feature extraction, time spent and sample size

In [13]:
print("--- TESTING DATA PART 3 ---")
time_start = time.perf_counter()
data3 = read.fexttest3()
elapsed_time = time.perf_counter() - time_start
print("Feature extraction time:", elapsed_time, "seconds")
print("Data size:", len(data3), "audio files")

--- TESTING DATA PART 3 ---
Feature extraction time: 461.4000931999999 seconds
Data size: 766 audio files


- prediction on input data, including validation and accuracy calculation

In [14]:
acc3 = classi.testandverify(data3, knn)

- accuracy

In [15]:
print("Accuracy:", acc3, "%")

Accuracy: 36.0313315926893 %


##### Average accuracy:

In [16]:
print("Average accuracy:", (acc1 + acc2 + acc3)/3, "%")

Average accuracy: 39.74655128453779 %


#### TESTING (CUSTOM DATASET)

- data reading, feature extraction, time spent and sample size

In [17]:
print("--- TESTING CUSTOM DATASET ---")
time_start = time.perf_counter()
data4 = read.fextcustom()
elapsed_time = time.perf_counter() - time_start
print("Feature extraction time:", elapsed_time, "seconds")
print("Data size:", len(data4), "audio files")

--- TESTING CUSTOM DATASET ---
Feature extraction time: 18.405843999999888 seconds
Data size: 30 audio files


- prediction on input data, including validation and accuracy calculation

In [18]:
acc4 = classi.testandverify(data4, knn)

- accuracy

In [19]:
print("Accuracy:", acc4, "%")

Accuracy: 40.0 %


In [20]:
alldata = []
alldata.extend(data1)
alldata.extend(data2)
alldata.extend(data3)

#### Cross-validation

- K-Nearest Neighbours

In [53]:
cv.leave_one_out([i[1] for i in data], [i[2] for i in data], knn)
cv.k_fold([i[1] for i in data], [i[2] for i in data], knn)
cv.holdout([i[1] for i in data], [i[2] for i in data], knn)

Leave One Out mean accuracy: 37.55406413124534 %
K-Fold mean accuracy: 37.03214182440999 %
Holdout mean accuracy: 66.66666666666666 %


- Random Forest Classification

In [None]:
cv.leave_one_out([i[1] for i in data], [i[2] for i in data], rfc)
cv.k_fold([i[1] for i in data], [i[2] for i in data], rfc)
cv.holdout([i[1] for i in data], [i[2] for i in data], rfc)

#### Predictions, confusion matrices and classification reports

- K-Nearest Neighbours

In [34]:
knnpred = classi.test(alldata, knn)

In [37]:
for i in range(len(alldata)):
    if knnpred[i] == alldata[i][3]:
        a = alldata[i][2]
        alldata[i][2] = knnpred[i]
        alldata[i][3] = a

In [38]:
rf.print_confusion_matrix([i[2] for i in alldata], knnpred)

[[  5   4   1  28   2   9   6   2   0   7  10]
 [ 14   5   0   9   2   1   9  12   3   2   0]
 [  3   5   0   6  12  20  46   3   1   5  21]
 [  7   3   4 135  21  55  59  16   2   8  88]
 [ 10   4   0  46 305  63  23  18  17   6 176]
 [  2   2   0  12  15  57  16   3   2  14  33]
 [ 19  12   3 135  29 113 220  15   2  11 148]
 [  2   2   0   4   0   6   3  18   2   0  21]
 [  0   0   0   0   4   1   0   7   1   2   1]
 [ 24   1   3   6   0   0   3   3   0  34   0]
 [  8  10   3  23  29  32  12  29   1   7 390]]


In [40]:
rf.print_classification_report([i[2] for i in alldata], knnpred)

             precision    recall  f1-score   support

          1       0.05      0.07      0.06        74
          2       0.10      0.09      0.10        57
          3       0.00      0.00      0.00       122
          4       0.33      0.34      0.34       398
          5       0.73      0.46      0.56       668
          6       0.16      0.37      0.22       156
          7       0.55      0.31      0.40       707
          8       0.14      0.31      0.20        58
          9       0.03      0.06      0.04        16
         10       0.35      0.46      0.40        74
         11       0.44      0.72      0.54       544

avg / total       0.46      0.41      0.41      2874



- Random Forest Classification

In [35]:
prediction = rf.test(alldata, rfc)

In [36]:
for i in range(len(alldata)):
    if prediction[i] == alldata[i][3]:
        a = alldata[i][2]
        alldata[i][2] = prediction[i]
        alldata[i][3] = a

In [39]:
rf.print_confusion_matrix([i[2] for i in alldata], prediction)

[[ 13   7   7  21   3   5   1   2   2   5   8]
 [ 11   4   3   5   1   2  16   8   1   3   3]
 [  1  16   8  11   9  20  28  14   5   2   8]
 [ 28  17  23 136  29  39  38  18  12   9  49]
 [ 25  15  14  56 240  83  23  25  35  17 135]
 [  4  10  11   9  36  30  24   9   7   8   8]
 [ 28  44  39  87  38  90 239  37  24  20  61]
 [  1   2   2   4   1   5   5  33   2   2   1]
 [  3   1   0   0   2   1   0   4   3   2   0]
 [  5   7   6   6   1   0   2  10   2  35   0]
 [ 19  17  25  34  88  62  18  56  14  27 184]]


In [41]:
rf.print_classification_report([i[2] for i in alldata], prediction)

             precision    recall  f1-score   support

          1       0.09      0.18      0.12        74
          2       0.03      0.07      0.04        57
          3       0.06      0.07      0.06       122
          4       0.37      0.34      0.35       398
          5       0.54      0.36      0.43       668
          6       0.09      0.19      0.12       156
          7       0.61      0.34      0.43       707
          8       0.15      0.57      0.24        58
          9       0.03      0.19      0.05        16
         10       0.27      0.47      0.34        74
         11       0.40      0.34      0.37       544

avg / total       0.42      0.32      0.35      2874



---

###### Reload .py files (avoids the need to restart the kernel every time the files are changed)

In [None]:
importlib.reload(classi)

In [None]:
importlib.reload(rf)

In [50]:
importlib.reload(cv)

<module 'crossvalidation' from 'D:\\College\\Soft Computing\\instrecognition-tool\\crossvalidation.py'>

---

### THE END

---