<a href="https://colab.research.google.com/github/agimu/SpokenNumerals/blob/main/ECS708P_miniproject_submission_Ajay_Girish_Munjamani_200611136.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Name : **Ajay Girish Munjamani**

# 1 Basic solution

Please do the following tasks first:

1. Go to https://drive.google.com/
2. Create a folder named 'Data' in 'MyDrive'. On the left, click 'New' > 'Folder', enter the name 'Data', and click 'create'
3. Open the 'Data' folder and create a folder named 'MLEnd'.
4. Move the file 'trainingMLEnd.csv' to the newly created folder 'MyDrive/Data/MLEnd'.



In [None]:
from google.colab import drive

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os, sys, re, pickle, glob
import urllib.request
import zipfile

#from IPython.display import Audio
import IPython.display as ipd
from tqdm import tqdm
import librosa
drive.mount('/content/drive')

Mounted at /content/drive


Check whether MLEnd folder exists.

In [None]:
path = '/content/drive/MyDrive/Data/MLEnd'
os.listdir(path)

['trainingMLEnd.csv', 'training', 'training.zip']

Downloading the Data

In [None]:
def download_url(url, save_path):
    with urllib.request.urlopen(url) as dl_file:
        with open(save_path, 'wb') as out_file:
            out_file.write(dl_file.read())

url  = "https://collect.qmul.ac.uk/down?t=6H8231DQL1NGDI9A/613DLM2R3OFV5EEH9INK2OG"
save_path = '/content/drive/MyDrive/Data/MLEnd/training.zip'
download_url(url, save_path)

Unzipping the file

In [None]:
directory_to_extract_to = '/content/drive/MyDrive/Data/MLEnd/training/'
with zipfile.ZipFile(save_path, 'r') as zip_ref:
    zip_ref.extractall(directory_to_extract_to)

Let's now load the contents of 'trainingMLEnd.csv' into a pandas DataFrame as 'labels' and the audio files as 'files'

In [None]:
files = glob.glob('/content/drive/MyDrive/Data/MLEnd/training/*/*.wav')
labels = pd.read_csv('/content/drive/MyDrive/Data/MLEnd/trainingMLEnd.csv')

# Extracting Features

For feature extraction, I will be using a library called pyAudioAnalysis. We need to install all the requirements and dependecies for the library

In [None]:
!pip install pyAudioAnalysis
!pip install matplotlib==3.1.2
!pip install simplejson==3.16.0
!pip install scipy==1.4.1
!pip install numpy==1.18.1
!pip install hmmlearn==0.2.2
!pip install eyeD3==0.9.5
!pip install pydub==0.24.0
!pip install scikit_learn
!pip install tqdm==4.46.0
!pip install plotly==4.1.1
from pyAudioAnalysis import ShortTermFeatures as STF


Collecting pyAudioAnalysis
[?25l  Downloading https://files.pythonhosted.org/packages/71/42/09adc0229b78dc514004ecf83508afa36a998502a36a4ebdacc14ae55fcf/pyAudioAnalysis-0.3.6.tar.gz (52.4MB)
[K     |████████████████████████████████| 52.4MB 76kB/s 
[?25hBuilding wheels for collected packages: pyAudioAnalysis
  Building wheel for pyAudioAnalysis (setup.py) ... [?25l[?25hdone
  Created wheel for pyAudioAnalysis: filename=pyAudioAnalysis-0.3.6-cp37-none-any.whl size=52589856 sha256=de70228744b3be625b93ea20ddb4d11585d61e1d0550b6a37857570a3761f1a8
  Stored in directory: /root/.cache/pip/wheels/fd/74/c2/361da76b03ed9d45c1b606d8fd25ac53ab965f754061fc4805
Successfully built pyAudioAnalysis
Installing collected packages: pyAudioAnalysis
Successfully installed pyAudioAnalysis-0.3.6


# Preprocessing
Creating a Numpy array with 6 features that also takes in the argument for the number of files that needs to be considered for the model. I am also using the Short term features from pyAudioAnalysis library.

In [None]:
def getPitch(x,fs,winLen=0.02):
  #winLen = 0.02 
  p = winLen*fs
  frame_length = int(2**int(p-1).bit_length())
  hop_length = frame_length//2
  f0, voiced_flag, voiced_probs = librosa.pyin(y=x, fmin=80, fmax=450, sr=fs,
                                                 frame_length=frame_length,hop_length=hop_length)
  return f0,voiced_flag

def getXy(files,labels_file,scale_audio=False, onlySingleDigit=False):
  X,y =[],[]
  for file in tqdm(files):
    fileID = file.split('/')[-1]
    yi = list(labels_file[labels_file['File ID']==fileID]['intonation'])[0]
    fs = None # if None, fs would be 22050
    x, fs = librosa.load(file,sr=fs)
    if scale_audio: x = x/np.max(np.abs(x))
    f0, voiced_flag = getPitch(x,fs,winLen=0.02)
    ZCR = STF.zero_crossing_rate(x)
    energy_entropy = STF.energy_entropy(x)
    mfcc = librosa.feature.mfcc(y=x, sr=fs)
    rolloff = librosa.feature.spectral_rolloff(y=x, sr=fs)
    spec_cent = librosa.feature.spectral_centroid(y=x, sr=fs)
    #frqs = STF.phormants(x,44100)
    #centroid, spread = STF.spectral_centroid_spread(fourier, sampling_rate=fs)
      
    power = np.sum(x**2)/len(x)
    pitch_mean = np.nanmean(f0) if np.mean(np.isnan(f0))<1 else 0
    pitch_std  = np.nanstd(f0) if np.mean(np.isnan(f0))<1 else 0
    voiced_fr = np.mean(voiced_flag)

    xi = [power,pitch_mean,pitch_std,voiced_fr,ZCR,energy_entropy]
    X.append(xi)
    y.append(yi)
  return np.array(X),np.array(y)

Applying getXy to all the files in the dataset to train and test the model

In [None]:
X,y = getXy(files,labels_file=labels,scale_audio=True, onlySingleDigit=True)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 74%|███████▍  | 14892/20000 [1:56:05<38:50,  2.19it/s][A
 74%|███████▍  | 14893/20000 [1:56:06<40:01,  2.13it/s][A
 74%|███████▍  | 14894/20000 [1:56:06<39:15,  2.17it/s][A
 74%|███████▍  | 14895/20000 [1:56:07<41:30,  2.05it/s][A
 74%|███████▍  | 14896/20000 [1:56:07<39:42,  2.14it/s][A
 74%|███████▍  | 14897/20000 [1:56:08<41:47,  2.03it/s][A
 74%|███████▍  | 14898/20000 [1:56:08<39:44,  2.14it/s][A
 74%|███████▍  | 14899/20000 [1:56:09<40:15,  2.11it/s][A
 74%|███████▍  | 14900/20000 [1:56:09<38:35,  2.20it/s][A
 75%|███████▍  | 14901/20000 [1:56:10<41:59,  2.02it/s][A
 75%|███████▍  | 14902/20000 [1:56:10<41:41,  2.04it/s][A
 75%|███████▍  | 14903/20000 [1:56:10<40:02,  2.12it/s][A
 75%|███████▍  | 14904/20000 [1:56:11<40:29,  2.10it/s][A
 75%|███████▍  | 14905/20000 [1:56:11<39:59,  2.12it/s][A
 75%|███████▍  | 14906/20000 [1:56:12<39:13,  2.16it/s][A
 75%|███████▍  | 14907/20000 [1:56:12<43:21,  1.96

The shapes of X and y are:

In [None]:
print('The shape of X is', X.shape) 
print('The shape of y is', y.shape)

The shape of X is (20000, 6)
The shape of y is (20000,)


#Saving the Array for future use

In [None]:
np.save('features.npy',X)

# Splitting the data set

We will be splitting the dataset for testing and validation purposes and will be making use of train_test_split module from sklearn.model_selection


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=0.3)
X_train.shape, X_val.shape, y_train.shape, y_val.shape

((14000, 6), (6000, 6), (14000,), (6000,))

# Modeling

We will be using RandomForests algorithm from sklearn and making predictions off of the intonation from the labels.

In [None]:
from sklearn.ensemble import RandomForestClassifier

Classifier = RandomForestClassifier()
Classifier.fit(X_train, y_train)
yt_p = Classifier.predict(X_train)
yv_p = Classifier.predict(X_val)


#Performance
For classification problems the metrics used to evaluate an algorithm are accuracy, confusion matrix, precision recall, and F1 values. We will be using modules from sklearn.metrics to evaluate the performance.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(confusion_matrix(yv_p,y_val))
print(classification_report(yv_p,y_val))
print(accuracy_score(yv_p, y_val))


[[958  71 450  58]
 [ 96 851 192 349]
 [364 151 782 105]
 [ 64 422 134 953]]
              precision    recall  f1-score   support

       bored       0.65      0.62      0.63      1537
     excited       0.57      0.57      0.57      1488
     neutral       0.50      0.56      0.53      1402
    question       0.65      0.61      0.63      1573

    accuracy                           0.59      6000
   macro avg       0.59      0.59      0.59      6000
weighted avg       0.59      0.59      0.59      6000

0.5906666666666667


We will also compare training prediction and validation prediction

In [None]:
print('Training Accuracy', np.mean(yt_p==y_train))
print('Validation  Accuracy', np.mean(yv_p==y_val))

Training Accuracy 1.0
Validation  Accuracy 0.5906666666666667


# 2 Advanced solution

For the advanced solution, we are trying to find out the digit based on the 'digit_label' attribute in the label file and a few additional features related to sound. For that we will be making small changes to the getXy function and add in a few more features to make our model better.

Using the same numpy array that we saved before to save time instead of creating another array


In [None]:
X = np.load('features.npy')

Shapes of X and y are

In [None]:
print('The shape of X is', X.shape) 
print('The shape of y is', y.shape)

The shape of X is (20000, 6)
The shape of y is (20000,)


Splitting the dataset for testing and validation purposes

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=0.3)
X_train.shape, X_val.shape, y_train.shape, y_val.shape

((14000, 6), (6000, 6), (14000,), (6000,))

#Modeling

To identify the digits based on 'digit_label', following my intuition, I am using SVM from sklearn. Just to be sure, we will also compare it to the random forests model. Also we have to note that I am normalising the predictors to improve performance.

In [None]:
from sklearn import svm
mean = X_train.mean(0)
sd =  X_train.std(0)

X_train = (X_train-mean)/sd
X_val  = (X_val-mean)/sd

model  = svm.SVC(C=1,gamma=2)
model.fit(X_train,y_train)

yt_p = model.predict(X_train)
yv_p = model.predict(X_val)

print('Training Accuracy', np.mean(yt_p==y_train))
print('Validation  Accuracy', np.mean(yv_p==y_val))

Training Accuracy 0.7382857142857143
Validation  Accuracy 0.569


#Performance


In [None]:
print(confusion_matrix(yv_p,y_val))
print(classification_report(yv_p,y_val))
print(accuracy_score(yv_p, y_val))

[[920  86 454  46]
 [132 803 183 407]
 [380 169 731 104]
 [ 68 402 155 960]]
              precision    recall  f1-score   support

       bored       0.61      0.61      0.61      1506
     excited       0.55      0.53      0.54      1525
     neutral       0.48      0.53      0.50      1384
    question       0.63      0.61      0.62      1585

    accuracy                           0.57      6000
   macro avg       0.57      0.57      0.57      6000
weighted avg       0.57      0.57      0.57      6000

0.569


Let us also try using random forests for the same set of data and see if the accuracy imporves

In [None]:
Classifier = RandomForestClassifier()
Classifier.fit(X_train, y_train)
yt_p = Classifier.predict(X_train)
yv_p = Classifier.predict(X_val)

Performance for RandomForests

In [None]:
print('Training Accuracy', np.mean(yt_p==y_train))
print('Validation  Accuracy', np.mean(yv_p==y_val))
print(confusion_matrix(yv_p,y_val))
print(classification_report(yv_p,y_val))
print(accuracy_score(yv_p, y_val))

Training Accuracy 1.0
Validation  Accuracy 0.5813333333333334
[[951  62 415  43]
 [110 789 189 372]
 [365 180 769 123]
 [ 74 429 150 979]]
              precision    recall  f1-score   support

       bored       0.63      0.65      0.64      1471
     excited       0.54      0.54      0.54      1460
     neutral       0.50      0.54      0.52      1437
    question       0.65      0.60      0.62      1632

    accuracy                           0.58      6000
   macro avg       0.58      0.58      0.58      6000
weighted avg       0.58      0.58      0.58      6000

0.5813333333333334


#Conclusion
For the basic solution, we were able to acheive the validation accuracy of 59% by modeling our dataset using RandomForests algorithm. For the advanced solution however, I thought that by using support vector machine algorithm, we would achieve better results compared to RandomForests; but it was not the case. RandomForest was better for the advanced solution as well.
We may think that ~60% is not that good of an accuracy but considering the limitations of the dataset and my knowledge about audio and its features, I can conclude the following project stating that RandomForests algorithm works comparitively well in classifying the audio samples based on their intonations and their digit labels.