# ASSIGNMENT 1 - Speech recognition

what I will do:
* load all the audio files
* extract from the filenames the digit labels (to use as GT)
* compute all the MFCCs (to use as audio feature)
* split the data into train and test sets
* use a Random Forest algorithm
* test the model
* compute a confusion matrix

In [1]:
import numpy as np
import pandas as pd
import librosa.display
import librosa
import IPython.display as ipd
import matplotlib.pyplot as plt
import os
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier 
import tqdm

## Create the dictionary

I create a dictionary that I will use both to create the Pandas data frame (which I will use only to split train and test sets) and to store the MFCCs values before converting them to matrices.

In [2]:
data = {'mfcc':[], 'labels':[], 'speaker':[], 'train_set':[], 'train_target_set':[], 'test_set':[], 'test_target_set':[]}

## Compute the data


In [3]:
#some preliminary steps:

data_path = "recordings"  
n_mfcc = 20 
elements = os.listdir(data_path)  

#number_of_elements = len(elements)

### Some notes:

* each audio file generates different numbers of samples --> to make our data homogeneous we'll take the mean value of all the MFCCs we're collecting from all samples. This way we get only 13 coefficients for each audio file --> 13 rows vector for each audio file


* the MFCCs get stored in the dictionary key "mfcc" which is a list of 18 vectors with 13 rows, where 18 is the total number of audio files we're working with and 13 is the number of MFCCs for each of them


* I will use tqdm to see how long it takes for the for loop to finish (took 56 sec on my computer)

In [13]:
#extract the digit labels from each wav file
for i, filename in enumerate(tqdm.tqdm(elements)):
    file_name_components = filename.split("_")
    digit_label = file_name_components[0]
    data["labels"].append(digit_label)
    #data["speaker"].append(speaker_name)  #this is not really a necessary data to collect
        
    #load the audio files
    audio, sr = librosa.load(os.path.join(data_path, filename))
        
    #compute the MFCCs for each audio file
    mfcc = librosa.feature.mfcc(audio, n_fft=2048, hop_length=512, n_mfcc=n_mfcc)
    feature_vector =np.mean(mfcc, axis=1)
    
    #features_matrix[i,:] = feature_vector
    data["mfcc"].append(feature_vector)
    
    

100%|██████████████████████████████████████████████████████████████████████████████| 3000/3000 [01:46<00:00, 28.06it/s]


Next there are some checks that I did to see if everything was working properly...

In [None]:
# len(data["mfcc"])

In [None]:
# data["mfcc"][0].size

In [None]:
# data["mfcc"]

### Listen to the last audio loaded

In [None]:
ipd.Audio(audio, rate=sr)

### Visualize the last MFCCs computed (no mean value yet)

In [None]:
librosa.display.specshow(mfcc,
                        x_axis = 'time',
                        sr=sr)

plt.xlabel('Time')
plt.ylabel('MFCCs')
plt.show() 

## Creating the Pandas Data Frame

I create this data frame for the sole purpose of splitting the data into train and test sets in a non-biased way FOR EACH class of digits. The train_test_split method from scikit learn will provide the randomness and it doesn't require data frames per se... BUT I need the data frame to apply the method on each different class of digits!

In [5]:
df = pd.DataFrame(data["mfcc"])   
df['digit_label']=data["labels"]
#df['speaker_name']=data["speaker"]  #not needed

# call df below if you want to visualize the data frame
#df.head()

## Split test and training sets

NOTE : This is meant to work with whatever group of digits (may be 0-3, 4-7, 0-9...).

* In the first part of the for loop I pick iteratively only one class of digits (first the 0s, then the 1s, the 2s...) and divide the data frame into data and targets.

* In the second part of the loop I actually split the data and the targets into train and test so that the test data will be around 20% of the total (see the test_size argument)

* In the third part I return to numpy arrays, so that I can merge together the train and test data I got from each different digit class (because the classifier works with 2d matrices).

This is very fast for loop.

In [10]:
digits=np.unique(data["labels"])  #understand what digits we are working with

for i in tqdm.tqdm(digits):
    df_same_digit = df[df.digit_label == i]
    X = df_same_digit.iloc[:, 0:13]
    y = df_same_digit.iloc[:, 13] 
    
    #splitting 
    X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=0, test_size=0.2)
    
    
    X_train_np = (X_train).to_numpy()    #using Pandas' function to transform data frame into numpy array
    data["train_set"].append(X_train_np)
    TRAIN = np.concatenate(data["train_set"], axis=0)
    
    X_test_np = (X_test).to_numpy()   
    data["test_set"].append(X_test_np)
    TEST = np.concatenate(data["test_set"], axis=0)
    
    y_train_np = (y_train).to_numpy()    
    data["train_target_set"].append(y_train_np)
    TRAIN_TARGET = np.concatenate(data["train_target_set"], axis=0)   
    
    y_test_np = (y_test).to_numpy()    
    data["test_target_set"].append(y_test_np)
    TEST_TARGET = np.concatenate(data["test_target_set"], axis=0)

100%|█████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 169.93it/s]


Here I run some codes to check how my data now looks.

First I take a look at the results I get immediately after the splitting. You can notice that the splitting produced pandas data frames in which is very easy to see how the splitting algorithm worked.

In [11]:
 print(X_train)  #this is just for the digit "2"
print(type(X_train))

              0           1           2          3          4          5   \
2834 -420.241547  190.577744  -48.976040  18.850777  42.124546 -32.847275   
2845 -404.643829  185.162552  -43.370239  12.719451  35.329224 -33.192661   
2763 -353.636444  223.851776  -30.640776  23.995687  15.994088 -49.407070   
2993 -517.000732  232.470825  -73.659012  14.383049  22.255398 -45.507545   
2985 -499.941162  227.532166  -65.901955   6.783215  25.549255 -38.577709   
...          ...         ...         ...        ...        ...        ...   
2951 -507.667114  246.612625  -82.658821   0.318944   9.632804 -52.175011   
2892 -357.355865  224.650391  -77.109909  30.480385  39.189312 -32.906834   
2817 -464.087219  162.275909  -29.771009  23.153200  39.437393 -22.088482   
2747 -385.234375  188.604111 -101.337753  13.828812  34.147106 -47.880974   
2872 -364.017548  228.201538  -78.769882  25.889086  33.843235 -33.839317   

             6          7          8          9          10         11  \
2

Once you transform those back to numpy arrays you can get this:

In [None]:
# print(X_train_np)          #notice that it's the same as above, obviously
# print((X_train).shape)
# print(type(X_train_np))

It may be useful to check also the concatenated matrices:

In [12]:
 print(TRAIN.shape)
# print(type(TRAIN))
# print(TRAIN)

(4800, 13)


In [None]:
#let's take a look at the target
# print(TRAIN_TARGET.shape)
# print(TRAIN_TARGET)



Let's recap and see if everything is ready to be used with the classifier:


In [7]:
print(TRAIN.shape)
print(TRAIN_TARGET.shape)
print(TEST.shape)
print(TEST_TARGET.shape)


(2400, 13)
(2400,)
(600, 13)
(600,)


## Random Forest Classifier

In [9]:
#create the random forest classifier
clf = RandomForestClassifier(n_jobs=2, random_state=0)

#training
clf.fit(TRAIN, TRAIN_TARGET)

ValueError: Unknown label type: 'continuous'

In [None]:
a = clf.predict(TEST)

# call the following to actually see the prediction results:
#print(a)

To get the confusion matrix I use Pandas function "crosstab".

In [None]:
pd.crosstab(TEST_TARGET, a,  rownames=['Actual digits'], colnames=['Predicted digits'])