### Classification of MNIST digits from sound

##### Data information:

We have a dataset of `8800` spoken digits, encoded as a time series of `13` Mel-frequency cepstral coefficients. 


###### Train Set:

Filename: `Train_Arabic_Digits.txt`

Each block delimited by an empty line corresponds to a one analysis frame which consist of `4-93` lines of `13 MEL` coefficients.

The first `330` blocks correspond to male speakers, the next `330` are spoken by females.

Block `1-660` correspond to the digit `0`, `661-1320` to the digit `1`, etc.

In summary we should have `6600` test data samples.

###### Test Set:

Filename: `Test_Arabic_Digits.txt`

Same as the training set, except we have only `220` for each digit, the first `110` are spoken by males, and the other `110` by females.

In summary we should have `2200` test data samples.


Link to the [data source](http://archive.ics.uci.edu/ml/datasets/Spoken+Arabic+Digit).

In [159]:
%matplotlib inline

import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sys
import operator
import functools
from sklearn.neighbors import KNeighborsClassifier


In [193]:
# index to track how many samples we've seen for the current class
s_ix   = 0

trainings_set_raw = [[]]

with open('Train_Arabic_Digits.txt', 'r') as file:
    for lineix, line in enumerate(file):
        if line.isspace():
            s_ix += 1
            trainings_set_raw.append([])
        else:
            coeffs = [float(s) for s in line.split(' ')]
            trainings_set_raw[s_ix].append(coeffs)

s_ix   = 0

test_set_raw = [[]]

with open('Test_Arabic_Digits.txt', 'r') as file:
    for lineix, line in enumerate(file):
        if line.isspace():
            s_ix += 1
            test_set_raw.append([])
        else:
            coeffs = [float(s) for s in line.split(' ')]
            test_set_raw[s_ix].append(coeffs)


In [194]:
# as the samples have different length, we will clip them to the smallest size

def check_timestep_size(dataset):
    min_len = sys.maxsize
    max_len = 0
    for sample in range(1,len(dataset)):
        cur_len = len(dataset[sample])
        if  cur_len < min_len:
            min_len = cur_len
        if  cur_len > max_len:
            max_len = cur_len

    print('Smallest timestep size: {0} for {1} samples'.format(min_len, len(trainings_set)))
    print('Greatest timestep size: {0} for {1} samples'.format(max_len, len(trainings_set)))
    
check_timestep_size(trainings_set_raw)
check_timestep_size(test_set_raw)

Smallest timestep size: 4 for 6600 samples
Greatest timestep size: 93 for 6600 samples
Smallest timestep size: 7 for 6600 samples
Greatest timestep size: 83 for 6600 samples


In [195]:
# we will append zeros to every sample until it each one reaches 93 timesteps

def add_zero_padding(dataset):
    
    zero_padding = [0.] * 13
    
    for ix in range(1,len(dataset)):
        while len(dataset[ix]) < 93:
            dataset[ix].append(zero_padding)

add_zero_padding(trainings_set_raw)
add_zero_padding(test_set_raw)

# let's check the sizes again
check_timestep_size(trainings_set_raw)
check_timestep_size(test_set_raw)

Smallest timestep size: 93 for 6600 samples
Greatest timestep size: 93 for 6600 samples
Smallest timestep size: 93 for 6600 samples
Greatest timestep size: 93 for 6600 samples


In [196]:
# now we will flatten the timesteps so every sample will have a dimension of 93*13

def flatten_set(dataset):
    is_flattened = True
    new_dataset  = [0] * len(dataset)
    
    for ix in range(1, len(dataset)):
        new_dataset[ix] = functools.reduce(operator.add, dataset[ix])
        is_flattened &= len(new_dataset[ix]) == 93 * 13
            
    print('Everything is flattened: {0}'.format(is_flattened))

trainings_set = flatten_set(trainings_set_raw)
test_set      = flatten_set(test_set_raw)

Everything is flattened: True
Everything is flattened: True


In [197]:
# now we have the create the target labels

trainings_set_targets = [ [0] * 660
                        , [1] * 660
                        , [2] * 660
                        , [3] * 660
                        , [4] * 660
                        , [5] * 660
                        , [6] * 660
                        , [7] * 660
                        , [8] * 660
                        , [9] * 660]

test_set_targets = [ [0] * 220
                   , [1] * 220
                   , [2] * 220
                   , [3] * 220
                   , [4] * 220
                   , [5] * 220
                   , [6] * 220
                   , [7] * 220
                   , [8] * 220
                   , [9] * 220]



2