# **Homework 2-1 Phoneme Classification**

## The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT)
The TIMIT corpus of reading speech has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems.

This homework is a multiclass classification task, 
we are going to train a deep neural network classifier to predict the phonemes for each frame from the speech corpus TIMIT.

link: https://academictorrents.com/details/34e2b78745138186976cbc27939b1b34d18bd5b3

## Download Data
Download data from google drive, then unzip it.

You should have `timit_11/train_11.npy`, `timit_11/train_label_11.npy`, and `timit_11/test_11.npy` after running this block.<br><br>
`timit_11/`
- `train_11.npy`: training data<br>
- `train_label_11.npy`: training label<br>
- `test_11.npy`:  testing data<br><br>

**notes: if the google drive link is dead, you can download the data directly from Kaggle and upload it to the workspace**




## Preparing Data
Load the training and testing data from the `.npy` file (NumPy array).

In [1]:
import numpy as np

print('Loading data ...')

data_root='./timit_11/'
train = np.load(data_root + 'train_11.npy')
train_label = np.load(data_root + 'train_label_11.npy')
test = np.load(data_root + 'test_11.npy')

print('Size of training data: {}'.format(train.shape))
print('Size of testing data: {}'.format(test.shape))

Loading data ...
Size of training data: (1229932, 429)
Size of testing data: (451552, 429)


## Create Dataset

In [2]:
import torch
from torch.utils.data import Dataset

class TIMITDataset(Dataset):
    def __init__(self, X, y=None):
        self.data = torch.from_numpy(X).float()
        if y is not None:
            y = y.astype(np.int)
            self.label = torch.LongTensor(y)
        else:
            self.label = None

    def __getitem__(self, idx):
        if self.label is not None:
            return self.data[idx], self.label[idx]
        else:
            return self.data[idx]

    def __len__(self):
        return len(self.data)


Split the labeled data into a training set and a validation set, you can modify the variable `VAL_RATIO` to change the ratio of validation data.

In [3]:
VAL_RATIO = 0.2

percent = int(train.shape[0] * (1 - VAL_RATIO))
train_x, train_y, val_x, val_y = train[:percent], train_label[:percent], train[percent:], train_label[percent:]
print('Size of training set: {}'.format(train_x.shape))
print('Size of validation set: {}'.format(val_x.shape))

Size of training set: (983945, 429)
Size of validation set: (245987, 429)


## Create Model

Define model architecture, you are encouraged to change and experiment with the model architecture.

In [5]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=0)
rf.fit(train_x, train_y)

train_pred = rf.predict(train_x)
# print('Train RMSE', np.mean((train_pred - tr_set.dataset.target.numpy())**2)**0.5)
# dev_pred = rf.predict(dv_set.dataset.data)
# print('Dev RMSE', np.mean((dev_pred - dv_set.dataset.target.numpy())**2)**0.5)

KeyboardInterrupt: 

In [14]:
train_x.shape
train_x2, train_y2 = train_x[:80000], train_y[:80000]

In [9]:
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
est = HistGradientBoostingClassifier(learning_rate=0.1, max_iter=100, l2_regularization=1e-4,\
                    min_samples_leaf=10, random_state=0)
est.fit(train_x, train_y)

train_pred = est.predict(train_x)
print('Train accuracy', (train_pred == train_y).sum() / train_y.shape[0])
val_pred = est.predict(val_x)
print('Valid accuracy', (val_pred == val_y).sum() / val_y.shape[0])

HistGradientBoostingClassifier(l2_regularization=0.0001, min_samples_leaf=10,
                               random_state=0)

In [10]:

train_pred

array(['36', '36', '36', ..., '0', '0', '0'], dtype='<U2')

0.653822114040927

In [None]:
from sklearn.model_selection import GridSearchCV
parameters = {'learning_rate':[0.1, 0.001, 0.0001], 'max_iter':[100],\
        'l2_regularization':[1e-3], 'min_samples_leaf':[20], 'max_depth':[None]}

est = HistGradientBoostingClassifier(random_state=0)
est = GridSearchCV(est, parameters, scoring='accuracy', cv=5, verbose=0, n_jobs=10)
est = est.fit(train_x2, train_y2)

print(est.cv_results_['params'])
print(est.cv_results_['mean_test_score'])

## Training

## Testing

Create a testing dataset, and load model from the saved checkpoint.

In [None]:
# create testing dataset
test_set = TIMITDataset(test, None)
test_loader = DataLoader(test_set, batch_size=BATCH_SIZE, shuffle=False)

Make prediction.

In [None]:
predict = []
model.eval() # set the model to evaluation mode
with torch.no_grad():
    for i, data in enumerate(test_loader):
        inputs = data
        inputs = inputs.to(device)
        outputs = model(inputs)
        _, test_pred = torch.max(outputs, 1) # get the index of the class with the highest probability

        for y in test_pred.cpu().numpy():
            predict.append(y)

Write prediction to a CSV file.

After finish running this block, download the file `prediction.csv` from the files section on the left-hand side and submit it to Kaggle.

In [None]:
with open('prediction.csv', 'w') as f:
    f.write('Id,Class\n')
    for i, y in enumerate(predict):
        f.write('{},{}\n'.format(i, y))