# Intent prediction:

`This portion as well as the other markdown cell were written outside of the 2-hour time limit. Some extraneous code was deleted from the final result. All code including comments were present before the execution of the training cell`

### Tell us about your favorite machine learning project:

One of my favorite projects was a speech recognition neural net I worked. I had been programming and learning Data Science for a little over two months at the time. [GitHub to Project](https://github.com/Zethtren/Speech_Recognition_Transcription_Translation)

#### What were the goals and focuses of this project?

The goal of this project was to vuild, from scratch, a series of models for predicting which language was being spoken, what was being said in an audio file, and translating it to another language.

#### Describe the technical details related to the project, such as: the input features; model architectures; algorithms; metrics; optimizers; performance evaluation; etc… 

I collected data from a couple of sources. (See Github for additional information on this or any other questions you may have). For the first two models I used Short-Time Fourier transformed audio segments (as 2D numpy arrays) as inputs to the models. The architecture for the language prediction was a feed forward CNN It took the 2d numpy array as input and produce a softmax prob of which language was spoken ( I achieved 98% accuracy here). 

The architecture for the transcribor was an LSTM which took a very similar numpy array as input and returned a sequence of single character spelling the content of the audio. I only achieved 40% accuracy here but, the model would have benefited greatly from longer training and increased data samples. 

I tried a combination of SGD and Adam optimizers and models were evaluated strictly on accuracy.

Baseline for RNN was (1/28)% ~ 3.6%

#### What were some novel approaches that you employed while solving the problem?

I didn't really take too many novel approaches to the problem. It was difficult to find resources on building speech transcription models so a lot the model architecture was guess and check. I've had a number of ideas on how to improve it since but have not had the time or finances to pursue them as a primary goal.

#### What kinds of results did you produce?

98% accuracy when detecting which language was spoken between English and German

40% accuracy on transcribing text. 

Did not attempt tranlastion as I was unsatisfied with the transcription result. (Many more resources exist on translating than transcribing)

#### What would you change about this project?

A lot. I was satisfied with the first model and would definitely wish to re-train it on a wider breadth of languages.

For the transcription model I've had a lot of ideas to explore.
First I would like to see if I could set a threshold value (for silence) to separate words across a speech file. This would allow me to build models that predict words instead of letters. I have a feeling this would be much easier.

If the audio has to come in a stream I would definitely like to spend the time to make the samples smaller and train it to predict just a hnadful of characters at a time. This would decrease dependencies on previously spoken letters and allow the model to train faster and better. It would also allow it to be served in a manner which feels more natural or reactive instead of waiting until the sentence is finished. 

I've discussed more about it in my project page as well.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import urllib
import numpy as np
import json
import random
import gensim
import gensim.downloader

In [2]:
print(list(gensim.downloader.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


In [3]:
# Download a pre-built word-vector map
vectors = gensim.downloader.load('glove-wiki-gigaword-50')

In [4]:
# Show use of vector
# Converts a value into a 50 dimensional vector embedding
# These can be built smaller around specific vocabulary
vectors.get_vector('some')

array([ 9.2871e-01, -1.0834e-01,  2.1497e-01, -5.0237e-01,  1.0379e-01,
        2.2728e-01, -5.4198e-01, -2.9008e-01, -6.4607e-01,  1.2664e-01,
       -4.1487e-01, -2.9343e-01,  3.6855e-01, -4.1733e-01,  6.9116e-01,
        6.7341e-02,  1.9715e-01, -3.0465e-02, -2.1723e-01, -1.2238e+00,
        9.5469e-03,  1.9594e-01,  5.6595e-01, -6.7473e-02,  5.9208e-02,
       -1.3909e+00, -8.9275e-01, -1.3546e-01,  1.6200e-01, -4.0210e-01,
        4.1644e+00,  3.7816e-01,  1.5797e-01, -4.8892e-01,  2.3131e-01,
        2.3258e-01, -2.5314e-01, -1.9977e-01, -1.2258e-01,  1.5620e-01,
       -3.1995e-01,  3.8314e-01,  4.7266e-01,  8.7700e-01,  3.2223e-01,
        1.3292e-03, -4.9860e-01,  5.5580e-01, -7.0359e-01, -5.2693e-01],
      dtype=float32)

In [5]:
# Download and read in the data

def get_response(url):
    operUrl = urllib.request.urlopen(url)
    if(operUrl.getcode()==200):
        data = operUrl.read()
    else:
        print("Error receiving data", operUrl.getcode())
    return data

data = get_response('https://raw.githubusercontent.com/clinc/oos-eval/master/data/data_full.json')

In [6]:
json_data = json.loads(data)

In [7]:
json_data.keys()

dict_keys(['oos_val', 'val', 'train', 'oos_test', 'test', 'oos_train'])

In [8]:
# Split the data into corresponding segments

val = json_data['val']
test = json_data['test']
train = json_data['train']

In [9]:
# Create a set containing 20 random choices from the complete set
# Use random seed for reproducability

pre_choice_set = set([train[i][1] for i, j in enumerate(train)])
pre_choice_set = list(pre_choice_set)

In [10]:
chosen_set = set()
random.seed(42)
while len(chosen_set) < 20:
    chosen_set.add(random.choice(pre_choice_set))

In [11]:
chosen_set

{'bill_balance',
 'calendar_update',
 'change_user_name',
 'exchange_rate',
 'freeze_account',
 'goodbye',
 'ingredients_list',
 'lost_luggage',
 'meeting_schedule',
 'pto_used',
 'reminder_update',
 'report_lost_card',
 'routing',
 'schedule_maintenance',
 'share_location',
 'spelling',
 'timezone',
 'todo_list_update',
 'what_can_i_ask_you',
 'yes'}

In [12]:
# Create word sets from passed strings

train = [train[i] for i, j in enumerate(train) if train[i][1] in chosen_set]
test  = [test[i] for i, j in enumerate(test) if test[i][1] in chosen_set]
val   = [val[i] for i, j in enumerate(val) if val[i][1] in chosen_set]

In [13]:
# I cannot shuffle from here on out otherwise the values will not line up
train_split_words = [train[i][0].split(" ") for i, j in enumerate(train)]
val_split_words   = [val[i][0].split(" ") for i, j in enumerate(val)]
test_split_words  = [test[i][0].split(" ") for i, j in enumerate(test)]
train_answer      = [train[i][1] for i, j in enumerate(train)]
val_answer        = [val[i][1] for i, j in enumerate(val)]
test_answer       = [test[i][1] for i, j in enumerate(test)]

In [14]:
# define a case to handle missing words (Mostly numbers in this sample set)

def get_vector(string):
    try: 
        value = vectors.get_vector(string)
        return value
    except:
        return np.array([0.0] * 50)

In [15]:
# Arbitrary value roughly double the actual max

pad_length = 50

In [16]:
pad_value = np.array([0.0] * 50)

train_vectors = [[get_vector(train_split_words[k][i]) for i, j in enumerate(train_split_words[k])] for k, _ in enumerate(train_split_words)]
val_vectors =   [[get_vector(val_split_words[k][i]) for i, j in enumerate(val_split_words[k])] for k, _ in enumerate(val_split_words)]
test_vectors =  [[get_vector(test_split_words[k][i]) for i, j in enumerate(test_split_words[k])] for k, _ in enumerate(test_split_words)]

In [17]:
def adjust_padding(some_list, pad_length=pad_length, pad_value=pad_value):
    some_list = some_list
    while len(some_list) < pad_length:
        some_list.append(pad_value)
    return some_list

In [18]:
# Create padded training val and test sets so they will fit into fixed size model

train_padded = [adjust_padding(train_vectors[i]) for i, j in enumerate(train_vectors)]
val_padded   = [adjust_padding(val_vectors[i]) for i, j in enumerate(val_vectors)]
test_padded  = [adjust_padding(test_vectors[i]) for i, j in enumerate(test_vectors)]

In [19]:
np.array(train_padded).shape

(2000, 50, 50)

In [20]:
# Build a ConvNet for classifying the texts
# Could also use an RNN but due to two hour time-limit and data shaping issues
# with feeding a many-to-one model I opted for ConvNet

class ClassifierModel(nn.Module):

    def __init__(self):

        super(ClassifierModel, self).__init__()

        self.state_input = nn.Conv2d(1, 4, (3, 50))
        self.conv2 = nn.Conv1d(4, 8, (3))
        self.conv3 = nn.Conv1d(8, 16, (3))
        self.linear1 = nn.Linear(704, 64)
        self.linear2 = nn.Linear(64, 32)
        self.probs_pre = nn.Linear(32, 20)
        self.probabilities = nn.Softmax(dim=1)

    def forward(self, x):

        x = F.relu(self.state_input(x))
        x = x.view(x.shape[0:3])

        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))

        x = x.view(-1, 16*44)

        x = F.relu(self.linear1(x))
        x = F.relu(self.linear2(x))
        x = self.probabilities(self.probs_pre(x))
        
        return x

In [21]:
device = torch.device('cuda')

In [22]:
model = ClassifierModel().float().to(device)

In [23]:
chosen_dict = {j: i for i, j in enumerate(chosen_set)}

for item in chosen_dict.keys():
    for i, j in enumerate(train_answer):
        if item == j:
            train_answer[i] = chosen_dict[item]
    for i, j in enumerate(val_answer):
        if item == j:
            val_answer[i] = chosen_dict[item]
    for i, j in enumerate(test_answer):
        if item == j:
            test_answer[i] = chosen_dict[item]


In [25]:

criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=0.0075)


for epoch in range(5000):  # loop over the dataset multiple times

    # zero the parameter gradients
    optimizer.zero_grad()

    # forward + backward + optimize
    outputs = model(torch.tensor(train_padded).reshape((2000,1,50,50)).float().to(device))
    loss = criterion(outputs.to('cpu'), torch.tensor(train_answer))
    loss.backward()
    optimizer.step()
    
    val_outputs = model(torch.tensor(val_padded).reshape((400,1,50,50)).float().to(device))
    val_loss = criterion(val_outputs.to('cpu'), torch.tensor(val_answer))
    
    # print statistics
    if epoch % 50 == 0:
        print("Loss: " + str(loss))
        print("Val Loss: " + str(val_loss))

print('Finished Training')

Loss: tensor(2.9958, grad_fn=<NllLossBackward>)
Val Loss: tensor(2.9957, grad_fn=<NllLossBackward>)
Loss: tensor(2.7369, grad_fn=<NllLossBackward>)
Val Loss: tensor(2.7384, grad_fn=<NllLossBackward>)
Loss: tensor(2.6558, grad_fn=<NllLossBackward>)
Val Loss: tensor(2.7014, grad_fn=<NllLossBackward>)
Loss: tensor(2.6323, grad_fn=<NllLossBackward>)
Val Loss: tensor(2.6665, grad_fn=<NllLossBackward>)
Loss: tensor(2.6042, grad_fn=<NllLossBackward>)
Val Loss: tensor(2.6444, grad_fn=<NllLossBackward>)
Loss: tensor(2.6012, grad_fn=<NllLossBackward>)
Val Loss: tensor(2.6467, grad_fn=<NllLossBackward>)
Loss: tensor(2.6002, grad_fn=<NllLossBackward>)
Val Loss: tensor(2.6479, grad_fn=<NllLossBackward>)
Loss: tensor(2.5989, grad_fn=<NllLossBackward>)
Val Loss: tensor(2.6483, grad_fn=<NllLossBackward>)
Loss: tensor(2.5981, grad_fn=<NllLossBackward>)
Val Loss: tensor(2.6469, grad_fn=<NllLossBackward>)
Loss: tensor(2.5976, grad_fn=<NllLossBackward>)
Val Loss: tensor(2.6418, grad_fn=<NllLossBackward>)


## End of time limit

I started the training on this model just before the 2 hour deadline. There were a couple of iterations on learning rates.

One learning rate using the AdamW faired a little better initially but also seemed to cap off. I'd definitely like to modify the architecture and optimizer a little bit because I feel the results could be better. Once its done training as I established here I will run an accuracy test on the test set. Although, I'm not expecting great results.

Given more time. I would have attempted an LSTM based architecture that evaluated the final hidden layer. This would have reduced the need for quite as much prep-work.

I'm also going to look across the data-set and use this [TF-IDF](https://github.com/Zethtren/NLP_Exploration) class I built about a week ago. I opted against this since th3e challenge indicted the desire to make this function as a deployable model. 

TF-IDF requires the entire dataset to compare against for document frequency. This implementation above only requires that there be fewer than 50 words. This could also be modified. 

(This would have given me the fastest and easiest results)

As stated an LSTM model would allow for more flexibility in ingestion, having no requirements on sentence length). However the cost is a more complicated architecture with longer training and prediction times. Although I know from past usage it would likely be more accurate.

In [26]:
outputs = model(torch.tensor(test_padded).reshape((600,1,50,50)).float().to(device))

In [34]:
maxes = np.argmax(outputs.to('cpu').detach().numpy(), axis = 1)

In [40]:
correct_guesses = [maxes[i] == test_answer[i] for i in range(len(maxes))]

In [41]:
correct_guesses.count(True)

338

In [42]:
correct_guesses.count(False)

262

In [44]:
print("Percent correct (Accuracy): " + str(correct_guesses.count(True)/len(correct_guesses)))

Percent correct (Accuracy): 0.5633333333333334


As stated above I knew the result wouldn't be great but 56.3% Isn't terrible for a pre-liminary model. And there is a lot that was clearly working. 

A custom reduced dimension embedding layer could improve accuracy. 
Increased amount of data and increase in training time would likely have also seen benefits