# Introduction - This notebook is still a work in progress tbh :( 

In early 2017, Quora released a really interesting [dataset](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) on question pairs. The dataset is composed of a number of question pairs, where each pair is composed of a set of two questions that are somewhat related to each other. The label associated with this pair is a binary label indicating whether or not the two questions are duplicates of each other. For example, "How can I be a good geologist?" and "What should I do to be a great geologist?" have basically the same meaning. Determining whether two questions have the same intent is important for Quora because they don't want 3 of the same questions, each with different answers and each with different links. 

In this notebook style blog post, we'll look at the dataset that Quora released, as well as create a deep learning model that determines whether two questions can be considered duplicates or not. 

# Data Loading

We'll first start by loading in the dataset into a pandas dataframe. Pandas is the go-to Python library for all your data science needs. It helps with dealing with input data in CSV or TSV formats and with transforming your data into a form where it can be inputted into ML models. If you'd like more information on Pandas, check out this [tutorial](https://github.com/adeshpande3/Pandas-Tutorial/blob/master/Pandas%20Tutorial.ipynb), which goes over a lot of the functions and data structures that Pandas uses. 

In [1]:
import pandas as pd
df = pd.read_csv('Data/Quora/quora_duplicate_questions.tsv', sep='\t')

Let's take a look at what the first couple rows of the data look like. We can see that each row of the dataset contains an ID for the question pair, IDs for both of the sentences, both of the sentences in text form, and finally a binary label of whether the pair is a duplicate. 

In [2]:
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


Next, let's take a look at some of the duplicate questions, just to get an idea of what Quora defines as a duplicate pair of questions.

In [3]:
df[df['is_duplicate'] == 1].head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
5,5,11,12,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1
7,7,15,16,How can I be a good geologist?,What should I do to be a great geologist?,1
11,11,23,24,How do I read and find my YouTube comments?,How can I see all my Youtube comments?,1
12,12,25,26,What can make Physics easy to learn?,How can you make physics easy to learn?,1
13,13,27,28,What was your first sexual experience like?,What was your first sexual experience?,1


Now, let's see how many different examples we have to work with. 

In [4]:
df.shape

(404290, 6)

One of the most important parts of understanding datasets before applying machine learning models is to see how many examples we have of each class. In this case, we only have two classes (duplicate and not duplicate).

In [5]:
print 'Number of duplicate pairs:', len(df[df['is_duplicate'] == 1])
print 'Number of non-duplicate pairs:', len(df[df['is_duplicate'] == 0])

Number of duplicate pairs: 149263
Number of non-duplicate pairs: 255027


# Word Vectors

As with lots of deep learning approaches to NLP tasks, our first job is to create word vectors (If you're unfamiliar with word vectors, and why we use them, check out my Deep Learning for NLP [tutorial](https://adeshpande3.github.io/adeshpande3.github.io/Deep-Learning-Research-Review-Week-3-Natural-Language-Processing)). Word vector generation can either be in the model itself, or you can use pretrained word vectors, which is the approach we'll be taking here. The vectors were downloaded from the following [website](https://nlp.stanford.edu/projects/glove/), and were trained using the GloVe model. 

We'll load the vectors into two data structures, wordsList and wordVectors. Wordslist is a Python list that includes all of the words for which we have vectors for. wordVectors is a numpy array that contains 50 dimensional vectors for each of the words in the wordsList.

In [6]:
import numpy as np

wordsList = np.load('Data/Quora/wordsList.npy').tolist()
wordVectors = np.load('Data/Quora/wordVectors.npy')
print len(wordsList) # Contains all of the words that we have vectors for
print wordVectors.shape # Contains all of the respective vectors
numDimensions = wordVectors.shape[1]

400000
(400000, 50)


Let's look at how we can use these two data structures. Let's say we want to get the word vector for the word 'baseball'. We first look for its index in wordsList and then access the corresponding vector in wordVectors. 

In [7]:
baseballIndex = wordsList.index('baseball')
wordVectors[baseballIndex]

array([-1.93270004,  1.04209995, -0.78514999,  0.91033   ,  0.22711   ,
       -0.62158   , -1.64929998,  0.07686   , -0.58679998,  0.058831  ,
        0.35628   ,  0.68915999, -0.50598001,  0.70472997,  1.26639998,
       -0.40031001, -0.020687  ,  0.80862999, -0.90565997, -0.074054  ,
       -0.87674999, -0.62910002, -0.12684999,  0.11524   , -0.55685002,
       -1.68260002, -0.26291001,  0.22632   ,  0.713     , -1.08280003,
        2.12310004,  0.49869001,  0.066711  , -0.48225999, -0.17896999,
        0.47699001,  0.16384   ,  0.16537   , -0.11506   , -0.15962   ,
       -0.94926   , -0.42833   , -0.59456998,  1.35660005, -0.27506   ,
        0.19918001, -0.36008   ,  0.55667001, -0.70314997,  0.17157   ], dtype=float32)

# Model Inputs

Every machine learning and deep learning model needs to define inputs and outputs. Given our task of determining whether a question pair is a duplicate, we have two sentences as input, and a single binary label (1 - duplicate, 0 - not a duplicate). 

![](Data/QuoraModel.png)

Now, let's go through each of the 404,290 question pairs and turn each of the questions into a N x 50 dimensional matrix where N is the number of words in the sentence. Each question pair will have two associated matrices (one for each question), and then we will concatenate them. The resulting matrix will the input into our RNN. 

![](Data/QuoraInputs.png)

(We'll be treating question marks as a separate word)

Let's see how that looks like just for question pair that we were talking about.

In [8]:
firstQuestion = df.loc[7,'question1'] # Getting the first sentence in the first question pair
print 'The first question:', firstQuestion
secondQuestion = df.loc[7,'question2'] # Getting the second sentence in the first question pair
print 'The second question:', secondQuestion

The first question: How can I be a good geologist?
The second question: What should I do to be a great geologist?


Before turning the questions into matrices, we'll first need to clean the sentences to separate punctuation from the words, and check that the values passed in are indeed strings. We'll create a function called cleanSentences, do all of our preprocessing in the function, and then return the cleaned sentence. Data preprocessing is extremely important in the field of machine learning, especially so when dealing with natural language inputs. 

In [9]:
import re
def cleanSentences(string):
    if (isinstance(string, basestring) == False):
        # Since the value passed in isn't really a string, we'll just return an empty string
        return " " # TODO: Find some better way to deal with these values  
    string = string.lower()
    string = re.sub('([.,!?()])', r' \1 ', string) # Separates punctuation from the word
    return string

Our next job is to call the cleanSentences function on our two sentences, and then calculate the number of total words in those 2 sentences. We should see that the variable lenBothSentences should be 18.

In [10]:
firstQuestion = cleanSentences(firstQuestion)
secondQuestion = cleanSentences(secondQuestion)
firstQuestionSplit = firstQuestion.split()
secondQuestionSplit = secondQuestion.split()
lenBothSentences = len(firstQuestionSplit) + len(secondQuestionSplit)
print 'The total number of words:', lenBothSentences

The total number of words: 18


Now that we know the number of words in the two sentences, let's turn the two sentences into a single 18 x 50 matrix. 

In [11]:
firstXInput = np.zeros((lenBothSentences, numDimensions), dtype='float32')
indexCounter = 0
for word in firstQuestionSplit:
    try:
        firstXInput[indexCounter] = wordVectors[wordsList.index(word)]
    except ValueError:
        firstXInput[indexCounter] = wordVectors[399999] #Vector for unkown words, in case the word isn't found
    indexCounter = indexCounter + 1
for word in secondQuestionSplit:
    try:
        firstXInput[indexCounter] = wordVectors[wordsList.index(word)]
    except ValueError:
        firstXInput[indexCounter] = wordVectors[399999] #Vector for unkown words, in case the word isn't found
    indexCounter = indexCounter + 1
firstXInput.shape

(18, 50)

Okay, so now that we know how to create an input matrix for a single sentence pair, let's do the same for every pair in the dataset and create one huge X matrix. This will be the X matrix that we'll feed into our model. This matrix will have the dimensionality S x N x D where S is the number of training examples that we have (about 350,000 pairs for training and 50,000 for testing), N is the sequence length, and D is the number of dimensions of the word vectors. 

Let's examine N, the sequence length, a bit more carefully. In every sentence pair, there could be a different sequence length right? In the question pair we saw earlier, there was a total of 18 words (+ punctuation). In the second question pair, there could be 20,30,40, etc words. We don't know. So, what we'll do is we'll set the sequence length to the maximum length of all question pairs in the test. Let's compute that value. 

In [None]:
maxSeqLength = 0
for index, row in df.iterrows():
    firstQuestion = cleanSentences(row['question1'])
    secondQuestion = cleanSentences(row['question2'])
    firstQuestionSplit = firstQuestion.split()
    secondQuestionSplit = secondQuestion.split()
    lenBothSentence = len(firstQuestionSplit) + len(secondQuestionSplit)
    if (lenBothSentence > maxSeqLength):
        maxSeqLength = lenBothSentence
print 'The maximum sequence length in the whole dataset is:', maxSeqLength

In [12]:
maxSeqLength = 296

We'll use the first 350,000 examples for training and the rest for testing

In [13]:
numTrainExamples = 350000
numTestExamples = df.shape[0] - numTrainExamples

Now that we have the max sequence length and have defined our training and testing sets, let's create our X and Y matrices for training. Remember, the X matrix represents our inputs, and the Y matrix represents our labels. For the pairs that have a length < seqLength, we will pad the rest with zeros. 

Since computing the whole 3-D at once is very compute intensive (The whole X matrix ends up being 23 GB!), we'll be just storing the index of vector, and then in our actual Tensorflow graph, we will load in the vectors using a tf.nn.lookup function.

In [14]:
numClasses = 2
X = np.zeros((numTrainExamples + numTestExamples, maxSeqLength), dtype='int64')
Y = np.zeros((numTrainExamples + numTestExamples, numClasses), dtype='int32')

Now, let's fill in this X matrix. We'll basically goes through every row of the Pandas dataframe using the function iterrows. We will extract the two questions in each pair, clean them, and then fill the values of the indices where the word vector is located into our X matrix. Then, we'll look at whether the pair was marked as a duplicate. If so, we'll add [0,1] to our Y matrix to represent the 'duplicate' label.

If you're running the notebook, the following piece of code is pretty compute intensive, so you can go ahead and comment it out and load in the precomputed X and Y matrices in the next block of code. 

In [None]:
exampleCounter = 0
for index, row in df.iterrows():
    firstQuestion = cleanSentences(row['question1'])
    secondQuestion = cleanSentences(row['question2'])
    firstQuestionSplit = firstQuestion.split()
    secondQuestionSplit = secondQuestion.split()
    indexCounter = 0
    for word in firstQuestionSplit:
        try:
            X[exampleCounter][indexCounter] = wordsList.index(word)
        except ValueError:
            X[exampleCounter][indexCounter] = 399999 #Vector for unkown words
        indexCounter = indexCounter + 1
    for word in secondQuestionSplit:
        try:
            X[exampleCounter][indexCounter] = wordsList.index(word)
        except ValueError:
            X[exampleCounter][indexCounter] = 399999 #Vector for unkown words
        indexCounter = indexCounter + 1
    if (row['is_duplicate'] == 1):
        Y[exampleCounter] = [0,1]
    else:
        Y[exampleCounter] = [1,0]
    exampleCounter = exampleCounter + 1
np.save('Data/xMatrix.npy', X)
np.save('Data/yMatrix.npy', Y)

In [15]:
X = np.load('Data/xMatrix.npy')
Y = np.load('Data/yMatrix.npy')

Now that we have all of the inputs and outputs put into 2 large matrices, lets divide them up into training and testing sets. 

In [16]:
xTrain = X[0:numTrainExamples]
yTrain = Y[0:numTrainExamples]
xTest = X[numTrainExamples:]
yTest = Y[numTrainExamples:]

# Helper Functions

Below you can find a couple of helper functions that will be useful when training the network in a later step.

In [21]:
from random import randint

def getBatch(isTest = False):
    labels = np.zeros([batchSize, numClasses])
    arr = np.zeros([batchSize, maxSeqLength])
    if (isTest):
        num = randint(0,xTest.shape[0] - batchSize)
        arr = xTest[num:num+batchSize]
        labels = yTest[num:num+batchSize]
    else:
        num = randint(0,xTrain.shape[0] - batchSize)
        arr = xTrain[num:num+batchSize]
        labels = yTrain[num:num+batchSize]  
    return arr, labels

# RNN Model

Ah, now we get to the fun part. Now, we’re ready to start creating our Tensorflow graph. We’ll first need to define some hyperparameters, such as batch size, number of LSTM units, number of output classes, and number of training iterations.

For a quick refresher on LSTMs and RNNs, you can check out the following tutorials for help. 
* Chris Olah's phenomenal [post](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) on LSTMs
* Denny Britz's [blog post series](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) on RNNs
* My [blog post](https://adeshpande3.github.io/adeshpande3.github.io/Deep-Learning-Research-Review-Week-3-Natural-Language-Processing) on deep learning for NLP

In [54]:
batchSize = 24
lstmUnits = 64
iterations = 10000

As with most Tensorflow graphs, we’ll now need to specify a couple placeholders, one for the inputs into the network, one for the labels, and one for the dropout probabilites. The purpose of a placeholder is basically to tell Tensorflow "We're going to input in actual data later (during the training loop), but for now, we're going to define these placeholder variable instead". It lets Tensorflow know about the size of the inputs beforehand. The most important part about defining these placeholders is understanding each of their dimensionalities.

The labels placeholder represents a set of values, each either [1, 0] or [0, 1], depending on whether each training example is positive (a duplicate) or negative (not a duplicate). The input data placeholder represents the matrices of all the question pairs in the batch. 

Once we have our input data placeholder, we’re going to call the tf.nn.lookup() function in order to get our word vectors. The call to that function will return a 3-D Tensor of dimensionality batch size by max sequence length by word vector dimensions.

In [63]:
import tensorflow as tf
tf.reset_default_graph()

labels = tf.placeholder(tf.float32, [batchSize, numClasses])
input_data = tf.placeholder(tf.int32, [batchSize, maxSeqLength])

data = tf.Variable(tf.zeros([batchSize, maxSeqLength, numDimensions]),dtype=tf.float32)
data = tf.nn.embedding_lookup(wordVectors,input_data)

Now that we have we've defined the inputs and outputs, let’s look at how we can feed this input into an LSTM network. We’re going to call the tf.nn.rnn_cell.BasicLSTMCell function. This function takes in an integer for the number of LSTM units that we want. This is one of the hyperparameters that will take some tuning to figure out the optimal value. We’ll then wrap that LSTM cell in a dropout layer to help prevent the network from overfitting.

Finally, we’ll feed both the LSTM cell and the 3-D tensor full of input data into a function called tf.nn.dynamic_rnn. This function is in charge of unrolling the whole network and creating a pathway for the data to flow through the RNN graph.

The first output of the dynamic RNN function can be thought of as the last hidden state vector. This vector will be reshaped and then multiplied by a final weight matrix and a bias term to obtain the final output values.

In [64]:
lstmCell = tf.contrib.rnn.BasicLSTMCell(lstmUnits)
value, _ = tf.nn.dynamic_rnn(lstmCell, data, dtype=tf.float32)

hiddenUnits = 32

weight = tf.Variable(tf.truncated_normal([lstmUnits, hiddenUnits]))
bias = tf.Variable(tf.constant(0.1, shape=[hiddenUnits]))
value = tf.transpose(value, [1, 0, 2])
last = tf.gather(value, int(value.get_shape()[0]) - 1)
fc1 = (tf.matmul(last, weight) + bias)

weight2 = tf.Variable(tf.truncated_normal([hiddenUnits, numClasses]))
bias2 = tf.Variable(tf.constant(0.1, shape=[numClasses]))
prediction = (tf.matmul(fc1, weight2) + bias2)

Next, we’ll define correct prediction and accuracy metrics to track how the network is doing. The correct prediction formulation works by looking at the index of the maximum value of the 2 output values, and then seeing whether it matches with the training labels.

Then, we'll declare our loss function as well as our optimizer object. 

In [49]:
correctPred = tf.equal(tf.argmax(prediction,1), tf.argmax(labels,1))
accuracy = tf.reduce_mean(tf.cast(correctPred, tf.float32))

loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=labels))
optimizer = tf.train.AdamOptimizer(0.01).minimize(loss)

The following code is for setting up Tensorboard, which is a neat tool that allows us to visualize metrics such as loss and accuracy as our network is training. 

In [50]:
import datetime

sess = tf.InteractiveSession()
tf.summary.scalar('Loss', loss)
tf.summary.scalar('Accuracy', accuracy)
merged = tf.summary.merge_all()
logdir = "tensorboard/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S") + "/"
writer = tf.summary.FileWriter(logdir, sess.graph)

# Training

The basic idea of the training loop is that we first define a Tensorflow session. Then, we load in a batch of reviews and their associated labels using the getBatch function. Next, we call the session’s run function. This function has two arguments. The first is called the "fetches" argument. It defines the value we’re interested in computing. We want our optimizer to be computed since that is the component that minimizes our loss function. The second argument is where we input our feed_dict. This data structure is where we provide inputs to all of our placeholders. We need to feed our batch of reviews and our batch of labels. This loop is then repeated for a set number of training iterations.

In [51]:
saver = tf.train.Saver()
sess.run(tf.global_variables_initializer())

for i in range(iterations):
    #Next Batch of reviews
    nextBatch, nextBatchLabels = getBatch();
    sess.run(optimizer, {input_data: nextBatch, labels: nextBatchLabels})
   
    #Write summary to Tensorboard
    if (i % 10 == 0):
        summary = sess.run(merged, {input_data: nextBatch, labels: nextBatchLabels})
        writer.add_summary(summary, i)
    
    if (i % 1000 == 0):
        trainingAccuracy = sess.run(accuracy, {input_data: nextBatch, labels: nextBatchLabels})
        print 'The training loss at iteration', i, 'is', trainingAccuracy
writer.close()

The training loss at iteration 0 is 0.791667
The training loss at iteration 1000 is 0.666667
The training loss at iteration 2000 is 0.583333
The training loss at iteration 3000 is 0.583333
The training loss at iteration 4000 is 0.625
The training loss at iteration 5000 is 0.666667
The training loss at iteration 6000 is 0.75
The training loss at iteration 7000 is 0.666667
The training loss at iteration 8000 is 0.666667


KeyboardInterrupt: 

Using Tensorboard, we can look at the loss and accuracy of the model as it trains. 

TODO: Include both of the images. 

# Testing

Now, let's test our trained network on question pairs is hasn't seen before. 

In [53]:
iterations = 10
for i in range(iterations):
    nextBatch, nextBatchLabels = getBatch(True);
    print("Accuracy for this batch:", (sess.run(accuracy, {input_data: nextBatch, labels: nextBatchLabels})) * 100)

('Accuracy for this batch:', 54.166668653488159)
('Accuracy for this batch:', 62.5)
('Accuracy for this batch:', 75.0)
('Accuracy for this batch:', 62.5)
('Accuracy for this batch:', 75.0)
('Accuracy for this batch:', 58.333331346511841)
('Accuracy for this batch:', 58.333331346511841)
('Accuracy for this batch:', 66.666668653488159)
('Accuracy for this batch:', 58.333331346511841)
('Accuracy for this batch:', 75.0)


# How Can We Do Better?

Now that we've seen how an RNN/LSTM model performs, let's think about the different ways we can bump up the accuracy. 

1. Try to represent the inputs in a different way. Concatenating the two sentence vectors is one way of representing the question pair, but can we think of any others? We can also use different word vectors/embeddings. The Stanford GloVe website also has 100 and 300 dimensional vectors, which may provide better representations. 
2. Change the model architecture we use. RNNs and LSTMs normally perform the best on natural language inputs, but we've recently seen [convnets](https://code.facebook.com/posts/1978007565818999/a-novel-approach-to-neural-machine-translation/) and [LSTMs with attention mechanisms](http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/) used as well.  
3. Consider ways to reduce the effect of imbalanced data. As we saw in the beginning of the notebook, 2/3 of the examples were negative and the rest were positive. This is a decent ratio, but the closer we can get to 50/50, the better. 

# Bidirectional Model

In [79]:
tf.reset_default_graph()

labels = tf.placeholder(tf.float32, [batchSize, numClasses])
input_data = tf.placeholder(tf.int32, [batchSize, maxSeqLength])
keep_prob = tf.placeholder(tf.float32)
seq_len = tf.placeholder(tf.int32, [None])

data = tf.Variable(tf.zeros([batchSize, maxSeqLength, numDimensions]),dtype=tf.float32)
data = tf.nn.embedding_lookup(wordVectors,input_data)

lstm_fw_cell = tf.contrib.rnn.BasicLSTMCell(lstmUnits, forget_bias=1.0, state_is_tuple=True)
lstm_bw_cell = tf.contrib.rnn.BasicLSTMCell(lstmUnits, forget_bias=1.0, state_is_tuple=True)
lstm_fw_cell = tf.contrib.rnn.DropoutWrapper(lstm_fw_cell, keep_prob)
lstm_bw_cell = tf.contrib.rnn.DropoutWrapper(lstm_bw_cell, keep_prob)
value, states = tf.nn.bidirectional_dynamic_rnn(cell_fw=lstm_fw_cell, cell_bw=lstm_bw_cell, inputs=data,sequence_length=seq_len, dtype=tf.float32)

value = tf.concat(value, 2)
hiddenUnits = 32

weight = tf.Variable(tf.truncated_normal([2*lstmUnits, hiddenUnits]))
bias = tf.Variable(tf.constant(0.1, shape=[hiddenUnits]))
value = tf.transpose(value, [1, 0, 2])
last = tf.gather(value, int(value.get_shape()[0]) - 1)
fc1 = (tf.matmul(last, weight) + bias)

weight2 = tf.Variable(tf.truncated_normal([hiddenUnits, numClasses]))
bias2 = tf.Variable(tf.constant(0.1, shape=[numClasses]))
prediction = (tf.matmul(fc1, weight2) + bias2)

correctPred = tf.equal(tf.argmax(prediction,1), tf.argmax(labels,1))
accuracy = tf.reduce_mean(tf.cast(correctPred, tf.float32))

loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=labels))
optimizer = tf.train.AdamOptimizer().minimize(loss)

sess = tf.InteractiveSession()
saver = tf.train.Saver()
sess.run(tf.global_variables_initializer())

for i in range(iterations):
    #Next Batch of reviews
    nextBatch, nextBatchLabels = getBatch();
    train_seq_len = np.ones(batchSize) * maxSeqLength
    sess.run(optimizer, {input_data: nextBatch, labels: nextBatchLabels, keep_prob: 0.7, seq_len: train_seq_len})
    
    if (i % 100 == 0):
        trainingAccuracy = sess.run(accuracy, {input_data: nextBatch, labels: nextBatchLabels, keep_prob: 1.0, seq_len: train_seq_len})
        print 'The training loss at iteration', i, 'is', trainingAccuracy
writer.close()

The training loss at iteration 0 is 0.708333
The training loss at iteration 100 is 0.666667
The training loss at iteration 200 is 0.75
The training loss at iteration 300 is 0.708333
The training loss at iteration 400 is 0.75
The training loss at iteration 500 is 0.416667
The training loss at iteration 600 is 0.75
The training loss at iteration 700 is 0.75
The training loss at iteration 800 is 0.666667
The training loss at iteration 900 is 0.833333
The training loss at iteration 1000 is 0.583333
The training loss at iteration 1100 is 0.75
The training loss at iteration 1200 is 0.666667
The training loss at iteration 1300 is 0.75
The training loss at iteration 1400 is 0.666667
The training loss at iteration 1500 is 0.75
The training loss at iteration 1600 is 0.708333
The training loss at iteration 1700 is 0.666667
The training loss at iteration 1800 is 0.541667
The training loss at iteration 1900 is 0.541667
The training loss at iteration 2000 is 0.541667
The training loss at iteration 2

KeyboardInterrupt: 