# Assessing the GloVe model (and predicting with it)

### First we import our packages and load all of the data for the GloVe embedding.

In [1]:
import pandas as pd
import numpy as np
from model_architecture import build_model
import json
import itertools
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.sequence import pad_sequences
from build_vocab import normalize_text
from process_and_train import text_to_sequence

In [2]:
data = pd.read_csv('quora_duplicate_questions.tsv', sep='\t')

There are some NaN values in the downloaded dataset. In the code, these are dropped, but that will affect our indexing later, so I'm going to see exactly which rows have NaN values.

In [3]:
pd.isnull(data).any(1).to_numpy().nonzero()[0]

array([105796, 201871, 363416], dtype=int64)

In [4]:
data.dropna(inplace=True)

In [6]:
with open('data/glove/300dim_glove_word_index.json','r') as read_file:
    word_to_int = json.load(read_file)
embedding_matrix = np.load('data/glove/300dim_glove_emb_matrix.npy')
# I'm setting input_size to 248 because it will be necessary later. It can be however many words you plan on using.
model = build_model(vocab_size = len(word_to_int) + 2, input_size = 248, embedding_dim = 300, lstm_units = 50, embedding_matrix = embedding_matrix, learning_rate = 0.001, clip_norm = None)

In [7]:
checkpoint_path = "models\glove\cp.ckpt"
model.load_weights(checkpoint_path)

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x17c06dfff98>

### Now we can put in any two questions (or sentences) that we'd like and ask the model to predict whether they are the same or not.

In [8]:
def prep_for_input(pair):
    q1 = pair[0]
    q2 = pair[1]
    q1 = text_to_sequence(normalize_text(q1), word_to_int)
    q2 = text_to_sequence(normalize_text(q2), word_to_int)
    q1 = pad_sequences([q1], maxlen=248)
    q2 = pad_sequences([q2], maxlen=248)
    return [q1,q2]

In [9]:
q1 = 'When is Christmas?'
q2 = 'What day is Christmas?'
print('The probability that these are the same is {}'.format(model.predict(prep_for_input([q1, q2]))[0][0]))

The probability that these are the same is 0.4604267477989197


### To see where our model is lacking, let's see some of the decisions it made on the testing set.

First we process the data so that it can be loaded back into the model. The model has been pretrained and the weights will be loaded from a file.

In [10]:
vec_data = data.copy()
for index, row in data.iterrows():
    question1_words = normalize_text(row['question1'])
    question2_words = normalize_text(row['question2'])
    vec_data.at[index,'question1'] = text_to_sequence(question1_words, word_to_int)
    vec_data.at[index,'question2'] = text_to_sequence(question2_words, word_to_int)

In [11]:
X = vec_data[['question1','question2']]
Y = vec_data['is_duplicate']

In [12]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2, random_state=42)

In order to see which predictions correspond to which elements of the test set, we need to save the indices. As was hinted at earlier, these are a bit off because of the 3 dropped NaN rows. We'll have to deal with that later.

In [13]:
test_indices = list(X_test.index)

In [14]:
X_train = {'left': X_train.question1, 'right': X_train.question2}
X_test = {'left': X_test.question1, 'right': X_test.question2}

In [15]:
max_seq_length = max(vec_data['question1'].map(lambda x: len(x)).max(), vec_data['question2'].map(lambda x: len(x)).max())
print('Maximum length of preprocessed questions is {}'.format(max_seq_length))
for dataset, side in itertools.product([X_train, X_test], ['left', 'right']): 
    dataset[side] = pad_sequences(dataset[side], maxlen=max_seq_length)

Maximum length of preprocessed questions is 248


### Now we're ready to take a look at the predictions. First we save the predictions to an array, and we also round them to their nearest integer to see what the predicted class is.

In [16]:
predictions = model.predict([X_test['left'], X_test['right']])
predicted_classes = np.rint(predictions)

In [17]:
from sklearn.metrics import confusion_matrix

The confusion matrix shows that we have almost twice as many false negatives as false positives. As we'll see below, the loss is quite high on the valuation set, so in either case we'd like to examine predictions which are wildly off (eg. probability comes out to 0.8 when the correct answer is 0)

In [18]:
confusion_matrix(Y_test, predicted_classes)

array([[45656,  5584],
       [ 9334, 20296]], dtype=int64)

In [19]:
loss, acc = model.evaluate([X_test['left'], X_test['right']], Y_test)



In [20]:
low_similarity_indices = []
for i in range(len(predictions)):
    if predictions[i][0] < 0.2:
        low_similarity_indices.append(i)

Here we make adjustments for the rows that were NaN'd out.

In [21]:
low_similarity_rows = []
for i in low_similarity_indices:
    index = test_indices[i]
    if index > 105796 and index < 201871:
        index = index - 1
    if index > 201871 and index < 363416:
        index = index - 2
    if index > 363416:
        index = index - 3
    low_similarity_rows.append(index)

You'll see later that in addition to there being more false negatives, there are also more high confidence false negatives than false positives.

In [32]:
predict_dissimilar = data.iloc[low_similarity_rows]
false_negative = predict_dissimilar[predict_dissimilar['is_duplicate']==1]
print('The number of high confidence false negatives is {}'.format(false_negative.shape[0]))

The number of high confidence false negatives is 2712


Here we get a sampling of some of the poor predictions this model made. To be honest, I find some of them to be a toss up.

In [54]:
# these questions are supposed to be the same but were predicted to be very dissimilar
# i = 2378, for example
i = np.random.choice(2712, 1)
print(i)
print(list(false_negative.iloc[i]['question1']))
print(list(false_negative.iloc[i]['question2']))

['How do you get rid of dog ticks?']
['How do you get rid of dead ticks on dogs?']


### Poorly predicted false positives

In [24]:
high_similarity_indices = []
for i in range(len(predictions)):
    if predictions[i][0] > 0.8:
        high_similarity_indices.append(i)

In [25]:
high_similarity_rows = []
for i in high_similarity_indices:
    index = test_indices[i]
    if index > 105796 and index < 201871:
        index = index - 1
    if index > 201871 and index < 363416:
        index = index - 2
    if index > 363416:
        index = index - 3
    high_similarity_rows.append(index)

In [34]:
predict_similar = data.iloc[high_similarity_rows]
false_positive = predict_similar[predict_similar['is_duplicate']==0]
print('The number of high confidence false positives is {}'.format(false_positive.shape[0]))

The number of high confidence false positives is 1264


Again we see that some of the poorly made predictions are wildly wrong, but some could go either way. It seems that the biggest problem plaguing this dataset is defining what it means for two questions to be the same.

In [27]:
# these questions are supposed to be different but were predicted to be very similar
i = np.random.choice(1264, 1)
print(list(false_positive.iloc[i]['question1']))
print(list(false_positive.iloc[i]['question2']))

['How long do attack dogs live?']
['How long do dogs live?']


As an example, the questions 'How do you get rid of dog ticks?' and 'How do you get rid of dead ticks on dogs?' are marked as being duplicates of each other, but the questions 'How long do attack dogs live?', 'How long do dogs live?' are marked as being different. One could argue that the distinction between "attack dogs" and "dogs" is more significant than the distinction between "dead ticks" and "ticks," but that person may not have ever had to remove a live tick from anything before. One would hope that in the case of ambiguity, the model would assign a low level of confidence (around .5), but that appears to not be the case, at least a glance.

### Accurately predicted positives (high confidence)

In [53]:
true_positive = predict_similar[predict_similar['is_duplicate']==1]
print("The number of high confidence true positives is {}".format(true_positive.shape[0]))

The number of high confidence true positives is 10527


In [52]:
# i = 947 (funny)
# i = 104, 306 (good example)
i = np.random.choice(1264, 1)
print(i)
print(list(true_positive.iloc[i]['question1']))
print(list(true_positive.iloc[i]['question2']))

[104]
['What will happen to the money in foreign banks after demonetizing 500 and 1000 rupee notes?']
['How will the decision to illegalize the 500 and 1000 Rs notes help to get rid of black money in the Swiss bank or maybe in other foreign banks and currencies?']


In [56]:
print(true_positive.iloc[947]['question1'])
print(true_positive.iloc[947]['question2'])

Why do people try to ask silly questions on Quora rather than googling it?
Why do so many people ask things on Quora that they could just as easily Google?
