## Purpose:

Using the retrieval model for a response to query, this notebook follows the recommendation of a popular paper for baselining a more advanced approach.

Replicate the TF-IDF Baseline Evaluation but on my data: https://github.com/dennybritz/chatbot-retrieval/blob/master/notebooks/TFIDF%20Baseline%20Evaluation.ipynb

In [87]:
import pandas as pd
import numpy as np
import time
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [121]:
%%time
# Load Data
train_df = pd.read_pickle("../../data/option3_data/train_df.pickle")
test_df = pd.read_csv("../../data/option3_data/test.csv")
val_df = pd.read_csv("../../data/option3_data/valid.csv")
#test_df = pd.read_pickle("../../data/option3_data/test.pickle")
y_test = np.zeros(len(test_df))

CPU times: user 1.15 s, sys: 265 ms, total: 1.41 s
Wall time: 1.41 s


In [126]:
train_df.head()

Unnamed: 0,context,label,utterance
20,"That's a good idea, but I'm still too embarras...",1,Obgyns deal with WAY worse stuff than this on ...
21,"That's a good idea, but I'm still too embarras...",0,You need to be evaluated by both mental health...
22,"That's a good idea, but I'm still too embarras...",0,"As everyone else has said, yes you are doing y..."
23,"That's a good idea, but I'm still too embarras...",0,Hypertrophic scarring. [More info and possible...
24,"That's a good idea, but I'm still too embarras...",0,"okay, good to know. Thank you for your response"


In [90]:
def evaluate_recall(y, y_test, k=1):
    num_examples = float(len(y))
    num_correct = 0
    for predictions, label in zip(y, y_test):
        if label in predictions[:k]:
            num_correct += 1
    return num_correct/num_examples

In [91]:
def predict_random(context, utterances):
    return np.random.choice(len(utterances), 4, replace=False)

In [92]:
%%time
# Evaluate Random predictor
y_random = [predict_random(test_df.context[x], test_df.iloc[x,1:].values) for x in range(len(test_df))]
for n in [1, 2, 5, 10]:
    print("Recall @ ({}, 10): {:g}".format(n, evaluate_recall(y_random, y_test, n)))

Recall @ (1, 10): 0.100675
Recall @ (2, 10): 0.2
Recall @ (5, 10): 0.399273
Recall @ (10, 10): 0.399273
CPU times: user 2.63 s, sys: 70.7 ms, total: 2.7 s
Wall time: 2.75 s


In [93]:
class TFIDFPredictor:
    def __init__(self):
        self.vectorizer = TfidfVectorizer()

    def train(self, data):
        self.vectorizer.fit(np.append(data.context.values,data.utterance.values))

    def predict(self, context, utterances):
        # Convert context and utterances into tfidf vector
        vector_context = self.vectorizer.transform([context])
        vector_doc = self.vectorizer.transform(utterances)
        # The dot product measures the similarity of the resulting vectors
        result = np.dot(vector_doc, vector_context.T).todense()
        result = np.asarray(result).flatten()
        # Sort by top results and return the indices in descending order
        return np.argsort(result, axis=0)[::-1]

In [125]:
%%time
# Evaluate TFIDF predictor
pred = TFIDFPredictor()
pred.train(train_df)
y = [pred.predict(test_df.context[x], test_df.iloc[x,1:].values) for x in range(len(test_df))]
for n in [1, 2, 5, 10]:
    print("Recall @ ({}, 10): {:g}".format(n, evaluate_recall(y, y_test, n)))

Recall @ (1, 10): 0.476141
Recall @ (2, 10): 0.570431
Recall @ (5, 10): 0.722859
Recall @ (10, 10): 1
CPU times: user 1min 37s, sys: 2.71 s, total: 1min 39s
Wall time: 1min 40s


**Very surprising, because these are very similar metrics to what was first reported in paper. Especially given the performance of random selection.**

In [128]:
import gensim
from gensim.models import Word2Vec

In [129]:
model = Word2Vec.load("../../data/reddit_may_2015_embeddings/model_full_reddit")

In [131]:
print(model.wv.doesnt_match("breakfast cereal dinner lunch".split()))

cereal


  if np.issubdtype(vec.dtype, np.int):


In [133]:
model.most_similar(positive=['reddit'])

  """Entry point for launching an IPython kernel.
  if np.issubdtype(vec.dtype, np.int):


[('reedit', 0.7281540036201477),
 ('fph', 0.7037860155105591),
 ('twitter', 0.7018535137176514),
 ('chan', 0.6998030543327332),
 ('facebook', 0.6992213726043701),
 ('sub', 0.6870619058609009),
 ('forum', 0.6837552189826965),
 ('neogaf', 0.6561684608459473),
 ('tia', 0.6557859182357788),
 ('srd', 0.6524701118469238)]