## Purpose:

Replicate the TF-IDF Baseline Evaluation but on my data: https://github.com/dennybritz/chatbot-retrieval/blob/master/notebooks/TFIDF%20Baseline%20Evaluation.ipynb

In [68]:
import pandas as pd
import numpy as np
import random
import time
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [184]:
%%time
# Load Data
train_df = pd.read_csv("../../data/option3_data/train.csv")
test_df = pd.read_csv("../../data/option3_data/test.csv")
val_df = pd.read_csv("../../data/option3_data/valid.csv")
y_test = np.zeros(len(test_df))

CPU times: user 5.98 s, sys: 1 s, total: 6.98 s
Wall time: 9.28 s


In [189]:
# Force any times to string that may be numpy objects
def to_string(dataframe):
    for c in dataframe.columns:
        dataframe[c] = dataframe[c].apply(lambda r: str(r))
    return dataframe

test_df = to_string(test_df)
val_df = to_string(val_df)

In [91]:
test_df.head()

Unnamed: 0,Context,Ground Truth Utterance,Distractor_0,Distractor_1,Distractor_2,Distractor_3,Distractor_4,Distractor_5,Distractor_6,Distractor_7,Distractor_8
0,&gt; Cushing's syndrome Typically it is caused...,"Thanks for the information, I'll be sure to sc...",It is facing the opposite direction of the oth...,&gt;You can't say that trying to get pregnant ...,Call your pediatrician's office on your own an...,looks like either an allergic reaction to a de...,Any chance you are actually just really sore? ...,"I am not a doctor or expert of any kind, but y...",Yep. Only alcohol I can touch now is one brand...,Most often its not painful. I've only ever see...,"I get it. I've yet to meet someone who says ""Y..."
1,is there any chance it will die down given time?,if its thrombosed (clotted)? no although warm ...,"It sounds like when you hurt your finger, inst...",Thanks. That's good to know.,I don't growing pains that often anymore :/,"I don't know if you have one near you, but Dol...",Have you had a pap to check for cervical infec...,So that stuff about toxins in my body and seri...,You can't just get TPN at the ER. Though it ma...,"I understand both is not optimum, but of the t...",no. This is nit muscle pain
2,"It feels rather unpleasant all the time, but l...",I'm not shocked that you don't have a tongue p...,See your GP.,The brain damage was suggested by a medical st...,Male or female is relevant. And height for tha...,This looks like petechiae/purpura. Does it bla...,"Well, unless I contracted Aids from jacking of...","Nothing to be ashamed about, has nothing to do...",I don't think that's the issue. I also notice ...,&gt;I was wondering what this meant for her. O...,I would definitely talk to your doctor about g...
3,"I see, I apologize. Age 21 Sex Male Height 5'1...",[^(**Mouseover** to view the metric conversion...,I don't have an infection. a pharmacist said i...,The shellfish part seems like a case of food p...,The only idea I have is diet. I'm in Norway so...,"Well, you mention that taking off your coat he...",This EXACT thing happened to my boyfriend (21 ...,"Thanks, awesome explanation :)",I'm not a doctor but I did have my gallbladder...,"thanks for the correction, will change my post...",Ahh ok. I'll go ahead and do that. Your answer...
4,28yo Male 6'1 175lbs Caucasian I'm a healthy g...,No. You cannot sweat it out. The vaccine is in...,"Im 13, I get them all the time. Honestly, they...",Provigil is a huge wild card and likely respon...,"TC is made up of LDL, HDL, and VLDL. I would a...","Hi, I think I may be of some help here. One of...",VARICOCELES. Ugh. Better than something wrong ...,OH one more thing. the clear nails pro stuff I...,&gt;As my period is due in 3 days If your peri...,What about niacin it raises HDLs mainly and it...,kidney stone


In [112]:
test_df.iloc[0,[0,1]+random.sample(range(2,11),k)]

Context                   &gt; Cushing's syndrome Typically it is caused...
Ground Truth Utterance    Thanks for the information, I'll be sure to sc...
Distractor_6              Yep. Only alcohol I can touch now is one brand...
Distractor_2              Call your pediatrician's office on your own an...
Name: 0, dtype: object

In [163]:
k = 5
dataframe_new = pd.DataFrame(columns=['Context','Ground Truth Utterance']+['Distractor_'+str(i) for i in range(0,k)])
distractor_row = test_df.iloc[1,[0,1]+random.sample(range(2,11),k)]
distractor_row.columns = dataframe_new.columns
distractor_row.index = dataframe_new.columns
dataframe_new = dataframe_new.append(distractor_row,ignore_index=True)



Unnamed: 0,Context,Ground Truth Utterance,Distractor_0,Distractor_1,Distractor_2,Distractor_3,Distractor_4
0,is there any chance it will die down given time?,if its thrombosed (clotted)? no although warm ...,"I don't know if you have one near you, but Dol...",I don't growing pains that often anymore :/,"I understand both is not optimum, but of the t...","It sounds like when you hurt your finger, inst...",So that stuff about toxins in my body and seri...
1,is there any chance it will die down given time?,if its thrombosed (clotted)? no although warm ...,"I don't know if you have one near you, but Dol...",I don't growing pains that often anymore :/,"I understand both is not optimum, but of the t...","It sounds like when you hurt your finger, inst...",So that stuff about toxins in my body and seri...


In [179]:
%%time


def sample_k_utterances(dataframe,k):
    """
    Input: test_df or valid_df with Context	Ground Truth Utterance	Distractor_0	Distractor_1	Distractor_2	Distractor_3	Distractor_4	Distractor_5	Distractor_6	Distractor_7	Distractor_8
        columns.
    Output:
        A new dataframe with k randomly selected distractor columns
    """
    
    dataframe_new = pd.DataFrame(columns=['Context','Ground Truth Utterance']+['Distractor_'+str(i) for i in range(0,k)])
    # Create a test_df with only 2, 5 possible answers
    for row in range(0,len(dataframe)):
        # random sample k columns from distractor utterances
        distractors = random.sample(range(2,11),k)
        new_row = dataframe.iloc[row,[0,1]+random.sample(range(2,11),k)]
        new_row.columns = dataframe_new.columns
        new_row.index = dataframe_new.columns

        dataframe_new = dataframe_new.append(new_row,ignore_index=True)

    return dataframe_new


# 1 in 2 R@1
df_2_test = sample_k_utterances(test_df,2)
df_2_test  = to_string(df_2_test )
# 1 in 5 R@1
df_5_test = sample_k_utterances(test_df,5)
df_5_test  = to_string(df_5_test )

CPU times: user 1min 4s, sys: 564 ms, total: 1min 5s
Wall time: 1min 35s


In [12]:
def evaluate_recall(y, y_test, k=1):
    """
    Compute the Recall @k. That is given a set of n possible answers, compute the percent correct when selecting k values.
    
    Input:
        y: Index set of answers.
        y_test: The correct index.
    Output:
    """
    num_examples = float(len(y))
    num_correct = 0
    for predictions, label in zip(y, y_test):
        if label in predictions[:k]:
            num_correct += 1
    return num_correct/num_examples

In [34]:
def predict_random(context, utterances):
    return np.random.choice(len(utterances), 10, replace=False)

Expected values are of course $k*E(x)$. Below function demonstrates that.

In [30]:
%%time
# Evaluate Random predictor
# For every context in the test dataframe, 
y_random = [predict_random(test_df.Context[x], test_df.iloc[x,1:].values) for x in range(len(test_df))]
for n in [1, 2, 5, 10]:
    print("Recall @ ({}, 10): {:g}".format(n, evaluate_recall(y_random, y_test, n)))

Recall @ (1, 10): 0.09752
Recall @ (2, 10): 0.202606
Recall @ (5, 10): 0.492959
Recall @ (10, 10): 1
CPU times: user 4.22 s, sys: 40.9 ms, total: 4.26 s
Wall time: 6.04 s


In [60]:
class TFIDFPredictor:
    """
    x
    """
    def __init__(self):
        self.vectorizer = TfidfVectorizer()

    def train(self, data):
        data = data.replace(np.nan, 'missing', regex=True)
        self.vectorizer.fit(np.append(data.Context.values,data.Utterance.values))

    def predict(self, context, utterances):
        # Convert context and utterances into tfidf vector
        vector_context = self.vectorizer.transform([context])
        vector_doc = self.vectorizer.transform(utterances)
        # The dot product measures the similarity of the resulting vectors
        result = np.dot(vector_doc, vector_context.T).todense()
        result = np.asarray(result).flatten()
        # Sort by top results and return the indices in descending order
        return np.argsort(result, axis=0)[::-1]

In [190]:
%%time
#pred = TFIDFPredictor()
#print("Training...")
#pred.train(train_df)
#print('Finished training.')
batches = [2,5,10]
for batch in batches:
    print('Batch size: ',batch)
    # Evaluate TFIDF predictor
    if batch == 2:
        dataframe = df_2_test
    elif batch == 5:
        dataframe = df_5_test
    elif batch == 10:
        dataframe = test_df
    y_test = np.zeros(len(dataframe))
    y = [pred.predict(dataframe.Context[x], dataframe.iloc[x,1:].values) for x in range(len(dataframe))]
    for n in [1, 2, 5, 10]:
        if n < batch:
            print("Recall @ ({}, {}): {:g}".format(n,batch, evaluate_recall(y, y_test, n)))

Batch size:  2
Recall @ (1, 2): 0.640185
Batch size:  5
Recall @ (1, 5): 0.544872
Recall @ (2, 5): 0.657629
Batch size:  10
Recall @ (1, 10): 0.48739
Recall @ (2, 10): 0.584384
Recall @ (5, 10): 0.728457
CPU times: user 1min 23s, sys: 889 ms, total: 1min 24s
Wall time: 2min 2s


**Very surprising, because these are very similar metrics to what was first reported in paper. Especially given the performance of random selection.**