# Largest Number of Similar Words Algorithm

In this notebook, we are going to try an algorithm to find the most similar question in the given dataset by looking for the one with the largest number of similar words. Then, the final answer will be the predefined answer for that question.

In [1]:
import pandas as pd
from collections import Counter
import re

## Load data

In [2]:
# Load training data
train_df = pd.read_csv("../data/en.csv", sep=";")
train_df.head()

Unnamed: 0,question,answer
0,Hi,Hello
1,Hello,Hi
2,Who are you?,I'm the person you are looking for. What for? ...
3,What's your name?,"Antonio, or maybe not..."
4,What's my name?,The most beautiful name I've ever heard: Mufasa.


In [3]:
# Load variation data
var_df = pd.read_csv("../data/en-var.csv", sep=";")
var_df.head()

Unnamed: 0,question,variation
0,Hi,Hiii
1,Hello,Helloooo!
2,What's your name?,Tell me your name.
3,What's your name?,What is your name?
4,What's my name?,What is my name?


## Tokenizer

To apply the algorithm, the first step will be to tokenize the questions. We could use tokenizers from libraries like NLTK or Spacy but, for now, we'll write a simple tokenizer ourselves.

In [4]:
# Let's define some contractions to replace them later,
# without being too precise (there are contractions that depend on the context)
CONTRACTIONS = {
    "n't": "not",
    "'re": "are",
    "'ll": "will",
    "'s": "is",
    "'ve": "have",
    "'d": "had",
    " u ": "you",
}

In [5]:
class Tokenizer:
    """A super simple class to tokenize phrases"""

    def __init__(self, replace_contractions: bool = True):
        self.replace_contractions = replace_contractions

    def tokenize(self, phrase: str):
        """Tokenize phrase"""
        
        # Replace contractions
        if self.replace_contractions:
            for contraction, replacement in CONTRACTIONS.items():
                phrase = phrase.replace(contraction, " " + replacement)
        
        # Remove marks (TODO: use regex)
        marks = """!()-[]{};:'"\,<>./?@#$%^&*_~"""
        for m in marks:
            phrase = phrase.replace(m, "")
            
        # Remove repetition of last letters (something to work on more)
        last_letter = phrase[-1]
        i = -2
        while phrase[i] == last_letter:
            phrase = phrase[:i] + last_letter        

        # Tokenize phrase
        words = [word.lower() for word in phrase.split()]

        return words

## Model

In [6]:
class MostTimes:
    """Get the question with the most similar words"""

    def __init__(self, tokenizer: Tokenizer = Tokenizer()):
        self.tokenizer = tokenizer
        self.words = {}
        self.answers = {}

    def train(self, df: pd.DataFrame):
        """
        Train the model.
        
        Parameters
        ----------
        df : DataFrame
            Pandas DataFrame with one column for the questions and other for
            their corresponding answers.
        
        """
        for _, row in df.iterrows():
            question, answer = row
            self.answers[question.lower()] = answer
            q_words = self.tokenizer.tokenize(question)
            for word in q_words:
                if word in self.words:
                    self.words[word].add(answer)
                else:
                    self.words[word] = {answer}

    def get_answer(self, question: str):
        """
        Get most probable answer.
        
        Parameters
        ----------
        question : str
            Question to ask the bot model.
            
        Returns
        -------
        A string containing the answer.

        """
        if question in self.answers:
            return self.answers[question]
        words = self.tokenizer.tokenize(question)
        answers = []
        for w in words:
            if w in self.words:
                answers.extend(self.words[w])
        if len(answers) == 0:
            return "Mmmm, I don't know."
        c = Counter(answers)
        common_answers = c.most_common()
        top = common_answers[0][0]
        return top

It could be optimized, but for now let's keep it simple.

## Train the model

In [7]:
m = MostTimes()
m.train(train_df)

## Test the model

In [8]:
# Performance test
%timeit m.get_answer("How many years old are you?")

8.28 µs ± 18 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [9]:
# Accuracy test
n_correct = 0
test_df = pd.merge(var_df, train_df, on="question", how="left")
for _, row in test_df.iterrows():
    answer = m.get_answer(row.variation)
    if answer == row.answer:
        n_correct += 1
acc = n_correct / test_df.shape[0] * 100
print("accuracy: {:.2f}%".format(acc))

accuracy: 62.50%


It is not very reliable for now. It would be necessary to increase the volume of test data and debug the tokenizer and the model more.

## Playground

In [10]:
print("Talk to Tony the bot (write ':q' to exit)")
print("-----------------------------------------")
exit_conditions = {":q", "quit", "exit"}
while True:
    question = input("> ")
    if question in exit_conditions:
        break
    else:
        print(f"🤖 {m.get_answer(question)}")

Talk to Tony the bot (write ':q' to exit)
-----------------------------------------
> Hellooo
🤖 Hi
> How old are you?
🤖 I'm 27.
> What did you study?
🤖 I studied Telecommunications Engineering at the UPM (Universidad Politéncica de Madrid)
> What's my name?
🤖 The most beautiful name I've ever heard: Mufasa.
> :q
