<a href="https://colab.research.google.com/github/aaubs/ds-master/blob/main/notebooks/M2_chatbot_train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple FAQ chatbot

![](https://source.unsplash.com/V5vqWC9gyEU)

This is a variation of https://github.com/python-engineer/pytorch-chatbot, and a toy-example.
For a more complex example check out (e.g.) https://github.com/MrJay10/banking-faq-bot. Datacamp has some courses on that topic, too.

The data needed for building such a system is a collection from a company FAQ for instance with variations for each question-answer pair type. For examples with many q-a pairs like the banking-faq, you can consider similarity-based approaches, where input text is matched with most "similar" predifined questions and then the user picks upon request - "did you mean xyz-question"
Alternatively, one could create paraphrased versions of questions for each question-answer pair to have more training examples. This can be done manually or 

The overall architecture is as follows:

*   Given a free text/prompt, predict which question is asked (intent)
*   Pick corresponding answer (e.g. random out of 2-3) to simulate dialogue

In this notebook, we will first use TFIDF-Logit, then standard SpaCy vectors and finally Floret (new SpaCy vectors that combine a new efficient compression with fastText, that helps overcome typos by including subword-elements in the model)


In [None]:
import pandas as pd
import numpy as np

In [None]:
# utils...
import json
import requests
import random

we will be using the medium-sized spacy model here

In [None]:
! spacy download en_core_web_md --quiet

In [None]:
import spacy #spacy for quick language prepro
nlp = spacy.load('en_core_web_md') #instantiating English module

In [None]:
from sklearn.pipeline import make_pipeline #pipeline creation
from sklearn.feature_extraction.text import TfidfVectorizer #transforms text to sparse matrix
from sklearn.linear_model import LogisticRegression #Logit model

we use spacy for preprocessing

In [None]:
def text_prepro(texts):
  """
  takes in a list/iterable of texts
  removes twitter stuff
  lowercases, normalizes text
  """

  clean_container = []

  for text in nlp.pipe(texts, disable=["parser", "ner"]):

    txt = [token.lemma_.lower() for token in text # lemmatize and lower
          if token.is_alpha # remove numbers
          and not token.is_punct] # remove punctoation

    clean_container.append(" ".join(txt))
  
  return clean_container

The data is a json-file with questions and answers as well as "intent-tags"
We can open it from local or grab it from remote with requests (only one option).
The reasonn for using a json here is that variations of questions and answers are independent (number/linking; n-n). e.g. you can have 2 ways to ask and 4 ways to answer for a specific issue.

In [None]:
#open loacal file
# data = json.load(open('chatbot-convo.json','r'))

In [None]:
# stream file from remote online
r = requests.get('https://github.com/aaubs/ds-master/raw/main/data/chatbot-convo.json')
data = json.loads(r.text)

For training, we need to get all questions and associated labels into tabular format. We iterate over all intents in the data and zip them with the intent-label into a list of tupels. Zip is a great function to bind 2 lists into a list of tupels `[a,b,c], [1,2,3] → [(a,1), (b,2), (c,3)]`
we multiply the label `l*[i['tag']]` to reach `len(i['patterns']) == len(l*[i['tag']])`

In [None]:
training = []

for i in data['intents']:
  l = len(data['intents'][0]['patterns'])
  tuples = list(zip(i['patterns'], l*[i['tag']])) 
  training.extend(tuples)


In [None]:
training_df = pd.DataFrame(training, columns=['txt','label'])

In [None]:
training_df.txt_p = text_prepro(training_df.txt)

In [None]:
#instantiate models and "bundle up as pipeline"

tfidf = TfidfVectorizer()
cls = LogisticRegression()

pipe = make_pipeline(tfidf, cls)

In [None]:
pipe.fit(training_df.txt_p, training_df.label)

In [None]:
pipe.predict(['tell me something funny'])

In [None]:
# let's give our bot a name
bot_name = '💬 Hugo'

Input triggers reply, which preprocesses the text, asks the model for a prediction (which class of question?), then iterates over all possible intent-types and where it finds a match, it retrieves a random response from this class.

In [None]:
def reply(txt):
  clean_text = text_prepro([txt])
  tag = pipe.predict(clean_text)[0]
  for intent in data['intents']:
    if tag == intent["tag"]:
      print(f"{bot_name}: {random.choice(intent['responses'])}")


In [None]:
reply('Byes')

### Using SpaCy for vectorization
Using "traditional" spacy embeddings. Medium web model.
When using spacy, we rely on pretrained vectors i.e. spacy takes care of all preprocessing and vectorization internally. Thus, we can skip TFIDF and go directly to model training.

In [None]:
# we use again logistic regression
model_spacy = LogisticRegression()

In [None]:
training_df.head(10)

In [None]:
# we grab the vectors for all texts and stack them into a matrix
X_train = np.vstack([txt.vector for txt in nlp.pipe(training_df.txt, disable=["parser", "ner"])])

In [None]:
# quick explaininng of the vectors (not really part of the code)
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(X_train)[0]

In [None]:
# training
model_spacy.fit(X_train, training_df.label)

In [None]:
bot_name = '💬 SpaCy Hugo'

In [None]:
def reply_spacy(txt):
  tag = model_spacy.predict([nlp(txt).vector])[0]
  for intent in data['intents']:
    if tag == intent["tag"]:
      print(f"{bot_name}: {random.choice(intent['responses'])}")

In [None]:
reply_spacy('how can I pay?')

### Trying out Floret 🌸
In 2017 Facebook introduced [fastText](https://fasttext.cc/) that includes sub-word elements in the model creation. In April 2022 Explorion.ai presented Blom Embeddings - an elegant way to reduce embessing size. Later In August 2022 they introduced [floret](https://explosion.ai/blog/floret-vectors), an approach to jon fastText with Bloom (not related to https://huggingface.co/bigscience/bloom).

In [None]:
# get floret installed
! python -m pip install floret 'spacy~=3.4.0' --quiet

In [None]:
# download floret vectors
! wget -nc https://github.com/explosion/spacy-vectors-builder/releases/download/en-3.4.0/en_vectors_floret_md.floret.gz

In [None]:
# spacy initialize them internally
! spacy init vectors en en_vectors_floret_md.floret.gz en_vectors_floret_md --mode floret

In [None]:
# This is the spaCy pipeline with floret vectors
nlp_fl = spacy.load("en_vectors_floret_md")

how are these vectors different from standard spacy?
the difference lies in OOV (out of vocabulary) words. Such words can occur where there are jargon or typos. Standard vecs in spacy contain a many 10k vocabularies. However, not everything can be covered

In [None]:
word_1 = nlp.vocab["univercities"]
word_2 = nlp.vocab["universities"]

word_1.similarity(word_2)

In [None]:
# typo --> 0-vector
word_1.vector

Floret vectors will also rely on 
`<univ', 'unive', 'niver', 'iverc', 'verci', 'ercit', 'rciti', 'citie', 'ities', 'ties>` and get it closer to right...

In [None]:
word_1 = nlp_fl.vocab["univercities"]
word_2 = nlp_fl.vocab["universities"]

word_1.similarity(word_2)

In [None]:
word_1.vector

In [None]:
X_train = np.vstack([txt.vector for txt in nlp_fl.pipe(training_df.txt, disable=["parser", "ner"])])

In [None]:
model_fl = LogisticRegression()

In [None]:
model_fl.fit(X_train, training_df.label)

In [None]:
bot_name = '🌸 Floret Hugo'

In [None]:
def reply_fl(txt):
  tag = model_fl.predict([nlp_fl(txt).vector])[0]
  for intent in data['intents']:
    if tag == intent["tag"]:
      print(f"{bot_name}: {random.choice(intent['responses'])}")

In [None]:
reply_fl('when do I get stuff?')

## Run Chatbot command-line style

In [None]:
bot_name = "🌸 Floret Hugo"
your_name = "🎃 RJ"


print("Let's chat! (type 'quit' to exit)")
while True:
    # sentence = "do you use credit cards?"
    sentence = input(f"{your_name}: ")
    if sentence == "quit":
        break
    reply_fl(sentence)

### Chatbot app with streamlit:

There is an extension which allows to implement a chatbot in ST:
https://ai-yash-st-chat-exampleschatbot-fkuecs.streamlitapp.com/

Example code can be found here:
https://github.com/AI-Yash/st-chat/blob/main/examples/chatbot.py

The bot in the example runs a rather fancy online deployed model by FB that actually evaluates the whole dialogue rather than individual sentences. Also, you would need to understand session states and a few other things to run that...

## On vectors

Wait, hold on, where do such vectors come from? Well, they are pretrained form large text-collections in a "self-supervised" fashion. The idea comes from [Word2Vec](https://proceedings.neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html) and related approaches, popular starting 2013. The idea here is to take text i.e. sentences and challenge a model to learn context, e.g. predict the "masked" next word from a sequence of words ⇒ Copenhagen is the [MASK] of Denmark.