<a href="https://colab.research.google.com/github/aaubs/ds-master/blob/main/notebooks/M2_chatbot_train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple FAQ chatbot

![](https://source.unsplash.com/V5vqWC9gyEU)

This is a variation of https://github.com/python-engineer/pytorch-chatbot, and a toy-example.
For a more complex example check out (e.g.) https://github.com/MrJay10/banking-faq-bot. Datacamp has some courses on that topic, too.

The data needed for building such a system is a collection from a company FAQ for instance with variations for each question-answer pair type. For examples with many q-a pairs like the banking-faq, you can consider similarity-based approaches, where input text is matched with most "similar" predifined questions and then the user picks upon request - "did you mean xyz-question"
Alternatively, one could create paraphrased versions of questions for each question-answer pair to have more training examples. This can be done manually or 

The overall architecture is as follows:

*   Given a free text/prompt, predict which question is asked (intent)
*   Pick corresponding answer (e.g. random out of 2-3) to simulate dialogue

In this notebook, we will first use TFIDF-Logit, then standard SpaCy vectors and finally Floret (new SpaCy vectors that combine a new efficient compression with fastText, that helps overcome typos by including subword-elements in the model)


In [None]:
!pip install gradio -q

In [1]:
import pandas as pd
import numpy as np
import gradio as gr

In [2]:
# utils...
import json
import requests
import random
import time


we will be using the medium-sized spacy model here

In [3]:
! spacy download en_core_web_md --quiet

2022-11-01 13:50:30.589015: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[K     |████████████████████████████████| 42.8 MB 66.2 MB/s 
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [4]:
import spacy #spacy for quick language prepro
nlp = spacy.load('en_core_web_md') #instantiating English module

In [5]:
from sklearn.pipeline import make_pipeline #pipeline creation
from sklearn.feature_extraction.text import TfidfVectorizer #transforms text to sparse matrix
from sklearn.linear_model import LogisticRegression #Logit model

we use spacy for preprocessing

In [6]:
def text_prepro(texts):
  """
  takes in a list/iterable of texts
  removes twitter stuff
  lowercases, normalizes text
  """

  clean_container = []

  for text in nlp.pipe(texts, disable=["parser", "ner"]):

    txt = [token.lemma_.lower() for token in text # lemmatize and lower
          if token.is_alpha # remove numbers
          and not token.is_punct] # remove punctoation

    clean_container.append(" ".join(txt))
  
  return clean_container

The data is a json-file with questions and answers as well as "intent-tags"
We can open it from local or grab it from remote with requests (only one option).
The reasonn for using a json here is that variations of questions and answers are independent (number/linking; n-n). e.g. you can have 2 ways to ask and 4 ways to answer for a specific issue.

In [None]:
#open loacal file
# data = json.load(open('chatbot-convo.json','r'))

In [7]:
# stream file from remote online
r = requests.get('https://github.com/aaubs/ds-master/raw/main/data/chatbot-convo.json')
data = json.loads(r.text)

For training, we need to get all questions and associated labels into tabular format. We iterate over all intents in the data and zip them with the intent-label into a list of tupels. Zip is a great function to bind 2 lists into a list of tupels `[a,b,c], [1,2,3] → [(a,1), (b,2), (c,3)]`
we multiply the label `l*[i['tag']]` to reach `len(i['patterns']) == len(l*[i['tag']])`

In [18]:
training = []

for i in data['intents']:
  l = len(i['patterns'])
  tuples = list(zip(i['patterns'], l*[i['tag']])) 
  training.extend(tuples)


In [19]:
training_df = pd.DataFrame(training, columns=['txt','label'])

In [None]:
training_df

In [21]:
training_df.txt_p = text_prepro(training_df.txt)

  """Entry point for launching an IPython kernel.


In [22]:
#instantiate models and "bundle up as pipeline"

tfidf = TfidfVectorizer()
cls = LogisticRegression()

pipe = make_pipeline(tfidf, cls)

In [23]:
pipe.fit(training_df.txt_p, training_df.label)

Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer()),
                ('logisticregression', LogisticRegression())])

In [24]:
pipe.predict(['tell me something funny'])

array(['funny'], dtype=object)

In [25]:
# let's give our bot a name
bot_name = '💬 Hugo'

Input triggers reply, which preprocesses the text, asks the model for a prediction (which class of question?), then iterates over all possible intent-types and where it finds a match, it retrieves a random response from this class.

In [31]:
def reply(txt):
  clean_text = text_prepro([txt])
  tag = pipe.predict(clean_text)[0]
  for intent in data['intents']:
    if tag == intent["tag"]:
      print(f"{bot_name}: {random.choice(intent['responses'])}")


In [40]:
reply('Tell me a joke')

💬 Hugo: Why did the hipster burn his mouth? He drank the coffee before it was cool.


### Using SpaCy for vectorization
Using "traditional" spacy embeddings. Medium web model.
When using spacy, we rely on pretrained vectors i.e. spacy takes care of all preprocessing and vectorization internally. Thus, we can skip TFIDF and go directly to model training.

In [38]:
# we use again logistic regression
model_spacy = LogisticRegression()

In [41]:
training_df.head(10)

Unnamed: 0,txt,label
0,Hi,greeting
1,Hey,greeting
2,How are you,greeting
3,Is anyone there?,greeting
4,Hello,greeting
5,Good day,greeting
6,Bye,goodbye
7,See you later,goodbye
8,Goodbye,goodbye
9,Thanks,thanks


In [42]:
# we grab the vectors for all texts and stack them into a matrix
X_train = np.vstack([txt.vector for txt in nlp.pipe(training_df.txt, disable=["parser", "ner"])])

In [None]:
# quick explaininng of the vectors (not really part of the code)
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(X_train)[0]

In [49]:
# training
model_spacy.fit(X_train, training_df.label)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

In [50]:
bot_name = '💬 SpaCy Hugo'

In [51]:
def reply_spacy(txt):
  tag = model_spacy.predict([nlp(txt).vector])[0]
  for intent in data['intents']:
    if tag == intent["tag"]:
      print(f"{bot_name}: {random.choice(intent['responses'])}")

In [52]:
reply_spacy('how can I pay?')

💬 SpaCy Hugo: We accept VISA, Mastercard and Paypal


### Trying out Floret 🌸
In 2017 Facebook introduced [fastText](https://fasttext.cc/) that includes sub-word elements in the model creation. In April 2022 Explorion.ai presented Blom Embeddings - an elegant way to reduce embessing size. Later In August 2022 they introduced [floret](https://explosion.ai/blog/floret-vectors), an approach to jon fastText with Bloom (not related to https://huggingface.co/bigscience/bloom).

In [53]:
# get floret installed
! python -m pip install floret 'spacy~=3.4.0' --quiet

[?25l[K     |█                               | 10 kB 22.0 MB/s eta 0:00:01[K     |██                              | 20 kB 7.5 MB/s eta 0:00:01[K     |███                             | 30 kB 10.3 MB/s eta 0:00:01[K     |████▏                           | 40 kB 4.8 MB/s eta 0:00:01[K     |█████▏                          | 51 kB 4.4 MB/s eta 0:00:01[K     |██████▏                         | 61 kB 5.2 MB/s eta 0:00:01[K     |███████▎                        | 71 kB 5.8 MB/s eta 0:00:01[K     |████████▎                       | 81 kB 6.5 MB/s eta 0:00:01[K     |█████████▎                      | 92 kB 6.6 MB/s eta 0:00:01[K     |██████████▍                     | 102 kB 5.3 MB/s eta 0:00:01[K     |███████████▍                    | 112 kB 5.3 MB/s eta 0:00:01[K     |████████████▍                   | 122 kB 5.3 MB/s eta 0:00:01[K     |█████████████▌                  | 133 kB 5.3 MB/s eta 0:00:01[K     |██████████████▌                 | 143 kB 5.3 MB/s eta 0:00:01[K   

In [54]:
# download floret vectors
! wget -nc https://github.com/explosion/spacy-vectors-builder/releases/download/en-3.4.0/en_vectors_floret_md.floret.gz

--2022-11-01 14:22:55--  https://github.com/explosion/spacy-vectors-builder/releases/download/en-3.4.0/en_vectors_floret_md.floret.gz
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/468670482/dea3b10f-1c47-47f0-841e-542650ab9d80?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20221101%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20221101T142256Z&X-Amz-Expires=300&X-Amz-Signature=d9d8a4b5826f71e6ced283978140bb98ede161a80d40bfdd0d55ca9e4f86e868&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=468670482&response-content-disposition=attachment%3B%20filename%3Den_vectors_floret_md.floret.gz&response-content-type=application%2Foctet-stream [following]
--2022-11-01 14:22:56--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/46867048

In [55]:
# spacy initialize them internally
! spacy init vectors en en_vectors_floret_md.floret.gz en_vectors_floret_md --mode floret

2022-11-01 14:23:41.056906: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[38;5;4mℹ Creating blank nlp object for language 'en'[0m
[2022-11-01 14:23:42,080] [INFO] Reading vectors from en_vectors_floret_md.floret.gz
INFO:spacy:Reading vectors from en_vectors_floret_md.floret.gz
50000it [00:03, 13114.75it/s]
[2022-11-01 14:23:45,982] [INFO] Loaded vectors from en_vectors_floret_md.floret.gz
INFO:spacy:Loaded vectors from en_vectors_floret_md.floret.gz
[38;5;2m✔ Successfully converted 50000 vectors[0m
[38;5;2m✔ Saved nlp object with vectors to output directory. You can now use
the path to it in your config as the 'vectors' setting in [initialize].[0m
/content/en_vectors_floret_md


In [56]:
# This is the spaCy pipeline with floret vectors
nlp_fl = spacy.load("en_vectors_floret_md")

how are these vectors different from standard spacy?
the difference lies in OOV (out of vocabulary) words. Such words can occur where there are jargon or typos. Standard vecs in spacy contain a many 10k vocabularies. However, not everything can be covered

In [57]:
word_1 = nlp.vocab["univercities"]
word_2 = nlp.vocab["universities"]

word_1.similarity(word_2)

  after removing the cwd from sys.path.


0.0

In [58]:
# typo --> 0-vector
word_1.vector

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

Floret vectors will also rely on 
`<univ', 'unive', 'niver', 'iverc', 'verci', 'ercit', 'rciti', 'citie', 'ities', 'ties>` and get it closer to right...

In [59]:
word_1 = nlp_fl.vocab["univercities"]
word_2 = nlp_fl.vocab["universities"]

word_1.similarity(word_2)

0.7986329197883606

In [None]:
word_1.vector

In [61]:
X_train = np.vstack([txt.vector for txt in nlp_fl.pipe(training_df.txt, disable=["parser", "ner"])])

In [62]:
model_fl = LogisticRegression()

In [63]:
model_fl.fit(X_train, training_df.label)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

In [65]:
bot_name = '🌸 Floret Jeppe'

In [66]:
def reply_fl(txt):
  tag = model_fl.predict([nlp_fl(txt).vector])[0]
  for intent in data['intents']:
    if tag == intent["tag"]:
      print(f"{bot_name}: {random.choice(intent['responses'])}")

In [69]:
reply_fl('which day does my stuff arrive?')

🌸 Floret Jeppe: Delivery takes 2-4 days


## Run Chatbot command-line style

In [None]:
bot_name = "🌸 Floret Hugo"
your_name = "🎃 RJ"


print("Let's chat! (type 'quit' to exit)")
while True:
    # sentence = "do you use credit cards?"
    sentence = input(f"{your_name}: ")
    if sentence == "quit":
        break
    reply_fl(sentence)

In [None]:
bot_name = "🌸 Floret Hugo"

with gr.Blocks() as demo:
    chatbot = gr.Chatbot()
    msg = gr.Textbox()
    clear = gr.ClearButton([msg, chatbot])

    def respond(message, chat_history):
        # use the reply_fl function to get the bot's message
        bot_message = reply_fl(message)
        chat_history.append((message, bot_message))
        time.sleep(2)
        return "", chat_history

    msg.submit(respond, [msg, chatbot], [msg, chatbot])

demo.launch()

### Chatbot app with streamlit:

There is an extension which allows to implement a chatbot in ST:
https://ai-yash-st-chat-exampleschatbot-fkuecs.streamlitapp.com/

Example code can be found here:
https://github.com/AI-Yash/st-chat/blob/main/examples/chatbot.py

The bot in the example runs a rather fancy online deployed model by FB that actually evaluates the whole dialogue rather than individual sentences. Also, you would need to understand session states and a few other things to run that...

## On vectors

Wait, hold on, where do such vectors come from? Well, they are pretrained form large text-collections in a "self-supervised" fashion. The idea comes from [Word2Vec](https://proceedings.neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html) and related approaches, popular starting 2013. The idea here is to take text i.e. sentences and challenge a model to learn context, e.g. predict the "masked" next word from a sequence of words ⇒ Copenhagen is the [MASK] of Denmark.