# Build a Qusetion Answering App with Denser Retriever

<a target="_blank" href="https://colab.research.google.com/github/denser-org/denser-retriever/blob/develop/tutorials/question_answering/question_answering.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This notebook illustrates how to build a question answering app from scratch using [DenserRetriever](https://github.com/denser-org/denser-retriever). DenserRetriever is an enterprise-grade AI retriever designed to streamline AI integration into your applications, ensuring cutting-edge accuracy.

## Preparations

### Install Dependencies

First, we need to install denser-retriever python library.

In [None]:
%pip install denser-retriever
%pip install grpcio==1.60.1
%pip install grpcio-tools==1.60.1

### Prepare the Data

There is a subset of the  [InsuranceQA Corpus](https://github.com/shuzi/insuranceQA)  (1000 pairs of questions and answers) used in this demo, everyone can download on [Github](https://github.com/towhee-io/examples/releases/download/data/question_answer.csv).

In [None]:
! curl -L https://github.com/towhee-io/examples/releases/download/data/question_answer.csv -O

**question_answer.csv**: a file containing question and the answer.

Let's take a quick look:

In [None]:
import pandas as pd

df = pd.read_csv('question_answer.csv')
df.head()

Unnamed: 0,id,question,answer
0,0,Is Disability Insurance Required By Law?,Not generally. There are five states that requ...
1,1,Can Creditors Take Life Insurance After ...,If the person who passed away was the one with...
2,2,Does Travelers Insurance Have Renters Ins...,One of the insurance carriers I represent is T...
3,3,Can I Drive A New Car Home Without Ins...,Most auto dealers will not let you drive the c...
4,4,Is The Cash Surrender Value Of Life Ins...,Cash surrender value comes only with Whole Lif...


Download xgboost model file:

In [None]:
!curl -L https://raw.githubusercontent.com/denser-org/denser-retriever/main/experiments/models/msmarco_xgb_es%2Bvs%2Brr_n.json -O

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  410k  100  410k    0     0   902k      0 --:--:-- --:--:-- --:--:--  900k


Download config file:

In [None]:
!curl -L https://raw.githubusercontent.com/denser-org/denser-retriever/develop/tutorials/question_answering/config.yaml -O

### Start ElasticSearch instance

Elasticsearch and Milvus are required to run the Denser Retriever. They support the keyword search and vector search respectively.

#### ElasticSearch

Download elasticsearch.

In [None]:
%%bash

rm -rf elasticsearch-8.14.*
wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.14.3-linux-x86_64.tar.gz
wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.14.3-linux-x86_64.tar.gz.sha512
tar -xzf elasticsearch-8.14.3-linux-x86_64.tar.gz
sudo chown -R daemon:daemon elasticsearch-8.14.3/
shasum -a 512 -c elasticsearch-8.14.3-linux-x86_64.tar.gz.sha512
umount /sys/fs/cgroup
apt install cgroup-tools

Disable security settings:

In [None]:
!echo "xpack.security.enabled: false" >> elasticsearch-8.14.3/config/elasticsearch.yml

Start elasticsearch.

In [None]:
%%bash --bg

sudo -H -u daemon ./elasticsearch-8.14.3/bin/elasticsearch

*Once* the instance has been started, grep for `elasticsearch` in the processes list to confirm the availability.

In [None]:
# This part is important, since it takes a little amount of time for instance to load
import time
time.sleep(20)

In [None]:
%%bash

ps -ef | grep elastic

root       11590   11588  0 12:41 ?        00:00:00 sudo -H -u daemon ./elasticsearch-8.14.3/bin/ela
daemon     11591   11590 10 12:41 ?        00:00:05 /content/elasticsearch-8.14.3/jdk/bin/java -Xms4
daemon     11661   11591 99 12:41 ?        00:00:47 /content/elasticsearch-8.14.3/jdk/bin/java -Des.
daemon     11697   11661  0 12:41 ?        00:00:00 /content/elasticsearch-8.14.3/modules/x-pack-ml/
root       11929   11927  0 12:42 ?        00:00:00 grep elastic


## Build a Denser Retriever

In this section, we will show how to build our question answering engine using DenserRetriever. The basic idea behind question answering is to use Langchain text splitting to generate embedding from the question dataset and compare the input question with the embedding stored in Milvus.

### Generate passages

We first generate passages from the question dataset. We use the `DenserRetriever` to generate the embeddings of the questions and store them in Milvus.

In [None]:
from langchain_community.document_loaders import CSVLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from denser_retriever.utils import save_HF_docs_as_denser_passages

# Generate text chunks
documents = CSVLoader("question_answer.csv").load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
texts = text_splitter.split_documents(documents)
passage_file = "passages.jsonl"
save_HF_docs_as_denser_passages(texts, passage_file, 0)

### Ingest by DenserRetriever

Now we can build a DenserRetriever with the generated `passages.jsonl`.



In [None]:
from denser_retriever.retriever_general import RetrieverGeneral


retriever_denser = RetrieverGeneral("question_answer", "config.yaml")
retriever_denser.ingest("passages.jsonl")



2024-07-11 12:43:10 INFO: ES analysis default
2024-07-11 12:43:13 INFO: ES ingesting passages.jsonl record 2087
2024-07-11 12:43:14 INFO: Done building ES index


### Query test

Now that embedding for question dataset have been ingested by DenserRetriever, we can ask question with retrieve function.

In [None]:
import json
from denser_retriever.retriever_general import RetrieverGeneral

retriever_denser = RetrieverGeneral("question_answer", "config.yaml")

query = "Is Disability Insurance Required By Law?"
passages, docs = retriever_denser.retrieve(query, {})
print(json.dumps(passages[0], indent=4))



2024-07-11 13:04:56 INFO: ElasticSearch passages: 100 time: 0.082 sec. 
2024-07-11 13:04:56 INFO: Rerank time: 0.217 sec.
{
    "source": "question_answer.csv",
    "text": "id: 0\nquestion: Is  Disability  Insurance  Required  By  Law?\nanswer: Not generally. There are five states that require most all employers carry short term disability insurance on their employees. These states are: California, Hawaii, New Jersey, New York, and Rhode Island. Besides this mandatory short term disability law, there is no other legislative imperative for someone to purchase or be covered by disability insurance.",
    "title": "",
    "pid": 0,
    "score": 13.668355670776368
}


## Integrate in chat

Now we can add our retriever into a chatbot. We will use `towhee` and `gradio` to launch a simple chatbot.

In [None]:
!pip install towhee gradio

In [None]:
import gradio as gr
from towhee import pipe

def chat(message, history):
    history = history or []
    ans_pipe = (
        pipe.input('question')
            .map('question', 'res', lambda x: retriever_denser.retrieve(x, {}))
            .map('res', 'answer', lambda x: x[0])
            .output('question', 'answer')
    )

    response = ans_pipe(message).get()[1][0]['text'].split('answer:')[1]
    yield response

# chat("Is Disability Insurance Required By Law?", [])

chatbot = gr.ChatInterface(chat).queue()

if __name__ == "__main__":
    chatbot.launch()


Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://91e8c262c01b49c6b1.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
