# Разработка механизма ответов на вопросы за минуту

В этой тетради показано, как создать механизм ответов на вопросы с нуля  используя [Milvus](https://milvus.io/) и [Towhee](https://towhee.io/). Milvus - это самая совершенная векторная база данных с открытым исходным кодом, созданная для приложений искусственного интеллекта, и поддерживающая поиск ближайших соседей по десяткам миллионов записей, а Towhee - это платформа, которая предоставляет ETL для неструктурированных данных с использованием моделей машинного обучения SoTA.

Мы пройдемся по процедурам ответов на вопросы и оценим производительность. Более того, с помощью Towhee нам удалось упростить основную функциональность почти до 10 строк кода, так что вы можете начать взламывать свой собственный механизм ответов на вопросы.

## Подготовка

### Установите зависимости

Сначала нам нужно установить такие зависимости, как towhee, towhee.models и radio.

In [1]:
! python -m pip install -q towhee towhee.models gradio

[0m

### Подготовьте данные

Версия  [InsuranceQA Corpus](https://github.com/shuzi/insuranceQA)  В этой демонстрации используется часть корпуса вопросов и ответов по страхованию (1000 пар вопросов и ответов), которую каждый может скачать  [Github](https://github.com/towhee-io/examples/releases/download/data/question_answer.csv).

В этой демонстрации используется часть корпуса вопросов и ответов по страхованию (1000 пар вопросов и ответов), которую каждый может скачать на Github.

In [2]:
! curl -L https://github.com/towhee-io/examples/releases/download/data/question_answer.csv -O

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  595k  100  595k    0     0   602k      0 --:--:-- --:--:-- --:--:--  602k


**question_answer.csv**: файл, содержащий вопрос и ответ.


Давайте кратко рассмотрим:

In [3]:
import pandas as pd

df = pd.read_csv('question_answer.csv')
df.head()

Unnamed: 0,id,question,answer
0,0,Is Disability Insurance Required By Law?,Not generally. There are five states that requ...
1,1,Can Creditors Take Life Insurance After ...,If the person who passed away was the one with...
2,2,Does Travelers Insurance Have Renters Ins...,One of the insurance carriers I represent is T...
3,3,Can I Drive A New Car Home Without Ins...,Most auto dealers will not let you drive the c...
4,4,Is The Cash Surrender Value Of Life Ins...,Cash surrender value comes only with Whole Lif...


Чтобы использовать набор данных для получения ответов, давайте сначала определим словарь:

- `id_answer`: словарь с id и соответствующим ответом

In [4]:
id_answer = df.set_index('id')['answer'].to_dict()

### Создание коллекции Milvus

Перед началом работы, пожалуйста, убедитесь, что у вас запущен  [Milvus service](https://milvus.io/docs/install_standalone-docker.md). В блокноте используется [milvus 2.2.10](https://milvus.io/docs/v2.2.x/install_standalone-docker.md) and [pymilvus 2.2.11](https://milvus.io/docs/release_notes.md#2210).

In [5]:
! python -m pip install -q pymilvus==2.2.11

Затем определим функцию `create_milvus_collection` для создания коллекции в Milvus, которая использует [L2 distance metric](https://milvus.io/docs/metric.md#Euclidean-distance-L2) и [IVF_FLAT index](https://milvus.io/docs/index.md#IVF_FLAT).

In [26]:
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility, MilvusClient

#connections.connect(host='172.17.0.1', port='19530')
client = connections.connect(
    uri= 'https://in03-244ceba7647c848.api.gcp-us-west1.zillizcloud.com',
    token= '77e44c13012a64a48c21312903bf753be640225497971e1d1d1c13fe4fcec0cc0b4fa70b30aa2caba9e2fec9d38379eb853b9dc9'
)


In [27]:
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility, MilvusClient



def create_milvus_collection(collection_name, dim):
    if utility.has_collection(collection_name):
        utility.drop_collection(collection_name)
    
    fields = [
    FieldSchema(name='id', dtype=DataType.VARCHAR, descrition='ids', max_length=500, is_primary=True, auto_id=False),
    FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, descrition='embedding vectors', dim=dim)
    ]
    schema = CollectionSchema(fields=fields, description='reverse image search')
    collection = Collection(name=collection_name, schema=schema)

    # create IVF_FLAT index for collection.
    index_params = {
        'metric_type':'L2',
        'index_type':"IVF_FLAT",
        'params':{"nlist":2048}
    }
    collection.create_index(field_name="embedding", index_params=index_params)
    return collection

collection = create_milvus_collection("quick_setup", 768)

## Механизм ответа на вопросы

В этом разделе мы покажем, как создать наш механизм ответов на вопросы с использованием Milvus и Towhee. Основная идея, лежащая в основе системы ответов на вопросы, заключается в том, чтобы использовать Towhee для создания вложений из набора данных вопросов и сравнения входного вопроса с вложениями, хранящимися в Milvus.

[Towhee](https://towhee.io/) это платформа машинного обучения, которая позволяет создавать конвейеры обработки данных, а также предоставляет предопределенные операторы для реализации операций вставки и запроса в Milvus.

<img src="./workflow.png" width = "60%" height = "60%" align=center />

### Загрузить вопрос, встраиваемый в Milvus

Сначала мы генерируем вложение из текста вопроса с помощью оператора[dpr](https://towhee.io/text-embedding/dpr) и вставляем вложение в Milvus. Towhee предоставляет [method-chaining style API](https://towhee.readthedocs.io/en/main/index.html) пользователи могли создавать конвейер обработки данных с операторами.

In [29]:
%%time
from towhee import pipe, ops
import numpy as np
from towhee.datacollection import DataCollection

insert_pipe = (
    pipe.input('id', 'question', 'answer')
        .map('question', 'vec', ops.text_embedding.dpr(model_name='facebook/dpr-ctx_encoder-single-nq-base'))
        .map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
        .map(('id', 'vec'), 'insert_status', ops.ann_insert.milvus_client(uri="", port='19530', collection_name='question_answer'))
        .output()
)

import csv
with open('question_answer.csv', encoding='utf-8') as f:
    reader = csv.reader(f)
    next(reader)
    for row in reader:
        insert_pipe(*row)


2024-04-01 20:28:00,515 - 140540341481536 - connectionpool.py-connectionpool:1055 - DEBUG: Starting new HTTPS connection (1): towhee.io:443
2024-04-01 20:28:01,615 - 140540341481536 - connectionpool.py-connectionpool:549 - DEBUG: https://towhee.io:443 "GET /towhee-api/v1/repos/text-embedding/dpr/tree?recursive=true&ref=main HTTP/1.1" 200 1199
2024-04-01 20:28:01,647 - 140538884904640 - connectionpool.py-connectionpool:1055 - DEBUG: Starting new HTTPS connection (1): towhee.io:443
2024-04-01 20:28:01,651 - 140538904831680 - connectionpool.py-connectionpool:1055 - DEBUG: Starting new HTTPS connection (1): towhee.io:443
2024-04-01 20:28:01,666 - 140538577680064 - connectionpool.py-connectionpool:1055 - DEBUG: Starting new HTTPS connection (1): towhee.io:443
2024-04-01 20:28:01,673 - 140538569287360 - connectionpool.py-connectionpool:1055 - DEBUG: Starting new HTTPS connection (1): towhee.io:443
2024-04-01 20:28:01,675 - 140538862892736 - connectionpool.py-connectionpool:1055 - DEBUG: Sta

Collecting transformers (from -r /home/student/.towhee/operators/text-embedding/dpr/versions/main/requirements.txt (line 2))
  Downloading transformers-4.39.2-py3-none-any.whl.metadata (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting sentencepiece (from -r /home/student/.towhee/operators/text-embedding/dpr/versions/main/requirements.txt (line 3))
  Downloading sentencepiece-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Collecting torch (from -r /home/student/.towhee/operators/text-embedding/dpr/versions/main/requirements.txt (line 6))
  Downloading torch-2.2.2-cp311-cp311-manylinux1_x86_64.whl.metadata (25 kB)
Collecting regex!=2019.12.17 (from transformers->-r /home/student/.towhee/operators/text-embedding/dpr/versions/main/requirements.txt (line 2))
  Using cached regex-2023.12.25-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_6

  from .autonotebook import tqdm as notebook_tqdm
2024-04-01 20:32:33,896 - 140540341481536 - connectionpool.py-connectionpool:1055 - DEBUG: Starting new HTTPS connection (1): huggingface.co:443
2024-04-01 20:32:34,182 - 140540341481536 - connectionpool.py-connectionpool:549 - DEBUG: https://huggingface.co:443 "HEAD /facebook/dpr-ctx_encoder-single-nq-base/resolve/main/tokenizer_config.json HTTP/1.1" 200 0
2024-04-01 20:32:34,184 - 140540341481536 - _api.py-_api:261 - DEBUG: Attempting to acquire lock 140535435178512 on /home/student/.cache/huggingface/hub/.locks/models--facebook--dpr-ctx_encoder-single-nq-base/a661b1a138dac6dc5590367402d100765010ffd6.lock
2024-04-01 20:32:34,185 - 140540341481536 - _api.py-_api:264 - DEBUG: Lock 140535435178512 acquired on /home/student/.cache/huggingface/hub/.locks/models--facebook--dpr-ctx_encoder-single-nq-base/a661b1a138dac6dc5590367402d100765010ffd6.lock
2024-04-01 20:32:34,338 - 140540341481536 - connectionpool.py-connectionpool:549 - DEBUG: htt



2024-04-01 20:33:01,097 - 140540341481536 - node.py-node:142 - INFO: ann-insert/milvus-client-2 ends with status: NodeStatus.FAILED


RuntimeError: Node-ann-insert/milvus-client-2 runs failed, error msg: Create ann-insert/milvus-client-2 operator ann-insert/milvus-client:main with args None and kws {'uri': '', 'port': '19530', 'collection_name': 'question_answer'} failed, err: <MilvusException: (code=2, message=Fail connecting to server on localhost:19530. Timeout)>, Traceback (most recent call last):
  File "/home/student/Общедоступные/git/ProjectOnTeam/.venv/lib/python3.11/site-packages/pymilvus/client/grpc_handler.py", line 119, in _wait_for_channel_ready
    grpc.channel_ready_future(self._channel).result(timeout=timeout)
  File "/home/student/Общедоступные/git/ProjectOnTeam/.venv/lib/python3.11/site-packages/grpc/_utilities.py", line 151, in result
    self._block(timeout)
  File "/home/student/Общедоступные/git/ProjectOnTeam/.venv/lib/python3.11/site-packages/grpc/_utilities.py", line 97, in _block
    raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/student/Общедоступные/git/ProjectOnTeam/.venv/lib/python3.11/site-packages/towhee/runtime/nodes/node.py", line 88, in initialize
    self._op = self._op_pool.acquire_op(
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student/Общедоступные/git/ProjectOnTeam/.venv/lib/python3.11/site-packages/towhee/runtime/operator_manager/operator_pool.py", line 106, in acquire_op
    op = self._op_loader.load_operator(hub_op_id, op_args, op_kws, tag, latest)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student/Общедоступные/git/ProjectOnTeam/.venv/lib/python3.11/site-packages/towhee/runtime/operator_manager/operator_loader.py", line 154, in load_operator
    op = factory(function, arg, kws, tag, latest)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student/Общедоступные/git/ProjectOnTeam/.venv/lib/python3.11/site-packages/towhee/runtime/operator_manager/operator_loader.py", line 137, in _load_operator_from_hub
    return self._load_operator_from_path(path, function, arg, kws, tag)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student/Общедоступные/git/ProjectOnTeam/.venv/lib/python3.11/site-packages/towhee/runtime/operator_manager/operator_loader.py", line 125, in _load_operator_from_path
    return self._instance_operator(op, arg, kws) if op is not None else None
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student/Общедоступные/git/ProjectOnTeam/.venv/lib/python3.11/site-packages/towhee/runtime/operator_manager/operator_loader.py", line 163, in _instance_operator
    return op(*arg, **kws) if kws is not None else op(*arg)
           ^^^^^^^^^^^^^^^
  File "/home/student/.towhee/operators/ann-insert/milvus-client/versions/main/__init__.py", line 4, in milvus_client
    return MilvusClient(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student/.towhee/operators/ann-insert/milvus-client/versions/main/milvus_client.py", line 29, in __init__
    connections.connect(alias=self._connect_name, host=self._host, port=self._port)
  File "/home/student/Общедоступные/git/ProjectOnTeam/.venv/lib/python3.11/site-packages/pymilvus/orm/connections.py", line 349, in connect
    connect_milvus(**kwargs, user=user, password=password, token=token, db_name=db_name)
  File "/home/student/Общедоступные/git/ProjectOnTeam/.venv/lib/python3.11/site-packages/pymilvus/orm/connections.py", line 282, in connect_milvus
    gh._wait_for_channel_ready(timeout=timeout)
  File "/home/student/Общедоступные/git/ProjectOnTeam/.venv/lib/python3.11/site-packages/pymilvus/client/grpc_handler.py", line 123, in _wait_for_channel_ready
    raise MilvusException(Status.CONNECT_FAILED,
pymilvus.exceptions.MilvusException: <MilvusException: (code=2, message=Fail connecting to server on localhost:19530. Timeout)>


In [6]:
print('Total number of inserted data is {}.'.format(collection.num_entities))

Total number of inserted data is 1000.


#### Explanation of Data Processing Pipeline

Here is detailed explanation for each line of the code:

`pipe.input('id', 'question', 'answer')`: Get three inputs, namely question's id, quesion's text and question's answer;

`map('question', 'vec', ops.text_embedding.dpr(model_name='facebook/dpr-ctx_encoder-single-nq-base'))`: Use the `acebook/dpr-ctx_encoder-single-nq-base` model to generate the question embedding vector with the [dpr operator](https://towhee.io/text-embedding/dpr) in towhee hub;

`map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))`: normalize the embedding vector;

`map(('id', 'vec'), 'insert_status', ops.ann_insert.milvus_client(host='127.0.0.1', port='19530', collection_name='question_answer'))`: insert question embedding vector into Milvus;

### Ask Question with Milvus and Towhee

Now that embedding for question dataset have been inserted into Milvus, we can ask question with Milvus and Towhee. Again, we use Towhee to load the input question, compute a embedding, and use it as a query in Milvus. Because Milvus only outputs IDs and distance values, we provide the `id_answers` dictionary to get the answers based on IDs and display.

In [32]:
%%time
collection.load()
ans_pipe = (
    pipe.input('question')
        .map('question', 'vec', ops.text_embedding.dpr(model_name="facebook/dpr-ctx_encoder-single-nq-base"))
        .map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
        .map('vec', 'res', ops.ann_search.milvus_client( uri= 'https://in03-244ceba7647c848.api.gcp-us-west1.zillizcloud.com',token= '77e44c13012a64a48c21312903bf753be640225497971e1d1d1c13fe4fcec0cc0b4fa70b30aa2caba9e2fec9d38379eb853b9dc9', collection_name='testuser2', limit=1))
        .map('res', 'answer', lambda x: [id_answer[int(i[0])] for i in x])
        .output('question', 'answer')
)


ans = ans_pipe('Is  Disability  Insurance  Required  By  Law?')
ans = DataCollection(ans)
ans.show()

2024-04-01 20:39:42,857 - 140540341481536 - connectionpool.py-connectionpool:549 - DEBUG: https://huggingface.co:443 "HEAD /facebook/dpr-ctx_encoder-single-nq-base/resolve/main/tokenizer_config.json HTTP/1.1" 200 0
2024-04-01 20:39:43,101 - 140540341481536 - connectionpool.py-connectionpool:549 - DEBUG: https://huggingface.co:443 "HEAD /facebook/dpr-ctx_encoder-single-nq-base/resolve/main/config.json HTTP/1.1" 200 0
2024-04-01 20:39:43,280 - 140540341481536 - connectionpool.py-connectionpool:549 - DEBUG: https://huggingface.co:443 "HEAD /facebook/dpr-ctx_encoder-single-nq-base/resolve/main/config.json HTTP/1.1" 200 0
2024-04-01 20:39:47,723 - 140535374272192 - node.py-node:167 - INFO: Begin to run Node-_input
2024-04-01 20:39:47,729 - 140538862892736 - node.py-node:167 - INFO: Begin to run Node-text-embedding/dpr-0
2024-04-01 20:39:47,741 - 140535398397632 - node.py-node:167 - INFO: Begin to run Node-lambda-1
2024-04-01 20:39:47,745 - 140538904831680 - node.py-node:167 - INFO: Begin to

question,answer
Is Disability Insurance Required By Law?,Not generally. There are five states that require most all employers carry short term disability insurance on their employees. T...


CPU times: user 1.9 s, sys: 1.64 s, total: 3.54 s
Wall time: 11.7 s


Then we can get the answer about 'Is  Disability  Insurance  Required  By  Law?'.

In [33]:
ans[0]['answer']

['Not generally. There are five states that require most all employers carry short term disability insurance on their employees. These states are: California, Hawaii, New Jersey, New York, and Rhode Island. Besides this mandatory short term disability law, there is no other legislative imperative for someone to purchase or be covered by disability insurance.']

## Release a Showcase

We've done an excellent job on the core functionality of our question answering engine. Now it's time to build a showcase with interface. [Gradio](https://gradio.app/) is a great tool for building demos. With Gradio, we simply need to wrap the data processing pipeline via a `chat` function:

In [34]:
import towhee
def chat(message, history):
    history = history or []
    ans_pipe = (
        pipe.input('question')
            .map('question', 'vec', ops.text_embedding.dpr(model_name="facebook/dpr-ctx_encoder-single-nq-base"))
            .map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
            .map('vec', 'res', ops.ann_search.milvus_client(uri= 'https://in03-244ceba7647c848.api.gcp-us-west1.zillizcloud.com',token= '77e44c13012a64a48c21312903bf753be640225497971e1d1d1c13fe4fcec0cc0b4fa70b30aa2caba9e2fec9d38379eb853b9dc9', collection_name='testuser2', limit=1))
            .map('res', 'answer', lambda x: [id_answer[int(i[0])] for i in x])
            .output('question', 'answer')
    )

    response = ans_pipe(message).get()[1][0]
    history.append((message, response))
    return history, history

In [36]:
import gradio

collection.load()
chatbot = gradio.Chatbot(color_map=("green", "gray"))
interface = gradio.Interface(
    chat,
    ["text", "state"],
    [chatbot, "state"],
    allow_screenshot=False,
    allow_flagging="never",
)
interface.launch(inline=True, share=True)

TypeError: Chatbot.__init__() got an unexpected keyword argument 'color_map'