## Install dependencies

In [1]:
# # Install the latest release of Haystack in your own environment
# ! pip install farm-haystack

# # Install the latest master of Haystack
# !pip install --upgrade pip
# !pip install wget

# !pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]

In [2]:
# !pip install faiss
# !sudo apt-get install libomp-dev

## Imports

In [3]:
from haystack.document_stores import ElasticsearchDocumentStore
from haystack.document_stores import InMemoryDocumentStore

from haystack.nodes import EmbeddingRetriever, BM25Retriever, ElasticsearchRetriever
import pandas as pd
import requests

INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/
ERROR - root -  Failed to import 'magic' (from 'python-magic' and 'python-magic-bin' on Windows). FileTypeClassifier will not perform mimetype detection on extensionless files. Please make sure the necessary OS libraries are installed if you need this functionality.


### Start an Elasticsearch server
You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source.

In [5]:
# Recommended: Start Elasticsearch using Docker via the Haystack utility function
from haystack.utils import launch_es

launch_es()

In [6]:
# In Colab / No Docker environments: Start Elasticsearch from source
# ! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
# ! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
# ! chown -R daemon:daemon elasticsearch-7.9.2

# import os
# from subprocess import Popen, PIPE, STDOUT

# es_server = Popen(
#     ["elasticsearch-7.9.2/bin/elasticsearch"], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1)  # as daemon
# )
# # wait until ES has started
# ! sleep 30

### Init the DocumentStore
In contrast to Tutorial 1 (extractive QA), we:

* specify the name of our `text_field` in Elasticsearch that we want to return as an answer
* specify the name of our `embedding_field` in Elasticsearch where we'll store the embedding of our question and that is used later for calculating our similarity to the incoming user question
* set `excluded_meta_data=["question_emb"]` so that we don't return the huge embedding vectors in our search results

In [7]:
from haystack.document_stores import ElasticsearchDocumentStore

### Create a Retriever using embeddings
Instead of retrieving via Elasticsearch's plain BM25, we want to use vector similarity of the questions (user question vs. FAQ ones).
We can use the `EmbeddingRetriever` for this purpose and specify a model that we use for the embeddings.

In [9]:
from haystack import Pipeline

### Prepare & Index FAQ data
We create a pandas dataframe containing some FAQ data (i.e curated pairs of question + answer) and index those in elasticsearch.
Here: We download some question-answer pairs related to COVID-19

### Ask questions
Initialize a Pipeline (this time without a reader) and ask questions

In [10]:
# from google.colab import files
# uploaded = files.upload()

In [12]:
from typing import List
import requests
import pandas as pd
from haystack import Document
from haystack.document_stores import FAISSDocumentStore
from haystack.nodes import RAGenerator, DensePassageRetriever
from haystack.utils import fetch_archive_from_http
# import faiss

In [13]:
document_store = ElasticsearchDocumentStore()

INFO - haystack.telemetry -  Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by calling disable_telemetry() or by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems on the documentation page. More information at https://haystack.deepset.ai/guides/telemetry


In [14]:
# Initialize DPR Retriever to encode documents, encode question and query documents
retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
    passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
    use_gpu=True,
    embed_title=True,
)

# Initialize RAG Generator
generator = RAGenerator(
    model_name_or_path="facebook/rag-token-nq",
    use_gpu=False,
    top_k=1,
    max_length=200,
    min_length=2,
    embed_title=True,
    num_beams=2,
)

INFO - haystack.modeling.utils -  Using devices: CUDA:0
INFO - haystack.modeling.utils -  Number of GPUs: 1


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/493 [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find facebook/dpr-question_encoder-single-nq-base locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...


Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Loaded facebook/dpr-question_encoder-single-nq-base


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizerFast'.
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find facebook/dpr-ctx_encoder-single-nq-base locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...


Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Loaded facebook/dpr-ctx_encoder-single-nq-base
INFO - haystack.modeling.utils -  Using devices: CPU
INFO - haystack.modeling.utils -  Number of GPUs: 0


Downloading:   0%|          | 0.00/4.49k [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizerFast'.


Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizerFast'.


Downloading:   0%|          | 0.00/1.92G [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/rag-token-nq were not used when initializing RagTokenForGeneration: ['rag.question_encoder.question_encoder.bert_model.pooler.dense.bias', 'rag.question_encoder.question_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing RagTokenForGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RagTokenForGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RagTokenForGeneration were not initialized from the model checkpoint at facebook/rag-token-nq and are newly initialized: ['rag.generator.lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictio

In [16]:
import io
import json

covid_json_filename = "dataset/COVID-19 FAQs | Allianz Global Assistance.json"
df = pd.read_json(covid_json_filename)

with open(covid_json_filename) as json_file:
    data = json.load(json_file)

qas = data["FaqDocuments"]
questions = [sample["Question"] for sample in qas]
answers = [sample["Answer"] for sample in qas]

# Get embeddings for our questions from the FAQs


#q_embeddings = retriever.embed_queries(texts=questions)
# q_embeddings = [q for q in questions]
# df = df.rename(columns={"question": "content"})
docs = []
for i in range(len(questions)):
  #docs_to_index[i] = {"embedding":q_embeddings[i], "question" : questions[i], "answer" : answers[i]}
  entry = {"meta" : {"question":questions[i]}, "content" : answers[i]}
  docs.append(entry)

# Convert Dataframe to list of dicts and index them in our DocumentStore
#docs_to_index = df.to_dict(orient="records")
document_store.write_documents(docs)
document_store.update_embeddings(retriever)

INFO - haystack.document_stores.elasticsearch -  Updating embeddings for all 31 docs ...


Updating embeddings:   0%|          | 0/31 [00:00<?, ? Docs/s]

Create embeddings:   0%|          | 0/32 [00:00<?, ? Docs/s]

In [17]:
print(docs[0]["content"])



In [18]:
from haystack.utils import print_documents
from haystack.pipelines import DocumentSearchPipeline

p_retrieval = DocumentSearchPipeline(retriever)
res = p_retrieval.run(query="I am worried about COVID-19 impacting a trip I have scheduled or plan to schedule. Should I buy an Allianz travel protection plan to cover me in case COVID-19 impacts my trip", params={"Retriever": {"top_k": 2}})
print_documents(res, max_text_len=512)


Query: I am worried about COVID-19 impacting a trip I have scheduled or plan to schedule. Should I buy an Allianz travel protection plan to cover me in case COVID-19 impacts my trip

{   'content': 'COVID-19 is a known and evolving epidemic that is impacting '
               'travel worldwide, with continued spread and impacts '
               'expected.\xa0 Our travel protection plans do not generally '
               'cover losses directly or indirectly related to known, '
               'foreseeable, or expected events, epidemics, government '
               'travel. However, we are pleased to announce the introduction '
               'of our Epidemic Coverage Endorsement to certain plans '
               'purchased on or after March 6, 2021.\xa0 This endorsement adds '
               'cer...',
    'name': None}

{   'content': 'No, canceling a trip because you’re afraid to travel due to '
               'COVID-19 is generally not covered by our travel protection '
               

In [19]:
docs = [doc.content for doc in res['documents']]
docs[0]



## GPT-3

In [20]:
# !pip install openai

In [21]:
import openai
opanai_api_key_path = "opanai_api_key.txt"
openai.api_key_path = opanai_api_key_path

In [22]:
INFO = "COVID-19 is a known and evolving epidemic that is impacting travel worldwide, with continued spread and impacts expected.  Our travel protection plans do not generally cover losses directly or indirectly related to known, foreseeable, or expected events, epidemics, government prohibitions, warnings, or travel advisories, or fear of travel. However, we are pleased to announce the introduction of our Epidemic Coverage Endorsement to certain plans purchased on or after March 6, 2021.  This endorsement adds certain new covered reasons related to epidemics (including COVID-19) to some of our most popular insurance plans.  Please see the below FAQ section on “Epidemic Coverage Endorsement” for more information.  Note, the Epidemic Coverage Endorsement may not be available for all plans or in all jurisdictions.  To see if your plan includes this endorsement, please look for “Epidemic Coverage Endorsement” on your Declarations of Coverage or Letter of Confirmation. Additionally, in response to the ongoing public health and travel crisis, we are temporarily extending certain claims accommodations as follows*: 1. For plans that do not include the Epidemic Coverage Endorsement, we are temporarily accommodating claims for the following:  Emergency medical care for an insured who becomes ill with COVID-19 while on their trip (if your plan includes the Emergency Medical Care benefit) Trip cancellation and trip interruption if an insured, or that insured’s traveling companion or family member, becomes ill with COVID-19 either before or during the insured’s trip (if your plan includes Trip Cancellation or Trip Interruption benefits, as applicable)  2. If an insured or their traveling companion become ill with COVID-19 while on their trip, that insured will not be subject to the Trip Interruption benefit’s five-day maximum limit for additional accommodation and transportation expenses (however, the maximum daily limit for such expenses and the maximum Trip Interruption benefit limit still apply). These temporary accommodations are strictly applicable to COVID-19 and are only available to customers whose plan includes the applicable benefit.  These accommodations apply to plans currently in effect but may not apply to plans purchased in the future, so please refer to our Coverage Alert for the most up to date information before purchasing."

In [23]:
TAG = "use the above information to answer the question"

In [24]:
QUESTION = "I am worried about COVID-19 impacting a trip I have scheduled or plan to schedule. Should I buy an Allianz travel protection plan to cover me in case COVID-19 impacts my trip"

In [25]:
response = openai.Completion.create(
  engine="text-davinci-002",
  prompt=f"information:\n{INFO}\n\n\n{TAG}: {QUESTION}",
  temperature=0.7,
  max_tokens=256,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0
)

In [26]:
str(response['choices'][0].to_dict()['text'])[2:]



In [28]:
# opfile = openai.File.create(file=open("dataset/covid_exported_jl.json"), purpose='answers')
# opfilename = opfile.to_dict()['id']

In [35]:
opfilename = "file-M8vKsUy0q19BxdlywLLSIvxg"

'file-M8vKsUy0q19BxdlywLLSIvxg'

In [31]:
resp = openai.Answer.create(
    search_model="ada", 
    model="curie", 
    question=QUESTION, 
    file=opfilename, 
    examples_context="In 2017, U.S. life expectancy was 78.6 years.", 
    examples=[["What is human life expectancy in the United States?", "78 years."]], 
    max_rerank=10,
    max_tokens=10,
    stop=["\n", "<|endoftext|>"]
)

In [32]:
resp.to_dict()

{'answers': ['No, COVID-19 is not a covered'],
 'completion': 'cmpl-57wq120AnZw4PKjR6MYpgjF1dNLj2',
 'file': 'file-M8vKsUy0q19BxdlywLLSIvxg',
 'model': 'curie:2020-05-03',
 'object': 'answer',
 'search_model': 'ada:2020-05-03',
 'selected_documents': [<OpenAIObject search_result at 0x7f03f77abbd0> JSON: {
    "document": 9,
    "object": "search_result",
    "score": 108.16,
    "text": "If you or a traveling companion become ill due to an epidemic disease (such as COVID-19) or are individually-ordered to quarantine, these are covered reasons that could trigger Trip Cancellation or Trip interruption benefits for the insured.\u00a0 Note, the plan only covers expenses of the insured. Expenses of traveling companions are not covered unless they are also an insured under the plan.\u00a0 Benefits may not cover the full cost of your quarantine and are subject to applicable benefit limits.\u00a0 For information on what qualifies as an \u201cindividually-ordered quarantine,\u201d see the FAQ\u

In [33]:
resp.keys()

dict_keys(['answers', 'completion', 'file', 'model', 'object', 'search_model', 'selected_documents'])

In [34]:
resp['selected_documents'][-1]['text']

'No, canceling a trip because of an area being affected by COVID-19 is generally not covered by our travel protection plans. However, if you’re concerned about traveling during this time, many airlines and other travel suppliers are allowing their customers to change the dates of their travel without change fees. If you change your trip’s dates, we are happy to allow you to move your plan coverage dates to cover a new or rescheduled trip, so long as that trip is scheduled to be completed within 770 days from the plan’s original purchase date.* For terms and details, please see the below FAQ on changing your travel protection plan’s effective dates. This temporary accommodation is strictly applicable to COVID-19.\xa0 This accommodation applies to plans currently in effect but may not apply to plans purchased in the future, so please refer to our\xa0Coverage Alert\xa0for the most up to date information before purchasing.'