## Installing Haystack

To start, let's install the latest release of Haystack with `pip`:

In [1]:
# %%bash

# pip install --upgrade pip
# pip install farm-haystack[colab]

Set the logging level to INFO:

In [2]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

## Initializing the ElasticsearchDocumentStore


1. Download, extract, and set the permissions for the Elasticsearch installation image:

In [3]:
# %%bash

# wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
# tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
# chown -R daemon:daemon elasticsearch-7.9.2

2. Start the server:

In [4]:
%%bash --bg

sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch

In [5]:
import time
time.sleep(30)

4. Initialize the ElasticsearchDocumentStore:


In [5]:
from haystack.utils import launch_es
launch_es()

In [6]:
import os
from haystack.document_stores import ElasticsearchDocumentStore

# Get the host where Elasticsearch is running, default to localhost
host = os.environ.get("ELASTICSEARCH_HOST", "localhost")

document_store = ElasticsearchDocumentStore(
    host=host,
    username="",
    password="",
    index="document"
)

INFO - haystack.telemetry -  Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by calling disable_telemetry() or by manually setting the environment variable  HAYSTACK_TELEMETRY_ENABLED as described for different operating systems on the documentation page. More information at https://docs.haystack.deepset.ai/docs/telemetry


## Indexing Documents with a Pipeline


In [39]:
import pandas as pd
import re

df = pd.read_csv("pt_question_answers.csv")

df.shape

df[["pt_title", "pt_body", "pt_answer"]]

df["text"] = "question: " + df["pt_title"] + "\n" + df["pt_body"] + "\n" + "answer: " + df["pt_answer"]

df = df[["text"]]

CLEANR = re.compile('<.*?>') 

def cleanhtml(raw_html):
  cleantext = re.sub(CLEANR, '', raw_html)
  return cleantext

df["text"] = df["text"].apply(lambda x: cleanhtml(x))

df["text"] = df["text"].str.lower()

df

Unnamed: 0,text
0,question: extracting the top-k value-indices from a 1-d tensor\ngiven a 1-d ...
1,question: how to display custom images in tensorboard (e.g. matplotlib plots...
2,question: python wheels: cp27mu not supported\ni'm trying to install pytorch...
3,question: loading torch7 trained models (.t7) in pytorch\ni am using torch7 ...
4,question: pytorch: how to use dataloaders for custom datasets\nhow to make u...
...,...
14588,question: how to disable neptune callback in transformers trainer runs?\naft...
14589,question: bgr to rgb for cub_200 images by image.split()\ni am creating a py...
14590,question: neural networks extending learning domain\ni have a simple functio...
14591,question: how do i multiply tensors like this?\ni am working on a project wh...


In [40]:
d=df['text'].tolist()
file = open('data.txt','w')
file.writelines(d)
file.close()


2. Initialize the pipeline, TextConverter, and PreProcessor:

In [41]:
from haystack import Pipeline
from haystack.nodes import TextConverter, PreProcessor

indexing_pipeline = Pipeline()
text_converter = TextConverter()
preprocessor = PreProcessor(
    clean_whitespace=True,
    clean_header_footer=True,
    clean_empty_lines=True,
    split_by="word",
    split_length=1024,
    split_overlap=20,
    split_respect_sentence_boundary=True,
)


In [42]:
import os

indexing_pipeline.add_node(component=text_converter, name="TextConverter", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="PreProcessor", inputs=["TextConverter"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["PreProcessor"])


3. Run the indexing pipeline to write the text data into the DocumentStore:

In [43]:
files_to_index = ['data.txt']
indexing_pipeline.run_batch(file_paths=files_to_index)

INFO - haystack.pipelines.base -  It seems that an indexing Pipeline is run, so using the nodes' run method instead of run_batch.


Converting files:   0%|          | 0/1 [00:00<?, ?it/s]

Preprocessing:   0%|          | 0/1 [00:00<?, ?docs/s]

























{'documents': [<Document: {'content': 'question: extracting the top-k value-indices from a 1-d tensor\ngiven a 1-d tensor in torch (torch.tensor), containing values which can be compared (say floating point), how can we extract the indices of the top-k values in that tensor?\napart from the brute-force method, i am looking for some api call, that torch/lua provides, which can perform this task efficiently.\n\nanswer: as of pull request #496 torch now includes a built-in api named torch.topk. example:\n\n&gt; t = torch.tensor{9, 1, 8, 2, 7, 3, 6, 4, 5}\n\n-- obtain the 3 smallest elements\n&gt; res = t:topk(3)\n&gt; print(res)\n1\n2\n3\n[torch.doubletensor of size 3]\n\n-- you can also get the indices in addition\n&gt; res, ind = t:topk(3)\n&gt; print(ind)\n2\n4\n6\n[torch.longtensor of size 3]\n\n-- alternatively you can obtain the k largest elements as follow\n-- (see the api documentation for more details)\n&gt; res = t:topk(3, true)\n&gt; print(res)\n9\n8\n7\n[torch.doubletensor of 

## Initializing the Retriever


In [44]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store)

In [50]:
from haystack.nodes import OpenAIAnswerGenerator

In [51]:
retriever = OpenAIAnswerGenerator(document_store=document_store)

TypeError: __init__() got an unexpected keyword argument 'document_store'

## Initializing the Reader


In [45]:
from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

INFO - haystack.modeling.utils -  Using devices: CUDA:0 - Number of GPUs: 1
INFO - haystack.modeling.utils -  Using devices: CUDA:0 - Number of GPUs: 1
INFO - haystack.modeling.model.language_model -   * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)
INFO - haystack.modeling.model.language_model -  Auto-detected model language: english
INFO - haystack.modeling.model.language_model -  Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.
INFO - haystack.modeling.utils -  Using devices: CUDA:0 - Number of GPUs: 1


## Creating the Retriever-Reader Pipeline


In [46]:
from haystack import Pipeline

querying_pipeline = Pipeline()
querying_pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
querying_pipeline.add_node(component=reader, name="Reader", inputs=["Retriever"])


## Asking a Question


In [47]:
from pprint import pprint
def get_answer(query):
    prediction = querying_pipeline.run(
    query=query,
    params={
        "Retriever": {"top_k": 5},
        "Reader": {"top_k": 1}
    })
    
    return prediction["answers"][0].answer
    
    
    
    

In [48]:
top_10_questions = pd.read_csv("top100questions.csv").iloc[:10].question.tolist()

top_10_questions

['How do I check if PyTorch is using the GPU?\n',
 'How do I save a trained model in PyTorch?\n',
 'What does .view() do in PyTorch?\n',
 'Why do we need to call zero_grad() in PyTorch?\n',
 'How do I print the model summary in PyTorch?\n',
 'How do I initialize weights in PyTorch?\n',
 'What does model.eval() do in pytorch?\n',
 "What's the difference between reshape and view in pytorch?\n",
 'What does model.train() do in PyTorch?\n',
 'What does .contiguous() do in PyTorch?\n']

In [49]:
for query in top_10_questions:
    answer = get_answer(query)
    
    print("Query: ", query)
    print("Answer: ", answer)
    print("\n\n\n")

INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.


Inferencing Samples:   0%|          | 0/2 [00:00<?, ? Batches/s]

INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.


Query:  How do I check if PyTorch is using the GPU?

Answer:  identify the model of your graphics card






Inferencing Samples:   0%|          | 0/2 [00:00<?, ? Batches/s]

INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.


Query:  How do I save a trained model in PyTorch?

Answer:  however you want






Inferencing Samples:   0%|          | 0/2 [00:00<?, ? Batches/s]

INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.


Query:  What does .view() do in PyTorch?

Answer:  expects the new shape to be provided by individual int arguments






Inferencing Samples:   0%|          | 0/2 [00:00<?, ? Batches/s]

INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.


Query:  Why do we need to call zero_grad() in PyTorch?

Answer:  when we want to &quot;conserve&quot; ram with massive datasets






Inferencing Samples:   0%|          | 0/3 [00:00<?, ? Batches/s]

INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.


Query:  How do I print the model summary in PyTorch?

Answer:  forward_pass






Inferencing Samples:   0%|          | 0/2 [00:00<?, ? Batches/s]

INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.


Query:  How do I initialize weights in PyTorch?

Answer:  adjust
newval






Inferencing Samples:   0%|          | 0/3 [00:00<?, ? Batches/s]

INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.


Query:  What does model.eval() do in pytorch?

Answer:  fix the parameters of bn






Inferencing Samples:   0%|          | 0/2 [00:00<?, ? Batches/s]

INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.


Query:  What's the difference between reshape and view in pytorch?

Answer:  two different methods






Inferencing Samples:   0%|          | 0/2 [00:00<?, ? Batches/s]

INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.


Query:  What does model.train() do in PyTorch?

Answer:  run multiple times and print output






Inferencing Samples:   0%|          | 0/2 [00:00<?, ? Batches/s]

Query:  What does .contiguous() do in PyTorch?

Answer:  wraps the operation






In [17]:
prediction = querying_pipeline.run(
    query="How to get a uniform distribution in a range [r1,r2] in PyTorch?",
    params={
        "Retriever": {"top_k": 1},
        "Reader": {"top_k": 1}
    }
)

INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.


Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]

'slicing'

In [19]:


pprint(prediction["answers"])

[<Answer {'answer': '(r1 - r2) * u + r2', 'type': 'extractive', 'score': 0.29058438539505005, 'context': '\n\nif u is a random variable uniformly distributed on [0, 1], then (r1 - r2) * u + r2 is uniformly distributed on [r1, r2].\n\nthus, you just need:\n\n(r1 ', 'offsets_in_document': [{'start': 1364, 'end': 1382}], 'offsets_in_context': [{'start': 66, 'end': 84}], 'document_id': 'bad5c1fe8e74cca9ad00065539705f8e', 'meta': {'_split_id': 7}}>]


In [21]:
prediction = querying_pipeline.run(
    query="Can I slice tensors with logical indexing or lists of indices?",
    params={
        "Retriever": {"top_k": 1},
        "Reader": {"top_k": 1}
    }
)
pprint(prediction["answers"])

INFO - haystack.schema -  Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.


Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]

[<Answer {'answer': 'slicing', 'type': 'extractive', 'score': 0.2083435207605362, 'context': 'c = b[:, a_idx] # throws error\n\nif i want contiguous ranges of indices, slicing works:\nc = b[:, 1:2]\n\nthis example of conv1d and pool1d layers into an', 'offsets_in_document': [{'start': 4116, 'end': 4123}], 'offsets_in_context': [{'start': 72, 'end': 79}], 'document_id': 'a7503e3ee8987303b7c3d73d79843819', 'meta': {'_split_id': 5}}>]
