Start docker container with elasticsearch:
_docker pull docker.elastic.co/elasticsearch/elasticsearch:7.12.1
_docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.12.1

In [1]:
from TEI_Handling import TEIFile
import glob
from pathlib import Path
import multiprocessing
from multiprocessing.pool import Pool
from haystack.document_store.elasticsearch import ElasticsearchDocumentStore
from haystack.preprocessor import PreProcessor

05/04/2021 21:37:19 - INFO - faiss.loader -   Loading faiss with AVX2 support.
05/04/2021 21:37:19 - INFO - faiss.loader -   Loading faiss.
  return torch._C._cuda_getDeviceCount() > 0
05/04/2021 21:37:20 - INFO - farm.modeling.prediction_head -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


In [2]:
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

05/04/2021 21:37:21 - INFO - elasticsearch -   HEAD http://localhost:9200/ [status:200 request:0.006s]
05/04/2021 21:37:21 - INFO - elasticsearch -   HEAD http://localhost:9200/document [status:200 request:0.002s]
05/04/2021 21:37:21 - INFO - elasticsearch -   GET http://localhost:9200/document [status:200 request:0.002s]
05/04/2021 21:37:21 - INFO - elasticsearch -   PUT http://localhost:9200/document/_mapping [status:200 request:0.008s]
05/04/2021 21:37:21 - INFO - elasticsearch -   HEAD http://localhost:9200/label [status:200 request:0.002s]


In [3]:
def to_dict(paper_path):
    tei = TEIFile(paper_path)
    processor = PreProcessor(
        clean_empty_lines=True,
        clean_whitespace=True,
        clean_header_footer=True,
        split_by="word",
        split_length=250,
        split_respect_sentence_boundary=True,
        split_overlap=20
    )
    return processor.process(tei.to_dict())

In [4]:
papers = sorted(Path("../data/unpaywall-grobid-sample").glob('*.tei.xml'))
print(f"Processing {papers.__len__()} papers on {multiprocessing.cpu_count()} cores.")
pool = Pool()

Processing 20 papers on 12 cores.


In [5]:
dicts = []
dicts.extend(pool.imap(to_dict, papers, 5))
pool.close()

In [6]:
documents = []
for inner_list in dicts:
    documents.extend(inner_list)

In [None]:
print(documents[:2])

In [None]:
document_store.write_documents(documents)

In [7]:
from haystack.retriever.sparse import ElasticsearchRetriever
retriever = ElasticsearchRetriever(document_store=document_store)

In [8]:
from haystack.reader.farm import FARMReader
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=False)

05/04/2021 21:37:36 - INFO - farm.utils -   Using device: CPU 
05/04/2021 21:37:36 - INFO - farm.utils -   Number of GPUs: 0
05/04/2021 21:37:36 - INFO - farm.utils -   Distributed Training: False
05/04/2021 21:37:36 - INFO - farm.utils -   Automatic Mixed Precision: None
05/04/2021 21:37:36 - INFO - filelock -   Lock 140249343923776 acquired on /home/daniel/.cache/huggingface/transformers/c40d0abb589629c48763f271020d0b1f602f5208c432c0874d420491ed37e28b.122ed338b3591c07dba452777c59ff52330edb340d3d56d67aa9117ad9905673.lock


Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

05/04/2021 21:37:36 - INFO - filelock -   Lock 140249343923776 released on /home/daniel/.cache/huggingface/transformers/c40d0abb589629c48763f271020d0b1f602f5208c432c0874d420491ed37e28b.122ed338b3591c07dba452777c59ff52330edb340d3d56d67aa9117ad9905673.lock
05/04/2021 21:37:37 - INFO - filelock -   Lock 140245914882000 acquired on /home/daniel/.cache/huggingface/transformers/eac3273a8097dda671e3bea1db32c616e74f36a306c65b4858171c98d6db83e9.084aa7284f3a51fa1c8f0641aa04c47d366fbd18711f29d0a995693cfdbc9c9e.lock


Downloading:   0%|          | 0.00/496M [00:00<?, ?B/s]

05/04/2021 21:38:48 - INFO - filelock -   Lock 140245914882000 released on /home/daniel/.cache/huggingface/transformers/eac3273a8097dda671e3bea1db32c616e74f36a306c65b4858171c98d6db83e9.084aa7284f3a51fa1c8f0641aa04c47d366fbd18711f29d0a995693cfdbc9c9e.lock
Some weights of RobertaModel were not initialized from the model checkpoint at deepset/roberta-base-squad2 and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
05/04/2021 21:38:56 - INFO - filelock -   Lock 140245570869376 acquired on /home/daniel/.cache/huggingface/transformers/81c80edb4c6cefa5cae64ccfdb34b3b309ecaf60da99da7cd1c17e24a5d36eb5.647b4548b6d9ea817e82e7a9231a320231a1c9ea24053cc9e758f3fe68216f05.lock


Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

05/04/2021 21:38:57 - INFO - filelock -   Lock 140245570869376 released on /home/daniel/.cache/huggingface/transformers/81c80edb4c6cefa5cae64ccfdb34b3b309ecaf60da99da7cd1c17e24a5d36eb5.647b4548b6d9ea817e82e7a9231a320231a1c9ea24053cc9e758f3fe68216f05.lock
05/04/2021 21:38:58 - INFO - filelock -   Lock 140245134641376 acquired on /home/daniel/.cache/huggingface/transformers/b87d46371731376b11768b7839b1a5938a4f77d6bd2d9b683f167df0026af432.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock


Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

05/04/2021 21:38:58 - INFO - filelock -   Lock 140245134641376 released on /home/daniel/.cache/huggingface/transformers/b87d46371731376b11768b7839b1a5938a4f77d6bd2d9b683f167df0026af432.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock
05/04/2021 21:38:59 - INFO - filelock -   Lock 140245134556416 acquired on /home/daniel/.cache/huggingface/transformers/c9d2c178fac8d40234baa1833a3b1903d393729bf93ea34da247c07db24900d0.cb2244924ab24d706b02fd7fcedaea4531566537687a539ebb94db511fd122a0.lock


Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

05/04/2021 21:39:00 - INFO - filelock -   Lock 140245134556416 released on /home/daniel/.cache/huggingface/transformers/c9d2c178fac8d40234baa1833a3b1903d393729bf93ea34da247c07db24900d0.cb2244924ab24d706b02fd7fcedaea4531566537687a539ebb94db511fd122a0.lock
05/04/2021 21:39:00 - INFO - filelock -   Lock 140245570869376 acquired on /home/daniel/.cache/huggingface/transformers/e8a600814b69e3ee74bb4a7398cc6fef9812475010f16a6c9f151b2c2772b089.451739a2f3b82c3375da0dfc6af295bedc4567373b171f514dd09a4cc4b31513.lock


Downloading:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

05/04/2021 21:39:00 - INFO - filelock -   Lock 140245570869376 released on /home/daniel/.cache/huggingface/transformers/e8a600814b69e3ee74bb4a7398cc6fef9812475010f16a6c9f151b2c2772b089.451739a2f3b82c3375da0dfc6af295bedc4567373b171f514dd09a4cc4b31513.lock
05/04/2021 21:39:00 - INFO - farm.utils -   Using device: CPU 
05/04/2021 21:39:00 - INFO - farm.utils -   Number of GPUs: 0
05/04/2021 21:39:00 - INFO - farm.utils -   Distributed Training: False
05/04/2021 21:39:00 - INFO - farm.utils -   Automatic Mixed Precision: None
05/04/2021 21:39:01 - INFO - farm.infer -   Got ya 11 parallel workers to do inference ...
05/04/2021 21:39:01 - INFO - farm.infer -    0    0    0    0    0    0    0    0    0    0    0 
05/04/2021 21:39:01 - INFO - farm.infer -   /w\  /w\  /w\  /w\  /w\  /w\  /w\  /|\  /w\  /w\  /w\
05/04/2021 21:39:01 - INFO - farm.infer -   /'\  / \  /'\  /'\  / \  / \  /'\  /'\  /'\  /'\  /'\
05/04/2021 21:39:01 - INFO - farm.infer -                       


In [9]:
from haystack.pipeline import ExtractiveQAPipeline
pipe = ExtractiveQAPipeline(reader, retriever)

In [14]:
prediction = pipe.run(query="What is the Internet Engineering Task Force?", top_k_retriever=10, top_k_reader=5)

05/04/2021 21:43:51 - INFO - elasticsearch -   POST http://localhost:9200/document/_search [status:200 request:0.021s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.79 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.88 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.87 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.84 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.85 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.86 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.86 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.88 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.88 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.86 Batches/s]


In [12]:
from haystack.utils import print_answers

In [15]:
print_answers(prediction, details="minimal")

[   {   'answer': 'IETF',
        'context': 'r Comment (RFC) documents [2] of the Internet Engineering '
                   'Task Force (or IETF) [3]. RFCs are numbered when they are '
                   'accepted to be published. There ar'},
    {   'answer': 'World Academy of Science, Engineering and Technology',
        'context': 'uble arc plans compared to Single arc plans with\n'
                   'World Academy of Science, Engineering and Technology '
                   'International Journal of Medical and Health Scie'},
    {   'answer': 'algorithm',
        'context': 'An algorithm can be described informally, with a basic set '
                   'of steps that must be performed to reach a predetermined '
                   'result, or with mathematical rigor'},
    {   'answer': '-Each layer should be chosen thinking on the standardized '
                  'protocols. -Layer boundaries must be chosen bearing in mind '
                  'to make low information flow betwee