Start docker container with elasticsearch:
_docker pull docker.elastic.co/elasticsearch/elasticsearch:7.12.1
_docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.12.1

In [1]:
from TEI_Handling import TEIFile
import glob
from pathlib import Path
import multiprocessing
from multiprocessing.pool import Pool
from haystack.document_store.elasticsearch import ElasticsearchDocumentStore
from haystack.preprocessor import PreProcessor

05/04/2021 22:08:25 - INFO - faiss.loader -   Loading faiss with AVX2 support.
05/04/2021 22:08:25 - INFO - faiss.loader -   Loading faiss.
  return torch._C._cuda_getDeviceCount() > 0
05/04/2021 22:08:26 - INFO - farm.modeling.prediction_head -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


In [2]:
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

05/04/2021 22:08:28 - INFO - elasticsearch -   HEAD http://localhost:9200/ [status:200 request:0.007s]
05/04/2021 22:08:28 - INFO - elasticsearch -   HEAD http://localhost:9200/document [status:200 request:0.002s]
05/04/2021 22:08:28 - INFO - elasticsearch -   GET http://localhost:9200/document [status:200 request:0.001s]
05/04/2021 22:08:28 - INFO - elasticsearch -   PUT http://localhost:9200/document/_mapping [status:200 request:0.010s]
05/04/2021 22:08:28 - INFO - elasticsearch -   HEAD http://localhost:9200/label [status:200 request:0.002s]


In [None]:
def to_dict(paper_path):
    tei = TEIFile(paper_path)
    processor = PreProcessor(
        clean_empty_lines=True,
        clean_whitespace=True,
        clean_header_footer=True,
        split_by="word",
        split_length=250,
        split_respect_sentence_boundary=True,
        split_overlap=20
    )
    return processor.process(tei.to_dict())

In [None]:
papers = sorted(Path("../data/unpaywall-grobid-sample").glob('*.tei.xml'))
print(f"Processing {papers.__len__()} papers on {multiprocessing.cpu_count()} cores.")
pool = Pool()

In [None]:
dicts = []
dicts.extend(pool.imap(to_dict, papers, 5))
pool.close()

In [None]:
documents = []
for inner_list in dicts:
    documents.extend(inner_list)

In [None]:
print(documents[:2])

In [None]:
document_store.write_documents(documents)

In [3]:
from haystack.retriever.sparse import ElasticsearchRetriever
retriever = ElasticsearchRetriever(document_store=document_store)

In [4]:
from haystack.reader.farm import FARMReader
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=False)

05/04/2021 22:08:48 - INFO - farm.utils -   Using device: CPU 
05/04/2021 22:08:48 - INFO - farm.utils -   Number of GPUs: 0
05/04/2021 22:08:48 - INFO - farm.utils -   Distributed Training: False
05/04/2021 22:08:48 - INFO - farm.utils -   Automatic Mixed Precision: None
Some weights of RobertaModel were not initialized from the model checkpoint at deepset/roberta-base-squad2 and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
05/04/2021 22:08:58 - INFO - farm.utils -   Using device: CPU 
05/04/2021 22:08:58 - INFO - farm.utils -   Number of GPUs: 0
05/04/2021 22:08:58 - INFO - farm.utils -   Distributed Training: False
05/04/2021 22:08:58 - INFO - farm.utils -   Automatic Mixed Precision: None
05/04/2021 22:08:59 - INFO - farm.infer -   Got ya 11 parallel workers to do inference ...
05/04/2021 22:08:59 - INFO - farm.infer -    0    0    0 

In [5]:
from haystack.pipeline import ExtractiveQAPipeline
pipe = ExtractiveQAPipeline(reader, retriever)

In [9]:
prediction = pipe.run(query="What is the IETF?", top_k_retriever=10, top_k_reader=5)

05/04/2021 22:09:56 - INFO - elasticsearch -   POST http://localhost:9200/document/_search [status:200 request:0.021s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.20 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.04 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.09 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.17 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.36 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.93 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.16 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.17 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.51 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.49 Batches/s]


In [10]:
from haystack.utils import print_answers

In [11]:
print_answers(prediction, details="minimal")

[   {   'answer': 'Internet Engineering Task Force',
        'context': 'ibed in the Requests For Comment (RFC) documents [2] of '
                   'the Internet Engineering Task Force (or IETF) [3]. RFCs '
                   'are numbered when they are accepted to'},
    {   'answer': '= ( , , Σ, , , , , ) in initial configuration . Input '
                  'stimuli move the APN',
        'context': 'scheme plus adaptive layer as follows:\n'
                   '= ( , , Σ, , , , , ) in initial configuration . Input '
                   'stimuli move the APN to the next configuration if, and on'},
    {   'answer': 'Σ',
        'context': 'al behavior in step k; ε ("empty string") denotes absence '
                   'of valid element;\n'
                   'Σ is the set of all possible events that make up the input '
                   'chain; A⊆C is t'},
    {   'answer': 'Σ',
        'context': 'al behavior in step k; ε ("empty string") denotes absence '
                   'of val