Same kind of answers are not generating from similar kind of documents #1308

SAIVENKATARAJU · 2021-07-28T10:14:05Z

Hi,
I am trying to build extractive QA for watersoftner , I have product manuals for 4 different models. each of the manual contains some info about the WS , do's and dont's. most of the info is same across all the docs, but regarding specifications it may vary from doc to doc. I was trying to pull the answer from every doc. but exact answers are returning from only one or two docs, even though the same answer is present in all docs.

example:

pipe = ExtractiveQAPipeline(reader_para, retriever)

question = 'can I Install it  outside'
for file_to_search in files:
    print(file_to_search)
    # Here is the pipeline definition

    results = []
    prediction = pipe.run(query=question, top_k_retriever=4, top_k_reader=2, filters={'name': [file_to_search]},)

    result=post_processing(prediction)
    print_answers(prediction,details = "medium")
    print("--------------------------------")
    # post_processing(prediction)

Here I asked model, "can i install it outside", and explicitly searching in individual doc through filter. and output is below.

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.42 Batches/s]
[   {   'answer': 'Always install the water softener between the water inlet '
                  'and water heater',
        'context': 'ce such as a circuit breaker or fuse. Always install the '
                   'water softener between the water inlet and water heater. '
                   'Any other installed water conditioni',
        'score': 8.972942352294922},
    {   'answer': 'except outside water pipes',
        'context': 'r in the home, install the water softener close to the '
                   'water supply inlet, and upstream of all other plumbing '
                   'connections, except outside water pipes.',
        'score': 8.767389297485352}]
--------------------------------
model1.pdf
07/28/2021 09:49:04 - WARNING - farm.data_handler.dataset -   Could not determine type for feature 'labels'. Converting now to a tensor of default type long.
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.44 Batches/s]
[   {   'answer': 'If installing the water softener outdoors, do not locate '
                  'where it will be exposed to wet weather, direct sunlight, '
                  'extreme hot or cold temperatures, or other forms of abuse',
        'context': 'If installing the water softener outdoors, do not locate '
                   'where it will be exposed to wet weather, direct sunlight, '
                   'extreme hot or cold temperatures, or other forms of abuse',
        'score': 6.44699764251709},
    {   'answer': 'Always install the water softener between the water inlet '
                  'and water heater',
        'context': 'rly protected by an over current device such as a circuit '
                   'breaker or fuse. Always install the water softener between '
                   'the water inlet and water heater.',
        'score': 5.814720630645752}]
--------------------------------
model2.pdf
07/28/2021 09:49:04 - WARNING - farm.data_handler.dataset -   Could not determine type for feature 'labels'. Converting now to a tensor of default type long.
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.45 Batches/s]
[   {   'answer': 'If installing the water softener outdoors, do not locate '
                  'where it will be exposed to wet weather, direct sunlight, '
                  'extreme hot or cold temperatures, or other forms of abuse',
        'context': 'If installing the water softener outdoors, do not locate '
                   'where it will be exposed to wet weather, direct sunlight, '
                   'extreme hot or cold temperatures, or other forms of abuse',
        'score': 9.138312339782715},
    {   'answer': 'If installing the water softener outdoors, do not locate '
                  'where it will be exposed to wet weather, direct sunlight, '
                  'extreme hot or cold temperatures, or other forms of abuse',
        'context': 'If installing the water softener outdoors, do not locate '
                   'where it will be exposed to wet weather, direct sunlight, '
                   'extreme hot or cold temperatures, or other forms of abuse',
        'score': 9.044705390930176}]
--------------------------------
model3.pdf
07/28/2021 09:49:04 - WARNING - farm.data_handler.dataset -   Could not determine type for feature 'labels'. Converting now to a tensor of default type long.
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.42 Batches/s]
[   {   'answer': 'When installing in an outside location',
        'context': 'cause distortion or other damage to non-metallic parts. '
                   'When installing in an outside location you must take steps '
                   'necessary to assure the softener, i',
        'score': 7.140499591827393},
    {   'answer': 'install the water softener close to the water supply inlet, '
                  'and upstream of all other plumbing connections, except '
                  'outside water pipes',
        'context': 'e home, install the water softener close to the water '
                   'supply inlet, and upstream of all other plumbing '
                   'connections, except outside water pipes. Outsid',
        'score': 7.136640548706055}]

from the above answers our required answer is "if installing watersoftner.......", but its only coming from two of the docs. My pre-processing is below


processor = PreProcessor(clean_empty_lines=True, clean_whitespace=True, split_by='word',
                                 split_respect_sentence_boundary=True,
                                 split_length=156,
                                 )
        #
converter = PDFToTextConverter(remove_numeric_tables=True)
for file in files:
  doc = converter.convert(kbasepath + file, meta={'name': file}, remove_numeric_tables=True)
  doc_clean = processor.process(doc)
  print(len(doc_clean))
  for i in range(len(doc_clean)):
      document_store.write_documents([{'text': doc_clean[i]['text'],
                                       'meta': {'name': file, 'author': "NSVR", 'id': uuid.uuid4().hex[:8]}}, ])

it really helpful for me how to debug this kind of issues.

The text was updated successfully, but these errors were encountered:

brandenchan · 2021-07-29T08:52:49Z

Hi @SAIVENKATARAJU , my first suspicion would be that each of the product manuals might be split up differently by the PreProcessor. Maybe for one manual, the answer comes at the very beginning of a passage and in another, it is at the end. This can influence the model's predictions.

If you are using an ElasticsearchRetriever, one thing you could try to resolve this would be to increase the PreProcessor's split_length (perhaps to 500) and setting split_overlap to some fraction of this amount (maybe around 50). This will probably also require split_respect_sentence_boundary=False.

If this doesn't have the desired impact, we might have to dig deeper by looking at the Retriever's predictions using an EvalDocument Node. This might give us a better sense of which component exactly is causing the differing predictions across the different documents.

SAIVENKATARAJU · 2021-07-30T06:08:08Z

Hey @brandenchan
Thanks for your suggestions. I will check that

SAIVENKATARAJU · 2021-08-04T16:22:17Z

Hey @brandenchan ,

No good improvement with parameters tuning. is it good idea to create different index for different files?

brandenchan · 2021-08-11T08:46:10Z

Hi @SAIVENKATARAJU , what's your intention with creating an index for each file? Is it that you would like to perform your query on just one of your files? (In this case I would recommend using metadata filtering)

The next thing I would recommend would be to actually see the output of the retriever. To do this you want to initialize an EvalDocument Node with debug=True. You should place it in the pipeline after the retriever. After running your pipeline EvalDocument.log should show you the retrieved documents for each query. It might be worth checking to see the retrieved documents are what you would expect.

ZanSara · 2021-10-13T10:19:04Z

Hi @SAIVENKATARAJU, since your last post we have implemented new debugging features (see this PR: #1558). Please check out the new master, test it out again, and if you still face the same problemlet us know by opening a new issue. Closing this for now.

brandenchan self-assigned this Jul 29, 2021

ZanSara closed this as completed Oct 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Same kind of answers are not generating from similar kind of documents #1308

Same kind of answers are not generating from similar kind of documents #1308

SAIVENKATARAJU commented Jul 28, 2021

brandenchan commented Jul 29, 2021

SAIVENKATARAJU commented Jul 30, 2021

SAIVENKATARAJU commented Aug 4, 2021

brandenchan commented Aug 11, 2021

ZanSara commented Oct 13, 2021

Same kind of answers are not generating from similar kind of documents #1308

Same kind of answers are not generating from similar kind of documents #1308

Comments

SAIVENKATARAJU commented Jul 28, 2021

brandenchan commented Jul 29, 2021

SAIVENKATARAJU commented Jul 30, 2021

SAIVENKATARAJU commented Aug 4, 2021

brandenchan commented Aug 11, 2021

ZanSara commented Oct 13, 2021