Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Same kind of answers are not generating from similar kind of documents #1308

Closed
SAIVENKATARAJU opened this issue Jul 28, 2021 · 5 comments
Closed
Assignees

Comments

@SAIVENKATARAJU
Copy link

Hi,
I am trying to build extractive QA for watersoftner , I have product manuals for 4 different models. each of the manual contains some info about the WS , do's and dont's. most of the info is same across all the docs, but regarding specifications it may vary from doc to doc. I was trying to pull the answer from every doc. but exact answers are returning from only one or two docs, even though the same answer is present in all docs.

example:

pipe = ExtractiveQAPipeline(reader_para, retriever)

question = 'can I Install it  outside'
for file_to_search in files:
    print(file_to_search)
    # Here is the pipeline definition

    results = []
    prediction = pipe.run(query=question, top_k_retriever=4, top_k_reader=2, filters={'name': [file_to_search]},)

    result=post_processing(prediction)
    print_answers(prediction,details = "medium")
    print("--------------------------------")
    # post_processing(prediction)

Here I asked model, "can i install it outside", and explicitly searching in individual doc through filter. and output is below.

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.42 Batches/s]
[   {   'answer': 'Always install the water softener between the water inlet '
                  'and water heater',
        'context': 'ce such as a circuit breaker or fuse. Always install the '
                   'water softener between the water inlet and water heater. '
                   'Any other installed water conditioni',
        'score': 8.972942352294922},
    {   'answer': 'except outside water pipes',
        'context': 'r in the home, install the water softener close to the '
                   'water supply inlet, and upstream of all other plumbing '
                   'connections, except outside water pipes.',
        'score': 8.767389297485352}]
--------------------------------
model1.pdf
07/28/2021 09:49:04 - WARNING - farm.data_handler.dataset -   Could not determine type for feature 'labels'. Converting now to a tensor of default type long.
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.44 Batches/s]
[   {   'answer': 'If installing the water softener outdoors, do not locate '
                  'where it will be exposed to wet weather, direct sunlight, '
                  'extreme hot or cold temperatures, or other forms of abuse',
        'context': 'If installing the water softener outdoors, do not locate '
                   'where it will be exposed to wet weather, direct sunlight, '
                   'extreme hot or cold temperatures, or other forms of abuse',
        'score': 6.44699764251709},
    {   'answer': 'Always install the water softener between the water inlet '
                  'and water heater',
        'context': 'rly protected by an over current device such as a circuit '
                   'breaker or fuse. Always install the water softener between '
                   'the water inlet and water heater.',
        'score': 5.814720630645752}]
--------------------------------
model2.pdf
07/28/2021 09:49:04 - WARNING - farm.data_handler.dataset -   Could not determine type for feature 'labels'. Converting now to a tensor of default type long.
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.45 Batches/s]
[   {   'answer': 'If installing the water softener outdoors, do not locate '
                  'where it will be exposed to wet weather, direct sunlight, '
                  'extreme hot or cold temperatures, or other forms of abuse',
        'context': 'If installing the water softener outdoors, do not locate '
                   'where it will be exposed to wet weather, direct sunlight, '
                   'extreme hot or cold temperatures, or other forms of abuse',
        'score': 9.138312339782715},
    {   'answer': 'If installing the water softener outdoors, do not locate '
                  'where it will be exposed to wet weather, direct sunlight, '
                  'extreme hot or cold temperatures, or other forms of abuse',
        'context': 'If installing the water softener outdoors, do not locate '
                   'where it will be exposed to wet weather, direct sunlight, '
                   'extreme hot or cold temperatures, or other forms of abuse',
        'score': 9.044705390930176}]
--------------------------------
model3.pdf
07/28/2021 09:49:04 - WARNING - farm.data_handler.dataset -   Could not determine type for feature 'labels'. Converting now to a tensor of default type long.
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.42 Batches/s]
[   {   'answer': 'When installing in an outside location',
        'context': 'cause distortion or other damage to non-metallic parts. '
                   'When installing in an outside location you must take steps '
                   'necessary to assure the softener, i',
        'score': 7.140499591827393},
    {   'answer': 'install the water softener close to the water supply inlet, '
                  'and upstream of all other plumbing connections, except '
                  'outside water pipes',
        'context': 'e home, install the water softener close to the water '
                   'supply inlet, and upstream of all other plumbing '
                   'connections, except outside water pipes. Outsid',
        'score': 7.136640548706055}]

from the above answers our required answer is "if installing watersoftner.......", but its only coming from two of the docs. My pre-processing is below


processor = PreProcessor(clean_empty_lines=True, clean_whitespace=True, split_by='word',
                                 split_respect_sentence_boundary=True,
                                 split_length=156,
                                 )
        #
converter = PDFToTextConverter(remove_numeric_tables=True)
for file in files:
  doc = converter.convert(kbasepath + file, meta={'name': file}, remove_numeric_tables=True)
  doc_clean = processor.process(doc)
  print(len(doc_clean))
  for i in range(len(doc_clean)):
      document_store.write_documents([{'text': doc_clean[i]['text'],
                                       'meta': {'name': file, 'author': "NSVR", 'id': uuid.uuid4().hex[:8]}}, ])

it really helpful for me how to debug this kind of issues.

@brandenchan
Copy link
Contributor

Hi @SAIVENKATARAJU , my first suspicion would be that each of the product manuals might be split up differently by the PreProcessor. Maybe for one manual, the answer comes at the very beginning of a passage and in another, it is at the end. This can influence the model's predictions.

If you are using an ElasticsearchRetriever, one thing you could try to resolve this would be to increase the PreProcessor's split_length (perhaps to 500) and setting split_overlap to some fraction of this amount (maybe around 50). This will probably also require split_respect_sentence_boundary=False.

If this doesn't have the desired impact, we might have to dig deeper by looking at the Retriever's predictions using an EvalDocument Node. This might give us a better sense of which component exactly is causing the differing predictions across the different documents.

@brandenchan brandenchan self-assigned this Jul 29, 2021
@SAIVENKATARAJU
Copy link
Author

Hey @brandenchan
Thanks for your suggestions. I will check that

@SAIVENKATARAJU
Copy link
Author

Hey @brandenchan ,

No good improvement with parameters tuning. is it good idea to create different index for different files?

@brandenchan
Copy link
Contributor

Hi @SAIVENKATARAJU , what's your intention with creating an index for each file? Is it that you would like to perform your query on just one of your files? (In this case I would recommend using metadata filtering)

The next thing I would recommend would be to actually see the output of the retriever. To do this you want to initialize an EvalDocument Node with debug=True. You should place it in the pipeline after the retriever. After running your pipeline EvalDocument.log should show you the retrieved documents for each query. It might be worth checking to see the retrieved documents are what you would expect.

@ZanSara
Copy link
Contributor

ZanSara commented Oct 13, 2021

Hi @SAIVENKATARAJU, since your last post we have implemented new debugging features (see this PR: #1558). Please check out the new master, test it out again, and if you still face the same problemlet us know by opening a new issue. Closing this for now.

@ZanSara ZanSara closed this as completed Oct 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants