Using PreProcessor functions on eval data #751

Timoeller · 2021-01-19T18:25:47Z

This PR adds the preprocessing functionality to evaluation data

That way we can more easily test different preprocessing configurations and their impact on performance.

Open Todos:

[x] real data testing (tested on COVIDQA.json and splitting by word and passage)
[x] changed data (tiny.json) make sure other tests do not break
[x] additional test case

More improvements that wont be covered in this PR:

Split by sentence not supported (sentence split uses nltk and removes symbols between sentences in non obvious ways)
Cleaning functionality of PreProcessor not supported yet
Splitting with overlap functionality of PreProcessor not supported yet

Fixes #711

…s, add max_docs to highler level fct

tholor

LGTM

Rob192 · 2021-01-21T12:54:39Z

Great @Timoeller ! Thank you very much for this work I was looking into doing something similar on my side ! Are you guys working towards a Optimizer class that would take as an input an evaluation file and would tweak several parameters (e.g. mainly length of documents, retriever type, values for retriever and reader k's ...) to ajust the settings based on the customer data and push to production something optimized ? I would love to know more about this if this is the plan or help in developping this solution !

Timoeller · 2021-01-21T13:59:54Z

Hey @Rob192 thanks a lot. I am impressed how fast new features are picked up by the community :D

About benchmarking: have you seen our https://haystack.deepset.ai/bm/benchmarks/ already? I could envision some small section about Passage splitting there as well. @brandenchan do you have thoughts on this?

If you want to create a more detailed analysis I think it would be awesome if you could create a small blog post. There you can compare different PreProcessing steps in combination with different Retrievers. (I think Reader performance wont be so useful here)
Would you be interested in that? We can surely guide you along the way : )

Rob192 · 2021-01-21T15:31:39Z

Yes I have seen the benchmark already, it is very usefull.

Here at the LabIA of Etalab, we think the analysis is not only something you should perform on the SQUAD to get some general insights but also something that we should perform with the data of the final user. Really the performance of the stack depends on the size of the knowledge base and the typical questions the user is getting from his customers.

We are using a grid search to search for our dataset what would be the best parameters. @psorianom has written a report on some of our experiments here https://etalab-ia.github.io/knowledge-base/piaf/retriever/Visualisation_csv_results.html For the moment and as you said we are mainly testing the retriever performances -> the main outcome is that the filtering of the knowledge base (based on some folders and subfolders that the user can manually choose) has a big impact on the performances.

Up to now this process is still artisanal but we are working to standardize it a bit so it is easy when you start a new project with new data to generate these metrics and set up an optimized haystack.

Rob192 · 2021-01-22T14:27:21Z

Hello @Timoeller !

I think there is a small bug in the _extract_docs_and_labels_from_dict function of preprocessor.utils : if you use a preprocessor set up with the following parameters (basically set up to create unique sentences with a small split_length

    preprocessor = PreProcessor(
        clean_empty_lines=False,
        clean_whitespace=False,
        clean_header_footer=False,
        split_by="word",
        split_length=20,
        split_overlap=0, 
        split_respect_sentence_boundary=True
    )

Then the variable splits of the _extract_docs_and_labels_from_dict function will contain a list with an empty doc at the beginning :

[{'text': '', 'id': 'be6520a0-90ec-4742-b3f9-2ca1eb29efd1-0', 'score': None, 'probability': None, 'question': None, 'meta': {'name': "Peut-on créer un syndicat secondaire dans un groupe d'immeubles en copropriété\xa0?", '_split_id': 0, '_split_offset': 0}, 'embedding': None},
 {'text': "Oui, lorsqu'une même copropriété comporte plusieurs bâtiments, les copropriétaires peuvent créer un ou plusieurs syndicats secondaires par un vote à la majorité absolue.", 'id': 'be6520a0-90ec-4742-b3f9-2ca1eb29efd1-1', 'score': None, 'probability': None, 'question': None, 'meta': {'name': "Peut-on créer un syndicat secondaire dans un groupe d'immeubles en copropriété\xa0?", '_split_id': 1, '_split_offset': 1}, 'embedding': None},
... ]

In this specific case, my answer is 'Oui' and the position of my answer is 0. however, the_split_offset being 1, then my answer will never be found, resulting in the following error :

Traceback (most recent call last):
File "c:\users\utilisateur\pythonprojects\haystack\haystack\preprocessor\utils.py", line 48, in eval_data_from_json
cur_docs, cur_labels = _extract_docs_and_labels_from_dict(document, preprocessor)
File "c:\users\utilisateur\pythonprojects\haystack\haystack\preprocessor\utils.py", line 163, in _extract_docs_and_labels_from_dict
document_id=cur_id,
UnboundLocalError: local variable 'cur_id' referenced before assignment

I am not sure of what would be the best solution. Either make sure that the first document is indeed filled or declare cur_id=splits[0].id like you do when len(splits) == 1

Timoeller · 2021-01-22T16:43:11Z

Thanks for pointing us to your evaluation of different retriever settings. You guys seem to have done quite some detailed analysis already. I will dig into this analysis later on, for example I do not understand why the scores are so low. On which data did you do this analysis?

About the bug, this is indeed problematic. Thanks for bringing this up. I think it should be best dealt with in the preprocessor class, since there a "useless" document is created that then results in the mentioned bug. I created an issue #763 and will work on it coming week.

Rob192 · 2021-01-22T19:02:25Z

Hello, the dataset used for testing the retriever was composed of questions that were directly asked by the citizens to service-public.fr. In the dataset you can also find the documents indicated by the civil agents as answers. We think this is very harsh testing as the questions are sometimes unclear or the answer is not explicitly found in the document. Sometimes you need to extropolate a bit to understand how the document will help you answering the question. It is much different from a squad type dataset where the questions are tailor made to the contexts !

Timoeller added 5 commits January 19, 2021 19:22

Add eval data splitting

ece1154

Merge remote-tracking branch 'origin/master' into preprocess_eval_data

50c9481

Check for optional preprocessor

7a977c8

Adjust for split by passage, add test and test data, adjust docstring…

822e16c

…s, add max_docs to highler level fct

Adjust test

29e7bce

Timoeller changed the title ~~WIP: Using PreProcessor functions on eval data~~ Using PreProcessor functions on eval data Jan 20, 2021

Timoeller requested a review from tholor January 20, 2021 10:42

tholor approved these changes Jan 20, 2021

View reviewed changes

Timoeller merged commit 4803da0 into master Jan 20, 2021

Timoeller deleted the preprocess_eval_data branch January 20, 2021 13:40

Rob192 mentioned this pull request Jan 22, 2021

add the splitting of documents in eval retriever etalab-ia/piaf-ml#12

Merged

Timoeller mentioned this pull request Jan 27, 2021

Improve preprocessing and adding of eval data #780

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using PreProcessor functions on eval data #751

Using PreProcessor functions on eval data #751

Timoeller commented Jan 19, 2021 •

edited

Loading

tholor left a comment

Rob192 commented Jan 21, 2021

Timoeller commented Jan 21, 2021

Rob192 commented Jan 21, 2021

Rob192 commented Jan 22, 2021

Timoeller commented Jan 22, 2021

Rob192 commented Jan 22, 2021

Using PreProcessor functions on eval data #751

Using PreProcessor functions on eval data #751

Conversation

Timoeller commented Jan 19, 2021 • edited Loading

This PR adds the preprocessing functionality to evaluation data

Open Todos:

More improvements that wont be covered in this PR:

tholor left a comment

Choose a reason for hiding this comment

Rob192 commented Jan 21, 2021

Timoeller commented Jan 21, 2021

Rob192 commented Jan 21, 2021

Rob192 commented Jan 22, 2021

Timoeller commented Jan 22, 2021

Rob192 commented Jan 22, 2021

Timoeller commented Jan 19, 2021 •

edited

Loading