Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using PreProcessor functions on eval data #751

Merged
merged 5 commits into from
Jan 20, 2021
Merged

Conversation

Timoeller
Copy link
Contributor

@Timoeller Timoeller commented Jan 19, 2021

This PR adds the preprocessing functionality to evaluation data

That way we can more easily test different preprocessing configurations and their impact on performance.

Open Todos:

[x] real data testing (tested on COVIDQA.json and splitting by word and passage)
[x] changed data (tiny.json) make sure other tests do not break
[x] additional test case

More improvements that wont be covered in this PR:

  • Split by sentence not supported (sentence split uses nltk and removes symbols between sentences in non obvious ways)
  • Cleaning functionality of PreProcessor not supported yet
  • Splitting with overlap functionality of PreProcessor not supported yet

Fixes #711

@Timoeller Timoeller changed the title WIP: Using PreProcessor functions on eval data Using PreProcessor functions on eval data Jan 20, 2021
@Timoeller Timoeller requested a review from tholor January 20, 2021 10:42
Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Timoeller Timoeller merged commit 4803da0 into master Jan 20, 2021
@Timoeller Timoeller deleted the preprocess_eval_data branch January 20, 2021 13:40
@Rob192
Copy link
Contributor

Rob192 commented Jan 21, 2021

Great @Timoeller ! Thank you very much for this work I was looking into doing something similar on my side ! Are you guys working towards a Optimizer class that would take as an input an evaluation file and would tweak several parameters (e.g. mainly length of documents, retriever type, values for retriever and reader k's ...) to ajust the settings based on the customer data and push to production something optimized ? I would love to know more about this if this is the plan or help in developping this solution !

@Timoeller
Copy link
Contributor Author

Hey @Rob192 thanks a lot. I am impressed how fast new features are picked up by the community :D

About benchmarking: have you seen our https://haystack.deepset.ai/bm/benchmarks/ already? I could envision some small section about Passage splitting there as well. @brandenchan do you have thoughts on this?

If you want to create a more detailed analysis I think it would be awesome if you could create a small blog post. There you can compare different PreProcessing steps in combination with different Retrievers. (I think Reader performance wont be so useful here)
Would you be interested in that? We can surely guide you along the way : )

@Rob192
Copy link
Contributor

Rob192 commented Jan 21, 2021

Yes I have seen the benchmark already, it is very usefull.

Here at the LabIA of Etalab, we think the analysis is not only something you should perform on the SQUAD to get some general insights but also something that we should perform with the data of the final user. Really the performance of the stack depends on the size of the knowledge base and the typical questions the user is getting from his customers.

We are using a grid search to search for our dataset what would be the best parameters. @psorianom has written a report on some of our experiments here https://etalab-ia.github.io/knowledge-base/piaf/retriever/Visualisation_csv_results.html For the moment and as you said we are mainly testing the retriever performances -> the main outcome is that the filtering of the knowledge base (based on some folders and subfolders that the user can manually choose) has a big impact on the performances.

Up to now this process is still artisanal but we are working to standardize it a bit so it is easy when you start a new project with new data to generate these metrics and set up an optimized haystack.

@Rob192
Copy link
Contributor

Rob192 commented Jan 22, 2021

Hello @Timoeller !

I think there is a small bug in the _extract_docs_and_labels_from_dict function of preprocessor.utils : if you use a preprocessor set up with the following parameters (basically set up to create unique sentences with a small split_length

    preprocessor = PreProcessor(
        clean_empty_lines=False,
        clean_whitespace=False,
        clean_header_footer=False,
        split_by="word",
        split_length=20,
        split_overlap=0, 
        split_respect_sentence_boundary=True
    )

Then the variable splits of the _extract_docs_and_labels_from_dict function will contain a list with an empty doc at the beginning :

[{'text': '', 'id': 'be6520a0-90ec-4742-b3f9-2ca1eb29efd1-0', 'score': None, 'probability': None, 'question': None, 'meta': {'name': "Peut-on créer un syndicat secondaire dans un groupe d'immeubles en copropriété\xa0?", '_split_id': 0, '_split_offset': 0}, 'embedding': None},
 {'text': "Oui, lorsqu'une même copropriété comporte plusieurs bâtiments, les copropriétaires peuvent créer un ou plusieurs syndicats secondaires par un vote à la majorité absolue.", 'id': 'be6520a0-90ec-4742-b3f9-2ca1eb29efd1-1', 'score': None, 'probability': None, 'question': None, 'meta': {'name': "Peut-on créer un syndicat secondaire dans un groupe d'immeubles en copropriété\xa0?", '_split_id': 1, '_split_offset': 1}, 'embedding': None},
... ]

In this specific case, my answer is 'Oui' and the position of my answer is 0. however, the_split_offset being 1, then my answer will never be found, resulting in the following error :

Traceback (most recent call last):
File "c:\users\utilisateur\pythonprojects\haystack\haystack\preprocessor\utils.py", line 48, in eval_data_from_json
cur_docs, cur_labels = _extract_docs_and_labels_from_dict(document, preprocessor)
File "c:\users\utilisateur\pythonprojects\haystack\haystack\preprocessor\utils.py", line 163, in _extract_docs_and_labels_from_dict
document_id=cur_id,
UnboundLocalError: local variable 'cur_id' referenced before assignment

I am not sure of what would be the best solution. Either make sure that the first document is indeed filled or declare cur_id=splits[0].id like you do when len(splits) == 1

@Timoeller
Copy link
Contributor Author

Thanks for pointing us to your evaluation of different retriever settings. You guys seem to have done quite some detailed analysis already. I will dig into this analysis later on, for example I do not understand why the scores are so low. On which data did you do this analysis?


About the bug, this is indeed problematic. Thanks for bringing this up. I think it should be best dealt with in the preprocessor class, since there a "useless" document is created that then results in the mentioned bug. I created an issue #763 and will work on it coming week.

@Rob192
Copy link
Contributor

Rob192 commented Jan 22, 2021

Hello, the dataset used for testing the retriever was composed of questions that were directly asked by the citizens to service-public.fr. In the dataset you can also find the documents indicated by the civil agents as answers. We think this is very harsh testing as the questions are sometimes unclear or the answer is not explicitly found in the document. Sometimes you need to extropolate a bit to understand how the document will help you answering the question. It is much different from a squad type dataset where the questions are tailor made to the contexts !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add proprocessing to eval data from file
3 participants