Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document Retrieval for extractive QA with COVID-QA #108

Closed
aaronbriel opened this issue Oct 15, 2020 · 6 comments
Closed

Document Retrieval for extractive QA with COVID-QA #108

aaronbriel opened this issue Oct 15, 2020 · 6 comments

Comments

@aaronbriel
Copy link

aaronbriel commented Oct 15, 2020

Thank you so much for sharing your data and tools! I am working with the question-answering dataset for an experiment of my own.

@Timoeller mentioned in #103 that the documents used in the annotation tool to create the COVID-QA.json dataset "are a subset of CORD-19 papers that annotators deemed related to Covid." I was wondering if these are the same documents as listed in faq_covidbert.csv.

The reason I ask is that, as a workaround I've created my own retrieval txt file(s) through extracting the answers from COVID-QA.json, but the results are hit or miss. They are particularly off if I break the file up into chunks to improve performance, for instance into a separate txt file for each answer. I'm assuming this is due to lost context. I'm wondering if I should simply be using faq_covidbert as illustrated here, even though I am using extractive-QA.

The reason I did my method is that I was trying to follow an approach most closely approximating the extractive QA tutorial.

My ultimate objective is to compare the experience of using extractive QA vs FAQ-style QA, so I presumed that it would be apropos to have a bit of separation in the doc storage dataset.

Thank you!

@aaronbriel aaronbriel changed the title Subset of papers used to create COVID-QA.json Question Answering dataset Document Retrieval for extractive QA with COVID-QA.json Oct 15, 2020
@aaronbriel aaronbriel changed the title Document Retrieval for extractive QA with COVID-QA.json Document Retrieval for extractive QA with COVID-QA Oct 15, 2020
@Timoeller
Copy link
Contributor

Hey @aaronbriel cool that you like the dataset. We also have an accompanying ACL workshop paper.

index docs (and labels) into haystack

So you want to index the documents used into haystack and have used the "context" for each "paragraph" in the COVID-QA.json file. This is correct since these texts are exactly the associated papers.

You can directly use the COVID-QA.json file in haystacks Tutorial5_Evaluation.py once the PR deepset-ai/haystack/pull/494 is merged.

use the finder for asking single questions on the docs

If you have indexed the json file in haystacks document store you can then use the finder with any questions like:

prediction = finder.get_answers(question="Is this really working?", top_k_retriever=10, top_k_reader=5, index=doc_index)

So you said you want to compare extractive QA with FAQ based QA. Thats pretty cool. I presume that FAQ based QA is rather simple since you only match the incoming question with the questions from your database. Looking forward seeing your results there!

@aaronbriel
Copy link
Author

@Timoeller Ah I'm embarrassed - I forgot about the context entries! I've been doing too many things at once apparently.. Thank you so much for the quick response!

@aaronbriel
Copy link
Author

aaronbriel commented Oct 15, 2020

@Timoeller I'm seeing less than stellar results (at least for the few tests I've done) when I break up the contexts into separate files as opposed to using my prior approach of concatenating all questions and answers sequentially into a single document. For example, with the question "Where was COVID19 first discovered?" (a question directly pulled from COVID-QA.json), the latter correctly displays "Wuhan City, Hubei Province, China" as the result with highest probability (.78). However, with the contexts approach the highest probability result is "1971 by Theodor Diener" (.57).

This is using the deepset/roberta-base-squad2 that has been fine-tuned with the COVID-QA.json dataset. I would have expected this to increase performance, so it would seem that retrieval is the weak link here. Am I missing something?

@aaronbriel aaronbriel reopened this Oct 15, 2020
@aaronbriel
Copy link
Author

aaronbriel commented Oct 16, 2020

After further investigation into open haystack issues, I'm wondering if I'm running into the problem with roberta-base-squad2 as described here and here. Interestingly, however, the question above is towards the end of the document instead of the beginning. Either way, this is outside of the scope of the original issue so I will close and further investigate in the context of FARMReader and the model. Thanks!

@Timoeller
Copy link
Contributor

Exactly, the bugs you linked are just minor things that shouldnt affect your use case.

Btw it is this model: https://huggingface.co/deepset/roberta-base-squad2-covid that was finetuned on covid qa. Not the plain "deepset/roberta-base-squad2" (missing "-covid" here)

@aaronbriel
Copy link
Author

Right - I actually finetuned my own model with that same covid-qa data as I am creating different models that are trained on different sets of covid data for experimental purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants