-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Extract link retrieval from WebRetriever, introduce LinkContentFetcher #5227
Conversation
Pull Request Test Coverage Report for Build 5541969811
💛 - Coveralls |
Co-authored-by: Daria Fokina <daria.f93@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @vblagoje!
I really like the purpose of this node and the examples you provided. It unlocks several nice features for demos. ✨
I feel that something can be improved in node design and error handling. Let's see what you think...
return response | ||
except requests.RequestException as e: | ||
logger.debug("Error retrieving URL %s: %s", url, e) | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have doubts about exception handling (similar to those of html_content_handler
).
@anakin87 I thought about raising/not raising exceptions and the use cases we are likely to face in the LinkContentRetriever as a standalone component in the pipeline and also as a building block of WebRetriever and perhaps some other components in the near future. I've concluded that it is best to allow flexibility for raising and not raising exceptions. I introduced the raise_on_failure boolean flag as an (init) parameter. This parameter will give us increased robustness/fault tolerance but also flexibility. Here is what I mean more concretely. Robustness: When running a pipeline that use LinkContentRetriever directly and processes several URLs, robustness is crucial. The pipeline should be capable of running to completion even if one or a few URLs fails to be processed. Perhaps other fetched URLs provided enough document context - for example. If raise_on_failure is set to True, the pipeline will blow up once a single exception is encountered. This could lead to incomplete results, with many URLs potentially left unprocessed. By having the option to set raise_on_failure to False, the pipeline can continue to process the remaining URLs even if one URL fails, increasing the pipeline's robustness. Robustness here goes nicely with the adjacent added value of fault tolerance: raise_on_failure can introduce fault tolerance into your pipeline. When set to False, the pipeline can tolerate faults (in this case, failures to retrieve content from URLs) and continue execution. And finally, as I mentioned at the beginning - flexibility. Having the raise_on_failure flag allows users to choose how they want their pipeline to behave. Some components using LinkContentRetriever might want to have the normal exception raising to implement whatever suitable exception-handling logic. Be it for content fetching, be it for content parsing. This flexibility can benefit a wide range of use cases in the future. Let me know if you agree with this approach, and I'll strengthen the test suit to reflect both use cases. |
Ok then @anakin87 - see the last few commits that I think address your concerns. Don't be hesitant to raise more comments now, let's get this one as perfect as possible! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vblagoje, I appreciate the improvements and really like your work in this node!
I added some comments but I feel we are close to the finish line...
I still have a doubt about the nomenclature:
this node is not a real Retriever. Does it make sense to call it LinkContentRetriever
? It can be a bit misleading, IMHO.
I'd also like to hear the opinion of @dfokina. --> As indicated in the docstrings,
LinkContentRetriever fetches content from a URL and converts it into a list of Document objects.
except Exception as e: | ||
if raise_on_failure: | ||
raise e | ||
logger.debug("Couldn't extract content from %s", response.url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the new approach with raise_on_failure
, but I would also like to have a higher logging level (info? warn?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this extensively. It will create panic for users with a default logging value set to a warning. Ask Tuana for stories. I highly respect your feedback here, and I would have commented the same, but I am convinced these should be on debug because a) they fail a lot, and I mean a lot, and b) expert users will know how to set the logger to understand and see failures if needed. Users who set raise_on_failure
to True will get exceptions as they wished. The best-balanced solution IMHO. Let's ask @silvanocerza and @ZanSara for opinions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit of background: I selected debug level instead of warning because when we retrieve many links in WebRetriever, quite a few fail. They fail because websites increasingly block non-human traffic. Some parsing also fails because we use licence friendly HTML parser, which could be better, but it works. Having these failed links and parsing logged at warn looks scary to users. Tuana told me she's seeing many users complaining about these warning logs. That's why. Does it make better sense with this background info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer it to be a warning or info to be fair. If users don't want to see logging they can disable it. 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the logic is wrong and the log won't be ever printed as of now.
If raise_on_failure
is False
the extractor won't even raise any exception.
This would both log and reraise the exception, but at that point might as well just let the extractor raise and remove the try
.
logger.debug("Couldn't extract content from %s", response.url)
raise e
extracted_doc["content"] = extracted_content | ||
logger.debug("%s handler extracted content from %s", handler, url) | ||
else: | ||
text = extracted_doc.get("text", "") # we might have snippet under text, use it as fallback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't understand properly what happens in this else
branch.
Please explain...
try: | ||
response = requests.get(url, headers=LinkContentRetriever.REQUEST_HEADERS, timeout=timeout) | ||
if response.status_code != HTTPStatus.OK: | ||
logger.debug("Couldn't retrieve content from %s, status code: %s", url, response.status_code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Higher logging level (info? warn?)?
|
||
|
||
@pytest.mark.unit | ||
def test_call(mocked_requests, mocked_article_extractor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_call(mocked_requests, mocked_article_extractor): | |
def test_fetch(mocked_requests, mocked_article_extractor): |
This and the following tests could be renamed with fetch
instead of call
.
@pytest.mark.integration | ||
def test_retrieve_with_valid_url_on_live_web(): | ||
""" | ||
Test that LinkContentRetriever can fetch content from a valid URL using the retrieve method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test that LinkContentRetriever can fetch content from a valid URL using the retrieve method | |
Test that LinkContentRetriever can fetch content from a valid URL using the run method |
I was thinking about the naming but got some extra questions: |
There is no summarization going on. It fetches content from the URL and converts it to a list of documents. |
In this PR, I see two points still open:
|
@anakin87 is it less of a retriever than WebRetriever? They both fetch web data (just different kinds) and convert it into documents, right? |
Yeah, LinkFetcher works; where do we put it in the source tree is another question now that the component is not the retriever. We need input from @TuanaCelik here. She and I talked extensively about logs from the WebRetriever. Let's wait for her to come back. If she says these should be info/warning I'll accept your suggestions @silvanocerza @anakin87 |
You mean when |
LinkContentExtractor, LinkContentFetcher, all good. I thinking about where to put this class in the source tree :-) |
Updated @anakin87 , please have another look and the three last commits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost good to go...
Left two comments.
""" | ||
Checks the behavior when there's an exception during content extraction, and raise_on_failure is set to False. | ||
""" | ||
caplog.set_level(logging.DEBUG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
caplog.set_level(logging.DEBUG) | |
caplog.set_level(logging.WARNING) |
Perhaps this and the following should be changed?
else: | ||
logger.warning("%s handler failed to extract content from %s", handler, url) | ||
text = extracted_doc.get("text", "") | ||
extracted_doc["content"] = text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand the meaning of this part.
Can you please explain? Why do we want to provide a default text for the document?
If really useful, it should be documented in some way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid question. See https://github.com/deepset-ai/haystack/pull/5229/files#r1261301963 for details. Whatever we agree on here as a fallback key - I'm fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I see...
What puzzles me about the current strategy is that you can end up having both content
and text
(inside meta
) for the same document.
How about explicitly naming it snippet_text
and adding a short comment explaining this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 🎆
Let's also make sure this new component is added to the documentation! (we can add it here: https://github.com/deepset-ai/haystack/blob/main/docs/pydoc/config/retriever.yml) |
Let's wait for it a bit - IMHO. See how it works standalone and within WebRetriever. Perhaps after we add pdf support and see that it could be really useful to users - we can add it to docs and promote it then. Thoughts @dfokina @julian-risch? |
I am for adding documentation from the start. How do we expect people to start using the new feature if we don't document it? We don't need to have tutorials or example scripts and we don't need to promote it now but a documentation page should be there. It can also mention something like that this is a new feature in a preview mode, expect changes etc. |
I'm also for adding it to the docs (API ref & Guides), unless we specifically don't want anyone to be using it for now. |
Ok that's great, I agree. Will it be automatically added to API ref? Do we need do add anything to this PR at this time? |
It should be added here: https://github.com/deepset-ai/haystack/blob/main/docs/pydoc/config/retriever.yml |
After a bit of back and forth debate we have settled on the LinkContentFetcher name for this component. |
Sorry for the naming mismatch! |
What?
Introduces a new component,
LinkContentRetriever
. This component already implicitly exists withinWebRetriever
but has never been explicitly exposed until now.Why?
There is a growing necessity to separate
LinkContentRetriever
from the existingWebRetriever
code base and expose it as an independent component yet allow it to be reused within WebRetriever. The standaloneLinkContentRetriever
will be particularly beneficial in pipeline and agent scenarios where content needs to be fetched from a specific URL link. This will be useful in scenarios such as, "Read https://pythonspeed.com/articles/base-image-python-docker-images/ and extract the main conclusion from the blog post, be brief."LinkContentRetriever
will be particularly beneficial for Haystack V2, as it will enable the creation of more meaningful and engaging demos.How can it be used?
LinkContentRetriever
can be used as a standalone component for link retrieval. We currently support HTML-to-text conversion, but the design allows a clear extension point for other content types (e.g. pdf). A potential use-case could be, "Read https://arxiv.org/pdf/2305.06983.pdf and provide a summary and main learnings using the Feynman technique. The summary should be at most a few paragraphs long."Here is a simple example of how it may be used. The upcoming V2 will expand these possibilities.
The results are:
How did you test it?
Added comprehensive unit tests to ensure the functionality and reliability of
LinkContentRetriever
.Added an example from the above as a new example
examples/link_content_blog_post_summary.py
Notes for the reviewer:
Please review the new
LinkContentRetriever
component and its corresponding unit tests. Consider its potential use cases and extension possibilities for other content types. Are those extensions well thought out? I would greatly appreciate your insights and suggestions.