Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Extract link retrieval from WebRetriever, introduce LinkContentFetcher #5227

Merged
merged 18 commits into from
Jul 13, 2023

Conversation

vblagoje
Copy link
Member

@vblagoje vblagoje commented Jun 29, 2023

What?

Introduces a new component, LinkContentRetriever. This component already implicitly exists within WebRetriever but has never been explicitly exposed until now.

Why?

There is a growing necessity to separate LinkContentRetriever from the existing WebRetriever code base and expose it as an independent component yet allow it to be reused within WebRetriever. The standalone LinkContentRetriever will be particularly beneficial in pipeline and agent scenarios where content needs to be fetched from a specific URL link. This will be useful in scenarios such as, "Read https://pythonspeed.com/articles/base-image-python-docker-images/ and extract the main conclusion from the blog post, be brief." LinkContentRetriever will be particularly beneficial for Haystack V2, as it will enable the creation of more meaningful and engaging demos.

How can it be used?

LinkContentRetriever can be used as a standalone component for link retrieval. We currently support HTML-to-text conversion, but the design allows a clear extension point for other content types (e.g. pdf). A potential use-case could be, "Read https://arxiv.org/pdf/2305.06983.pdf and provide a summary and main learnings using the Feynman technique. The summary should be at most a few paragraphs long."

Here is a simple example of how it may be used. The upcoming V2 will expand these possibilities.

import os
from haystack.nodes import PromptNode, LinkContentRetriever, PromptTemplate
from haystack import Pipeline

openai_key = os.environ.get("OPENAI_API_KEY")
if not openai_key:
    raise ValueError("Please set the OPENAI_API_KEY environment variable")

retriever = LinkContentRetriever()
pt = PromptTemplate(
    "Given the paragraphs of the blog post, "
    "provide the main learnings and the final conclusion using short bullet points format."
    "\n\nParagraphs: {documents}"
)

prompt_node = PromptNode(
    "gpt-3.5-turbo-16k-0613",
    api_key=openai_key,
    max_length=512,
    default_prompt_template=pt,
    model_kwargs={"stream": True},
)

pipeline = Pipeline()
pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
pipeline.add_node(component=prompt_node, name="PromptNode", inputs=["Retriever"])

blog_posts = [
    "https://pythonspeed.com/articles/base-image-python-docker-images/",
    "https://lilianweng.github.io/posts/2023-06-23-agent/",
]

for blog_post in blog_posts:
    print(f"Blog post summary: {blog_post}")
    pipeline.run(blog_post)
    print("\n\n\n")

The results are:

Blog post summary: https://pythonspeed.com/articles/base-image-python-docker-images/
Main Learnings:
- When building a Docker image for a Python application, the choice of base image is important.
- Criteria for choosing a base image include stability, security updates, up-to-date dependencies, extensive dependencies, up-to-date Python, and small image size.
- Alpine Linux is not recommended for use as a base image due to potential issues.
- The three major operating systems that meet the criteria are Debian Stable, Ubuntu LTS, and RedHat Enterprise Linux and clones.

Final Conclusion:
- When choosing a base image for a Python application, it is recommended to use Ubuntu LTS, RedHat Universal Base Image, or Debian Stable.
- Avoid using Alpine Linux as a base image.
- Consider the specific criteria and needs of your application when making the decision.



Blog post summary: https://lilianweng.github.io/posts/2023-06-23-agent/
Main Learnings:

- LLM (large language model) can be used as the core controller for building autonomous agents.
- LLM-powered autonomous agents consist of LLM as the brain, complemented by components such as planning, memory, and tool use.
- Planning involves task decomposition and can be guided by techniques like Chain of Thought and Tree of Thoughts.
- Self-reflection allows agents to improve iteratively by refining past action decisions and correcting mistakes.
- Memory includes sensory memory, short-term memory, and long-term memory. External vector stores and fast retrieval can enhance memory capabilities.
- Tool use extends the capabilities of LLM-powered agents by leveraging external APIs and expert modules.
- There are various algorithms, such as LSH, ANNOY, HNSW, FAISS, and ScaNN, that enable fast maximum inner-product search (MIPS) in memory retrieval.
- Autonomy in scientific discovery and generative agents simulations showcases the potential of LLM-powered agents in complex tasks.
- Proof-of-concept examples like AutoGPT and GPT-Engineer demonstrate the use of LLM in autonomous agents but also highlight challenges such as reliability of natural language interfaces and the need for long-term planning.

Final Conclusion:

LLM-powered autonomous agents have the potential to tackle complex tasks and improve iteratively through self-reflection. These agents can benefit from task decomposition, memory retrieval, and tool use. However, challenges such as finite context length, reliability of natural language interfaces, and long-term planning need to be addressed to fully harness the capabilities of LLM-powered agents.

How did you test it?

Added comprehensive unit tests to ensure the functionality and reliability of LinkContentRetriever.
Added an example from the above as a new example examples/link_content_blog_post_summary.py

Notes for the reviewer:

Please review the new LinkContentRetriever component and its corresponding unit tests. Consider its potential use cases and extension possibilities for other content types. Are those extensions well thought out? I would greatly appreciate your insights and suggestions.

@vblagoje vblagoje requested a review from a team as a code owner June 29, 2023 09:36
@vblagoje vblagoje requested review from masci and removed request for a team June 29, 2023 09:36
@coveralls
Copy link
Collaborator

coveralls commented Jun 29, 2023

Pull Request Test Coverage Report for Build 5541969811

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 827 unchanged lines in 21 files lost coverage.
  • Overall coverage increased (+1.4%) to 45.45%

Files with Coverage Reduction New Missed Lines %
document_stores/elasticsearch/init.py 2 71.43%
nodes/prompt/invocation_layer/azure_chatgpt.py 2 89.47%
nodes/prompt/invocation_layer/azure_open_ai.py 2 89.47%
preview/pipeline.py 3 91.18%
nodes/prompt/invocation_layer/hugging_face.py 5 91.73%
nodes/prompt/invocation_layer/chatgpt.py 7 75.0%
nodes/prompt/prompt_model.py 7 86.0%
nodes/retriever/_openai_encoder.py 7 81.32%
document_stores/elasticsearch/es7.py 8 80.43%
nodes/prompt/invocation_layer/open_ai.py 11 80.85%
Totals Coverage Status
Change from base Build 5410444095: 1.4%
Covered Lines: 10523
Relevant Lines: 23153

💛 - Coveralls

haystack/nodes/retriever/link_content.py Outdated Show resolved Hide resolved
haystack/nodes/retriever/link_content.py Outdated Show resolved Hide resolved
haystack/nodes/retriever/link_content.py Outdated Show resolved Hide resolved
haystack/nodes/retriever/link_content.py Outdated Show resolved Hide resolved
haystack/nodes/retriever/link_content.py Outdated Show resolved Hide resolved
@vblagoje vblagoje requested review from anakin87 and removed request for masci July 4, 2023 16:06
Co-authored-by: Daria Fokina <daria.f93@gmail.com>
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @vblagoje!
I really like the purpose of this node and the examples you provided. It unlocks several nice features for demos. ✨

I feel that something can be improved in node design and error handling. Let's see what you think...

haystack/nodes/retriever/link_content.py Outdated Show resolved Hide resolved
haystack/nodes/retriever/link_content.py Show resolved Hide resolved
haystack/nodes/retriever/link_content.py Outdated Show resolved Hide resolved
haystack/nodes/retriever/link_content.py Outdated Show resolved Hide resolved
haystack/nodes/retriever/link_content.py Outdated Show resolved Hide resolved
haystack/nodes/retriever/link_content.py Outdated Show resolved Hide resolved
haystack/nodes/retriever/link_content.py Outdated Show resolved Hide resolved
haystack/nodes/retriever/link_content.py Outdated Show resolved Hide resolved
return response
except requests.RequestException as e:
logger.debug("Error retrieving URL %s: %s", url, e)
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have doubts about exception handling (similar to those of html_content_handler).

@vblagoje
Copy link
Member Author

vblagoje commented Jul 10, 2023

@anakin87 I thought about raising/not raising exceptions and the use cases we are likely to face in the LinkContentRetriever as a standalone component in the pipeline and also as a building block of WebRetriever and perhaps some other components in the near future. I've concluded that it is best to allow flexibility for raising and not raising exceptions. I introduced the raise_on_failure boolean flag as an (init) parameter. This parameter will give us increased robustness/fault tolerance but also flexibility. Here is what I mean more concretely.

Robustness: When running a pipeline that use LinkContentRetriever directly and processes several URLs, robustness is crucial. The pipeline should be capable of running to completion even if one or a few URLs fails to be processed. Perhaps other fetched URLs provided enough document context - for example. If raise_on_failure is set to True, the pipeline will blow up once a single exception is encountered. This could lead to incomplete results, with many URLs potentially left unprocessed. By having the option to set raise_on_failure to False, the pipeline can continue to process the remaining URLs even if one URL fails, increasing the pipeline's robustness.

Robustness here goes nicely with the adjacent added value of fault tolerance: raise_on_failure can introduce fault tolerance into your pipeline. When set to False, the pipeline can tolerate faults (in this case, failures to retrieve content from URLs) and continue execution.

And finally, as I mentioned at the beginning - flexibility. Having the raise_on_failure flag allows users to choose how they want their pipeline to behave. Some components using LinkContentRetriever might want to have the normal exception raising to implement whatever suitable exception-handling logic. Be it for content fetching, be it for content parsing. This flexibility can benefit a wide range of use cases in the future.

Let me know if you agree with this approach, and I'll strengthen the test suit to reflect both use cases.

@anakin87 anakin87 self-assigned this Jul 10, 2023
@vblagoje
Copy link
Member Author

Ok then @anakin87 - see the last few commits that I think address your concerns. Don't be hesitant to raise more comments now, let's get this one as perfect as possible!

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vblagoje, I appreciate the improvements and really like your work in this node!
I added some comments but I feel we are close to the finish line...

I still have a doubt about the nomenclature:
this node is not a real Retriever. Does it make sense to call it LinkContentRetriever? It can be a bit misleading, IMHO.
I'd also like to hear the opinion of @dfokina. --> As indicated in the docstrings,

LinkContentRetriever fetches content from a URL and converts it into a list of Document objects.

haystack/nodes/retriever/link_content.py Outdated Show resolved Hide resolved
haystack/nodes/retriever/link_content.py Show resolved Hide resolved
except Exception as e:
if raise_on_failure:
raise e
logger.debug("Couldn't extract content from %s", response.url)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new approach with raise_on_failure, but I would also like to have a higher logging level (info? warn?).

Copy link
Member Author

@vblagoje vblagoje Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this extensively. It will create panic for users with a default logging value set to a warning. Ask Tuana for stories. I highly respect your feedback here, and I would have commented the same, but I am convinced these should be on debug because a) they fail a lot, and I mean a lot, and b) expert users will know how to set the logger to understand and see failures if needed. Users who set raise_on_failure to True will get exceptions as they wished. The best-balanced solution IMHO. Let's ask @silvanocerza and @ZanSara for opinions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of background: I selected debug level instead of warning because when we retrieve many links in WebRetriever, quite a few fail. They fail because websites increasingly block non-human traffic. Some parsing also fails because we use licence friendly HTML parser, which could be better, but it works. Having these failed links and parsing logged at warn looks scary to users. Tuana told me she's seeing many users complaining about these warning logs. That's why. Does it make better sense with this background info?

Copy link
Contributor

@silvanocerza silvanocerza Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer it to be a warning or info to be fair. If users don't want to see logging they can disable it. 🤷

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the logic is wrong and the log won't be ever printed as of now.

If raise_on_failure is False the extractor won't even raise any exception.
This would both log and reraise the exception, but at that point might as well just let the extractor raise and remove the try.

        logger.debug("Couldn't extract content from %s", response.url)
        raise e

haystack/nodes/retriever/link_content.py Outdated Show resolved Hide resolved
extracted_doc["content"] = extracted_content
logger.debug("%s handler extracted content from %s", handler, url)
else:
text = extracted_doc.get("text", "") # we might have snippet under text, use it as fallback
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't understand properly what happens in this else branch.
Please explain...

try:
response = requests.get(url, headers=LinkContentRetriever.REQUEST_HEADERS, timeout=timeout)
if response.status_code != HTTPStatus.OK:
logger.debug("Couldn't retrieve content from %s, status code: %s", url, response.status_code)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Higher logging level (info? warn?)?



@pytest.mark.unit
def test_call(mocked_requests, mocked_article_extractor):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_call(mocked_requests, mocked_article_extractor):
def test_fetch(mocked_requests, mocked_article_extractor):

This and the following tests could be renamed with fetch instead of call.

@pytest.mark.integration
def test_retrieve_with_valid_url_on_live_web():
"""
Test that LinkContentRetriever can fetch content from a valid URL using the retrieve method
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Test that LinkContentRetriever can fetch content from a valid URL using the retrieve method
Test that LinkContentRetriever can fetch content from a valid URL using the run method

@dfokina
Copy link
Contributor

dfokina commented Jul 11, 2023

I was thinking about the naming but got some extra questions:
Looks like the LinkContentRetriever can retrieve content, summarize it and then put it into a document? So it's part retriever, part summarizer and part converter? Like an all-in-one.
But will be located under the Retriever nodes folder and it's not considered a Retriever? Do we maybe want to extract it into a separate category then?

@vblagoje
Copy link
Member Author

I was thinking about the naming but got some extra questions: Looks like the LinkContentRetriever can retrieve content, summarize it and then put it into a document? So it's part retriever, part summarizer and part converter? Like an all-in-one. But will be located under the Retriever nodes folder and it's not considered a Retriever? Do we maybe want to extract it into a separate category then?

There is no summarization going on. It fetches content from the URL and converts it to a list of documents.

@anakin87
Copy link
Member

In this PR, I see two points still open:

  • naming: it's clear that this component is not a retriever.
    Is LinkContentRetriever a good name?
    Is it something like LinkFetcher better?
    @dfokina @vblagoje

  • raising/logging the errors:
    If we don't want a warning, I would still prefer an INFO level
    @silvanocerza @vblagoje

@dfokina
Copy link
Contributor

dfokina commented Jul 11, 2023

@anakin87 is it less of a retriever than WebRetriever? They both fetch web data (just different kinds) and convert it into documents, right?
I do like "fetcher", though, but not LinkFetcher (sounds like it just fetches URLs). Some options: WebContentFetcher / PageContentFetcher / WebContentExtractor

@vblagoje
Copy link
Member Author

Yeah, LinkFetcher works; where do we put it in the source tree is another question now that the component is not the retriever. We need input from @TuanaCelik here. She and I talked extensively about logs from the WebRetriever. Let's wait for her to come back. If she says these should be info/warning I'll accept your suggestions @silvanocerza @anakin87

@silvanocerza
Copy link
Contributor

@vblagoje @anakin87 there's no need to bother Tuana in this case I think.

As the code currently stand the log will be printed only when raise_on_failure is True, if an exception is raised we don't really need the log I think.

@anakin87
Copy link
Member

@dfokina the WebRetriever will use the LinkContentRetriever in this refactoring/new design (see #5229).
Based on my understanding, the WebRetriever is a particular type of Retriever, while the LinkContentRetriever simply

fetches content from a URL and converts it into a list of Document objects.

@dfokina
Copy link
Contributor

dfokina commented Jul 11, 2023

@anakin87 okay, thanks! Then maybe one of the options here, @vblagoje , if they make sense to you:

I do like "fetcher", but not LinkFetcher (sounds like it just fetches URLs). Some other options: WebContentFetcher / PageContentFetcher / WebContentExtractor

@vblagoje
Copy link
Member Author

@vblagoje @anakin87 there's no need to bother Tuana in this case I think.

As the code currently stand the log will be printed only when raise_on_failure is True, if an exception is raised we don't really need the log I think.

You mean when raise_on_failure is False, right? When raise_on_failure is True no exceptions are logged, they are raised to the caller. Do you agree @silvanocerza ? @anakin87 that's what we agreed, correct?

@vblagoje
Copy link
Member Author

vblagoje commented Jul 11, 2023

@anakin87 okay, thanks! Then maybe one of the options here, @vblagoje , if they make sense to you:

I do like "fetcher", but not LinkFetcher (sounds like it just fetches URLs). Some other options: WebContentFetcher / PageContentFetcher / WebContentExtractor

LinkContentExtractor, LinkContentFetcher, all good. I thinking about where to put this class in the source tree :-)

@vblagoje
Copy link
Member Author

Updated @anakin87 , please have another look and the three last commits.

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost good to go...
Left two comments.

"""
Checks the behavior when there's an exception during content extraction, and raise_on_failure is set to False.
"""
caplog.set_level(logging.DEBUG)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
caplog.set_level(logging.DEBUG)
caplog.set_level(logging.WARNING)

Perhaps this and the following should be changed?

Comment on lines 99 to 102
else:
logger.warning("%s handler failed to extract content from %s", handler, url)
text = extracted_doc.get("text", "")
extracted_doc["content"] = text
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand the meaning of this part.
Can you please explain? Why do we want to provide a default text for the document?
If really useful, it should be documented in some way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid question. See https://github.com/deepset-ai/haystack/pull/5229/files#r1261301963 for details. Whatever we agree on here as a fallback key - I'm fine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I see...

What puzzles me about the current strategy is that you can end up having both content and text (inside meta) for the same document.

How about explicitly naming it snippet_text and adding a short comment explaining this?

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🎆

@anakin87
Copy link
Member

anakin87 commented Jul 13, 2023

Let's also make sure this new component is added to the documentation!

(we can add it here: https://github.com/deepset-ai/haystack/blob/main/docs/pydoc/config/retriever.yml)

@vblagoje
Copy link
Member Author

vblagoje commented Jul 13, 2023

Let's also make sure this new component is added to the documentation!

(we can add it here: https://github.com/deepset-ai/haystack/blob/main/docs/pydoc/config/retriever.yml)

Let's wait for it a bit - IMHO. See how it works standalone and within WebRetriever. Perhaps after we add pdf support and see that it could be really useful to users - we can add it to docs and promote it then. Thoughts @dfokina @julian-risch?

@julian-risch
Copy link
Member

Let's wait for it a bit - IMHO. See how it works standalone and within WebRetriever. Perhaps after we add pdf support and see that it could be really useful to users - we can add it to docs and promote it then. Thoughts @dfokina @julian-risch?

I am for adding documentation from the start. How do we expect people to start using the new feature if we don't document it? We don't need to have tutorials or example scripts and we don't need to promote it now but a documentation page should be there. It can also mention something like that this is a new feature in a preview mode, expect changes etc.
The documentation is valued by users in its current shape and I'd say they rightfully expect us to have documentation for a new component like LinkContentRetriever. Even if it's limited for the beginning.

@dfokina
Copy link
Contributor

dfokina commented Jul 13, 2023

I'm also for adding it to the docs (API ref & Guides), unless we specifically don't want anyone to be using it for now.
As @julian-risch said, I can add a warning about it being a feature in development so that it doesn't raise too many complaints and we can expand it at our own pace.

@vblagoje
Copy link
Member Author

vblagoje commented Jul 13, 2023

Ok that's great, I agree. Will it be automatically added to API ref? Do we need do add anything to this PR at this time?

@anakin87
Copy link
Member

@vblagoje

It should be added here: https://github.com/deepset-ai/haystack/blob/main/docs/pydoc/config/retriever.yml
(in the list of modules).

@vblagoje vblagoje merged commit f21005f into main Jul 13, 2023
@vblagoje vblagoje deleted the link_content_retriever branch July 13, 2023 10:54
@vblagoje vblagoje changed the title refactor: Extract link retrieval from WebRetriever, introduce LinkContentRetriever refactor: Extract link retrieval from WebRetriever, introduce LinkContentFetcher Jul 13, 2023
@vblagoje
Copy link
Member Author

After a bit of back and forth debate we have settled on the LinkContentFetcher name for this component.

@anakin87
Copy link
Member

Sorry for the naming mismatch!
I just now realized that what I suggested to @vblagoje is the procedure to add the new component to the API reference, not to the documentation 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants