refactor: Extract link retrieval from WebRetriever, introduce LinkContentFetcher #5227

vblagoje · 2023-06-29T09:36:53Z

What?

Introduces a new component, LinkContentRetriever. This component already implicitly exists within WebRetriever but has never been explicitly exposed until now.

Why?

There is a growing necessity to separate LinkContentRetriever from the existing WebRetriever code base and expose it as an independent component yet allow it to be reused within WebRetriever. The standalone LinkContentRetriever will be particularly beneficial in pipeline and agent scenarios where content needs to be fetched from a specific URL link. This will be useful in scenarios such as, "Read https://pythonspeed.com/articles/base-image-python-docker-images/ and extract the main conclusion from the blog post, be brief." LinkContentRetriever will be particularly beneficial for Haystack V2, as it will enable the creation of more meaningful and engaging demos.

How can it be used?

LinkContentRetriever can be used as a standalone component for link retrieval. We currently support HTML-to-text conversion, but the design allows a clear extension point for other content types (e.g. pdf). A potential use-case could be, "Read https://arxiv.org/pdf/2305.06983.pdf and provide a summary and main learnings using the Feynman technique. The summary should be at most a few paragraphs long."

Here is a simple example of how it may be used. The upcoming V2 will expand these possibilities.

import os
from haystack.nodes import PromptNode, LinkContentRetriever, PromptTemplate
from haystack import Pipeline

openai_key = os.environ.get("OPENAI_API_KEY")
if not openai_key:
    raise ValueError("Please set the OPENAI_API_KEY environment variable")

retriever = LinkContentRetriever()
pt = PromptTemplate(
    "Given the paragraphs of the blog post, "
    "provide the main learnings and the final conclusion using short bullet points format."
    "\n\nParagraphs: {documents}"
)

prompt_node = PromptNode(
    "gpt-3.5-turbo-16k-0613",
    api_key=openai_key,
    max_length=512,
    default_prompt_template=pt,
    model_kwargs={"stream": True},
)

pipeline = Pipeline()
pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
pipeline.add_node(component=prompt_node, name="PromptNode", inputs=["Retriever"])

blog_posts = [
    "https://pythonspeed.com/articles/base-image-python-docker-images/",
    "https://lilianweng.github.io/posts/2023-06-23-agent/",
]

for blog_post in blog_posts:
    print(f"Blog post summary: {blog_post}")
    pipeline.run(blog_post)
    print("\n\n\n")

The results are:

Blog post summary: https://pythonspeed.com/articles/base-image-python-docker-images/
Main Learnings:
- When building a Docker image for a Python application, the choice of base image is important.
- Criteria for choosing a base image include stability, security updates, up-to-date dependencies, extensive dependencies, up-to-date Python, and small image size.
- Alpine Linux is not recommended for use as a base image due to potential issues.
- The three major operating systems that meet the criteria are Debian Stable, Ubuntu LTS, and RedHat Enterprise Linux and clones.

Final Conclusion:
- When choosing a base image for a Python application, it is recommended to use Ubuntu LTS, RedHat Universal Base Image, or Debian Stable.
- Avoid using Alpine Linux as a base image.
- Consider the specific criteria and needs of your application when making the decision.



Blog post summary: https://lilianweng.github.io/posts/2023-06-23-agent/
Main Learnings:

- LLM (large language model) can be used as the core controller for building autonomous agents.
- LLM-powered autonomous agents consist of LLM as the brain, complemented by components such as planning, memory, and tool use.
- Planning involves task decomposition and can be guided by techniques like Chain of Thought and Tree of Thoughts.
- Self-reflection allows agents to improve iteratively by refining past action decisions and correcting mistakes.
- Memory includes sensory memory, short-term memory, and long-term memory. External vector stores and fast retrieval can enhance memory capabilities.
- Tool use extends the capabilities of LLM-powered agents by leveraging external APIs and expert modules.
- There are various algorithms, such as LSH, ANNOY, HNSW, FAISS, and ScaNN, that enable fast maximum inner-product search (MIPS) in memory retrieval.
- Autonomy in scientific discovery and generative agents simulations showcases the potential of LLM-powered agents in complex tasks.
- Proof-of-concept examples like AutoGPT and GPT-Engineer demonstrate the use of LLM in autonomous agents but also highlight challenges such as reliability of natural language interfaces and the need for long-term planning.

Final Conclusion:

LLM-powered autonomous agents have the potential to tackle complex tasks and improve iteratively through self-reflection. These agents can benefit from task decomposition, memory retrieval, and tool use. However, challenges such as finite context length, reliability of natural language interfaces, and long-term planning need to be addressed to fully harness the capabilities of LLM-powered agents.

How did you test it?

Added comprehensive unit tests to ensure the functionality and reliability of LinkContentRetriever.
Added an example from the above as a new example examples/link_content_blog_post_summary.py

Notes for the reviewer:

Please review the new LinkContentRetriever component and its corresponding unit tests. Consider its potential use cases and extension possibilities for other content types. Are those extensions well thought out? I would greatly appreciate your insights and suggestions.

coveralls · 2023-06-29T09:52:05Z

Pull Request Test Coverage Report for Build 5541969811

0 of 0 changed or added relevant lines in 0 files are covered.
827 unchanged lines in 21 files lost coverage.
Overall coverage increased (+1.4%) to 45.45%

Files with Coverage Reduction	New Missed Lines	%
document_stores/elasticsearch/init.py	2	71.43%
nodes/prompt/invocation_layer/azure_chatgpt.py	2	89.47%
nodes/prompt/invocation_layer/azure_open_ai.py	2	89.47%
preview/pipeline.py	3	91.18%
nodes/prompt/invocation_layer/hugging_face.py	5	91.73%
nodes/prompt/invocation_layer/chatgpt.py	7	75.0%
nodes/prompt/prompt_model.py	7	86.0%
nodes/retriever/_openai_encoder.py	7	81.32%
document_stores/elasticsearch/es7.py	8	80.43%
nodes/prompt/invocation_layer/open_ai.py	11	80.85%

Totals
Change from base Build 5410444095:	1.4%
Covered Lines:	10523
Relevant Lines:	23153

💛 - Coveralls

haystack/nodes/retriever/link_content.py

Co-authored-by: Daria Fokina <daria.f93@gmail.com>

anakin87

Hey @vblagoje!
I really like the purpose of this node and the examples you provided. It unlocks several nice features for demos. ✨

I feel that something can be improved in node design and error handling. Let's see what you think...

haystack/nodes/retriever/link_content.py

anakin87 · 2023-07-06T08:33:19Z

haystack/nodes/retriever/link_content.py

+            return response
+        except requests.RequestException as e:
+            logger.debug("Error retrieving URL %s: %s", url, e)
+            return None


I have doubts about exception handling (similar to those of html_content_handler).

…nit tests

vblagoje · 2023-07-10T05:43:03Z

@anakin87 I thought about raising/not raising exceptions and the use cases we are likely to face in the LinkContentRetriever as a standalone component in the pipeline and also as a building block of WebRetriever and perhaps some other components in the near future. I've concluded that it is best to allow flexibility for raising and not raising exceptions. I introduced the raise_on_failure boolean flag as an (init) parameter. This parameter will give us increased robustness/fault tolerance but also flexibility. Here is what I mean more concretely.

Robustness: When running a pipeline that use LinkContentRetriever directly and processes several URLs, robustness is crucial. The pipeline should be capable of running to completion even if one or a few URLs fails to be processed. Perhaps other fetched URLs provided enough document context - for example. If raise_on_failure is set to True, the pipeline will blow up once a single exception is encountered. This could lead to incomplete results, with many URLs potentially left unprocessed. By having the option to set raise_on_failure to False, the pipeline can continue to process the remaining URLs even if one URL fails, increasing the pipeline's robustness.

Robustness here goes nicely with the adjacent added value of fault tolerance: raise_on_failure can introduce fault tolerance into your pipeline. When set to False, the pipeline can tolerate faults (in this case, failures to retrieve content from URLs) and continue execution.

And finally, as I mentioned at the beginning - flexibility. Having the raise_on_failure flag allows users to choose how they want their pipeline to behave. Some components using LinkContentRetriever might want to have the normal exception raising to implement whatever suitable exception-handling logic. Be it for content fetching, be it for content parsing. This flexibility can benefit a wide range of use cases in the future.

Let me know if you agree with this approach, and I'll strengthen the test suit to reflect both use cases.

vblagoje · 2023-07-10T08:40:54Z

Ok then @anakin87 - see the last few commits that I think address your concerns. Don't be hesitant to raise more comments now, let's get this one as perfect as possible!

anakin87

@vblagoje, I appreciate the improvements and really like your work in this node!
I added some comments but I feel we are close to the finish line...

I still have a doubt about the nomenclature:
this node is not a real Retriever. Does it make sense to call it LinkContentRetriever? It can be a bit misleading, IMHO.
I'd also like to hear the opinion of @dfokina. --> As indicated in the docstrings,

LinkContentRetriever fetches content from a URL and converts it into a list of Document objects.

haystack/nodes/retriever/link_content.py

anakin87 · 2023-07-11T08:26:54Z

haystack/nodes/retriever/link_content.py

+    except Exception as e:
+        if raise_on_failure:
+            raise e
+        logger.debug("Couldn't extract content from %s", response.url)


I like the new approach with raise_on_failure, but I would also like to have a higher logging level (info? warn?).

I thought about this extensively. It will create panic for users with a default logging value set to a warning. Ask Tuana for stories. I highly respect your feedback here, and I would have commented the same, but I am convinced these should be on debug because a) they fail a lot, and I mean a lot, and b) expert users will know how to set the logger to understand and see failures if needed. Users who set raise_on_failure to True will get exceptions as they wished. The best-balanced solution IMHO. Let's ask @silvanocerza and @ZanSara for opinions.

A bit of background: I selected debug level instead of warning because when we retrieve many links in WebRetriever, quite a few fail. They fail because websites increasingly block non-human traffic. Some parsing also fails because we use licence friendly HTML parser, which could be better, but it works. Having these failed links and parsing logged at warn looks scary to users. Tuana told me she's seeing many users complaining about these warning logs. That's why. Does it make better sense with this background info?

I'd prefer it to be a warning or info to be fair. If users don't want to see logging they can disable it. 🤷

Also the logic is wrong and the log won't be ever printed as of now.

If raise_on_failure is False the extractor won't even raise any exception.
This would both log and reraise the exception, but at that point might as well just let the extractor raise and remove the try.

logger.debug("Couldn't extract content from %s", response.url) raise e

haystack/nodes/retriever/link_content.py

anakin87 · 2023-07-11T08:31:50Z

haystack/nodes/retriever/link_content.py

+                    extracted_doc["content"] = extracted_content
+                    logger.debug("%s handler extracted content from %s", handler, url)
+                else:
+                    text = extracted_doc.get("text", "")  # we might have snippet under text, use it as fallback


I can't understand properly what happens in this else branch.
Please explain...

anakin87 · 2023-07-11T08:35:03Z

haystack/nodes/retriever/link_content.py

+        try:
+            response = requests.get(url, headers=LinkContentRetriever.REQUEST_HEADERS, timeout=timeout)
+            if response.status_code != HTTPStatus.OK:
+                logger.debug("Couldn't retrieve content from %s, status code: %s", url, response.status_code)


Higher logging level (info? warn?)?

anakin87 · 2023-07-11T08:38:16Z

test/nodes/test_link_content_retriever.py

+
+
+@pytest.mark.unit
+def test_call(mocked_requests, mocked_article_extractor):


Suggested change

def test_call(mocked_requests, mocked_article_extractor):

def test_fetch(mocked_requests, mocked_article_extractor):

This and the following tests could be renamed with fetch instead of call.

anakin87 · 2023-07-11T08:42:24Z

test/nodes/test_link_content_retriever.py

+@pytest.mark.integration
+def test_retrieve_with_valid_url_on_live_web():
+    """
+    Test that LinkContentRetriever can fetch content from a valid URL using the retrieve method


Suggested change

Test that LinkContentRetriever can fetch content from a valid URL using the retrieve method

Test that LinkContentRetriever can fetch content from a valid URL using the run method

dfokina · 2023-07-11T10:39:43Z

I was thinking about the naming but got some extra questions:
Looks like the LinkContentRetriever can retrieve content, summarize it and then put it into a document? So it's part retriever, part summarizer and part converter? Like an all-in-one.
But will be located under the Retriever nodes folder and it's not considered a Retriever? Do we maybe want to extract it into a separate category then?

vblagoje · 2023-07-11T10:43:43Z

I was thinking about the naming but got some extra questions: Looks like the LinkContentRetriever can retrieve content, summarize it and then put it into a document? So it's part retriever, part summarizer and part converter? Like an all-in-one. But will be located under the Retriever nodes folder and it's not considered a Retriever? Do we maybe want to extract it into a separate category then?

There is no summarization going on. It fetches content from the URL and converts it to a list of documents.

anakin87 · 2023-07-11T14:43:01Z

In this PR, I see two points still open:

naming: it's clear that this component is not a retriever.
Is LinkContentRetriever a good name?
Is it something like LinkFetcher better?
@dfokina @vblagoje
raising/logging the errors:
If we don't want a warning, I would still prefer an INFO level
@silvanocerza @vblagoje

dfokina · 2023-07-11T15:28:06Z

@anakin87 is it less of a retriever than WebRetriever? They both fetch web data (just different kinds) and convert it into documents, right?
I do like "fetcher", though, but not LinkFetcher (sounds like it just fetches URLs). Some options: WebContentFetcher / PageContentFetcher / WebContentExtractor

vblagoje · 2023-07-11T15:30:11Z

Yeah, LinkFetcher works; where do we put it in the source tree is another question now that the component is not the retriever. We need input from @TuanaCelik here. She and I talked extensively about logs from the WebRetriever. Let's wait for her to come back. If she says these should be info/warning I'll accept your suggestions @silvanocerza @anakin87

silvanocerza · 2023-07-11T15:33:54Z

@vblagoje @anakin87 there's no need to bother Tuana in this case I think.

As the code currently stand the log will be printed only when raise_on_failure is True, if an exception is raised we don't really need the log I think.

anakin87 · 2023-07-11T15:36:42Z

@dfokina the WebRetriever will use the LinkContentRetriever in this refactoring/new design (see #5229).
Based on my understanding, the WebRetriever is a particular type of Retriever, while the LinkContentRetriever simply

fetches content from a URL and converts it into a list of Document objects.

dfokina · 2023-07-11T15:45:35Z

@anakin87 okay, thanks! Then maybe one of the options here, @vblagoje , if they make sense to you:

I do like "fetcher", but not LinkFetcher (sounds like it just fetches URLs). Some other options: WebContentFetcher / PageContentFetcher / WebContentExtractor

vblagoje · 2023-07-11T16:08:07Z

@vblagoje @anakin87 there's no need to bother Tuana in this case I think.

As the code currently stand the log will be printed only when raise_on_failure is True, if an exception is raised we don't really need the log I think.

You mean when raise_on_failure is False, right? When raise_on_failure is True no exceptions are logged, they are raised to the caller. Do you agree @silvanocerza ? @anakin87 that's what we agreed, correct?

vblagoje · 2023-07-11T16:11:14Z

@anakin87 okay, thanks! Then maybe one of the options here, @vblagoje , if they make sense to you:

I do like "fetcher", but not LinkFetcher (sounds like it just fetches URLs). Some other options: WebContentFetcher / PageContentFetcher / WebContentExtractor

LinkContentExtractor, LinkContentFetcher, all good. I thinking about where to put this class in the source tree :-)

vblagoje · 2023-07-12T14:13:29Z

Updated @anakin87 , please have another look and the three last commits.

anakin87

Almost good to go...
Left two comments.

anakin87 · 2023-07-12T14:30:02Z

test/nodes/test_link_content_fetcher.py

+    """
+    Checks the behavior when there's an exception during content extraction, and raise_on_failure is set to False.
+    """
+    caplog.set_level(logging.DEBUG)


Suggested change

caplog.set_level(logging.DEBUG)

caplog.set_level(logging.WARNING)

Perhaps this and the following should be changed?

anakin87 · 2023-07-12T14:33:58Z

haystack/nodes/retriever/link_content.py

+                else:
+                    logger.warning("%s handler failed to extract content from %s", handler, url)
+                    text = extracted_doc.get("text", "")
+                    extracted_doc["content"] = text


I still don't understand the meaning of this part.
Can you please explain? Why do we want to provide a default text for the document?
If really useful, it should be documented in some way.

Valid question. See https://github.com/deepset-ai/haystack/pull/5229/files#r1261301963 for details. Whatever we agree on here as a fallback key - I'm fine

Now I see...

What puzzles me about the current strategy is that you can end up having both content and text (inside meta) for the same document.

How about explicitly naming it snippet_text and adding a short comment explaining this?

anakin87

LGTM! 🎆

anakin87 · 2023-07-13T07:34:56Z

Let's also make sure this new component is added to the documentation!

(we can add it here: https://github.com/deepset-ai/haystack/blob/main/docs/pydoc/config/retriever.yml)

vblagoje · 2023-07-13T08:22:41Z

Let's also make sure this new component is added to the documentation!

(we can add it here: https://github.com/deepset-ai/haystack/blob/main/docs/pydoc/config/retriever.yml)

Let's wait for it a bit - IMHO. See how it works standalone and within WebRetriever. Perhaps after we add pdf support and see that it could be really useful to users - we can add it to docs and promote it then. Thoughts @dfokina @julian-risch?

julian-risch · 2023-07-13T09:27:26Z

Let's wait for it a bit - IMHO. See how it works standalone and within WebRetriever. Perhaps after we add pdf support and see that it could be really useful to users - we can add it to docs and promote it then. Thoughts @dfokina @julian-risch?

I am for adding documentation from the start. How do we expect people to start using the new feature if we don't document it? We don't need to have tutorials or example scripts and we don't need to promote it now but a documentation page should be there. It can also mention something like that this is a new feature in a preview mode, expect changes etc.
The documentation is valued by users in its current shape and I'd say they rightfully expect us to have documentation for a new component like LinkContentRetriever. Even if it's limited for the beginning.

dfokina · 2023-07-13T09:38:46Z

I'm also for adding it to the docs (API ref & Guides), unless we specifically don't want anyone to be using it for now.
As @julian-risch said, I can add a warning about it being a feature in development so that it doesn't raise too many complaints and we can expand it at our own pace.

vblagoje · 2023-07-13T09:46:38Z

Ok that's great, I agree. Will it be automatically added to API ref? Do we need do add anything to this PR at this time?

anakin87 · 2023-07-13T09:48:48Z

@vblagoje

It should be added here: https://github.com/deepset-ai/haystack/blob/main/docs/pydoc/config/retriever.yml
(in the list of modules).

vblagoje · 2023-07-13T10:55:41Z

After a bit of back and forth debate we have settled on the LinkContentFetcher name for this component.

anakin87 · 2023-07-13T11:04:23Z

Sorry for the naming mismatch!
I just now realized that what I suggested to @vblagoje is the procedure to add the new component to the API reference, not to the documentation 😃

vblagoje added 2 commits June 29, 2023 10:45

Extract link retrieval from WebRetriever, introduce LinkContentRetriever

08dc67b

Add example

483afc8

vblagoje requested a review from a team as a code owner June 29, 2023 09:36

vblagoje requested review from masci and removed request for a team June 29, 2023 09:36

github-actions bot added topic:retriever topic:tests type:documentation Improvements on the docs labels Jun 29, 2023

vblagoje mentioned this pull request Jun 29, 2023

refactor: Update WebRetriever to use LinkContentFetcher #5229

Merged

vblagoje added 2 commits June 29, 2023 12:03

Fix pylint

6c661a1

Fine details, improve coverage

34b7552

dfokina reviewed Jul 4, 2023

View reviewed changes

vblagoje requested review from anakin87 and removed request for masci July 4, 2023 16:06

Apply suggestions from code review

d4b0540

Co-authored-by: Daria Fokina <daria.f93@gmail.com>

anakin87 reviewed Jul 6, 2023

View reviewed changes

vblagoje added 3 commits July 7, 2023 17:47

Make LinkContentRetriever inherit BaseComponent, adjust methods and u…

cdf57b8

…nit tests

PR feedback

8eb6e97

Add raise_on_failure, PR feedback

4f8be06

anakin87 self-assigned this Jul 10, 2023

Add more unit tests for raise_on_failure, add/improve test comments

744a7da

anakin87 reviewed Jul 11, 2023

View reviewed changes

vblagoje added 3 commits July 11, 2023 14:08

PR feedback

80464cb

Fix mypy typing

4c5dcc7

Pydoc fixes

27984dd

vblagoje added 3 commits July 11, 2023 20:20

Raise exception on http errors

f004f0b

Rename to LinkContentFetcher

4b7e926

Change debug to warning statements

d0e5f4c

anakin87 reviewed Jul 12, 2023

View reviewed changes

vblagoje added 2 commits July 12, 2023 16:40

Set the right logging level in tests

8126b26

Clarify use of snippet_text, return [] if no content extracted

5b5140b

anakin87 approved these changes Jul 12, 2023

View reviewed changes

Add to API docs

1dd5b25

vblagoje merged commit f21005f into main Jul 13, 2023

vblagoje deleted the link_content_retriever branch July 13, 2023 10:54

vblagoje changed the title ~~refactor: Extract link retrieval from WebRetriever, introduce LinkContentRetriever~~ refactor: Extract link retrieval from WebRetriever, introduce LinkContentFetcher Jul 13, 2023

dfokina mentioned this pull request Jul 13, 2023

[Docs] Add LinkContentFetcher docs #5353

Closed



		@pytest.mark.unit
		def test_call(mocked_requests, mocked_article_extractor):

	def test_call(mocked_requests, mocked_article_extractor):
	def test_fetch(mocked_requests, mocked_article_extractor):

	Test that LinkContentRetriever can fetch content from a valid URL using the retrieve method
	Test that LinkContentRetriever can fetch content from a valid URL using the run method

	caplog.set_level(logging.DEBUG)
	caplog.set_level(logging.WARNING)

refactor: Extract link retrieval from WebRetriever, introduce LinkContentFetcher #5227

refactor: Extract link retrieval from WebRetriever, introduce LinkContentFetcher #5227

Conversation

vblagoje commented Jun 29, 2023 • edited Loading

What?

Why?

How can it be used?

How did you test it?

Notes for the reviewer:

coveralls commented Jun 29, 2023 • edited Loading

Pull Request Test Coverage Report for Build 5541969811

💛 - Coveralls

anakin87 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vblagoje commented Jul 10, 2023 • edited Loading

vblagoje commented Jul 10, 2023

anakin87 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vblagoje Jul 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

silvanocerza Jul 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dfokina commented Jul 11, 2023

vblagoje commented Jul 11, 2023

anakin87 commented Jul 11, 2023

dfokina commented Jul 11, 2023

vblagoje commented Jul 11, 2023

silvanocerza commented Jul 11, 2023

anakin87 commented Jul 11, 2023

dfokina commented Jul 11, 2023

vblagoje commented Jul 11, 2023

vblagoje commented Jul 11, 2023 • edited Loading

vblagoje commented Jul 12, 2023

anakin87 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anakin87 left a comment

Choose a reason for hiding this comment

anakin87 commented Jul 13, 2023 • edited Loading

vblagoje commented Jul 13, 2023 • edited Loading

julian-risch commented Jul 13, 2023

dfokina commented Jul 13, 2023

vblagoje commented Jul 13, 2023 • edited Loading

anakin87 commented Jul 13, 2023

vblagoje commented Jul 13, 2023

anakin87 commented Jul 13, 2023

vblagoje commented Jun 29, 2023 •

edited

Loading

coveralls commented Jun 29, 2023 •

edited

Loading

vblagoje commented Jul 10, 2023 •

edited

Loading

vblagoje Jul 11, 2023 •

edited

Loading

silvanocerza Jul 11, 2023 •

edited

Loading

vblagoje commented Jul 11, 2023 •

edited

Loading

anakin87 commented Jul 13, 2023 •

edited

Loading

vblagoje commented Jul 13, 2023 •

edited

Loading

vblagoje commented Jul 13, 2023 •

edited

Loading