# Crawl Website Content for Question Answering with Apify

Author: Jiri Spilka ([Apify](https://apify.com/jiri.spilka))

In this tutorial, we'll use the [apify-haystack](https://github.com/apify/apify-haystack/tree/main) integration to call [Website Content Crawler](https://apify.com/apify/website-content-crawler) and crawl and scrape text content from the [Haystack website](https://haystack.deepset.ai). Then, we'll use the [OpenAIDocumentEmbedder](https://docs.haystack.deepset.ai/docs/openaidocumentembedder) to compute text embeddings and the [InMemoryDocumentStore](https://docs.haystack.deepset.ai/docs/inmemorydocumentstore) to store documents in a temporary in-memory database. The last step will be a retrieval augmented generation pipeline to answer users' questions from the scraped data.


## Install dependencies

In [None]:
!pip install -q apify-haystack

## Set up the API keys

You need to have an Apify account and obtain [APIFY_API_TOKEN](https://docs.apify.com/platform/integrations/api).

You also need an OpenAI account and [OPENAI_API_KEY](https://platform.openai.com/docs/quickstart)


In [2]:
import os
from getpass import getpass

os.environ["APIFY_API_TOKEN"] = getpass("Enter YOUR APIFY_API_TOKEN")
os.environ["OPENAI_API_KEY"] = getpass("Enter YOUR OPENAI_API_KEY")

Enter YOUR APIFY_API_TOKEN¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑
Enter YOUR OPENAI_API_KEY¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


## Use the Website Content Crawler to scrape data from the haystack documentation

Now, let us call the Website Content Crawler using the Haystack component `ApifyDatasetFromActorCall`. First, we need to define parameters for the Website Content Crawler and then what data we need to save into the vector database.

The `actor_id` and detailed description of input parameters (variable `run_input`) can be found on the [Website Content Crawler input page](https://apify.com/apify/website-content-crawler/input-schema).

For this example, we will define `startUrls` and limit the number of crawled pages to five.

In [3]:
actor_id = "apify/website-content-crawler"
run_input = {
    "maxCrawlPages": 5,  # limit the number of pages to crawl
    "startUrls": [{"url": "https://haystack.deepset.ai/"}],
}

Next, we need to define a dataset mapping function. We need to know the output of the Website Content Crawler. Typically, it is a JSON object that looks like this (truncated for brevity):

```
[
  {
    "url": "https://haystack.deepset.ai/overview/quick-start",
    "text": "Haystack is an open-source AI framework to build custom production-grade LLM ..."
  },
  {
    "url": "https://haystack.deepset.ai/cookbook",
    "text": "You can use these examples as guidelines on how to make use of different mod... "
  },
]
```

We will convert this JSON to a Haystack `Document` using the `dataset_mapping_function` as follows:


In [4]:
from haystack import Document

def dataset_mapping_function(dataset_item: dict) -> Document:
    return Document(content=dataset_item.get("text"), meta={"url": dataset_item.get("url")})

And the definition of the `ApifyDatasetFromActorCall`:

In [5]:
from apify_haystack import ApifyDatasetFromActorCall

apify_dataset_loader = ApifyDatasetFromActorCall(
    actor_id=actor_id,
    run_input=run_input,
    dataset_mapping_function=dataset_mapping_function,
)

Before actually running the Website Content Crawler, we need to define embedding function and document store:

In [6]:
from haystack.components.embedders import OpenAIDocumentEmbedder
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()
docs_embedder = OpenAIDocumentEmbedder()

After that, we can call the Website Content Crawler and print the scraped data:

In [None]:
# Crawler website and store documents in the document_store
# Crawling will take some time (1-2 minutes), you can monitor progress in the https://console.apify.com/actors/runs

docs = apify_dataset_loader.run()

In [12]:
print(docs)

{'documents': [Document(id=3650d4d2050c97d0b20d6bb9202eb72494e2dc6ad0222a7e4a7bad038780ab31, content: 'Haystack | Haystack
Multimodal
AI
Architect a next generation AI app around all modalities, not just...', meta: {'url': 'https://haystack.deepset.ai/'}, embedding: vector of size 1536), Document(id=a441728f7b8c8f7541304f23be229372f526306c6d39f634fecf245923d2f239, content: 'What is Haystack? | Haystack
Haystack is an open-source AI orchestration framework built by deepset ...', meta: {'url': 'https://haystack.deepset.ai/overview/intro'}, embedding: vector of size 1536), Document(id=82282e7eb3115bf0e8efbaaa4de70fd68bcd1bebf25218a68973c3441ff9638f, content: 'Demos | Haystack
Check out demos built with Haystack!
AutoQuizzer
Try out our AutoQuizzer demo built...', meta: {'url': 'https://haystack.deepset.ai/overview/demo'}, embedding: vector of size 1536), Document(id=55f775825a43a52c8f51f4ba08713389a652e05eb992ed15d7c18bbe68bbe38a, content: 'Get Started | Haystack
Haystack is an open-sourc

Compute the embeddings and store them in the database:

In [8]:
embeddings = docs_embedder.run(docs.get("documents"))
document_store.write_documents(embeddings["documents"])

Calculating embeddings: 1it [00:01,  1.07s/it]


5

## Retrieval and LLM generative pipeline

Once we have the crawled data in the database, we can set up the classical retrieval augmented pipeline. Refer to the [RAG Haystack tutorial](https://haystack.deepset.ai/tutorials/27_first_rag_pipeline) for details.


In [9]:
from haystack import Pipeline
from haystack.components.builders import ChatPromptBuilder
from haystack.components.embedders import OpenAITextEmbedder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.dataclasses import ChatMessage

text_embedder = OpenAITextEmbedder()
retriever = InMemoryEmbeddingRetriever(document_store)
generator = OpenAIChatGenerator(model="gpt-4o-mini")

template = """
Given the following information, answer the question.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:
"""

prompt_builder = ChatPromptBuilder(template=[ChatMessage.from_user(template)], required_variables="*")

# Add components to your pipeline
print("Initializing pipeline...")
pipe = Pipeline()
pipe.add_component("embedder", text_embedder)
pipe.add_component("retriever", retriever)
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("llm", generator)

# Now, connect the components to each other
pipe.connect("embedder.embedding", "retriever.query_embedding")
pipe.connect("retriever", "prompt_builder.documents")
pipe.connect("prompt_builder", "llm")



Initializing pipeline...


<haystack.core.pipeline.pipeline.Pipeline object at 0x79d0f361ea90>
üöÖ Components
  - embedder: OpenAITextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: ChatPromptBuilder
  - llm: OpenAIChatGenerator
üõ§Ô∏è Connections
  - embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.messages (List[ChatMessage])

Now, you can ask questions about Haystack and get correct answers:

In [11]:
question = "What is haystack?"

response = pipe.run({"embedder": {"text": question}, "prompt_builder": {"question": question}})

print(f"question: {question}")
print(f"answer: {response['llm']['replies'][0].text}")

question: What is haystack?
answer: Haystack is an open-source AI orchestration framework developed by deepset that enables Python developers to create real-world applications using large language models (LLMs). It provides tools for building various types of applications, including autonomous agents, multi-modal apps, and scalable retrieval-augmented generation (RAG) systems. Haystack's modular architecture allows users to customize components, experiment with state-of-the-art methods, and manage their technology stack effectively. It caters to developers at all levels, from prototyping to full-scale deployment, and is supported by a community that values open-source collaboration. Haystack can be utilized directly in Python or through a visual interface called deepset Studio.
