<a href="https://colab.research.google.com/github/baldpanda/advent-of-haystack-2023/blob/main/day_1/advent_of_haystack_day_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advent of Haystack: Day 1
_Make a copy of this Colab to start!_

In this first challenge, we are going to build a RAG pipeline that answers questions based on the contents of a URL. Most of the pipeline is ready, but your task is to complete the pipeline connections 👇

You should complete Step 5 of this colab.

### Components to use:
1. `LinkContentFetcher`: Expects a list of URLs and returns `ByteStream` type
2. `HTMLToDocument`: Expects a `ByteStream` and creates `Document` type.
3. `DocumentSplitter`: This expcects a list of `Documents` and returns a list of split, preprocessed documents.
4. `PromptBuilder`: To define the manner we want to interact with an LLM. We use Jinja templating
5. `GPTGenereator`: Expects a fully formed prompt and queries an OpenAI GPT model.

### 1) Installation

**Note:** There is a known issue with colab due to a version conflict error related to `llmx` which comes with Colab. You might get an `llmx` error. You can safely ignore this, or run `pip uninstall -y llmx`

In [None]:
!pip install haystack-ai
!pip install boilerpy3

Collecting haystack-ai
  Downloading haystack_ai-2.0.0b2-py3-none-any.whl (185 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m185.7/185.7 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Collecting lazy-imports (from haystack-ai)
  Downloading lazy_imports-0.3.1-py3-none-any.whl (12 kB)
Collecting openai<1.0.0 (from haystack-ai)
  Downloading openai-0.28.1-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.0/77.0 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
Collecting posthog (from haystack-ai)
  Downloading posthog-3.1.0-py2.py3-none-any.whl (37 kB)
Collecting rank-bm25 (from haystack-ai)
  Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Collecting monotonic>=1.5 (from posthog->haystack-ai)
  Downloading monotonic-1.6-py2.py3-none-any.whl (8.2 kB)
Collecting backoff>=1.10.0 (from posthog->haystack-ai)
  Downloading backoff-2.2.1-py3-none-any.whl (15 kB)
Installing collected packages: monotonic, rank-bm25, lazy-imports, 

In [None]:
from google.colab import userdata
openai_api_key = userdata.get('openai_api_key')

### 2) Enter API keys for LLM and search providers
Run this code and you’ll be prompted to enter your OpenAI API Key. If you don’t have a key, [follow these instructions](https://help.openai.com/en/articles/4936850-where-do-i-find-my-api-key).

### 3) Create components

In [None]:
from haystack.components.fetchers import LinkContentFetcher
from haystack.components.converters import HTMLToDocument
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators import GPTGenerator

fetcher = LinkContentFetcher()
converter = HTMLToDocument()
splitter = DocumentSplitter(split_length=100, split_overlap=5)

In [None]:
template = """Given the information below: \n
            {% for document in documents %}
                {{ document.content }}
            {% endfor %}
            Question: {{ query }}. \n Answer:"""

prompt_builder = PromptBuilder(template = template)

In [None]:
llm = GPTGenerator(api_key = openai_api_key, model_name = "gpt-4")

### 4) Add them to a Haystack 2.0 Pipeline

In [None]:
from haystack import Pipeline

pipeline = Pipeline()
pipeline.add_component(name="fetcher", instance=fetcher)
pipeline.add_component(name="converter", instance=converter)
pipeline.add_component(name="splitter", instance=splitter)
pipeline.add_component(name="prompt_builder", instance=prompt_builder)
pipeline.add_component(name="llm", instance=llm)

###5) Connect the components

Complete the pipelne connections to achieve a working pipeline that can be run

**PSA:** If you are re-running this cell multiple times and you get a `PipelineConnectionError`, try restarting your Colab runtime.

In [None]:
pipeline.connect("fetcher", "converter")
pipeline.connect("converter", "splitter")
pipeline.connect("splitter", "prompt_builder")
pipeline.connect("prompt_builder", "llm")

###6) Run the Pipeline

In [None]:
query_dict ={
    "urls": ["https://haystack.deepset.ai/blog/customizing-rag-to-summarize-hacker-news-posts-with-haystack2"],
    "query": "How do you build a custom component?"
}


result = pipeline.run(data={"fetcher": {"urls": query_dict["urls"]}, "prompt_builder": {"query": query_dict["query"]}})

In [None]:
print(result['llm']['replies'][0])

To build a custom component in Haystack 2.0, you need to create a class with a @component decorator on the class declaration and a run function with a decorator @component.output_types(my_output_name=my_output_type) that describes what output the pipeline should expect from this component. For example, if you are creating a custom component to fetch the latest posts from Hacker News, you need to create a class called HackernewsNewestFetcher with a run function that fetches the latest posts from Hacker News.


###7) Draw the Pipeline 🎨
When you run this code block, it will create a new file that will appear in the "Files" section of Colab (see menu tab on the side).

In [None]:
pipeline.draw("/content/pipeline.png")