<a href="https://colab.research.google.com/github/baldpanda/advent-of-haystack-2023/blob/main/day_6/advent_of_haystack_day_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advent of Haystack - Day 6
_Make a copy of this Colab to start!_


In this challenge, you will help Elf Bilge to preprocess the winter reports before indexing them to a DocumentStore for RAG applications.

Your task is to complete the code in **Section 1**

- [`FileTypeRouter`](https://docs.haystack.deepset.ai/v2.0/docs/filetyperouter): This component will help you route files based on their corresponding MIME type to different components

- [`MarkdownToDocument`](https://docs.haystack.deepset.ai/v2.0/docs/markdowntodocument): This component will help you convert markdown files into Haystack Documents

- [`PyPDFToDocument`](https://docs.haystack.deepset.ai/v2.0/docs/pypdftodocument): This component will help you convert pdf files into Haystack Documents

- [`TextFileToDocument`](https://docs.haystack.deepset.ai/v2.0/docs/textfiletodocument): This component will help you convert text files into Haystack Documents

- [`DocumentJoiner`](https://docs.haystack.deepset.ai/v2.0/docs/documentjoiner): This component will help you to join Documents coming from different branches of a pipeline

- [`DocumentCleaner`](https://docs.haystack.deepset.ai/v2.0/docs/documentcleaner) (optional): This component will help you to make Documents more readable by removing extra whitespaces etc.

- [`DocumentSplitter`](https://docs.haystack.deepset.ai/v2.0/docs/documentsplitter): This component will help you to split your Document into chunks

- [`SentenceTransformersDocumentEmbedder`](https://docs.haystack.deepset.ai/v2.0/docs/sentencetransformersdocumentembedder): This component will help you create embeddings for Documents.

- [`DocumentWriter`](https://docs.haystack.deepset.ai/v2.0/docs/documentwriter): This component will help you write Documents into the DocumentStore

#Installation
**Note:** There is a known issue with colab due to a version conflict error related to `llmx` which comes with Colab. You might get an `llmx` error. You can safely ignore this, or run `pip uninstall -y llmx`

In [None]:
%%bash
pip install haystack-ai
pip install transformers[torch,sentencepiece]==4.32.1 sentence-transformers>=2.2.0
pip install markdown-it-py mdit_plain
pip install pypdf



### Enabling Telemetry

Knowing you’re running this challenge helps us know whether Advent of Haystack is helping people learn about Haystack 2.0-Beta. But you can always opt out by commenting the following line.

In [None]:
from haystack.telemetry import tutorial_running

tutorial_running("challenge_6")

## Download All Winter Reports

All required files will be downloaded into this Colab notebook. You can see these files in "files" tab on the left.

In [None]:
!gdown https://drive.google.com/drive/folders/1vNeCG0Vgnri9DvIr_MRURV0S8QNWs08r -O /content --folder

Retrieving folder list
Processing file 1_2qWYxIfDO-_eQLSJZq_RPwlA7MM46W0 winter_report_one.txt
Processing file 1MvI5ntTxHs1nYXRIFRMCba3uJh_ZYsOV winter_report_three.md
Processing file 1WFswkWuwzMgLs4DFEcfiLXuRy_g-TmRd winter_report_two.pdf
Retrieving folder list completed
Building directory structure
Building directory structure completed
Downloading...
From: https://drive.google.com/uc?id=1_2qWYxIfDO-_eQLSJZq_RPwlA7MM46W0
To: /content/winter_report_one.txt
100% 2.39k/2.39k [00:00<00:00, 12.8MB/s]
Downloading...
From: https://drive.google.com/uc?id=1MvI5ntTxHs1nYXRIFRMCba3uJh_ZYsOV
To: /content/winter_report_three.md
100% 2.51k/2.51k [00:00<00:00, 10.1MB/s]
Downloading...
From: https://drive.google.com/uc?id=1WFswkWuwzMgLs4DFEcfiLXuRy_g-TmRd
To: /content/winter_report_two.pdf
100% 61.1k/61.1k [00:00<00:00, 2.40MB/s]
Download completed


## 1) Create a Pipeline to Index Documents

In [None]:
from haystack.components.writers import DocumentWriter
from haystack.components.converters import MarkdownToDocument, PyPDFToDocument, TextFileToDocument
from haystack.components.preprocessors import DocumentSplitter, DocumentCleaner
from haystack.components.routers import FileTypeRouter, DocumentJoiner
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.pipeline import Pipeline
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore()
file_type_router = FileTypeRouter(mime_types=["text/plain", "application/pdf"])
######## Initialize the necessary components with relevant parameters #############
text_file_converter = TextFileToDocument()
markdown_converter = MarkdownToDocument()
pdf_converter = PyPDFToDocument()
document_splitter = DocumentSplitter()
document_joiner = DocumentJoiner()



####################################################################################
document_embedder = SentenceTransformersDocumentEmbedder(model_name_or_path="sentence-transformers/all-MiniLM-L6-v2")
document_writer = DocumentWriter(document_store)

### Add components to the preprocessing pipeline

In [None]:
preprocessing_pipeline = Pipeline()
preprocessing_pipeline.add_component(instance=file_type_router, name="file_type_router")
preprocessing_pipeline.add_component(instance=text_file_converter, name="text_file_converter")
preprocessing_pipeline.add_component(instance=markdown_converter, name="markdown_converter")
preprocessing_pipeline.add_component(instance=pdf_converter, name="pypdf_converter")
######## Add new components to the pipeline #############
preprocessing_pipeline.add_component(instance=document_splitter, name="document_splitter")
preprocessing_pipeline.add_component(instance=document_joiner, name="document_joiner")
preprocessing_pipeline.add_component(instance=document_embedder, name="document_embedder")
preprocessing_pipeline.add_component(instance=document_writer, name="document_writer")




##########################################################

### Connect all components

In [None]:
preprocessing_pipeline.connect("file_type_router.text/plain", "text_file_converter.sources")
preprocessing_pipeline.connect("file_type_router.application/pdf", "pypdf_converter.sources")
preprocessing_pipeline.connect("file_type_router.unclassified", "markdown_converter.sources")
######## Complete this section with the rest of the connections #############
preprocessing_pipeline.connect("text_file_converter.documents", "document_joiner.documents")
preprocessing_pipeline.connect("pypdf_converter.documents", "document_joiner.documents")
preprocessing_pipeline.connect("markdown_converter.documents", "document_joiner.documents")
preprocessing_pipeline.connect("document_joiner.documents", "document_splitter.documents")
preprocessing_pipeline.connect("document_splitter.documents", "document_embedder.documents")
preprocessing_pipeline.connect("document_embedder.documents", "document_writer.documents")






#############################################################################

In [None]:
preprocessing_pipeline.draw("preprocessing_pipeline.png")

In [None]:
preprocessing_pipeline.run({
    "file_type_router": {"sources":["/content/winter_report_one.txt",
                                    "/content/winter_report_two.pdf",
                                    "/content/winter_report_three.md"]}
})

## 2) Test Your System

Run this code and you’ll be prompted to enter your openAI credentials. If you don’t have a key, [follow these instructions](https://help.openai.com/en/articles/4936850-where-do-i-find-my-api-key).

In [None]:
from getpass import getpass

api_key = getpass("OpenAI API Key: ")

OpenAI API Key: ··········


In [None]:
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers import InMemoryEmbeddingRetriever
from haystack.components.builders import PromptBuilder
from haystack.components.generators import GPTGenerator

template = """
You are a wise elf living in the forest with other elves.
You will be provided with some context from Elves' yearly winter reports.
Answer the questions from other elves based on the given context as if you are an elf as well.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{ question }}
Answer:
"""
pipe = Pipeline()
pipe.add_component("embedder", SentenceTransformersTextEmbedder(model_name_or_path="sentence-transformers/all-MiniLM-L6-v2"))
pipe.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store))
pipe.add_component("prompt_builder", PromptBuilder(template=template))
pipe.add_component("llm", GPTGenerator(api_key=api_key))
pipe.connect("embedder.embedding", "retriever.query_embedding")
pipe.connect("retriever", "prompt_builder.documents")
pipe.connect("prompt_builder", "llm")

In [None]:
query = "What should we do against water scarcity?"
# query = "Give me one example of nice moment they we had in past winters"
# query = "Which foods should we collect?"

pipe.run({
    "embedder": {"text": query},
    "prompt_builder": {
        "question": query
    }
})

Batches:   0%|          | 0/1 [00:00<?, ?it/s]



{'llm': {'replies': ['During the winter months, the water sources in the forest tend to freeze, leading to water scarcity for us elves. To combat this issue, we should start by practicing water conservation techniques. We can collect and store rainwater during the warmer months in barrels or other containers. Additionally, we should limit excessive water usage and ensure that there are no leaks in our homes or other structures. We can also try to find alternative water sources, such as melting ice or snow, but precautions should be taken to ensure the water is safe for consumption. Finally, we should explore ways to preserve and protect our existing water sources, such as planting trees to maintain the water table and prevent erosion. By being mindful of our water usage and implementing these measures, we can alleviate the challenges of water scarcity during the winter.'],
  'metadata': [{'model': 'gpt-3.5-turbo-0613',
    'index': 0,
    'finish_reason': 'stop',
    'usage': {'prompt_