# File Type Preprocessing Index Pipeline Tutorial

- **Level**: Beginner
- **Time to complete**: 10 minutes
- **Prerequisites**: You must have [an API key from an active OpenAI account](https://help.openai.com/en/articles/4936850-where-do-i-find-my-api-key).

- **Goal**: After completing this tutorial, you'll have learned how to build an indexing pipeline that will preproecess files based on their file type, using the `FileTypeRouter`.

> This tutorial uses Haystack 2.0 Beta. To learn more, read the [ Haystack 2.0 Beta announcement](https://haystack.deepset.ai/blog/introducing-haystack-2-beta-and-advent) or see [Haystack 2.0 Documentation](https://docs.haystack.deepset.ai/v2.0/docs).

## Components Used

- [`FileTypeRouter`](https://docs.haystack.deepset.ai/v2.0/docs/filetyperouter): This component will help you route files based on their corresponding MIME type to different components

- [`MarkdownToDocument`](https://docs.haystack.deepset.ai/v2.0/docs/markdowntodocument): This component will help you convert markdown files into Haystack Documents

- [`PyPDFToDocument`](https://docs.haystack.deepset.ai/v2.0/docs/pypdftodocument): This component will help you convert pdf files into Haystack Documents

- [`TextFileToDocument`](https://docs.haystack.deepset.ai/v2.0/docs/textfiletodocument): This component will help you convert text files into Haystack Documents

- [`DocumentJoiner`](https://docs.haystack.deepset.ai/v2.0/docs/documentjoiner): This component will help you to join Documents coming from different branches of a pipeline

- [`DocumentCleaner`](https://docs.haystack.deepset.ai/v2.0/docs/documentcleaner) (optional): This component will help you to make Documents more readable by removing extra whitespaces etc.

- [`DocumentSplitter`](https://docs.haystack.deepset.ai/v2.0/docs/documentsplitter): This component will help you to split your Document into chunks

- [`SentenceTransformersDocumentEmbedder`](https://docs.haystack.deepset.ai/v2.0/docs/sentencetransformersdocumentembedder): This component will help you create embeddings for Documents.

- [`DocumentWriter`](https://docs.haystack.deepset.ai/v2.0/docs/documentwriter): This component will help you write Documents into the DocumentStore

## Preparing the Colab Environment

- [Enable GPU Runtime in Colab](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration#enabling-the-gpu-in-colab)
- [Set logging level to INFO](https://docs.haystack.deepset.ai/docs/log-level)

## Installing dependencies


In [1]:
%%bash
pip install haystack-ai
pip install transformers[torch,sentencepiece]==4.32.1 sentence-transformers>=2.2.0
pip install markdown-it-py mdit_plain
pip install pypdf

Collecting haystack-ai
  Downloading haystack_ai-2.0.0b5-py3-none-any.whl (233 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 233.5/233.5 kB 5.6 MB/s eta 0:00:00
Collecting boilerpy3 (from haystack-ai)
  Downloading boilerpy3-1.0.7-py3-none-any.whl (22 kB)
Collecting haystack-bm25 (from haystack-ai)
  Downloading haystack_bm25-1.0.2-py2.py3-none-any.whl (8.8 kB)
Collecting lazy-imports (from haystack-ai)
  Downloading lazy_imports-0.3.1-py3-none-any.whl (12 kB)
Collecting openai>=1.1.0 (from haystack-ai)
  Downloading openai-1.10.0-py3-none-any.whl (225 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 225.1/225.1 kB 9.1 MB/s eta 0:00:00
Collecting posthog (from haystack-ai)
  Downloading posthog-3.3.4-py2.py3-none-any.whl (40 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40.8/40.8 kB 5.7 MB/s eta 0:00:00
Collecting httpx<1,>=0.23.0 (from openai>=1.1.0->haystack-ai)
  Downloading httpx-0.26.0-py3-none-any.whl (75 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75.9/75.9 kB 8.4

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
llmx 0.0.15a0 requires tiktoken, which is not installed.
tensorflow-probability 0.22.0 requires typing-extensions<4.6.0, but you have typing-extensions 4.9.0 which is incompatible.


## Download All Files

All required files will be downloaded into this Colab notebook. You can see these files in "files" tab on the left. Just like most real life data, these files are a mishmash of different types.

In [14]:
!gdown https://drive.google.com/drive/folders/1vNeCG0Vgnri9DvIr_MRURV0S8QNWs08r -O /content --folder

Retrieving folder contents
Processing file 1_2qWYxIfDO-_eQLSJZq_RPwlA7MM46W0 winter_report_one.txt
Processing file 1MvI5ntTxHs1nYXRIFRMCba3uJh_ZYsOV winter_report_three.md
Processing file 1WFswkWuwzMgLs4DFEcfiLXuRy_g-TmRd winter_report_two.pdf
Retrieving folder contents completed
Building directory structure
Building directory structure completed
Downloading...
From: https://drive.google.com/uc?id=1_2qWYxIfDO-_eQLSJZq_RPwlA7MM46W0
To: /content/winter_report_one.txt
100% 2.39k/2.39k [00:00<00:00, 12.3MB/s]
Downloading...
From: https://drive.google.com/uc?id=1MvI5ntTxHs1nYXRIFRMCba3uJh_ZYsOV
To: /content/winter_report_three.md
100% 2.51k/2.51k [00:00<00:00, 12.5MB/s]
Downloading...
From: https://drive.google.com/uc?id=1WFswkWuwzMgLs4DFEcfiLXuRy_g-TmRd
To: /content/winter_report_two.pdf
100% 61.1k/61.1k [00:00<00:00, 85.9MB/s]
Download completed


## Create a Pipeline to Index Documents

Next, you'll create a pipeline to index documents. To keep things uncomplicated, you'll use an `InMemoryDocumentStore` but this approach would also work with any other flavor of `DocumentStore`.

You'll need a different file converter class for each file type in our data sources: `.pdf`, `.txt`, and `.md` in this case. Our `FileTypeRouter` connects each file type to the proper converter.

Once all our files have been converted to Haystack Documents, the branches of our pipeline will become one again via the `DocumentJoiner`.





In [15]:
from haystack.components.writers import DocumentWriter
from haystack.components.converters import MarkdownToDocument, PyPDFToDocument, TextFileToDocument
from haystack.components.preprocessors import DocumentSplitter, DocumentCleaner
from haystack.components.routers import FileTypeRouter
from haystack.components.joiners import DocumentJoiner
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.pipeline import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()
file_type_router = FileTypeRouter(mime_types=["text/plain", "application/pdf"])
text_file_converter = TextFileToDocument()
markdown_converter = MarkdownToDocument()
pdf_converter = PyPDFToDocument()
document_joiner = DocumentJoiner()

From there, the steps to this indexing pipeline are a bit more standard. The `DocumentCleaner` removes whitespace. Then this `DocumentSplitter` breaks them into chunks of 150 words, with a bit of overlap to avoid missing context.

In [16]:
document_cleaner = DocumentCleaner()
document_splitter = DocumentSplitter(
    split_by = "word",
    split_length = 150,
    split_overlap = 50
)

Now you'll add a `SentenceTransformersDocumentEmbedder` to create embeddings from the documents. As the last step in this pipeline, the `DocumentWriter` will write them to the `InMemoryDocumentStore`.


In [17]:

document_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
document_writer = DocumentWriter(document_store)


After creating all the components, add them to the indexing pipeline and connect them.

In [18]:
preprocessing_pipeline = Pipeline()
preprocessing_pipeline.add_component(instance=file_type_router, name="file_type_router")
preprocessing_pipeline.add_component(instance=text_file_converter, name="text_file_converter")
preprocessing_pipeline.add_component(instance=markdown_converter, name="markdown_converter")
preprocessing_pipeline.add_component(instance=pdf_converter, name="pypdf_converter")
preprocessing_pipeline.add_component(instance=document_joiner, name="document_joiner")
preprocessing_pipeline.add_component(instance=document_cleaner, name="document_cleaner")
preprocessing_pipeline.add_component(instance=document_splitter, name="document_splitter")
preprocessing_pipeline.add_component(instance=document_embedder,name="document_embedder")
preprocessing_pipeline.add_component(instance=document_writer, name="document_writer")

preprocessing_pipeline.connect("file_type_router.text/plain", "text_file_converter.sources")
preprocessing_pipeline.connect("file_type_router.application/pdf", "pypdf_converter.sources")
preprocessing_pipeline.connect("file_type_router.unclassified", "markdown_converter.sources")
preprocessing_pipeline.connect("text_file_converter", "document_joiner")
preprocessing_pipeline.connect("pypdf_converter", "document_joiner")
preprocessing_pipeline.connect("markdown_converter", "document_joiner")
preprocessing_pipeline.connect("document_joiner", "document_cleaner")
preprocessing_pipeline.connect("document_cleaner", "document_splitter")
preprocessing_pipeline.connect("document_splitter", "document_embedder")
preprocessing_pipeline.connect("document_embedder", "document_writer")

Let's test this pipeline with a few recipes I've written. Are you getting hungry yet?

In [20]:
preprocessing_pipeline.run({
    "file_type_router": {"sources":["/content/vegan_sunflower_hemp_cheese_recipe.txt",
                                    "/content/vegan_keto_eggplant_recipe.pdf",
                                    "/content/vegan_flan_recipe.md"]}
})

Converting markdown files to Documents: 100%|██████████| 1/1 [00:00<00:00, 206.46it/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

{'document_writer': {'documents_written': 7}}

## Build a pipeline to query documents

At this point, you'll need an [OpenAI key](https://help.openai.com/en/articles/4936850-where-do-i-find-my-api-key). Add it to the "Secrets" pane of Colab by clicking the key icon in the left hand menu.

In [21]:
from google.colab import userdata

api_key = userdata.get("OPENAI_API_KEY")

In this step you'll build a query pipeline to answer questions about the documents.

This pipeline takes the prompt, searches the document store for relevant documents, and passes those documents along to the LLM to formulate an answer.

In [22]:
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.builders import PromptBuilder
from haystack.components.generators import GPTGenerator

template = """
Answer the questions based on the given context.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{ question }}
Answer:
"""
pipe = Pipeline()
pipe.add_component("embedder", SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2"))
pipe.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store))
pipe.add_component("prompt_builder", PromptBuilder(template=template))
pipe.add_component("llm", GPTGenerator(api_key=api_key))
pipe.connect("embedder.embedding", "retriever.query_embedding")
pipe.connect("retriever", "prompt_builder.documents")
pipe.connect("prompt_builder", "llm")

  instance = super().__call__(*args, **kwargs)


Try it out yourself by running the code below. If all has gone well, you should have a complete shopping list from all the recipe sources. 🧂🥥🧄

In [24]:
query = "What ingredients would I need to make vegan keto eggplant lasagna, vegan persimmon flan, and vegan hemp cheese?"

pipe.run({
    "embedder": {"text": query},
    "prompt_builder": {
        "question": query
    }
})

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

{'llm': {'replies': ['To make vegan keto eggplant lasagna, you would need 2 large eggplants, salt, store-bought vegan mozzarella (optional), basil, almonds, nutritional yeast, olive oil, garlic powder, lemon juice, and macadamia nuts. You will also need firm or extra firm tofu, spinach, and seasonings like garlic powder and salt.\n\nTo make vegan persimmon flan, you would need persimmon pulp, cornstarch, agar agar, agave nectar, granulated sugar, coconut creme, almond milk, and vanilla.\n\nTo make vegan hemp cheese, you would need sunflower seeds, hemp hearts, miso paste, nutritional yeast, rejuvelac (or alternative like kombucha or sauerkraut liquid), and salt.'],
  'meta': [{'model': 'gpt-3.5-turbo-0613',
    'index': 0,
    'finish_reason': 'stop',
    'usage': {'completion_tokens': 159,
     'prompt_tokens': 3013,
     'total_tokens': 3172}}]}}

```python
{'llm': {'replies': ['To make vegan keto eggplant lasagna, you would need 2 large eggplants, salt, store-bought vegan mozzarella (optional), basil, almonds, nutritional yeast, olive oil, garlic powder, lemon juice, and macadamia nuts. You will also need firm or extra firm tofu, spinach, and seasonings like garlic powder and salt.\n\nTo make vegan persimmon flan, you would need persimmon pulp, cornstarch, agar agar, agave nectar, granulated sugar, coconut creme, almond milk, and vanilla.\n\nTo make vegan hemp cheese, you would need sunflower seeds, hemp hearts, miso paste, nutritional yeast, rejuvelac (or alternative like kombucha or sauerkraut liquid), and salt.'],
  'meta': [{'model': 'gpt-3.5-turbo-0613',
    'index': 0,
    'finish_reason': 'stop',
    'usage': {'completion_tokens': 159,
     'prompt_tokens': 3013,
     'total_tokens': 3172}}]}}
```

## What's next

Congratulations on building an indexing pipeline that can preprocess different file types. Go forth and ingest all the messy real-world data into your workflows. 💥

If you liked this tutorial, you may also enjoy:
- [Serializing Haystack Pipelines](https://haystack.deepset.ai/tutorials/29_serializing_pipelines)
-  [Creating Your First QA Pipeline with Retrieval-Augmentation](https://haystack.deepset.ai/tutorials/27_first_rag_pipeline)

To stay up to date on the latest Haystack developments, you can [sign up for our newsletter](https://landing.deepset.ai/haystack-community-updates?utm_campaign=developer-relations&utm_source=tutorial&utm_medium=file_type_preprocessing_tutorial). Thanks for reading!