In [1]:
!pip install python-docx trafilatura markdown-it-py mdit_plain pypdf python-pptx openpyxl nltk



# SuperComponents

Supercomponents in general behave like any other component. They have init params, optional from_dict() and to_dict() methods as usual. The init params typically determine how the internal pipeline is constructed (e.g. which components are used).

In [2]:
from haystack_experimental.super_components.converters import AutoFileConverter

In [18]:
file_converter = AutoFileConverter(
    split_by="sentence",
    split_overlap=0,
    split_length=1,
    respect_sentence_boundary=False
)

In [19]:
example_files = [
    "example_files/react_paper.pdf",
    "example_files/sample_docx.docx",
    "example_files/sample_pptx.pptx",
    "example_files/sample.md",
    "example_files/sample_1.csv",
]

result = file_converter.run(sources=example_files)

No abbreviations file found for en. Using default abbreviations.
Converting markdown files to Documents: 100%|██████████| 1/1 [00:00<00:00, 357.78it/s]


In [22]:
result

{'documents': [Document(id=d48c3ffbe47725828883e1f86681fd2e8ccf87f383a46414243f6ca0f28e4d29, content: 'Sample Docx File
  
  The US has "passed the peak" on new coronavirus cases, President Donald Trump said...', meta: {'file_path': 'sample_docx.docx', 'docx': DOCXMetadata(author='Saha, Anirban', category='', comments='', content_status='', created='2020-07-14T08:14:00+00:00', identifier='', keywords='', language='', last_modified_by='Saha, Anirban', last_printed=None, modified='2020-07-14T08:16:00+00:00', revision=1, subject='', title='', version=''), 'source_id': '841f2916f4d4fe3612dac9490fc3d4ceb78ba76a2f78627413e0f5bcded1a206', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
  Document(id=8845820728d5e1eac08c6ea4ffc806ed15d4af9ebd962a9388a5e25d7bddce6d, content: 'The US has over 637,000 confirmed Covid-19 cases and over 30,826 deaths, the highest for any country...', meta: {'file_path': 'sample_docx.docx', 'docx': DOCXMetadata(author='Saha, Anirban', category='', comments=

In [21]:
previous_name = None
for document in result["documents"]:
    if document.meta["file_path"] != previous_name:
        doc = f"""
First lines from {document.meta["file_path"]}

{document.content}

-----------

"""  
        print(doc)
        previous_name = document.meta["file_path"]


------------

First lines from sample_docx.docx

Sample Docx File

The US has "passed the peak" on new coronavirus cases, President Donald Trump said and predicted that some states would reopen this month.


-----------



------------

First lines from sample.md

type: intro
date: 1.1.2023

pip install farm-haystack


What to build with Haystack

Ask questions in natural language and find granular answers in your own documents.


-----------



------------

First lines from react_paper.pdf

REAC T: S YNERGIZING REASONING AND ACTING IN
LANGUAGE MODELS
Shunyu Yao∗*,1, Jeffrey Zhao2, Dian Yu2, Nan Du2, Izhak Shafran2, Karthik Narasimhan1, Yuan Cao2
1Department of Computer Science, Princeton University
2Google Research, Brain team
1{shunyuy,karthikn}@princeton.edu
2{jeffreyzhao,dianyu,dunan,izhak,yuancao}@google.com
ABSTRACT
While large language models (LLMs) have demonstrated impressive performance
across tasks in language understanding and interactive decision making, their
abilities 

In [4]:
file_converter.to_dict()

{'type': 'haystack_experimental.super_components.converters.file_converter.AutoFileConverter',
 'init_parameters': {'split_by': 'word',
  'split_length': 250,
  'split_overlap': 30,
  'split_threshold': 0,
  'splitting_function': None,
  'respect_sentence_boundary': True,
  'language': 'en',
  'use_split_rules': True,
  'extend_abbreviations': True,
  'encoding': 'utf-8',
  'json_content_key': 'content'}}

## Expanding SuperComponents
What makes SuperComponents special is the ability to expand it by calling their `to_super_component_dict()` method. This converts the component to a generic `SuperComponent` that contains the pipeline constructed by the SuperComponent. From there on the pipeline can be changed in any way.

In [5]:
file_converter.to_super_component_dict()

{'type': 'haystack_experimental.core.super_component.super_component.SuperComponent',
 'init_parameters': {'pipeline': {'metadata': {},
   'max_runs_per_component': 100,
   'components': {'router': {'type': 'haystack.components.routers.file_type_router.FileTypeRouter',
     'init_parameters': {'mime_types': [<ConverterMimeType.CSV: 'text/csv'>,
       <ConverterMimeType.DOCX: 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'>,
       <ConverterMimeType.HTML: 'text/html'>,
       <ConverterMimeType.JSON: 'application/json'>,
       <ConverterMimeType.MD: 'text/markdown'>,
       <ConverterMimeType.TEXT: 'text/plain'>,
       <ConverterMimeType.PDF: 'application/pdf'>,
       <ConverterMimeType.PPTX: 'application/vnd.openxmlformats-officedocument.presentationml.presentation'>,
       <ConverterMimeType.XLSX: 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'>],
      'additional_mimetypes': None}},
    'csv': {'type': 'haystack.components.converte

In [8]:
from haystack_experimental.core.super_component import SuperComponent

super_file_converter = SuperComponent.from_dict(file_converter.to_super_component_dict())