<a href="https://colab.research.google.com/github/dmarchignoli/haystack-rag/blob/main/ingest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

In [1]:
!rm -rf /content/sample_data
!(cd /content/haystack-rag/ && git pull) || git clone https://github.com/dmarchignoli/haystack-rag.git /content/haystack-rag
%cd /content/haystack-rag/

/bin/bash: line 1: cd: /content/haystack-rag/: No such file or directory
Cloning into '/content/haystack-rag'...
remote: Enumerating objects: 163, done.[K
remote: Counting objects: 100% (163/163), done.[K
remote: Compressing objects: 100% (103/103), done.[K
remote: Total 163 (delta 75), reused 108 (delta 33), pack-reused 0 (from 0)[K
Receiving objects: 100% (163/163), 150.02 KiB | 771.00 KiB/s, done.
Resolving deltas: 100% (75/75), done.
/content/haystack-rag


In [2]:
%cd /content/haystack-rag/
import sys, os, importlib
if not sys.path[-1].startswith(os.getcwd()):
  sys.path.append(os.path.join(os.getcwd(), 'src'))
if 'haystack_rag' in sys.modules.keys():
  for m in [x for x in sys.modules.keys() if x.startswith('haystack_rag')]:
    del sys.modules[m]
  importlib.invalidate_caches()

os.environ.get("PYTHONPATH", "").split(":")
%env PYTHONPATH=/env/python:/content/haystack-rag/src

/content/haystack-rag
env: PYTHONPATH=/env/python:/content/haystack-rag/src


In [3]:
#!pip install 'haystack-ai>=2.7.0' 'sentence-transformers>=3.3.1' 'unstructured-fileconverter-haystack>=0.4.1' \
#  'google-cloud-storage>=2.19.0' 'pdfminer-six>=20240706'
!pip install -qe /content/haystack-rag

  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m391.4/391.4 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m92.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m267.2/267.2 kB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.8/109.8 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.9/54.9 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m89.7 MB/s

# Prepare pipeline

In [None]:
from haystack import Pipeline
from google.colab import userdata
from haystack.utils import Secret
from haystack.components.fetchers import LinkContentFetcher
from haystack_rag import GCSDocumentStore
from haystack_rag import DocIdIndexer, DocIdFilter
from haystack.components.caching import CacheChecker
from haystack.components.writers import DocumentWriter
from haystack.components.joiners.document_joiner import DocumentJoiner
from haystack.components.joiners import BranchJoiner
from haystack.components.converters import PDFMinerToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import NLTKDocumentSplitter
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack_integrations.document_stores.qdrant import QdrantDocumentStore
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
#from haystack_rag import DocMetaFixer, ByteStreamMaterializer

# Google cloud project used for S3 storage cache
project_id = 'cellular-ring-443811-t2'
# Google cloud storage bucket
bucket_name = 'haystack-docs-826421350323'
docs_store = GCSDocumentStore(project_id=project_id, bucket_name=bucket_name)

chunks_store = QdrantDocumentStore(
    url="https://78684256-5f96-47e6-9691-c5a9efc8d97c.eu-central-1-0.aws.cloud.qdrant.io:6333",
    api_key = Secret.from_token(userdata.get('QDRANT_API_KEY')),
    embedding_dim=896,
    similarity="cosine",
    index="haystack-rag",
    recreate_index=False) # type: ignore

pipeline = Pipeline()

docs_cache_checker = CacheChecker(document_store=docs_store, cache_field="url") #type: ignore
pipeline.add_component(instance=docs_cache_checker, name="docs_cache_checker")

fetcher = LinkContentFetcher() # type: ignore
pipeline.add_component(instance=fetcher, name="fetcher")
pipeline.connect("docs_cache_checker.misses", "fetcher")

converter = PDFMinerToDocument() #type: ignore
pipeline.add_component(instance=converter, name="converter")
pipeline.connect("fetcher", "converter")

docs_writer = DocumentWriter(document_store=docs_store) #type: ignore
pipeline.add_component(instance=docs_writer, name="docs_writer")
pipeline.connect("converter", "docs_writer")

docs_joiner = DocumentJoiner(join_mode="concatenate") #type: ignore
pipeline.add_component(instance=docs_joiner, name="docs_joiner")
pipeline.connect("docs_writer", "docs_joiner")
pipeline.connect("docs_cache_checker.hits", "docs_joiner")

cleaner = DocumentCleaner(remove_regex=r"\A[0-9]+\Z") #type: ignore
pipeline.add_component(instance=cleaner, name="cleaner")
pipeline.connect("docs_joiner", "cleaner")

splitter = NLTKDocumentSplitter(split_by="sentence", split_length=5, \
                                split_overlap=1, language="it") #type: ignore
pipeline.add_component(instance=splitter, name="splitter")
pipeline.connect("cleaner", "splitter")

chunks_id_extractor = DocIdIndexer() #type: ignore
pipeline.add_component(instance=chunks_id_extractor, name="chunks_id_extractor")
pipeline.connect("splitter", "chunks_id_extractor")

chunks_cache_checker = CacheChecker(document_store=chunks_store, cache_field="id") #type: ignore
pipeline.add_component(instance=chunks_cache_checker, name="chunks_cache_checker")
pipeline.connect("chunks_id_extractor", "chunks_cache_checker")

missing_chunks_filter = DocIdFilter() #type: ignore
pipeline.add_component(instance=missing_chunks_filter, name="missing_chunks_filter")
pipeline.connect("chunks_cache_checker.misses", "missing_chunks_filter.ids")
pipeline.connect("splitter", "missing_chunks_filter.documents")

model_name="HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1"
embedder = SentenceTransformersDocumentEmbedder(model=model_name) #type: ignore
pipeline.add_component(instance=embedder, name="embedder")
pipeline.connect("missing_chunks_filter", "embedder")

writer = DocumentWriter(document_store=chunks_store,
                        policy=DuplicatePolicy) #type: ignore
pipeline.add_component(instance=writer, name="writer")
pipeline.connect("embedder", "writer")

pipeline.warm_up()

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/208 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/601k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/55.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

# Run pipeline

In [None]:
from google.colab import auth
from google.auth import default

auth.authenticate_user()
try:
  # Attempt to get default credentials
  credentials, project = default()
  print("You are already authenticated as "+credentials.service_account_email)
except Exception as e:
  auth.authenticate_user()

You are already authenticated as default


In [None]:
from haystack_rag.utils import load_library_urls

lib_urls = load_library_urls()
#print(pipeline.inputs())
result = pipeline.run(data={"docs_cache_checker": {"items": lib_urls[:2]}},
                      include_outputs_from={"missing_chunk_filter", "missing_chunks_filter", "embedder", "writer"})


Batches:   0%|          | 0/3 [00:00<?, ?it/s]

100it [00:01, 85.95it/s]


# Retrieval

In [None]:
from google.colab import userdata
from haystack.utils import Secret
from haystack_integrations.document_stores.qdrant import QdrantDocumentStore
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack_integrations.components.retrievers.qdrant import QdrantEmbeddingRetriever
from haystack_integrations.document_stores.qdrant.filters import convert_filters_to_qdrant

# chunks_store = QdrantDocumentStore(
#     url="https://78684256-5f96-47e6-9691-c5a9efc8d97c.eu-central-1-0.aws.cloud.qdrant.io:6333",
#     api_key = Secret.from_token(userdata.get('QDRANT_API_KEY')),
#     embedding_dim=896,
#     similarity="cosine",
#     index="haystack-rag",
#     recreate_index=False) # type: ignore

from haystack import Pipeline

query_pipeline = Pipeline()

model_name="HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1"
query_instruction = "Instruct: Given a query, retrieve documents that answer the query.\nQuery: "
text_embedder = SentenceTransformersTextEmbedder(model=model_name, prefix=query_instruction) #type: ignore
query_pipeline.add_component("embedder", text_embedder)

retriever = QdrantEmbeddingRetriever(
    document_store=chunks_store,
    top_k=10, score_threshold=0.75,
    return_embedding=True) #type: ignore
query_pipeline.add_component("retriever", retriever)
query_pipeline.connect("embedder", "retriever")

query_pipeline.warm_up()

In [None]:
q='In cosa differisce il conto corrente dal conto deposito?'
q='Chi ha vinto la coppa Davis nel 2022?'
q='Quali sono le operazioni possibili sul conto corrente?'

result = query_pipeline.run({'embedder': {'text': q}}, include_outputs_from=['retriever'])
for d in result['retriever']['documents']:
  print(d.score, ' - ', d.meta['url'].split('/')[-1])
  print(d.content[:128])
  print()

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

0.8539894  -  Le_guide_della_Banca_d_Italia_Il_conto_corrente_in_parole_semplici.pdf
Puoi depositare in banca il denaro, puoi versare e prelevare denaro dal conto in qualsiasi momento e disporre di servizi, come: 

0.85197246  -  Le_guide_della_Banca_d_Italia_Il_conto_corrente_in_parole_semplici.pdf
Servizi: l’operatività del conto corrente bancario è molto ampia; al contrario, quella del deposito è limitata perché lo scopo p

0.8473186  -  Le_guide_della_Banca_d_Italia_Il_conto_corrente_in_parole_semplici.pdf
Nel momento di apertura del conto, valuta anche se sia meglio aprire un conto corrente intestato solo a te, oppure intestato anc

0.8343934  -  Le_guide_della_Banca_d_Italia_Il_conto_corrente_in_parole_semplici.pdf
Reclami? Ecco chi contattare _ _ _ _ 20
Come chiudere un conto corrente _ _ 21
dalla Aalla
Il conto corrente _ _ _ _ _ _ _ _ _ _

0.8316898  -  Le_guide_della_Banca_d_Italia_Il_conto_corrente_in_parole_semplici.pdf
domiciliazione): esempi sono gli affitti, le utenze, l

# Eval steps

In [None]:
result.keys()
len(result['missing_chunks_filter']['documents'])
#result['chunks_cache_checker'].keys()

0

In [None]:
def stored_doc_urls(chunks_store):
  return set([doc.meta['url'] for doc in chunks_store.get_documents_generator()])

stored_doc_urls(chunks_store)


{'https://www.bancaditalia.it/pubblicazioni/guide-bi/guida-conto-corrente/Le_guide_della_Banca_d_Italia_Il_conto_corrente_in_parole_semplici.pdf',
 'https://www.bancaditalia.it/pubblicazioni/guide-bi/guida-mutuo/Le-guide-della-Banca-d-Italia_Comprare-una-casa_Il-mutuo-ipotecario-in-parole-semplici.pdf'}

In [None]:
def delete_docs(chunks_store, urls=None):
  if urls is None:
    chunks_store.delete_documents()
  else:
    doc_ids = set([doc.id for doc in chunks_store.get_documents_generator() if doc.meta['url'] in urls])
    chunks_store.delete_documents(document_ids=list(doc_ids))

In [None]:
from haystack_rag import DocIdIndexer

result['embedder']['documents']

docIndexer = DocIdIndexer() #type: ignore
result = docIndexer.run(documents=result['embedder']['documents'])


In [None]:
result['chunks_cache_checker'].keys()

dict_keys(['hits'])

In [None]:
!pip install fastembed fastembed-haystack

Collecting fastembed
  Downloading fastembed-0.4.2-py3-none-any.whl.metadata (8.2 kB)
Collecting fastembed-haystack
  Downloading fastembed_haystack-1.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting loguru<0.8.0,>=0.7.2 (from fastembed)
  Downloading loguru-0.7.3-py3-none-any.whl.metadata (22 kB)
Collecting mmh3<5.0.0,>=4.1.0 (from fastembed)
  Downloading mmh3-4.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Collecting onnx<2.0.0,>=1.15.0 (from fastembed)
  Downloading onnx-1.17.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB)
Collecting onnxruntime<1.20.0,>=1.17.0 (from fastembed)
  Downloading onnxruntime-1.19.2-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting pillow<11.0.0,>=10.3.0 (from fastembed)
  Downloading pillow-10.4.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (9.2 kB)
Collecting py-rust-stemmers<0.2.0,>=0.1.0 (from fastembed)
  

In [None]:
from haystack_integrations.components.embedders.fastembed import FastembedDocumentEmbedder

model_name = "intfloat/multilingual-e5-large"

fast_embedder = FastembedDocumentEmbedder(
    model=model_name, batch_size=256
)
fast_embedder.warm_up()
#docs_w_embeddings = fast_embedder.run(documents=[documents[:1]])["documents"]


Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

model.onnx:   0%|          | 0.00/546k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/716 [00:00<?, ?B/s]

model.onnx_data:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

In [None]:
result = fast_embedder.run(documents=doc_chunks[:6])


Calculating embeddings: 100%|██████████| 6/6 [00:28<00:00,  4.67s/it]


In [None]:
from haystack.components.embedders import SentenceTransformersDocumentEmbedder

#embedder_model_name="HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1"
embedder_model_name="Alibaba-NLP/gte-Qwen2-1.5B-instruct"
document_embedder = SentenceTransformersDocumentEmbedder(model=embedder_model_name) # type: ignore

document_embedder.warm_up()

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/284 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/145k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/55.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/879 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/27.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.11G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/1.31k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/80.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/370 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

In [None]:
result = document_embedder.run(documents=doc_chunks[:6])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
import pandas as pd

from fastembed import (
    SparseTextEmbedding,
    TextEmbedding,
    LateInteractionTextEmbedding,
    ImageEmbedding,
)
from fastembed.rerank.cross_encoder import TextCrossEncoder

supported_models = (
    pd.DataFrame(TextEmbedding.list_supported_models())
    .sort_values("size_in_GB")
    .drop(columns=["additional_files"])
    .reset_index(drop=True)
)
supported_models

Unnamed: 0,model,dim,description,license,size_in_GB,sources,model_file
0,BAAI/bge-small-en-v1.5,384,"Text embeddings, Unimodal (text), English, 512...",mit,0.067,{'hf': 'qdrant/bge-small-en-v1.5-onnx-q'},model_optimized.onnx
1,BAAI/bge-small-zh-v1.5,512,"Text embeddings, Unimodal (text), Chinese, 512...",mit,0.09,{'url': 'https://storage.googleapis.com/qdrant...,model_optimized.onnx
2,snowflake/snowflake-arctic-embed-xs,384,"Text embeddings, Unimodal (text), English, 512...",apache-2.0,0.09,{'hf': 'snowflake/snowflake-arctic-embed-xs'},onnx/model.onnx
3,sentence-transformers/all-MiniLM-L6-v2,384,"Text embeddings, Unimodal (text), English, 256...",apache-2.0,0.09,{'url': 'https://storage.googleapis.com/qdrant...,model.onnx
4,jinaai/jina-embeddings-v2-small-en,512,"Text embeddings, Unimodal (text), English, 819...",apache-2.0,0.12,{'hf': 'xenova/jina-embeddings-v2-small-en'},onnx/model.onnx
5,nomic-ai/nomic-embed-text-v1.5-Q,768,"Text embeddings, Multimodal (text, image), Eng...",apache-2.0,0.13,{'hf': 'nomic-ai/nomic-embed-text-v1.5'},onnx/model_quantized.onnx
6,snowflake/snowflake-arctic-embed-s,384,"Text embeddings, Unimodal (text), English, 512...",apache-2.0,0.13,{'hf': 'snowflake/snowflake-arctic-embed-s'},onnx/model.onnx
7,BAAI/bge-small-en,384,"Text embeddings, Unimodal (text), English, 512...",mit,0.13,{'url': 'https://storage.googleapis.com/qdrant...,model_optimized.onnx
8,BAAI/bge-base-en-v1.5,768,"Text embeddings, Unimodal (text), English, 512...",mit,0.21,{'url': 'https://storage.googleapis.com/qdrant...,model_optimized.onnx
9,sentence-transformers/paraphrase-multilingual-...,384,"Text embeddings, Unimodal (text), Multilingual...",apache-2.0,0.22,{'hf': 'qdrant/paraphrase-multilingual-MiniLM-...,model_optimized.onnx


In [None]:
(
    pd.DataFrame(LateInteractionTextEmbedding.list_supported_models())
    .sort_values("size_in_GB")
    .drop(columns=["sources", "model_file"])
    .reset_index(drop=True)
)


Unnamed: 0,model,dim,description,license,size_in_GB,additional_files
0,answerdotai/answerai-colbert-small-v1,96,"Text embeddings, Unimodal (text), Multilingual...",apache-2.0,0.13,
1,colbert-ir/colbertv2.0,128,Late interaction model,mit,0.44,
2,jinaai/jina-colbert-v2,128,New model that expands capabilities of colbert...,cc-by-nc-4.0,2.24,[onnx/model.onnx_data]


In [None]:
from haystack.components.fetchers import LinkContentFetcher

fetcher = LinkContentFetcher() # type: ignore

r = fetcher.run(urls=lib_urls)
docscontent = r['streams']
print(type(docscontent), len(docscontent), type(docscontent[0]))

docscontent[0].meta

<class 'list'> 5 <class 'haystack.dataclasses.byte_stream.ByteStream'>


{'content_type': 'application/pdf',
 'url': 'https://www.bancaditalia.it/pubblicazioni/guide-bi/guida-conto-corrente/Le_guide_della_Banca_d_Italia_Il_conto_corrente_in_parole_semplici.pdf'}

In [None]:
from haystack_rag import  ByteStreamMaterializer

materializer = ByteStreamMaterializer()
r = materializer.run(docscontent)

docs_paths = r['paths']
print(docs_paths)


NameError: name 'docscontent' is not defined

In [None]:
import os
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter
from google.colab import userdata
from haystack.utils import Secret

converter = UnstructuredFileConverter(
    api_url='https://unstructured-api-826421350323.us-central1.run.app/',
    api_key=Secret.from_token(userdata.get('UNSTRUCTURED_API_KEY'))
)
r = converter.run(docs_paths)
documents = r['documents']
print(type(documents), len(documents), type(documents[0]))

Converting files to Haystack Documents: 5it [10:14, 122.81s/it]


KeyError: 0

In [None]:
from haystack_rag import DocMetaFixer

docMetaFixer = DocMetaFixer()
r = docMetaFixer.run(documents=documents, origin_urls=lib_urls)
documents = r['documents']
[d.meta for d in documents]

[{'file_path': '/root/.cache/haystack/Guida-centrale-rischi.pdf',
  'url': 'https://www.bancaditalia.it/pubblicazioni/guide-bi/guida-centrale/Guida-centrale-rischi.pdf'},
 {'file_path': '/root/.cache/haystack/Le-guide-della-Banca-d-Italia_Comprare-una-casa_Il-mutuo-ipotecario-in-parole-semplici.pdf',
  'url': 'https://www.bancaditalia.it/pubblicazioni/guide-bi/guida-mutuo/Le-guide-della-Banca-d-Italia_Comprare-una-casa_Il-mutuo-ipotecario-in-parole-semplici.pdf'},
 {'file_path': '/root/.cache/haystack/Le-guide-della-Banca-d-Italia_Il-credito-ai-consumatori-in-parole-semplici.pdf',
  'url': 'https://www.bancaditalia.it/pubblicazioni/guide-bi/guida-credito-consumatori/Le-guide-della-Banca-d-Italia_Il-credito-ai-consumatori-in-parole-semplici.pdf'},
 {'file_path': '/root/.cache/haystack/Le_guide_della_Banca_d_Italia_Il_conto_corrente_in_parole_semplici.pdf',
  'url': 'https://www.bancaditalia.it/pubblicazioni/guide-bi/guida-conto-corrente/Le_guide_della_Banca_d_Italia_Il_conto_corrente_in

In [None]:
from haystack.components.caching import CacheChecker

cache_checker = CacheChecker(document_store=gcs_store, cache_field="url")

cache_checker.run(items=['https://www.bancaditalia.it/pubblicazioni/guide-bi/guida-centrale/Guida-centrale-rischi.pdf', 'new'])

#json.loads(gcs_store._bucket.blob('docs/Guida-centrale-rischi.pdf.json').download_as_string()).keys()

{'hits': [Document(id=99f3eebdbeb814c39cfcdec1eab8ace575fc77ba80d92e3ac81385e3a4e3f353, content: 'LECONOMIA "'I"ERTUTT'
  
  LE GUIDE DELLA BANCA D’ITALIA
  
  LA CENTRALE DEI RISCHI in parole semplici
  
  COS...', meta: {'file_path': '/root/.cache/haystack/Guida-centrale-rischi.pdf', 'url': 'https://www.bancaditalia.it/pubblicazioni/guide-bi/guida-centrale/Guida-centrale-rischi.pdf'})],
 'misses': ['new']}

In [None]:
from haystack.components.preprocessors import NLTKDocumentSplitter

splitter = NLTKDocumentSplitter(split_by="sentence", split_length=5, split_overlap=2)

xs = splitter.run(documents[:1])['documents']
for x in xs[:50]:
  print(x.id)
  print(x.meta)
  print(x.content)




c8f98666d072dd6d0c561b70308036ce1c5e34f6c5c78c7fb3f3c68b0009db38
{'url': 'https://www.bancaditalia.it/pubblicazioni/guide-bi/guida-conto-corrente/Le_guide_della_Banca_d_Italia_Il_conto_corrente_in_parole_semplici.pdf', 'source_id': '18a07679c19954343a8adad3b58fb315031864a17123d21d39b0d8aacd04c20d', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0, '_split_overlap': [{'doc_id': '1d72d1a17e040fe1caa8dd28871e76f6c17b43ea0b3ecdfd9402c83917654156', 'range': (0, 139)}]}
LE GUIDE DELLA BANCA D’ITALIA
IL CONTO
CORRENTE in parole semplici
La La SCELTA e i COSTI
SCELTA e i COSTI
I DIRITTI del cliente
I DIRITTI del cliente
I CONTATTI utili
I CONTATTI utili
Il conto corrente dalla A alla Z
Il conto corrente dalla A alla ZBanca d’Italia Via Nazionale, 91 00184 Roma Tel. +39 06 47921 PEC: bancaditalia@pec.bancaditalia.it e-mail: email@bancaditalia.it
ISSN 2384-8871 (stampa)
ISSN 2283-5989 (online)
Grafica e stampa a cura della Divisione Editoria e stampa della Banca d’Italia
Versione aggiornat

In [None]:
import textwrap

#print(textwrap.fill(documents[0].content, width=80))
print(documents[0].content)


LE GUIDE DELLA BANCA D’ITALIA
IL CONTO
CORRENTE in parole semplici
La La SCELTA e i COSTI
SCELTA e i COSTI
I DIRITTI del cliente
I DIRITTI del cliente
I CONTATTI utili
I CONTATTI utili
Il conto corrente dalla A alla Z
Il conto corrente dalla A alla ZBanca d’Italia Via Nazionale, 91 00184 Roma Tel. +39 06 47921 PEC: bancaditalia@pec.bancaditalia.it e-mail: email@bancaditalia.it
ISSN 2384-8871 (stampa)
ISSN 2283-5989 (online)
Grafica e stampa a cura della Divisione Editoria e stampa della Banca d’Italia
Versione aggiornata a settembre 2022Cos’è il conto corrente bancario
Il conto corrente bancario è uno strumento che ti consente di depositare il denaro presso una banca, di effettuare le principali operazioni di pagamento – versamento di fondi, prelievo di contanti, esecuzione e ricezione di pagamenti, utilizzo di carte di pagamento e di assegni – e di usufruire di servizi come l’accredito dello stipendio o la domiciliazione delle bollette.
I consumatori, per le operazioni di pagamento 

In [None]:
from haystack.components.converters import PDFMinerToDocument

converter = PDFMinerToDocument()
results = converter.run(sources=docscontent)
documents = results["documents"]



* compare unstructured vs pdfminer text extraction: equivalenti
* replace unstructuder with pdfminer in pipeline
* finish ingestion pipeline
* ***TODO*** implement query
* ***TODO*** add store utils (show list of docs, delete all, delete docs)
* ***TODO*** implement generation


