## Populate Pinecone Document Store with Test Case Documents

To delete all records u need to `pip install "pinecone[grpc]"` and run the following code.

In [1]:
from pinecone.grpc import PineconeGRPC as Pinecone
from dotenv import load_dotenv
import os

load_dotenv()
pinecone_api_key = os.getenv("PINECONE_API_KEY")

pc = Pinecone(api_key=pinecone_api_key)
index = pc.Index("default")

index.delete(delete_all=True, namespace='default')



Initialize Pinecone Document Store

In [2]:
from haystack_integrations.document_stores.pinecone import PineconeDocumentStore
from dotenv import load_dotenv

load_dotenv()
document_store = PineconeDocumentStore(
		index="default",
		namespace="default",
		dimension=384,
  	metric="cosine",
  	spec={"serverless": {"region": "us-east-1", "cloud": "aws"}}
)

Prepare pipeline components

In [3]:
from haystack.components.routers import FileTypeRouter
from haystack.components.converters import TextFileToDocument, MarkdownToDocument, PyPDFToDocument 
from haystack.components.joiners import DocumentJoiner
from haystack.components.preprocessors import DocumentSplitter, DocumentCleaner
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter

file_type_router = FileTypeRouter(mime_types=["text/plain", "application/pdf", "text/markdown"])
text_file_converter = TextFileToDocument()
markdown_converter = MarkdownToDocument()
pdf_converter = PyPDFToDocument()
document_joiner = DocumentJoiner()
document_cleaner = DocumentCleaner()
document_splitter = DocumentSplitter(split_by="word", split_length=150, split_overlap=50)
document_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
document_writer = DocumentWriter(document_store)

Create a pipeline to populate the Pinecone Document Store with test case documents

In [4]:
from haystack import Pipeline

preprocessing_pipeline = Pipeline()
preprocessing_pipeline.add_component(instance=file_type_router, name="file_type_router")
preprocessing_pipeline.add_component(instance=text_file_converter, name="text_file_converter")
preprocessing_pipeline.add_component(instance=markdown_converter, name="markdown_converter")
preprocessing_pipeline.add_component(instance=pdf_converter, name="pypdf_converter")
preprocessing_pipeline.add_component(instance=document_joiner, name="document_joiner")
preprocessing_pipeline.add_component(instance=document_cleaner, name="document_cleaner")
preprocessing_pipeline.add_component(instance=document_splitter, name="document_splitter")
preprocessing_pipeline.add_component(instance=document_embedder, name="document_embedder")
preprocessing_pipeline.add_component(instance=document_writer, name="document_writer")

preprocessing_pipeline.connect("file_type_router.text/plain", "text_file_converter.sources")
preprocessing_pipeline.connect("file_type_router.application/pdf", "pypdf_converter.sources")
preprocessing_pipeline.connect("file_type_router.text/markdown", "markdown_converter.sources")
preprocessing_pipeline.connect("text_file_converter", "document_joiner")
preprocessing_pipeline.connect("pypdf_converter", "document_joiner")
preprocessing_pipeline.connect("markdown_converter", "document_joiner")
preprocessing_pipeline.connect("document_joiner", "document_cleaner")
preprocessing_pipeline.connect("document_cleaner", "document_splitter")
preprocessing_pipeline.connect("document_splitter", "document_embedder")
preprocessing_pipeline.connect("document_embedder", "document_writer")

<haystack.core.pipeline.pipeline.Pipeline object at 0x0000026CBEEE5EE0>
🚅 Components
  - file_type_router: FileTypeRouter
  - text_file_converter: TextFileToDocument
  - markdown_converter: MarkdownToDocument
  - pypdf_converter: PyPDFToDocument
  - document_joiner: DocumentJoiner
  - document_cleaner: DocumentCleaner
  - document_splitter: DocumentSplitter
  - document_embedder: SentenceTransformersDocumentEmbedder
  - document_writer: DocumentWriter
🛤️ Connections
  - file_type_router.text/plain -> text_file_converter.sources (List[Path])
  - file_type_router.application/pdf -> pypdf_converter.sources (List[Path])
  - file_type_router.text/markdown -> markdown_converter.sources (List[Path])
  - text_file_converter.documents -> document_joiner.documents (List[Document])
  - markdown_converter.documents -> document_joiner.documents (List[Document])
  - pypdf_converter.documents -> document_joiner.documents (List[Document])
  - document_joiner.documents -> document_cleaner.documents (Li

Run the pipeline

In [5]:
from pathlib import Path

preprocessing_pipeline.run({"file_type_router": {"sources": list(Path(Path('./Data/Test_Case')).glob("**/*"))}})

Converting markdown files to Documents: 100%|██████████| 8/8 [00:00<00:00, 1599.81it/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Document 136bcc65efa842bf4dab45cc51c0b89df20d582eb2c965c08134364c020bc1ac has metadata fields with unsupported types: ['_split_overlap']. Only str, int, bool, and List[str] are supported. The values of these fields will be discarded.
Document a6c0e5304630bcd37e2dad1fff7ab046d0d9965e26384b08d134d08a3bf009cd has metadata fields with unsupported types: ['_split_overlap']. Only str, int, bool, and List[str] are supported. The values of these fields will be discarded.


Upserted vectors:   0%|          | 0/25 [00:00<?, ?it/s]

{'document_writer': {'documents_written': 25}}

## Test RAG with Pinecone Document Store

Restart the kernel and run the following code to test the RAG pipeline with the populated Pinecone Document Store.

In [1]:
import openai
from dotenv import load_dotenv
import os
from haystack_integrations.document_stores.pinecone import PineconeDocumentStore

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

document_store = PineconeDocumentStore(
    index="default",
    namespace="default",
    dimension=384,
  	metric="cosine",
  	spec={"serverless": {"region": "us-east-1", "cloud": "aws"}}
)

Create pipeline to run a query

In [2]:
from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
from haystack_integrations.components.retrievers.pinecone import PineconeEmbeddingRetriever
from haystack.components.generators import OpenAIGenerator
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack import Pipeline

template = """
    Given these documents, answer the question.\nDocuments:
    {% for doc in documents %}
        {{ doc.content }}
    {% endfor %}

    \nQuestion: {{question}}
    \nAnswer:
"""

document_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
text_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
retriever = PineconeEmbeddingRetriever(document_store=document_store)
generator = OpenAIGenerator()
answer_builder = AnswerBuilder()
prompt_builder = PromptBuilder(template=template)

rag_pipeline = Pipeline()
rag_pipeline.add_component("text_embedder", text_embedder)
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", generator)
rag_pipeline.add_component("answer_builder", answer_builder)

rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")

<haystack.core.pipeline.pipeline.Pipeline object at 0x0000012E37142A20>
🚅 Components
  - text_embedder: SentenceTransformersTextEmbedder
  - retriever: PineconeEmbeddingRetriever
  - prompt_builder: PromptBuilder
  - llm: OpenAIGenerator
  - answer_builder: AnswerBuilder
🛤️ Connections
  - text_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - retriever.documents -> answer_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.prompt (str)
  - llm.replies -> answer_builder.replies (List[str])

Run the pipeline with a query

In [4]:
query = "Generate full documentation of DataAnalyzer project"
result = rag_pipeline.run({
    "text_embedder": {"text": query},
    "prompt_builder": {"question": query},
    "answer_builder": {"query": query}
})

print(result['answer_builder']['answers'][0].query)
print(result['answer_builder']['answers'][0].data)
print(result['answer_builder']['answers'][0].documents)

with open("./Data/Outputs/output.md", "w") as f:
    f.write(result['answer_builder']['answers'][0].data)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Generate full documentation of DataAnalyzer project
# DataAnalyzer Project Documentation

## Module Name: DataAnalyzer
## Version: 2.5.0
## Author: Tech Solutions

## Overview:
The DataAnalyzer project provides a fast and optimized approach to data analysis, suitable for small datasets in personal and academic projects. It offers various functions for processing datasets and generating basic statistics and visualizations.

## Functions:
1. **process_data(dataset: list) -> dict**
   This function processes the dataset and returns basic statistics such as:
   - Mean
   - Mode
   - Variance
   Usage Example:
   ```python
   from DataAnalyzer import process_data
   dataset = [2, 8, 22, 18, 25]
   stats = process_data(dataset)
   ```

2. **graph_data(dataset: list, chart: str = 'scatter') -> None**
   This function generates a chart based on the dataset. The default chart type is 'scatter'.
   Example:
   ```python
   from DataAnalyzer import graph_data
   dataset = [2, 8, 22, 18, 25]
   gr