# Wire RAG <a href="https://colab.research.google.com/github/appunite/Wire-RAG/blob/main/main_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup

Install dependencies for colab

In [None]:
from importlib.metadata import metadata
from itertools import count
!pip install haystack-ai pinecone-haystack sentence-transformers pinecone transformers
!wget -P utils https://raw.githubusercontent.com/appunite/Wire-RAG/main/utils/url_scraper.py
!wget -P utils https://raw.githubusercontent.com/appunite/Wire-RAG/main/utils/github_scraper.py

Enter api keys

In [6]:
import os
import getpass
os.environ["PINECONE_API_KEY"] = getpass.getpass("pinecone api key")
os.environ["OPENAI_API_KEY"] = getpass.getpass("open ai api key")
os.environ["GITHUB_API_TOKEN"] = getpass.getpass("github api token (PAT)")

Or load keys from .env file

In [1]:
from dotenv import load_dotenv
import os
load_dotenv()

True

## Populate Pinecone Database

### Scrape URLs


Whitelist: Allow any URL that begins with any element from the white_list.\
Blacklist: Block any URL that begins with any element from the black_list.

In [2]:
import nest_asyncio
from utils.url_scraper import start_scraping

# Apply the nest_asyncio patch to allow nested event loops in Jupyter
nest_asyncio.apply()

starting_url = "https://docs.wire.com"
depth_limit = 2

filter_list = {"white_list": ["https://docs.wire.com"], "black_list": []}
scraped_urls = await start_scraping(starting_url, depth_limit, filter_list)

print(f"Total URLs found: {len(scraped_urls)}")

Total URLs found: 429


### Extract metadata and content

In [3]:
from utils.url_scraper import extract_content_and_metadata, DATE_FORMATS, DATE_PATTERNS

scraped_urls_dict = []
for u in scraped_urls:
    scraped_urls_dict += extract_content_and_metadata(u, DATE_FORMATS, DATE_PATTERNS)
print(len(scraped_urls_dict))

6679


### Scrape Github

In [None]:
from utils.github_scraper import scrape_md_files

md_dict = await scrape_md_files(org_name="wireapp", api_key=os.getenv("GITHUB_API_TOKEN"), repo_limit=None)
print(len(md_dict))

### Save / Load .json

In [2]:
import json

# with open("./github_docs.json", "w", encoding='utf-8') as json_file:
#     json.dump(md_dict, json_file, ensure_ascii=False, indent=4)
# 
# with open("./docs_wire.json", "w", encoding='utf-8') as json_file:
#     json.dump(scraped_urls_dict, json_file, ensure_ascii=False, indent=4)
    
with open("./github_docs.json", 'r', encoding='utf-8') as json_file:
    md_dict = json.load(json_file)
print(len(md_dict), md_dict[0]['metadata'], sep='\n')

with open("./docs_wire.json", 'r', encoding='utf-8') as json_file:
    scraped_urls_dict = json.load(json_file)
print(len(scraped_urls_dict), scraped_urls_dict[0]['metadata'], sep='\n')

1366
{'url': 'https://github.com/wireapp/libsodium.js/blob/master/README.md', 'title': 'libsodium.js/README.md', 'headline': '', 'date': '2015-10-07'}
6679
{'url': 'https://docs.wire.com', 'title': 'Welcome to Wire’s documentation! — Wire 0.0.4 documentation', 'headline': 'Welcome to Wire’s documentation!\uf0c1', 'date': 'Unknown'}


### Populate database

To delete all records u need to `pip install "pinecone[grpc]"` and run the following code.

In [4]:
# Uncomment to delete all db records
# import os
# from pinecone import Pinecone
# Pinecone(api_key=os.getenv("PINECONE_API_KEY")).Index("wire-rag").delete(delete_all=True, namespace='docs-wire')

{}

Initialize Pinecone Document Store

In [3]:
from haystack.components.preprocessors import DocumentSplitter, DocumentCleaner
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack import Pipeline
from haystack import Document
from haystack_integrations.document_stores.pinecone import PineconeDocumentStore

docs_wire_ds = PineconeDocumentStore(
    index="wire-rag",
    namespace="docs-wire",
    dimension=384,
    metric="cosine",
    spec={"serverless": {"region": "us-east-1", "cloud": "aws"}}
)

github_wireapp_ds = PineconeDocumentStore(
    index="wire-rag",
    namespace="github-wireapp",
    dimension=384,
    metric="cosine",
    spec={"serverless": {"region": "us-east-1", "cloud": "aws"}}
)

scraped_urls_documents = [Document(content=doc["content"], meta=doc["metadata"]) for doc in scraped_urls_dict]
print(f"Scraped URLs documents: {len(scraped_urls_documents)}")

github_documents = [Document(content=doc["content"], meta=doc["metadata"]) for doc in md_dict]
print(f"Github documents: {len(github_documents)}")

Scraped URLs documents: 6679
Github documents: 1366


Create a pipelines to populate the Pinecone Document Store with both github and docs.wire documetns

In [4]:
# For all-MiniLM-L6-v2 default input text is 256 word pieces.
splitter_gh = DocumentSplitter(split_by="word", split_length=256, split_overlap=20)
embedder_gh = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
writer_gh = DocumentWriter(github_wireapp_ds)

pipeline_github = Pipeline()
pipeline_github.add_component(instance=splitter_gh, name="splitter_gh")
pipeline_github.add_component(instance=embedder_gh, name="embedder_gh")
pipeline_github.add_component(instance=writer_gh, name="writer_gh")

pipeline_github.connect("splitter_gh", "embedder_gh")
pipeline_github.connect("embedder_gh", "writer_gh")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7f5ab63a2c30>
🚅 Components
  - splitter_gh: DocumentSplitter
  - embedder_gh: SentenceTransformersDocumentEmbedder
  - writer_gh: DocumentWriter
🛤️ Connections
  - splitter_gh.documents -> embedder_gh.documents (List[Document])
  - embedder_gh.documents -> writer_gh.documents (List[Document])

In [5]:
cleaner_scraped = DocumentCleaner()
# For all-MiniLM-L6-v2 default input text is 256 word pieces.
splitter_scraped = DocumentSplitter(split_by="word", split_length=256, split_overlap=20)
embedder_scraped = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
writer_scraped = DocumentWriter(docs_wire_ds)

pipeline_scraped = Pipeline()
pipeline_scraped.add_component(instance=cleaner_scraped, name="cleaner_scraped")
pipeline_scraped.add_component(instance=splitter_scraped, name="splitter_scraped")
pipeline_scraped.add_component(instance=embedder_scraped, name="embedder_scraped")
pipeline_scraped.add_component(instance=writer_scraped, name="writer_scraped")

pipeline_scraped.connect("cleaner_scraped", "splitter_scraped")
pipeline_scraped.connect("splitter_scraped", "embedder_scraped")
pipeline_scraped.connect("embedder_scraped", "writer_scraped")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7f5ab63a2f60>
🚅 Components
  - cleaner_scraped: DocumentCleaner
  - splitter_scraped: DocumentSplitter
  - embedder_scraped: SentenceTransformersDocumentEmbedder
  - writer_scraped: DocumentWriter
🛤️ Connections
  - cleaner_scraped.documents -> splitter_scraped.documents (List[Document])
  - splitter_scraped.documents -> embedder_scraped.documents (List[Document])
  - embedder_scraped.documents -> writer_scraped.documents (List[Document])

Run the pipeline

In [None]:
pipeline_github.run(data = {"splitter_gh": { "documents" : github_documents }})
pipeline_scraped.run(data = {"cleaner_scraped": { "documents" : scraped_urls_documents }})
# preprocessing_pipeline.show()

## Test RAG with Pinecone Document Store

Restart the kernel and run the following code to test the RAG pipeline with the populated Pinecone Document Store.\
Create pipeline to run a query

In [1]:
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack_integrations.components.retrievers.pinecone import PineconeEmbeddingRetriever
from haystack.components.generators import OpenAIGenerator
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack import Pipeline
from haystack_integrations.document_stores.pinecone import PineconeDocumentStore
from dotenv import load_dotenv

template = """You are a knowledgeable assistant responsible for creating comprehensive documentation based on the following list of documents that refer to the user’s question. The content of these documents may contradict each other, so please prioritize the information from the documents with the most recent dates. If there are contradicting documents with dates labeled as 'None', provide all alternatives and explicitly indicate which parts contradict one another. However, if a document with date labeled as 'None' does not conflict with others, it should be included without special mention.

Instructions:
2. Analyze the Documents:
   - Review each document, noting any conflicting information.
   - Prioritize information from the most recent documents.
3. Handling Documents with 'None' Date:
   - If a document has a date marked as 'None':
     - Include all relevant alternatives and clearly indicate contradictions.
     - If it does not conflict with other documents, include it without special mention.

Output Format:
Your output should be structured using Markdown and include the following sections:
1. Summary:
   - Provide a brief overview of the key findings from all documents.
2. Detailed Analysis:
   - Present detailed descriptions of key points, prioritizing the latest information.
   - Preserve and format any code snippets from the documents appropriately.
   - Present full semantic context retrieved from given documents.
3. Contradictions:
   - For documents dated 'None', list all relevant alternatives and explicitly highlight any contradictions.
   - Do not generate this section if there are no contradictions.

General Guidelines:
- Ensure thoroughness by including all relevant information, aiming for completeness rather than brevity.
- Use headings, lists, and code blocks to enhance readability and organization.
- Given .md files should be the base structure of generated file. If .md files are poor, treat them as regular source.

User Question: {{question}}
Documents to Analyze:
{% for doc in documents %}
Date: {{doc.meta['date']}}
Title: {{doc.meta['title']}} - {{doc.meta['headline']}}
Content: 
{{doc.content}}
{% endfor %}"""

load_dotenv()

document_store = PineconeDocumentStore(
    index="default",
    namespace="default",
    dimension=384,
    metric="cosine",
    spec={"serverless": {"region": "us-east-1", "cloud": "aws"}}
)

text_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
retriever = PineconeEmbeddingRetriever(document_store=document_store, top_k=30)
prompt_builder = PromptBuilder(template=template)
generator = OpenAIGenerator(model="gpt-4o-mini") # "gpt-4o-mini" "gpt-4o" "gpt-3.5-turbo"
answer_builder = AnswerBuilder()

rag_pipeline = Pipeline()
rag_pipeline.add_component("text_embedder", text_embedder)
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("generator", generator)
rag_pipeline.add_component("answer_builder", answer_builder)

rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("retriever", "answer_builder.documents")
rag_pipeline.connect("prompt_builder", "generator")
rag_pipeline.connect("generator.replies", "answer_builder.replies")

# with open("./pipeline.yml", "w") as file:
#   rag_pipeline.dump(file)

<haystack.core.pipeline.pipeline.Pipeline object at 0x7f1f34060950>
🚅 Components
  - text_embedder: SentenceTransformersTextEmbedder
  - retriever: PineconeEmbeddingRetriever
  - prompt_builder: PromptBuilder
  - generator: OpenAIGenerator
  - answer_builder: AnswerBuilder
🛤️ Connections
  - text_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - retriever.documents -> answer_builder.documents (List[Document])
  - prompt_builder.prompt -> generator.prompt (str)
  - generator.replies -> answer_builder.replies (List[str])

Run the pipeline with a query

In [19]:
query = "Federation, how to make requests between two federated backends."
result = rag_pipeline.run({
    "text_embedder": {"text": query},
    "prompt_builder": {"question": query},
    "answer_builder": {"query": query}
})

print(result['answer_builder']['answers'][0].query)
print(result['answer_builder']['answers'][0].data)
for i, doc in enumerate(result['answer_builder']['answers'][0].documents):
    print(f"{i + 1}. {doc.meta['headline']} - {doc.to_dict()['url']}")

with open("./output.md", "w") as f:
    f.write(result['answer_builder']['answers'][0].data)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Federation, how to make requests between two federated backends.
# Summary

The documentation collectively outlines the framework and requirements for making requests between two federated backends using the Wire 0.0.4 API. Key points emphasize the roles of the *Federator* and *Federation Ingress* components in facilitating communication between backends, including authentication and authorization processes. The most recent updates highlight enhancements in processing federated requests, the ability to send requests to multiple backends in parallel, and the evolution of API conventions to improve functionality.

# Detailed Analysis

## Federated Requests Overview
- According to the **2023-01-10** document, every federated API request involves a service component (like brig or galley) in one backend, which communicates through the *Federator*. The response is relayed back via the *Federator Ingress* in the other backend.
  
## Backend to Backend Communication
- The document marked **Non