# Everything (from) Everywhere All at Once - Rag from Multiple Data Sources and Multiple File Types

This notebook is complementary to [this blog post](https://unstructured.io/blog/everything-from-everywhere-all-at-once-enterprise-rag-with-multiple-sources-and-filetypes). The blog walks through setting up Unstructured connectors, building the preprocessing workflow, and getting your data from Azure Blob Storage, OneDrive, and Outlook into Astra DB.

At this point, if you've followed the previous sections from the blog, you now have a fully functioning Unstructured pipeline that connects to your enterprise data — Outlook threads, OneDrive decks, Azure-stored PDFs — and processes them into clean, enriched, and embedded chunks inside AstraDB.

Now it’s time to switch gears and make that data useful.

We're going to set up a simple RAG pipeline using LangChain that can query the processed content stored in AstraDB. The goal is to retrieve relevant context across emails, slides, contracts, and pass it to an LLM to generate grounded answers.

Let’s start with some lightweight setup:

In [2]:
!pip install -q astrapy openai langchain langchain-openai langchain-community

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/333.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m333.5/333.5 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/75.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m51.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m67.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.7/64.7 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m331.1/331.1 kB[0m [31m26.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

In [3]:
import os
from google.colab import userdata
from astrapy import DataAPIClient
from openai import OpenAI
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate


### Connecting to AstraDB and Setting Up Your Models


With our Unstructured pipeline already pushing enriched content into AstraDB, the next step is to connect our notebook to that vector store and configure the models we’ll use for retrieval and generation.

First, load your credentials from Colab Secrets and establish the AstraDB and OpenAI clients:

In [4]:
os.environ['ASTRA_DB_API_ENDPOINT'] = userdata.get('ASTRA_DB_API_ENDPOINT')
os.environ['ASTRA_DB_APPLICATION_TOKEN'] = userdata.get('ASTRA_DB_APPLICATION_TOKEN')
os.environ['ASTRA_DB_COLLECTION_NAME'] = userdata.get('ASTRA_DB_COLLECTION_NAME')
os.environ['ASTRA_DB_KEYSPACE'] = userdata.get('ASTRA_DB_KEYSPACE')


In [5]:
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

In [6]:
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

astradb_client = DataAPIClient(os.environ["ASTRA_DB_APPLICATION_TOKEN"])
database = astradb_client.get_database(os.environ["ASTRA_DB_API_ENDPOINT"])
COLLECTION = database.get_collection(
    name=os.environ["ASTRA_DB_COLLECTION_NAME"],
    keyspace=os.environ["ASTRA_DB_KEYSPACE"]
)

EMBEDDING_MODEL = "text-embedding-3-large"
LLM_MODEL = "gpt-4o"
TOP_K = 5


### Building a Simple Retriever

Let’s wire up a lightweight retrieval function that runs a semantic similarity search over your AstraDB collection using OpenAI’s latest `text-embedding-3-large` model.

We’ll define two things:

- A helper to embed queries
- An `enhanced_retriever` that pulls relevant content and shows which files it came from


In [7]:
def get_embedding(text: str):
    """Generate embedding using OpenAI's text-embedding-3-large model"""
    response = openai_client.embeddings.create(
        model=EMBEDDING_MODEL,
        input=text
    )
    return response.data[0].embedding


def enhanced_retriever(query: str, n: int = TOP_K) -> dict:
    """Enhanced retriever that returns documents with metadata"""
    embedding = get_embedding(query)
    results = COLLECTION.find(sort={"$vector": embedding}, limit=n)

    retrieved_docs = []
    for doc in results:
        # import pdb; pdb.set_trace()
        retrieved_docs.append({
            "content": doc.get("content", ""),
            "source": doc["metadata"]["metadata"]["filename"]
        })

    sources = set([d["source"] for d in retrieved_docs])
    print(f"Retrieved from: {', '.join(sources)}")

    context = "\n".join(f"[Source: {d['source']}]\n{d['content']}" for d in retrieved_docs)
    return context

### Setting Up the LLM and Running the RAG Chain

With the retriever in place, we can now hook it up to a large language model using a simple LangChain prompt template. This is where the final response gets crafted while being grounded in the enterprise data we pulled from AstraDB.

We’ll use `ChatOpenAI` with a light temperature and a prompt that encourages synthesis across multiple sources:


In [None]:
llm = ChatOpenAI(
    model_name=LLM_MODEL,
    temperature=0.3,
    openai_api_key=os.environ["OPENAI_API_KEY"]
)

prompt_template = """You are an AI assistant with access to multiple enterprise documents including financial reports,
inventory data, business presentations, and customer communications.

Use the following context to answer the question. If the answer requires data from multiple sources,
synthesize the information appropriately.

Context:
{context}

Question: {question}

Please provide a comprehensive answer based on the available information. If specific numbers or data points
are mentioned in the context, include them in your response.

Answer:"""

PROMPT = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "question"]
)



In [None]:
from langchain.chains import LLMChain

def create_rag_chain():
    return LLMChain(llm=llm, prompt=PROMPT)

rag_chain = create_rag_chain()

def ask_question(question: str):
    """Main function to ask questions against the RAG system"""
    print(f"\n{'='*80}")
    print(f" Question: {question}")
    print(f"{'='*80}")

    context = enhanced_retriever(question)

    response = rag_chain.invoke({
        "context": context,
        "question": question
    })

    print(f"\n Answer:\n{response['text']}")
    print(f"{'='*80}\n")



When ask questions like the ones below, here’s what’s really happening behind the scenes:

- **Your question is embedded** using OpenAI’s `text-embedding-3-large` model.
- **AstraDB vector search retrieves the top chunks** of content most relevant to your query.
- Those chunks can come from **anywhere** — a PDF financial report, an Outlook email, or a PowerPoint slide deck and the retriever tags each one with its original source.
- **The LLM synthesizes the answer**, combining numbers, names, and context from multiple files into one grounded response.

This is why you see outputs such as:

- **Q3 Revenue** pulled directly from a PDF financial report  
- **Customer complaints** summarized from an Outlook email  
- **Hiring plans** extracted from both a PowerPoint deck and a PDF report  

All automatically with no extra work on your part.

In [None]:
ask_question("What was TechVision's total revenue in Q3 2024?")



 Question: What was TechVision's total revenue in Q3 2024?
Retrieved from: Q3 2024 Financial Report - TechVision Industries.pdf

 Answer:
TechVision Industries' total revenue in Q3 2024 was $847.3 million. This figure represents a 23% year-over-year growth compared to the same period in 2023. The revenue growth was driven by strong performance across various business segments, including a notable 41% increase in recurring revenue from the cloud services division.



The magic here is that Unstructured doesn’t treat each file type differently at retrieval time.  
Instead, it converts every incoming file — PDF, PowerPoint, Excel, email — into a **unified document model** of structured elements enriched with metadata.

Each element carries:
- Its **type** (text, table, image),
- The **source filename**,
- Any **enrichments** like image captions or table descriptions,
- And its **position** within the original document.

Because of this standardization, your retrieval pipeline only has to work with one consistent format, even though the underlying data spans multiple systems and file types.



In [None]:
ask_question("What are Jennifer Martinez's main complaints about support?")


 Question: What are Jennifer Martinez's main complaints about support?
Retrieved from: 98729709ec5679da.eml

 Answer:
Jennifer Martinez, the VP of Information Technology at Global Finance Corp, has several main complaints about the support received from TechVision for their CloudSync Pro Enterprise product:

1. **Lack of Communication**: Jennifer is extremely concerned about the lack of communication regarding known issues with version 3.2.1. She notes that similar complaints have been observed on community forums since September 28, yet no advisory was issued to enterprise customers, which she finds unacceptable.

2. **Delayed Response**: There has been a significant delay in response to their support case (#CS-2024-10-08-4573), with no response received for 72 hours after opening the case. This delay is critical given the severity of the issues they are experiencing.

3. **System Instability and Impact**: The issues with CloudSync Pro Enterprise are causing significant operational d

In [None]:
ask_question("How many new engineers does the company plan to hire?")


 Question: How many new engineers does the company plan to hire?
Retrieved from: enterprise-qbr-ppt.ppt, Q3 2024 Financial Report - TechVision Industries.pdf

 Answer:
The company plans to hire 500 new engineers as part of their efforts to accelerate hiring in Research and Development (R&D). This is one of the key focus areas outlined in the Q4 strategic initiatives, as mentioned in the enterprise quarterly business review presentation.



In [None]:
ask_question("What's the total units sold across all products?")


 Question: What's the total units sold across all products?
Retrieved from: Product_Inventory.csv, enterprise-qbr-ppt.ppt, Q3 2024 Financial Report - TechVision Industries.pdf

 Answer:
To determine the total units sold across all products, we need to sum up the "Units Sold" figures from the provided inventory data. Here's the breakdown:

1. CloudSync Pro: 2,495 units
2. DataShield Security Suite: 1,410 units
3. UltraBook Pro 15: 1,898 units
4. Analytics Dashboard Pro: 750 units
5. TeamCollab Suite: 4,240 units
6. NetworkHub Pro 48-Port: 284 units
7. SecureRouter Enterprise: 595 units
8. AI Assistant Professional: 3,130 units

Adding these figures together gives us the total units sold:

2,495 + 1,410 + 1,898 + 750 + 4,240 + 284 + 595 + 3,130 = 14,802 units

Therefore, the total units sold across all products is 14,802.



You ask one question.  
Behind the scenes, the system:
- Pulls the most relevant chunks across all sources,
- Shows you where they came from,
- And delivers a single, coherent answer grounded in your enterprise data.

In [None]:
ask_question("Based on the email and financial data, estimate the potential revenue risk from Global Finance Corp")


 Question: Based on the email and financial data, estimate the potential revenue risk from Global Finance Corp
Retrieved from: 98729709ec5679da.eml, Q3 2024 Financial Report - TechVision Industries.pdf

 Answer:
Based on the information provided, Global Finance Corp is a significant customer for TechVision Industries, contributing over $3.2 million annually in licensing revenue. The email from Jennifer Martinez indicates a severe issue that has escalated to a high level, suggesting potential dissatisfaction and risk of losing this customer if the issue is not resolved promptly.

The financial impact of the current issue is estimated at $4.7 million in lost productivity for Global Finance Corp this week, along with compliance risks and a significant increase in internal help desk tickets. These factors highlight the severity of the situation and the potential for long-term implications if not addressed.

Given that Global Finance Corp has been a loyal customer for seven years, the pote

### Conclusion

By now you’ve seen how Unstructured can turn a messy sprawl of enterprise content into a single, queryable knowledge base.  
Contracts from Azure Blob, decks from OneDrive, email threads from Outlook all normalized into a unified format, enriched with metadata, and stored as vector embeddings.  
The result is a RAG pipeline that not only answers questions accurately but also explains exactly where the information came from, regardless of file type or source.

This approach eliminates the headaches of building separate pipelines for every content type. Instead, you get one clean, consistent workflow that scales across your organization’s entire data landscape.

---

### Next Steps

- **Extend your pipeline**: Add the NER Node to enhance the value of your content.
- **Tweak retrieval logic**: Adjust chunk sizes, or add ranking models to fine‑tune what gets surfaced.
- **Experiment with LLMs**: Swap in different models for generation (Claude, GPT‑4o, etc.) to compare style, cost, and accuracy.
- **Add tooling**: Layer on dashboards, logs, or usage analytics to monitor how your RAG assistant performs.

---


Ready to unify your scattered enterprise knowledge?  
Sign up for a [free Unstructured account](https://unstructured.io/?modal=try-for-free), connect your first sources, and try building this workflow yourself.  
You’ll be able to go from unorganized files to a production‑ready, multi‑source RAG pipeline in an afternoon and finally get reliable, explainable answers out of your company’s data.
