# Beyond Retrieval: Adding a Memory Layer to RAG with Unstructured and Mem0

Your RAG system works. It retrieves the right documents, chunks them properly, and generates accurate answers. But here's the problem: it treats every user the same.

Ask it to explain a complex concept, and it gives you the same technical explanation whether you're an expert or a beginner. Query it again the next day about a related topic, and it's forgotten your knowledge level and preferences.

**RAG systems are great at retrieval, but they're terrible at personalization.**

The issue isn't with the documents or the vector search—it's that RAG has no memory of who's asking. Every query starts from scratch. Users have to re-establish their preferences every single time.

What if your RAG system could remember not just the documents it processes, but the users it serves? What if it could adapt its responses based on what it's learned about each user over time?

**That's what we're building in this notebook.**

This notebook implements the concepts from our companion blog post, [Beyond Retrieval: Adding a Memory Layer to RAG](link). If you haven't read it yet, we recommend starting there for a conceptual overview. Here, we'll focus on the hands-on implementation.

We'll take Unstructured's document processing capabilities and layer in Mem0's intelligent memory system. The result? A RAG application that doesn't just retrieve information—it personalizes how that information gets delivered to each user.

By the end of this walkthrough, you'll have built a system that:
- Remembers user preferences across sessions
- Adapts explanations to individual knowledge levels


Throughout this notebook, we'll use research papers on attention mechanisms as our example documents, building an AI assistant that learns how each user prefers to learn.

Let's dive in!

In [None]:
!pip install -U "unstructured-client" mem0ai openai weaviate-client

In [2]:
import os
import time
from google.colab import userdata
from unstructured_client import UnstructuredClient


def pretty_print_model(response_model):
    print(response_model.model_dump_json(indent=4))

Before we can start processing documents, we need to authenticate with Unstructured's API.

If you haven't already, sign up for a free [Unstructured account](https://unstructured.io/?modal=try-for-free). Once you're signed in, you can navigate to the API Keys section in the platform to generate your API key.

Store this key in your Colab secrets as `UNSTRUCTURED_API_KEY`, which we'll use below to authenticate:



In [3]:
os.environ["UNSTRUCTURED_API_KEY"] = userdata.get("UNSTRUCTURED_API_KEY")
client = UnstructuredClient(api_key_auth=os.environ["UNSTRUCTURED_API_KEY"])

## Setting up the S3 Source Connector

Now that we have our Unstructured client ready, we need to tell it where to find our research paper.

Source connectors in Unstructured define where your documents live. In this example, we're using Amazon S3 to store the research paper PDF, but Unstructured supports [many other sources](https://docs.unstructured.io/api-reference/workflow/sources/overview) like Google Drive, Azure Blob Storage, and more.

**What you'll need:**

- `AWS_ACCESS`: Your AWS access key ID
- `AWS_SECRET`: Your AWS secret access key  
- `S3_REMOTE_URL`: The S3 URI to your bucket or folder, formatted as `s3://your-bucket-name/` or `s3://your-bucket-name/folder-path/`

Store these in your Colab secrets, and we'll use them to create a source connector that points to the research paper:

In [4]:
os.environ["AWS_ACCESS"] = userdata.get("AWS_ACCESS")
os.environ["AWS_SECRET"] = userdata.get("AWS_SECRET")
os.environ["S3_REMOTE_URL"] = userdata.get("S3_REMOTE_URL")

In [None]:
from unstructured_client.models.operations import CreateSourceRequest
from unstructured_client.models.shared import CreateSourceConnector

source_connector_response = client.sources.create_source(
    request=CreateSourceRequest(
        create_source_connector=CreateSourceConnector(
            name="Memory Layer Demo - Source",
            type="s3",
            config={

                # For AWS access key ID with AWS secret access key authentication:
                "key": os.environ["AWS_ACCESS"],
                "secret": os.environ["AWS_SECRET"],

                "remote_url": os.environ["S3_REMOTE_URL"],
                "recursive": True
            }
        )
    )
)

pretty_print_model(source_connector_response.source_connector_information)

## Setting up the Weaviate Destination Connector

Now we need to configure where our processed data will go. We're using Weaviate as our destination because it's a vector database, which is perfect for storing our document chunks and their embeddings for semantic search.

For this notebook, we're using Weaviate Cloud (WCD), which offers a free tier and handles all the infrastructure for you.

If you don't have a Weaviate instance yet, [create a WCD account here](https://console.weaviate.cloud/) and set up a cluster. You can also check out [Unstructured's Weaviate destination documentation](https://docs.unstructured.io/api-reference/workflow/destinations/weaviate) for additional configuration options.

Once your cluster is ready, grab these credentials:

- `WEAVIATE_CLUSTER_URL`: Your Weaviate Cloud cluster URL
- `WEAVIATE_API_KEY`: The authentication key for your cluster  
- `WEAVIATE_COLLECTION_NAME`: The name of the collection where data will be stored

**Before creating the connector**, you need to set up your Weaviate collection with a minimum schema. Weaviate requires at least a `record_id` property before it can accept data. In the Weaviate UI, add this basic schema to your collection—Weaviate will automatically generate additional properties based on the incoming data from Unstructured.

Store these credentials in your Colab secrets, and we'll use them to create the destination connector:

In [6]:
os.environ["WEAVIATE_CLUSTER_URL"] = userdata.get("WEAVIATE_CLUSTER_URL")
os.environ["WEAVIATE_API_KEY"] = userdata.get("WEAVIATE_API_KEY")
os.environ["WEAVIATE_COLLECTION_NAME"] = userdata.get("WEAVIATE_COLLECTION_NAME")

In [None]:
from unstructured_client.models.operations import CreateDestinationRequest
from unstructured_client.models.shared import CreateDestinationConnector


destination_connector_response = client.destinations.create_destination(
        request=CreateDestinationRequest(
            create_destination_connector=CreateDestinationConnector(
                name="Memory Layer Demo - Destination",
                type="weaviate-cloud",
                config={
                    "cluster_url": os.environ["WEAVIATE_CLUSTER_URL"],
                    "collection": os.environ["WEAVIATE_COLLECTION_NAME"],
                    "api_key": os.environ["WEAVIATE_API_KEY"]
                }
            )
        )
    )

pretty_print_model(destination_connector_response.destination_connector_information)

## Creating the Document Processing Workflow

Now that we have our source (S3) and destination (Weaviate) configured, we need to define **how** the document should be processed. This is where Unstructured's workflow system shines.

A workflow is a pipeline of processing nodes, where each node performs a specific transformation on the document. The data flows through these nodes sequentially, with each step building on the previous one.

For our RAG system, we need to transform raw PDFs into structured, enriched, embedded chunks that can be semantically searched.

Here's the pipeline we'll build for our research papers:

**1. Partitioner** (`hi_res` strategy)
   - Extracts structured content from the PDF
   - Uses Object Detection Models and OCR for better accuracy for PDFs with embedded text

**2. Image Summarizer** (OpenAI vision model)
   - Generates descriptions for figures and diagrams
   - Converts visual information into text descriptions that become part of the searchable content

**3. Table Summarizer** (Anthropic Claude)
   - Creates natural language summaries of data tables
   - Makes structured data queryable in natural language

**4. Chunker** (`chunk_by_title`)
   - Breaks documents into semantically meaningful pieces
   - Keeps related content together based on document structure (sections under the same heading stay in the same chunk)
   - This preserves context better than arbitrary character splits

**5. Embedder** (OpenAI text-embedding-3-large)
   - Generates vector representations of each chunk
   - These embeddings enable semantic search in Weaviate—finding relevant content based on meaning, not just keywords

Each node plays a critical role in making our documents retrieval-ready. Once this workflow completes, your vector database will be populated with structured, enriched and embedded chunks.

Let's define these nodes:

In [9]:
from unstructured_client.models.shared import (
    WorkflowNode,
    WorkflowType,
    Schedule
)

partition_node = WorkflowNode(
    name="Partitioner",
    subtype="unstructured_api",
    type="partition",
    settings={
        "strategy": "hi_res",
        "extract_image_block_types": ["Image", "Table"],
    }
)

image_summarizer_node = WorkflowNode(
    name="Image summarizer",
    subtype="openai_image_description",
    type="prompter",
    settings={}
)

table_summarizer_node = WorkflowNode(
    name="Table summarizer",
    subtype="anthropic_table_description",
    type="prompter",
    settings={}
)

chunk_node = WorkflowNode(
    name="Chunker",
    subtype="chunk_by_title",
    type="chunk",
    settings={
        "new_after_n_chars": 200,
        "max_characters": 2048,
        "overlap": 50,
        "combine_text_under_n_chars": 200,
        "multipage_sections": True,
        "include_orig_elements": True,
    }
)

embedder_node = WorkflowNode(
    name='Embedder',
    subtype='azure_openai',
    type="embed",
    settings={
        'model_name': 'text-embedding-3-large'
    }
)



response = client.workflows.create_workflow(
    request={
        "create_workflow": {
            "name": f"RAG with Memory Layer Workflow",
            "source_id": source_connector_response.source_connector_information.id,
            "destination_id": destination_connector_response.destination_connector_information.id,
            "workflow_type": WorkflowType.CUSTOM,
            "workflow_nodes": [
                partition_node,
                image_summarizer_node,
                table_summarizer_node,
                chunk_node,
                embedder_node,

            ],
        }
    }
)

workflow_id = response.workflow_information.id
pretty_print_model(response.workflow_information)

{
    "created_at": "2025-10-22T13:35:55.149375Z",
    "destinations": [
        "1e5750a3-7990-4607-94cb-7b74b11e2010"
    ],
    "id": "40013e49-53d4-499f-bce0-583375b00ec4",
    "name": "RAG with Memory Layer Workflow",
    "sources": [
        "9e3be798-dbc2-465a-8794-1cb3957b99e9"
    ],
    "status": "active",
    "workflow_nodes": [
        {
            "name": "Table summarizer",
            "subtype": "anthropic_table_description",
            "type": "prompter",
            "id": "18d0aad4-ef9b-4fde-8059-358ec4c9b327",
            "settings": {
                "model": "claude-sonnet-4-20250514"
            }
        },
        {
            "name": "Image summarizer",
            "subtype": "openai_image_description",
            "type": "prompter",
            "id": "cce19b2e-f6b8-47fd-a455-1df715c43d00",
            "settings": {
                "model": "gpt-4o"
            }
        },
        {
            "name": "Embedder",
            "subtype": "azure_openai",
    

## Running the workflow

Now that we've defined how we want to process our research paper, let's start the workflow and wait for it to complete:

In [10]:
res = client.workflows.run_workflow(
    request={
        "workflow_id": workflow_id,
    }
)

pretty_print_model(res.job_information)

{
    "created_at": "2025-10-22T13:35:56.439961Z",
    "id": "fd00503c-1ff3-481e-ade2-0fcdfb99e569",
    "status": "SCHEDULED",
    "workflow_id": "40013e49-53d4-499f-bce0-583375b00ec4",
    "workflow_name": "RAG with Memory Layer Workflow",
    "job_type": "ephemeral"
}


In [11]:
response = client.jobs.list_jobs(
    request={
        "workflow_id": workflow_id
    }
)

last_job = response.response_list_jobs[0]
job_id = last_job.id
print(f"job_id: {job_id}")

job_id: fd00503c-1ff3-481e-ade2-0fcdfb99e569


Now that we've created and started a job, we can poll Unstructured's `get_job` endpoint and check for its status every 30s till completion

In [12]:
def poll_job_status(job_id, wait_time=30):
    while True:
        response = client.jobs.get_job(
            request={
                "job_id": job_id
            }
        )

        job = response.job_information

        if job.status == "SCHEDULED":
            print(f"Job is scheduled, polling again in {wait_time} seconds...")
            time.sleep(wait_time)
        elif job.status == "IN_PROGRESS":
            print(f"Job is in progress, polling again in {wait_time} seconds...")
            time.sleep(wait_time)
        else:
            print("Job is completed")
            break

    return job

job = poll_job_status(job_id)
pretty_print_model(job)

Job is scheduled, polling again in 30 seconds...
Job is in progress, polling again in 30 seconds...
Job is in progress, polling again in 30 seconds...
Job is in progress, polling again in 30 seconds...
Job is completed
{
    "created_at": "2025-10-22T13:35:56.439961",
    "id": "fd00503c-1ff3-481e-ade2-0fcdfb99e569",
    "status": "COMPLETED",
    "workflow_id": "40013e49-53d4-499f-bce0-583375b00ec4",
    "workflow_name": "RAG with Memory Layer Workflow",
    "job_type": "ephemeral",
    "runtime": "PT0S"
}


At this point, we've completed all the foundational steps:

- Extracted structured elements from raw documents using a **Partitioner**
- Generated descriptions for images and tables using **Enrichments**
- Organized the content into semantically meaningful chunks with a **Chunker**
- Generated vector embeddings for those chunks through an **Embedder**

Our processed data is now stored in Weaviate, ready for retrieval.

Next, we'll build the query system and add a memory layer on top. This is where we transform a standard RAG pipeline into one that remembers and adapts to each user.

## Querying with Memory

Your documents are now successfully processed and sitting in Weaviate. Now comes the interesting part: building a query system that doesn't just retrieve information, but learns about each user and adapts to them.

This is where we add [**Mem0**](https://memo.ai/) to create a memory layer on top of our RAG system. You can follow along with the complete implementation in the companion Colab notebook.

**The architecture is straightforward.** When a user asks a question, we:

1. Generate an embedding for their query
2. Search Weaviate for relevant document chunks
3. Check Mem0 for what we know about this specific user
4. Generate a response that's grounded in the documents but personalized to the user's preferences

**The difference?** Without memory, every user gets the same explanation for the same question. With memory, a beginner gets a simple explanation with examples, while an expert gets a technical deep dive. A user who prefers bullet points gets bullet points. Someone who wants markdown formatting gets markdown. And none of them have to repeat these preferences.

Let's set up our connections and build the retrieval functions.

In [13]:
import weaviate
from openai import OpenAI
from mem0 import MemoryClient

  return datetime.utcnow().replace(tzinfo=utc)


We'll need a Mem0 account to add the memory layer to our RAG system.

If you don't already have one, [create a Mem0 account here](https://app.mem0.ai/). Once you're logged in, navigate to the **API Keys** section in the dashboard and generate a new key, you'll need it to authenticate with Mem0's platform.

Store this key in your Colab secrets as `MEM0_API_KEY`, which we'll use below:

In [14]:
os.environ["MEM0_API_KEY"] = userdata.get("MEM0_API_KEY")

We'll also need an OpenAI API key for generating query embeddings and powering the LLM responses. If you don't have one, you can get it from [OpenAI's platform](https://platform.openai.com/api-keys).


In [15]:
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

Connecting to all clients

In [21]:
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

weaviate_client = weaviate.connect_to_weaviate_cloud(
    cluster_url=os.environ["WEAVIATE_CLUSTER_URL"],
    auth_credentials=weaviate.auth.AuthApiKey(os.environ["WEAVIATE_API_KEY"])
)

collection = weaviate_client.collections.get(os.environ["WEAVIATE_COLLECTION_NAME"])

memory = MemoryClient(
    api_key=os.environ["MEM0_API_KEY"],
)

EMBEDDING_MODEL = "text-embedding-3-large"
LLM_MODEL = "gpt-4o"
TOP_K = 5
USER_ID = "tutorial_user"

Before we add memory, let's set up the core retrieval functions. These handle the standard RAG operations:
- embedding queries,
- searching Weaviate for similar chunks,
- formatting the retrieved context.


In [22]:
def get_embedding(text: str):
    """Generate embedding using OpenAI's text-embedding-3-large"""
    response = openai_client.embeddings.create(
        model=EMBEDDING_MODEL,
        input=text
    )
    return response.data[0].embedding


def retrieve_from_weaviate(query: str, limit: int = TOP_K):
    """
    Retrieve relevant chunks from Weaviate using vector search
    Returns: list of dicts with 'content' and 'source' (filename)
    """
    # Generate query embedding
    query_vector = get_embedding(query)

    # Vector search in Weaviate
    results = collection.query.near_vector(
        near_vector=query_vector,
        limit=limit,
        return_metadata=weaviate.classes.query.MetadataQuery(distance=True)
    )

    # Extract relevant info
    retrieved_docs = []
    sources = set()

    for item in results.objects:
        content = item.properties.get('text', '')

        retrieved_docs.append({
            'content': content,
        })

    return retrieved_docs


def format_context(retrieved_docs):
    """Format retrieved documents into context string"""
    context = "\n\n".join(
        f"\n{doc['content']}"
        for doc in retrieved_docs
    )
    return context


Let's start by implementing a vanilla RAG query function consisting of just retrieval and generation. This serves as our baseline to demonstrate what changes when we add the memory layer.

This function retrieves relevant chunks from Weaviate and generates an answer based purely on the retrieved context. Every user gets the same response for the same question.

In [171]:
def query_without_memory(question: str):
    """
    Query the documents without using memory
    Returns generic response based only on retrieved context
    """
    print(f"\n{'='*80}")
    print(f"Question: {question}")
    print(f"{'='*80}\n")

    retrieved_docs = retrieve_from_weaviate(question)

    context = format_context(retrieved_docs)

    response = openai_client.chat.completions.create(
        model=LLM_MODEL,
        temperature=0.3,
        messages=[
            {
                "role": "system",
                "content": "You are a helpful AI assistant. Answer the question based on the provided context."
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}\n\nAnswer:"
            }
        ]
    )

    answer = response.choices[0].message.content

    print(f"Answer:\n{answer}")
    print(f"\n{'='*80}\n")

    return answer




In [172]:
q1 = "What are the key innovations in the Attention Is All You Need paper?"
response_q1 = query_without_memory(q1)

q2 = "How does sparse attention work compared to standard attention?"
response_q2_no_memory = query_without_memory(q2)


Question: What are the key innovations in the Attention Is All You Need paper?

Answer:
The "Attention Is All You Need" paper introduces several key innovations:

1. **Transformer Architecture**: The paper proposes the Transformer, a novel model architecture that relies entirely on attention mechanisms, specifically self-attention, to compute representations of input and output sequences. This is a departure from traditional models that use recurrent or convolutional neural networks.

2. **Self-Attention Mechanism**: The Transformer uses self-attention to relate different positions of a single sequence to compute a representation of the sequence. This allows the model to draw global dependencies between input and output without regard to their distance in the sequence.

3. **Multi-Head Attention**: To counteract the potential loss of resolution due to averaging in self-attention, the paper introduces multi-head attention, which allows the model to focus on different parts of the input

### Adding the Memory Layer

Now let's implement the memory-enabled version. This function works identically to our vanilla RAG but with one critical addition: it checks Mem0 for stored information about the user.

**Here's how it works:**

When `use_memory=True`, the function:

1. **Retrieves from Weaviate** (standard RAG retrieval)
2. **Queries Mem0** for any stored memories about this user
3. **Injects user preferences** into the system prompt if memories exist
4. **Generates the response** with both document context and user context
5. **Updates memory** by sending the conversation to Mem0 with instructions to extract and store relevant preferences

The key difference is in the system prompt. Without memory, every user gets: `"You are a helpful AI assistant. Answer based on the provided context."`

With memory, users get: `"You are a helpful AI assistant. Answer based on the context and user preferences. User Preferences: [their specific preferences]"`

The `memory.add()` call at the end is what makes the system learn over time. We send the user messages to Mem0 with specific instructions about what to extract—format preferences, knowledge level, learning style and it uses its own LLM to parse and store that information for future queries.




In [162]:
def query_papers(question: str, use_memory: bool = False):
    """
    Query papers with automatic memory management
    """
    # Retrieve from Weaviate
    query_vector = get_embedding(question)
    results = collection.query.near_vector(
        near_vector=query_vector,
        limit=TOP_K
    )

    retrieved_docs = []
    for item in results.objects:
        retrieved_docs.append({
            'content': item.properties.get('text', ''),
        })


    context = "\n\n".join(f"[Source: \n{d['content']}" for d in retrieved_docs)

    # Build system prompt
    if use_memory:
        # Retrieve memories with filters
        filters = {"AND": [{"user_id": USER_ID}]}
        memory_response = memory.search(query=question, filters=filters, version="v2")

        memories = memory_response.get('results', [])

        if len(memories) > 0:
            memories_text = "\n".join([f"- {m.get('memory', '')}" for m in memories])
            system_prompt = f"""You are a helpful AI assistant. Answer based on the context and user preferences.

User Preferences:
{memories_text}

"""
        else:
            system_prompt = "You are a helpful AI assistant. Answer based on the provided context."
    else:
        system_prompt = "You are a helpful AI assistant. Answer based on the provided context."

    # Generate response
    response = openai_client.chat.completions.create(
        model=LLM_MODEL,
        temperature=0.3,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ]
    )

    answer = response.choices[0].message.content

    system_prompt = """Extract and store ONLY: response output format preferences (which formatting the model response should be in), knowledge baseline (what user knows), and learning style.
    Do NOT store anything from the questions user asks, store ONLY from preferences user EXPLICITLY states."""


    # Update memory with system instruction for Mem0's LLM
    if use_memory:
        memory.add(
            messages=[
                {
                    "role": "system",
                    "content": system_prompt
                },
                {"role": "user", "content": question},
                # {"role": "assistant", "content": answer}
            ],
            user_id=USER_ID,
            metadata={"category": "preferences"}
        )

    return answer

Let's see the difference in action. We'll start by querying without memory to establish our baseline, then demonstrate how memory changes the interaction.

In [163]:
print("Q1: What is scaled dot-product attention and how does it work?\n")
response_q1 = query_papers(q1, use_memory=False)
print(f"\nAnswer:\n{response_q1}\n")

Q1: What is scaled dot-product attention and how does it work?


Answer:
The "Attention Is All You Need" paper introduces several key innovations:

1. **Transformer Architecture**: The paper proposes the Transformer model, which relies entirely on attention mechanisms, eliminating the need for recurrence and convolution. This allows for more efficient parallelization and faster training times.

2. **Self-Attention Mechanism**: The Transformer uses self-attention to compute representations of input and output sequences. This mechanism relates different positions of a single sequence to compute a representation of the sequence, enabling the model to capture dependencies regardless of their distance.

3. **Multi-Head Attention**: To counteract the reduced effective resolution due to averaging attention-weighted positions, the paper introduces Multi-Head Attention. This allows the model to focus on different parts of the sequence simultaneously, improving its ability to capture complex dep

In [164]:
print("Q2: How does sparse attention work compared to standard attention?\n")
response_q2_no_memory = query_papers(q2, use_memory=False)
print(f"\nAnswer:\n{response_q2_no_memory}\n")


Q2: How does sparse attention work compared to standard attention?


Answer:
Sparse attention works by selectively focusing on a subset of the input data, rather than attending to all possible pairs of positions as in standard attention. This approach reduces the computational complexity and memory usage, which are typically quadratic in the sequence length for standard attention.

In sparse attention, only certain positions in the input sequence are attended to, based on a predefined sparsity pattern. This can be achieved through various methods, such as block-sparse patterns, strided patterns, or fixed patterns, where specific cells summarize previous locations and propagate that information to future cells. These patterns allow for reduced HBM (High Bandwidth Memory) accesses, leading to faster execution and lower memory footprint.

Sparse attention can be implemented using factorized self-attention, where multiple attention heads attend to distinct subblocks of the input, rather th

Now let's test the memory system. This time, we'll include explicit preferences in the query—format preferences, knowledge level, and learning style.

When `use_memory=True`, Mem0 will automatically extract and store these preferences. Future queries won't need to repeat them.

In [165]:
preference_question = """I'd like my responses in markdown format. Explain concepts like I only know
about the vanilla attention mechanism and nothing else. I'm willing to learn as long as there
are comparisons and bridges to more complex topics from what I know already.

Now answer: How does sparse attention work compared to standard attention?"""

print(f"User: {preference_question}\n")
response_q2_with_memory = query_papers(preference_question, use_memory=True)
print(f"\nAnswer:\n{response_q2_with_memory}\n")


User: I'd like my responses in markdown format. Explain concepts like I only know
about the attention mechanism and nothing else. I'm willing to learn as long as there
are comparisons and bridges to more complex topics from what I know already.

Now answer: How does sparse attention work compared to standard attention?


Answer:
Certainly! Let's break down the concept of sparse attention compared to standard attention, using the attention mechanism as our starting point.

### Standard Attention

In a standard attention mechanism, every element in a sequence (like words in a sentence) attends to every other element. This means that for a sequence of length \( n \), the attention mechanism computes \( n \times n \) interactions. This is often visualized as a full grid or matrix where each element is connected to every other element. This approach can be computationally expensive and memory-intensive, especially for long sequences, because the operations scale quadratically with the seque

The response came back exactly as requested—markdown format, starting from basic attention concepts, with comparisons bridging to more complex ideas.

Now let's test with a completely different question, without repeating any of those preferences.

The below queries don't mention formatting preferences, knowledge level, or explanation styles. Let's see what happens:

In [170]:
print("Q3: What is Flash Attention?\n")
q3 = "What is Flash Attention?"
response_q3 = query_papers(q3, use_memory=True)
print(f"\nAnswer:\n{response_q3}\n")

Q3: What is Flash Attention?


Answer:
FlashAttention is an advanced attention algorithm designed to enhance the efficiency of the attention mechanism in Transformers, particularly by addressing the bottleneck caused by memory accesses. Here's a breakdown of its key features and how it compares to standard attention:

1. **Memory Access Optimization**: 
   - FlashAttention reduces the number of memory reads and writes, which are the primary factors affecting runtime in GPU computations. By minimizing these memory accesses, FlashAttention achieves faster performance compared to standard attention.

2. **Block Processing and Tiling**:
   - The algorithm splits the input into blocks and processes these blocks incrementally. This approach allows for the computation of the softmax reduction without needing access to the entire input at once, thus avoiding the storage of large intermediate attention matrices.

3. **Efficient Backward Pass**:
   - Unlike some other methods that recompute both

In [167]:
print("Q4: Compare standard, sparse, and Flash Attention. What are the trade-offs?\n")
q4 = "Compare standard, sparse, and Flash Attention. What are the trade-offs?"
response_q4 = query_papers(q4, use_memory=True)
print(f"\nAnswer:\n{response_q4}\n")

Q4: Compare standard, sparse, and Flash Attention. What are the trade-offs?


Answer:
When comparing standard, sparse, and FlashAttention, several trade-offs emerge in terms of speed, memory efficiency, and scalability:

1. **Standard Attention:**
   - **Speed:** Standard attention is generally slower due to higher memory access requirements. It requires Θ(𝑁𝑑 + 𝑁²) HBM accesses, which can be a bottleneck for runtime.
   - **Memory Efficiency:** It has a larger memory footprint compared to FlashAttention, making it less efficient for longer sequences.
   - **Scalability:** Standard attention struggles to scale efficiently to very long sequences due to its quadratic growth in both runtime and memory usage.

2. **Sparse Attention:**
   - **Speed:** Sparse attention mechanisms can be faster than standard attention for certain sequence lengths because they reduce the number of computations by focusing only on a subset of the attention matrix.
   - **Memory Efficiency:** Sparse attention can

The system remembered. Without repeating any preferences, the response automatically came back in markdown format, explained at a beginner level starting from attention mechanisms, and used comparisons throughout.

The memory layer is working, Mem0 retrieved the stored preferences and applied them to these questions.

Let's check what's actually stored in the memory layer

In [169]:
print("Final Memory State - What Mem0 Learned About You")
print("="*80 + "\n")

filters = {"AND": [{"user_id": USER_ID}]}
all_memories = memory.get_all(filters=filters)['results']
print(f"Total memories stored: {len(all_memories)}\n")
for idx, m in enumerate(all_memories):
    print(f"{idx}. {m.get('memory', '')}")

Final Memory State - What Mem0 Learned About You

Total memories stored: 3

0. User is willing to learn as long as there are comparisons and bridges to more complex topics
1. User only knows about the attention mechanism
2. User prefers responses in markdown format


Notice that what's stored is exactly what we instructed Mem0 to extract. The individual questions you asked (about sparse attention, Flash Attention, trade-offs) aren't in there. Those were just queries, not preferences.

This is the key difference. The memory layer isn't recording your conversation history, it's distilling the patterns that matter for personalization.

Here's what happened across the demo:

**Q1 (baseline)**: Asked about the base paper without memory → generic response

**Q2 (set preferences)**: Stated your preferences once in the query itself → Mem0 extracted and stored them

**Q3, Q4**: Asked completely different questions without mentioning preferences → system automatically applied them every time

The result? You stated your preferences once, and the system remembered them across multiple questions. No repetition needed.

This is what makes memory-enabled RAG different from standard RAG. Same documents, same retrieval, same LLM but the delivery adapts to each user.

**Ready to build your own memory-enabled RAG system?**

Swap in your own documents, customize what memories to extract, and see how personalization changes your AI's responses.

If you're working with enterprise data at scale, [sign up for Unstructured](https://unstructured.io/?modal=try-for-free) to access the full platform with workflow scheduling, multiple source connectors, and production-grade processing.
