Proposal: Add Retrieval Pipeline Abstractions (Microsoft.Extensions.DataRetrieval)

# Proposal: Add Retrieval Pipeline Abstractions (`Microsoft.Extensions.DataRetrieval`)

## Summary

This issue proposes adding retrieval pipeline abstractions to `dotnet/extensions`, enabling .NET developers to compose advanced Retrieval-Augmented Generation (RAG) retrieval pipelines using a consistent, pluggable model.

In a RAG application, **retrieval** is the step that finds relevant information to give to an LLM. These abstractions allow developers to add processing stages around vector search — transforming queries before searching and refining results after — without coupling to any specific vector store, LLM provider, or retrieval strategy.

These packages complement the existing `Microsoft.Extensions.DataIngestion` (for writing data into vector stores) by providing the symmetric read-side: pre-search query processing, vector search orchestration, and post-search result processing.

## Motivation

### What is a Retrieval Pipeline?

When a user asks a question in a RAG (Retrieval-Augmented Generation) application, the simplest approach is:

1. Embed the question as a vector
2. Find the closest document chunks in a vector store
3. Pass those chunks to an LLM as context

This works for simple cases, but falls apart quickly:

- **Ambiguous queries** — "How do I configure it?" matches poorly because embeddings don't know what "it" refers to
- **Vocabulary mismatch** — The user says "set up auth" but the docs say "configure identity providers" — semantically similar but far apart in vector space
- **Noise in results** — The top-5 nearest vectors often include irrelevant chunks that happen to share keywords
- **No quality signal** — There's no way to know if the retrieved chunks actually answer the question before passing them to the LLM

A **retrieval pipeline** solves these by adding processing stages around the vector search:

```
[User Query] → [Query Processors] → [Vector Search] → [Result Processors] → [Final Results]
               ↑ expand, rewrite,    ↑ the actual      ↑ rerank, filter,
                 or augment the        database           validate relevance
                 query before          lookup
                 searching
```

**Query processors** (pre-search) transform the query to improve search quality. **Result processors** (post-search) refine what comes back. The pipeline orchestrates both around any vector store.

### Why Abstractions?

Today, .NET developers implementing these patterns must:

1. Write custom orchestration code for each retrieval strategy
2. Tightly couple their logic to a specific vector store (Azure AI Search, Qdrant, Pinecone, etc.)
3. Re-implement common patterns from research papers with no shared building blocks

The existing `Microsoft.Extensions.DataIngestion` packages solved the **write-side** (document → chunks → vector store). `Microsoft.Extensions.DataRetrieval` solves the **read-side** (query → search → process → results) with the same philosophy: thin abstractions that enable a rich ecosystem.

### Symmetry with DataIngestion

| Concern | Write-Side | Read-Side |
|---------|-----------|-----------|
| Package | `Microsoft.Extensions.DataIngestion` | `Microsoft.Extensions.DataRetrieval` |
| Pipeline | `IngestionPipeline<T>` | `RetrievalPipeline` |
| Pre-step processors | `IngestionDocumentProcessor` | `RetrievalQueryProcessor` |
| Post-step processors | `IngestionChunkProcessor<T>` | `RetrievalResultProcessor` |
| Primary method | `ProcessAsync` | `ProcessAsync` |
| Data flows | Documents → Chunks → Vector Store | Query → Search → Ranked Results |

Developers familiar with one side immediately understand the other.

## Proposed API Surface

### Package: `Microsoft.Extensions.DataRetrieval.Abstractions`

```csharp
namespace Microsoft.Extensions.DataRetrieval;

// Core data types
public sealed class RetrievalQuery
{
    public RetrievalQuery(string text);
    public string Text { get; }
    public IList<string> Variants { get; set; }
    public IDictionary<string, object?> Metadata { get; }
}

public sealed class RetrievalChunk
{
    public RetrievalChunk(string content, double score);
    public string Content { get; }
    public double Score { get; set; }
    public IDictionary<string, object?> Record { get; }
}

public sealed class RetrievalResults
{
    public IList<RetrievalChunk> Chunks { get; set; }
    public IDictionary<string, object?> Metadata { get; }
}

// Processor abstractions
public abstract class RetrievalQueryProcessor
{
    public abstract Task<RetrievalQuery> ProcessAsync(
        RetrievalQuery query, CancellationToken cancellationToken = default);
}

public abstract class RetrievalResultProcessor
{
    public abstract Task<RetrievalResults> ProcessAsync(
        RetrievalResults results, RetrievalQuery query,
        CancellationToken cancellationToken = default);
}

// Re-ranking interface
public interface IReranker
{
    Task<IReadOnlyList<RetrievalChunk>> RerankAsync(
        string query, IReadOnlyList<RetrievalChunk> chunks,
        CancellationToken cancellationToken = default);
}

// Retrieval interface for DI and testability
public interface IRetriever
{
    Task<RetrievalResults> RetrieveAsync(
        string query, int topK = 5,
        CancellationToken cancellationToken = default);
}
```

### Package: `Microsoft.Extensions.DataRetrieval`

```csharp
namespace Microsoft.Extensions.DataRetrieval;

public sealed class RetrievalPipeline : IDisposable
{
    public RetrievalPipeline(
        RetrievalPipelineOptions? options = null,
        ILoggerFactory? loggerFactory = null);

    public IList<RetrievalQueryProcessor> QueryProcessors { get; }
    public IList<RetrievalResultProcessor> ResultProcessors { get; }

    public Task<RetrievalResults> ProcessAsync<TKey, TRecord>(
        VectorStoreCollection<TKey, TRecord> collection,
        string query,
        int topK = 5,
        Func<TRecord, string>? contentSelector = null,
        CancellationToken cancellationToken = default)
        where TKey : notnull
        where TRecord : class;
}

public sealed class RetrievalPipelineOptions
{
    public string ActivitySourceName { get; set; }
}

public sealed class VectorStoreRetriever<TKey, TRecord> : IRetriever
    where TKey : notnull
    where TRecord : class
{
    public VectorStoreRetriever(
        RetrievalPipeline pipeline,
        VectorStoreCollection<TKey, TRecord> collection,
        Func<TRecord, string>? contentSelector = null);
}

// Extension method for discoverability
public static class RetrievalPipelineExtensions
{
    public static IRetriever AsRetriever<TKey, TRecord>(
        this RetrievalPipeline pipeline,
        VectorStoreCollection<TKey, TRecord> collection,
        Func<TRecord, string>? contentSelector = null)
        where TKey : notnull
        where TRecord : class;
}
```

## Design Principles

1. **Symmetry with DataIngestion.** Query processors mirror chunk processors; `RetrievalPipeline` mirrors `IngestionPipeline<T>`. Developers familiar with one immediately understand the other.

2. **Composable pipelines.** Zero processors = raw vector search. Add one processor = single enhancement. Stack many = advanced multi-stage retrieval. No dead weight.

3. **Vector store agnostic.** `ProcessAsync` accepts any `VectorStoreCollection<TKey, TRecord>` (from `Microsoft.Extensions.VectorData`). Works with Azure AI Search, Qdrant, Pinecone, in-memory, or any MEVD provider.

4. **Observable.** Built-in `ActivitySource` + `ILogger` support. Each processor invocation is traced with structured log entries.

## Usage Example

```csharp
var pipeline = new RetrievalPipeline(loggerFactory: loggerFactory);
pipeline.QueryProcessors.Add(new MultiQueryExpander(chatClient));
pipeline.ResultProcessors.Add(new LlmReranker(chatClient));

var results = await pipeline.ProcessAsync(
    collection,
    "What are the retention policies?",
    topK: 10,
    contentSelector: record => record.Content);
```

## Relationship to Existing Packages

| Package | Role |
|---------|------|
| `Microsoft.Extensions.AI` | LLM client abstractions (`IChatClient`, `IEmbeddingGenerator`) |
| `Microsoft.Extensions.VectorData` | Vector store abstractions (`VectorStoreCollection<TKey, TRecord>`) |
| `Microsoft.Extensions.DataIngestion` | Write-side pipeline (document → chunks → vector store) |
| **`Microsoft.Extensions.DataRetrieval`** | **Read-side pipeline (query → search → process → results)** |

Together these four packages provide a complete, composable RAG stack without coupling to any specific provider.

## Implementation

A reference implementation exists on the [`feature/retrieval-abstractions`](https://github.com/luisquintanilla/extensions/tree/feature/retrieval-abstractions) branch of this repo's fork, including:

- Both packages with full XML documentation
- OpenTelemetry tracing integration
- Reciprocal Rank Fusion for multi-query deduplication
- Tree-traversal search paradigm for hierarchical indices
- README documentation for each package

## Design Decisions (per Framework Design Guidelines audit)

| Decision | Rationale |
|----------|-----------|
| Abstract classes for processors (not interfaces) | Allows non-breaking additions of new virtual members in future versions. Follows Framework Design (FDG) Guidelines "DO prefer classes over interfaces." |
| `IReranker` as interface (not abstract class) | Single-method stateless contract. Types may implement both `RetrievalResultProcessor` AND `IReranker` (adapter pattern). Single inheritance prevents this with two abstract classes. |
| Sealed data types (`RetrievalQuery`, `RetrievalChunk`, `RetrievalResults`) | Leaf DTOs not intended for inheritance. Prevents fragile base class problems. |
| Sealed `RetrievalPipeline` | Owns `ActivitySource` lifetime — subclassing would create resource management issues. |
| `IList<T>` for mutable collections | Pipeline processors need to add/remove/reorder entries. `IList<T>` (not `List<T>`) follows FDG. |
| `IDictionary<string, object?>` for metadata | Established pattern (`HttpContext.Items`, `Activity.Tags`). Allows extensibility without breaking changes. |
| `CancellationToken` last with `= default` | Standard .NET async method convention. |
| Abstractions package has zero non-polyfill dependencies | Consumers can reference abstractions without pulling heavy transitive deps. |
| `IRetriever` as interface in Abstractions | Data-source-agnostic retrieval contract. Enables DI, testability, and future implementations (web search, SQL, hybrid) without coupling to vector stores. |
| `RetrievalPipeline` does NOT implement `IRetriever` | Pipeline is a reusable processing engine that works with any collection per-call. `VectorStoreRetriever` adapts pipeline + collection → `IRetriever` for single-endpoint DI scenarios. |
| `ProcessAsync` (pipeline) vs `RetrieveAsync` (IRetriever) | Establishes clear vocabulary: pipelines *process* queries through stages, retrievers *retrieve* results. Symmetric with `IngestionPipeline.ProcessAsync` — both pipelines use the same verb for their primary method. |

## Open Questions

1. Should a fluent `RetrievalPipelineBuilder` be included in the core package, or ship separately?

## Design Note: `IRetriever` and `VectorStoreRetriever`

`RetrievalPipeline` intentionally does NOT implement `IRetriever`. The pipeline is a **processing engine** — it defines what transformations occur (query expansion, reranking, validation) but requires a vector store collection per-call. This enables one pipeline to serve multiple collections:

```csharp
// Same pipeline, different indices
var policyResults = await pipeline.ProcessAsync(policyCollection, query);
var faqResults = await pipeline.ProcessAsync(faqCollection, query);
```

`VectorStoreRetriever<TKey, TRecord>` bridges this gap. It captures a pipeline + collection + content selector, then exposes the simple `IRetriever` contract. This is the recommended DI registration pattern:

```csharp
services.AddSingleton<IRetriever>(sp =>
    new VectorStoreRetriever<string, Article>(
        sp.GetRequiredService<RetrievalPipeline>(),
        sp.GetRequiredService<VectorStoreCollection<string, Article>>(),
        record => record.Content));
```

For discoverability, `RetrievalPipeline` also offers an `AsRetriever()` extension method:

```csharp
// Direct usage — no DI needed
IRetriever retriever = pipeline.AsRetriever(collection, record => record.Content);
var results = await retriever.RetrieveAsync("What are the retention policies?");
```

The `IRetriever` abstraction is intentionally data-source-agnostic.Future implementations may include:

| Implementation | Data Source |
|---------------|-------------|
| `VectorStoreRetriever<TKey, TRecord>` | Any MEVD vector store provider |
| `WebSearchRetriever` | Bing, Google, or other search APIs |
| `DatabaseRetriever` | SQL/NoSQL query-based retrieval |
| `HybridRetriever` | Combines multiple `IRetriever` instances |
| `CachingRetriever` | Wraps another `IRetriever` with response caching |

This enables consumers to program against `IRetriever` regardless of backend, and swap implementations without changing calling code.



Package	Role
`Microsoft.Extensions.AI`	LLM client abstractions (`IChatClient`, `IEmbeddingGenerator`)
`Microsoft.Extensions.VectorData`	Vector store abstractions (`VectorStoreCollection<TKey, TRecord>`)
`Microsoft.Extensions.DataIngestion`	Write-side pipeline (document → chunks → vector store)
`Microsoft.Extensions.DataRetrieval`	Read-side pipeline (query → search → process → results)

Decision	Rationale
Abstract classes for processors (not interfaces)	Allows non-breaking additions of new virtual members in future versions. Follows Framework Design (FDG) Guidelines "DO prefer classes over interfaces."
`IReranker` as interface (not abstract class)	Single-method stateless contract. Types may implement both `RetrievalResultProcessor` AND `IReranker` (adapter pattern). Single inheritance prevents this with two abstract classes.
Sealed data types (`RetrievalQuery`, `RetrievalChunk`, `RetrievalResults`)	Leaf DTOs not intended for inheritance. Prevents fragile base class problems.
Sealed `RetrievalPipeline`	Owns `ActivitySource` lifetime — subclassing would create resource management issues.
`IList<T>` for mutable collections	Pipeline processors need to add/remove/reorder entries. `IList<T>` (not `List<T>`) follows FDG.
`IDictionary<string, object?>` for metadata	Established pattern (`HttpContext.Items`, `Activity.Tags`). Allows extensibility without breaking changes.
`CancellationToken` last with `= default`	Standard .NET async method convention.
Abstractions package has zero non-polyfill dependencies	Consumers can reference abstractions without pulling heavy transitive deps.
`IRetriever` as interface in Abstractions	Data-source-agnostic retrieval contract. Enables DI, testability, and future implementations (web search, SQL, hybrid) without coupling to vector stores.
`RetrievalPipeline` does NOT implement `IRetriever`	Pipeline is a reusable processing engine that works with any collection per-call. `VectorStoreRetriever` adapts pipeline + collection → `IRetriever` for single-endpoint DI scenarios.
`ProcessAsync` (pipeline) vs `RetrieveAsync` (IRetriever)	Establishes clear vocabulary: pipelines process queries through stages, retrievers retrieve results. Symmetric with `IngestionPipeline.ProcessAsync` — both pipelines use the same verb for their primary method.

Implementation	Data Source
`VectorStoreRetriever<TKey, TRecord>`	Any MEVD vector store provider
`WebSearchRetriever`	Bing, Google, or other search APIs
`DatabaseRetriever`	SQL/NoSQL query-based retrieval
`HybridRetriever`	Combines multiple `IRetriever` instances
`CachingRetriever`	Wraps another `IRetriever` with response caching

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Add Retrieval Pipeline Abstractions (Microsoft.Extensions.DataRetrieval) #7507

Proposal: Add Retrieval Pipeline Abstractions (`Microsoft.Extensions.DataRetrieval`)

Summary

Motivation

What is a Retrieval Pipeline?

Why Abstractions?

Symmetry with DataIngestion

Proposed API Surface

Package: `Microsoft.Extensions.DataRetrieval.Abstractions`

Package: `Microsoft.Extensions.DataRetrieval`

Design Principles

Usage Example

Relationship to Existing Packages

Implementation

Design Decisions (per Framework Design Guidelines audit)

Open Questions

Design Note: `IRetriever` and `VectorStoreRetriever`

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Concern	Write-Side	Read-Side
Package	`Microsoft.Extensions.DataIngestion`	`Microsoft.Extensions.DataRetrieval`
Pipeline	`IngestionPipeline<T>`	`RetrievalPipeline`
Pre-step processors	`IngestionDocumentProcessor`	`RetrievalQueryProcessor`
Post-step processors	`IngestionChunkProcessor<T>`	`RetrievalResultProcessor`
Primary method	`ProcessAsync`	`ProcessAsync`
Data flows	Documents → Chunks → Vector Store	Query → Search → Ranked Results

Proposal: Add Retrieval Pipeline Abstractions (Microsoft.Extensions.DataRetrieval) #7507

Description

Proposal: Add Retrieval Pipeline Abstractions (Microsoft.Extensions.DataRetrieval)

Summary

Motivation

What is a Retrieval Pipeline?

Why Abstractions?

Symmetry with DataIngestion

Proposed API Surface

Package: Microsoft.Extensions.DataRetrieval.Abstractions

Package: Microsoft.Extensions.DataRetrieval

Design Principles

Usage Example

Relationship to Existing Packages

Implementation

Design Decisions (per Framework Design Guidelines audit)

Open Questions

Design Note: IRetriever and VectorStoreRetriever

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Proposal: Add Retrieval Pipeline Abstractions (`Microsoft.Extensions.DataRetrieval`)

Package: `Microsoft.Extensions.DataRetrieval.Abstractions`

Package: `Microsoft.Extensions.DataRetrieval`

Design Note: `IRetriever` and `VectorStoreRetriever`