Skip to content

Add Langchain hook to common ai provider#67192

Merged
kaxil merged 9 commits into
mainfrom
aip99-langchain
May 19, 2026
Merged

Add Langchain hook to common ai provider#67192
kaxil merged 9 commits into
mainfrom
aip99-langchain

Conversation

@vikramkoka
Copy link
Copy Markdown
Contributor

@vikramkoka vikramkoka commented May 19, 2026

Summary

Adds LangChainHook to the common.ai provider, bridging an Airflow connection to LangChain chat and embedding models. The hook resolves credentials from a connection of type langchain and dispatches to the right vendor implementation via LangChain's universal initializers (init_chat_model and init_embeddings).

Design rationale

Own langchain connection type

Vendor names shouldn't be load-bearing across hook families. An earlier revision of this PR reused the existing pydanticai conn_type, but that conflated the UI: a user opening a "Pydantic AI" connection form while configuring LangChain is a leaky abstraction. The four pydanticai-* connection shapes also have different field layouts (Azure endpoint+deployment, Bedrock IAM, Vertex GCP), so the "shared conn" story silently misrouted for three of them.

Each framework now owns its conn_type. Future LangChain cloud-auth variants (Bedrock, Vertex, Azure) will follow the per-vendor-subclass pattern already established by PydanticAIBedrockHook / PydanticAIVertexHook / PydanticAIAzureHook.

Vendor-agnostic dispatch via init_chat_model and init_embeddings

The hook uses langchain.chat_models.init_chat_model("provider:name", api_key=..., base_url=...) for chat and the parallel langchain.embeddings.init_embeddings("provider:name", ...) for embeddings. Dispatch covers any provider those initializers support that accepts the api_key + optional base_url credential shape: OpenAI itself, OpenAI-compatible endpoints (Ollama, vLLM, LM Studio) via the openai: prefix + custom host, Anthropic, Groq, Mistral AI chat, DeepSeek, and others.

Providers with bespoke auth (Bedrock, Vertex, Azure for chat; Cohere, HuggingFace, Mistral embeddings, Bedrock embeddings) reject the api_key/base_url kwarg shape and are deferred to per-vendor subclasses. The docs scope the listed providers honestly so users don't hit a ValidationError at runtime trying a provider that looked supported on paper.

Single hook serves chat + embeddings; optional embed_conn_id

get_chat_model() reads llm_conn_id + llm_model; get_embedding_model() reads embed_conn_id (falls back to llm_conn_id) + embed_model. The common one-provider case stays a single hook instance. When chat and embeddings live on different API keys (premium chat vs free-tier embeddings), pass an explicit embed_conn_id.

conn.extra_dejson for parity with PydanticAIHook

The hook parses extra via conn.extra_dejson (matching PydanticAIHook), which swallows JSONDecodeError, returns {} for empty values, and applies secret masking. Sibling consistency matters: a user mis-keying their extra JSON gets the same behavior across both hooks.

[langchain] extra is framework-only

pip install apache-airflow-providers-common-ai[langchain] installs only langchain itself. Bundling langchain-openai (or any other vendor's integration package) under a framework-named extra would conflate the framework with a vendor choice -- the same kind of mistake as the conn_type. Users install their vendor's LangChain integration package separately (langchain-openai, langchain-anthropic, langchain-groq, etc.).

Usage

from airflow.providers.common.ai.hooks.langchain import LangChainHook
from airflow.sdk import task


@task
def summarize(text: str) -> str:
    hook = LangChainHook(
        llm_conn_id="langchain_default",
        llm_model="anthropic:claude-3-7-sonnet",
    )
    llm = hook.get_chat_model()
    return llm.invoke(f"Summarise: {text}").content

Configure the langchain_default connection (type langchain) with the API key in password, optionally a custom base URL in host, and optionally extra={"model": "...", "embed_model": "..."} to set default model identifiers on the connection.

See example_langchain_hook.py for chat-only, embedding-only, dual-capability, and separate-conn patterns, and example_langchain_tool_agent.py for an end-to-end ReAct agent demo with HITL review.

Gotchas

  • Cloud-auth providers (Bedrock, Vertex, Azure) are not covered by the api_key + base_url surface. LangChainBedrockHook / LangChainVertexHook / LangChainAzureHook subclasses (mirroring the pydantic-ai pattern) are deferred to a follow-up.
  • Cohere, HuggingFace, Mistral embeddings, etc. require provider-specific credential kwargs (cohere_api_key, AWS auth chain, GCP service-account) that this hook does not forward. Same follow-up.
  • default_conn_name is langchain_default, not pydanticai_default. Users adopting this hook need to create a new langchain connection in the UI rather than reusing an existing pydanticai_default. The per-framework conn_type is the right tradeoff; a back-compat alias would carry the wrong abstraction forward.

Deferred follow-ups

  • BaseChatHook / BaseAgentHook / BaseEmbeddingHook contract extraction in common.ai. Once that lands, LangChainHook will inherit from BaseChatHook + BaseEmbeddingHook. Operators will dispatch via BaseHook.get_hook(conn_id) instead of hardcoded conn_type checks.
  • LangChain cloud-auth variants (Bedrock, Vertex, Azure).
  • @task.langchain decorator, consistent with the absence of @task.pydantic_ai today; will land alongside the BaseChatHook refactor.

Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

Generated-by: [Claude] following the guidelines

- Adds LangChainHook to bridge Airflow connections to LangChain model constructors (ChatOpenAI, OpenAIEmbeddings),
  using constructor injection for credentials
  - Reuses the existing pydanticai connection type so users configure one connection for PydanticAI, LlamaIndex, and
  LangChain
  - Follows the same pattern as LlamaIndexHook: _resolve_connection_kwargs() extracts api_key and base_url from the
  Airflow connection and passes them directly to LangChain constructors
  - Adds langchain optional dependency extra (langchain>=1.0.0, langchain-openai>=0.3.0)

  What's included

  - hooks/langchain.py — LangChainHook(BaseHook) with get_chat_model() and get_embedding_model()
  - tests/unit/common/ai/hooks/test_langchain.py — full test coverage (init, connection resolution, chat model,
  embedding model)
  - docs/hooks/langchain.rst — hook documentation with usage examples
  - provider.yaml — LangChain integration and hook registration
  - pyproject.toml — langchain optional dependency extra

  Design decisions

  - BaseHook, not BaseAIHook — BaseAIHook is still in development. Will migrate in a follow-up PR once it ships.
  - Constructor injection — credentials passed as api_key=/base_url= kwargs to LangChain constructors. No environment
  variable mutation. Matches the LlamaIndexHook pattern.
  - Shared connection type — reuses pydanticai connection type rather than introducing a new one. One connection works
  across all three frameworks.
  - No @task.langchain yet — consistent with LlamaIndex (no @task.llamaindex). Deferred to the BaseAIHook migration PR.
- Adds LangChainHook to bridge Airflow connections to LangChain model constructors (ChatOpenAI, OpenAIEmbeddings),
  using constructor injection for credentials
  - Reuses the existing pydanticai connection type so users configure one connection for PydanticAI, LlamaIndex, and
  LangChain
  - Follows the same pattern as LlamaIndexHook: _resolve_connection_kwargs() extracts api_key and base_url from the
  Airflow connection and passes them directly to LangChain constructors
  - Adds langchain optional dependency extra (langchain>=1.0.0, langchain-openai>=0.3.0)

  What's included

  - hooks/langchain.py — LangChainHook(BaseHook) with get_chat_model() and get_embedding_model()
  - tests/unit/common/ai/hooks/test_langchain.py — full test coverage (init, connection resolution, chat model,
  embedding model)
  - docs/hooks/langchain.rst — hook documentation with usage examples
  - provider.yaml — LangChain integration and hook registration
  - pyproject.toml — langchain optional dependency extra

  Design decisions

  - BaseHook, not BaseAIHook — BaseAIHook is still in development. Will migrate in a follow-up PR once it ships.
  - Constructor injection — credentials passed as api_key=/base_url= kwargs to LangChain constructors. No environment
  variable mutation. Matches the LlamaIndexHook pattern.
  - Shared connection type — reuses pydanticai connection type rather than introducing a new one. One connection works
  across all three frameworks.
  - No @task.langchain yet — consistent with LlamaIndex (no @task.llamaindex). Deferred to the BaseAIHook migration PR.
@vikramkoka vikramkoka changed the title Aip99 langchain Add Langchain hook to common ai provider May 19, 2026
Comment thread providers/common/ai/pyproject.toml
kaxil added 7 commits May 19, 2026 22:01
- Own `langchain` connection type instead of reusing `pydanticai`, so the
  UI is honest about which framework a connection configures. The four
  pydanticai-* conn shapes don't map uniformly to LangChain either, so the
  "shared conn" framing silently misrouted for three of them.
- Replace hardcoded `langchain_openai.ChatOpenAI` with
  `langchain.chat_models.init_chat_model("<provider>:<model>", api_key=...,
  base_url=...)`. Same parallel API for embeddings via
  `langchain.embeddings.init_embeddings`. Dispatch covers anything those
  initializers support that accepts the api_key + base_url credential
  shape (OpenAI, OpenAI-compatible endpoints, Ollama). Providers with
  bespoke auth (Bedrock, Vertex, Azure, Cohere, HuggingFace, Mistral
  embeddings) are deferred to per-vendor subclasses, mirroring the
  pydantic-ai pattern.
- `embed_conn_id` (optional, falls back to `llm_conn_id`) keeps the single
  hook instance ergonomic for the common case while supporting different
  API keys for chat and embeddings.
- Parse `conn.extra` via `conn.extra_dejson` for parity with PydanticAIHook
  (swallows JSONDecodeError, applies secret masking).
- `default_conn_name` resolves at runtime rather than at class-def time,
  so future per-vendor subclasses (Bedrock/Vertex/Azure) inherit it cleanly.
- Example DAG: build the FAISS vectorstore once in `_build_tools` and
  close over it (the search tool is invoked many times per agent run),
  drop the eval-based calculator tool, add a `get_current_utc_time`
  tool instead.
- Docs explicitly scope the supported provider list to ones whose
  embedding/chat classes accept the api_key + base_url surface.
- `default_conn_name` is now `langchain_default`.
- Add `langchain>=1.0.0` and `langchain-openai>=0.3.0` to the `dev`
  dependency group. The test suite uses `@patch("langchain.chat_models.
  init_chat_model")` and `@patch("langchain.embeddings.init_embeddings")`,
  which import the target modules at decorator-resolution time. Without
  langchain in the dev environment, the LangChainHook tests fail at
  collection. Mirrors the `pydantic-ai-slim[mcp]` line that's there for
  MCPHook tests.
- Drop the stale `# TODO: inherit from BaseChatHook ...` comment from
  the hook. A future contributor adding the `BaseChatHook` contract
  will refactor every framework hook in one pass; a per-hook TODO
  doesn't help and the parenthetical was PR-process commentary that
  shouldn't be in source.
The `[langchain]` extra previously installed `langchain-openai` alongside
`langchain` itself. That conflated the framework with a specific vendor's
integration package -- the same kind of vendor-specificity leak we fixed
in the conn_type. Users wanting Anthropic, Groq, Mistral AI, etc. would
get `langchain-openai` for no reason.

- `[langchain]` now installs only `langchain>=1.0.0`. Users install their
  vendor's LangChain integration package separately (langchain-openai,
  langchain-anthropic, langchain-groq, etc.).
- Drop `langchain-openai` from the `dev` group too. Hook tests mock
  `init_chat_model` / `init_embeddings`, neither of which imports
  vendor classes at decorator-resolution time. `langchain` alone is
  enough for unit tests to pass.
- Docs updated to list the per-vendor packages users should install
  alongside the extra.
Mirrors the pydantic_ai hook docs pattern (and the other operator docs in
common.ai): runnable snippets live in an example DAG with START/END
markers, and `docs/hooks/langchain.rst` `exampleinclude`s them. The doc
prose stops drifting from the code that has to actually work.

Adds `example_langchain_hook.py` with four minimal DAGs, one per pattern:

- `howto_hook_langchain_chat` -- get_chat_model() + invoke
- `howto_hook_langchain_embedding` -- get_embedding_model() + embed_documents
- `howto_hook_langchain_chat_and_embedding` -- single hook serves both
- `howto_hook_langchain_different_conns` -- explicit embed_conn_id

The richer ReAct agent demo stays in `example_langchain_tool_agent.py`.
`hooks/langchain.rst` was an orphan; Sphinx with `-W` would fail with
"document isn't included in any toctree". Adds `hooks/index.rst` with a
`:glob: *` toctree mirroring the `operators/index.rst` pattern, and
points the top-level `Hooks` toctree entry at it. Future hooks added to
`hooks/` auto-appear in nav with no top-level edits.

`hooks/index.rst` includes a small "Choosing a hook" table covering
PydanticAIHook and LangChainHook (MCPHook has no hook guide -- it's
documented from the connection page).
LangChain's `BaseMessage.content` is typed `str | list[str | dict]` to
support multi-modal responses (text + images + tool calls). The example
DAGs only exercise the text-only path, but `summarize() -> str` returning
`.content` directly fails mypy with "Incompatible return value type".

Wraps `.content` in `str(...)` at the three call sites. The other two
sites land inside untyped dicts that mypy doesn't flag, but the
consistency matters: the docs reference these snippets via
`exampleinclude`, so users copy this code into their own typed `@task`
functions. Better to show the pattern that works in both cases.
CI's docs spell-check (sphinxcontrib-spelling) flagged four words not in
the global wordlist. Rephrased to use plainer terms instead of padding
the wordlist:

- "initialisers" / "initializers" -> "entry-point functions"
- "dispatchable" -> "accepted"
- Dropped the bare verb form "exampleinclude-d" from the example DAG
  docstring; the docstring now describes the file's content rather than
  how docs reference it.
@kaxil kaxil merged commit dcdd124 into main May 19, 2026
8 checks passed
@kaxil kaxil deleted the aip99-langchain branch May 19, 2026 23:27
kaxil added a commit that referenced this pull request May 21, 2026
…s, cloud URIs

Same playbook as #67192 (LangChain) and #67120 (DocumentLoader) plus
three LlamaIndex-specific architectural fixes:

Critical fixes
- Stop mutating LlamaIndex's global ``Settings`` singleton. The previous
  ``LlamaIndexHook.configure_settings()`` wrote ``Settings.embed_model``
  / ``Settings.llm`` process-wide, which leaks across concurrent tasks
  in the same worker. Replaced with per-call ``embed_model=`` /
  ``llm=`` parameters on ``VectorStoreIndex(...)`` and
  ``load_index_from_storage(...)``.
- Own ``llamaindex`` connection type instead of squatting on
  ``pydanticai``. Mirrors the LangChain / CrewAI fix.
- Remove ``documents`` from ``EmbeddingOperator.template_fields``.
  ``list[dict]`` doesn't survive Jinja stringification, and worse, a
  user document containing literal ``{{ var.value.api_key }}`` would
  leak secrets into the embedding store. Bind via ``loader.output``
  instead.

BYO embedding/LLM for non-OpenAI vendors
- LlamaIndex doesn't ship an ``init_chat_model`` / ``init_embedding_model``
  equivalent (verified in ``llama_index.core.embeddings.utils.resolve_embed_model``
  -- only ``"default"`` / ``"local"`` / ``"clip:"`` dispatch). The hook
  therefore covers OpenAI (matching LlamaIndex's own
  ``resolve_embed_model("default")`` behaviour) and operators accept a
  pre-built ``BaseEmbedding`` / ``LLM`` instance to bypass the hook for
  Cohere / Bedrock / Vertex / HuggingFace / etc.

Cloud-URI persistence
- ``EmbeddingOperator.persist_dir`` and
  ``RetrievalOperator.index_persist_dir`` accept storage URIs
  (``s3://``, ``gs://``, ``azure://``) resolved via
  ``ObjectStoragePath`` and fsspec, matching the merged
  ``DocumentLoaderOperator`` pattern.

Hook plumbing playbook (mirrors LangChain / CrewAI / DocumentLoader)
- ``conn_type = "llamaindex"`` + new ``connection-types`` entry in
  ``provider.yaml`` with ``embed_model`` / ``llm_model`` conn-fields.
- ``default_conn_name`` resolves at runtime via
  ``llm_conn_id: str | None = None``.
- ``_resolve_model`` honours ``conn.extra_dejson`` for parity with the
  sibling hooks (swallows ``JSONDecodeError``, applies secret masking).
- ``get_ui_field_behaviour`` added.
- ``[llamaindex]`` extra in ``pyproject.toml`` pinning
  ``llama-index-core``, ``llama-index-embeddings-openai``,
  ``llama-index-llms-openai`` (enough to back the hook's default
  OpenAI return values). Same in the ``dev`` group.

Misc operator/test fixes
- Wrap lazy ``llama_index`` imports with
  ``AirflowOptionalProviderFeatureException`` so missing extras surface
  cleanly.
- ``RetrievalOperator`` returns ``{"query": ..., "chunks": [...]}``
  (was ``"question"``) and ``chunks[*].node_id`` (was the misleading
  ``"source"`` key).
- ``RetrievalOperator`` raises ``FileNotFoundError`` with a "did you
  run EmbeddingOperator first?" hint when ``index_persist_dir`` is
  missing.
- All three test files get an autouse fixture stubbing
  ``llama_index.*`` in ``sys.modules`` so ``@patch`` resolves without
  ``llama-index-*`` packages installed in CI's non-DB test env
  (mirrors #67237).
- New ``example_llamaindex_hook.py`` with ``[START howto_*]`` markers
  for the docs to ``exampleinclude``.
kaxil added a commit that referenced this pull request May 21, 2026
* Add LlamaIndex operators to common.ai provider

 - Adds LlamaIndexHook to bridge Airflow connections to LlamaIndex's Settings singleton. Reuses the pydanticai connection type, supports separate
  embedding and LLM connections.
  - Adds EmbeddingOperator to chunk documents and produce embedding vectors via LlamaIndex's SentenceSplitter. Input is list[dict(text, metadata)]
  (same shape as DocumentLoaderOperator output), output includes chunks with vectors ready for downstream vector store ingest operators (pgvector,
  Pinecone, Weaviate).
  - Adds RetrievalOperator to load a persisted LlamaIndex index and perform similarity search. Output is scored chunks ready for synthesis via
  LLMOperator.

  Design notes

  All LlamaIndex imports are lazy (inside execute() / method bodies), so modules parse without llama-index installed. The hook currently hardcodes
  OpenAI embedding/LLM providers; a follow-up PR will refactor to use BaseAIHook for provider-agnostic model resolution when it lands.

  What's included

  ┌─────────────────────────────────────────┬──────────────────────────────────────────┐
  │                  File                   │                 Purpose                  │
  ├─────────────────────────────────────────┼──────────────────────────────────────────┤
  │ hooks/llamaindex.py                     │ Hook (~110 lines)                        │
  ├─────────────────────────────────────────┼──────────────────────────────────────────┤
  │ operators/llamaindex_embedding.py       │ EmbeddingOperator (~110 lines)           │
  ├─────────────────────────────────────────┼──────────────────────────────────────────┤
  │ operators/llamaindex_retrieval.py       │ RetrievalOperator (~90 lines)            │
  ├─────────────────────────────────────────┼──────────────────────────────────────────┤
  │ tests/.../test_llamaindex.py            │ 12 hook tests                            │
  ├─────────────────────────────────────────┼──────────────────────────────────────────┤
  │ tests/.../test_llamaindex_embedding.py  │ 10 operator tests                        │
  ├─────────────────────────────────────────┼──────────────────────────────────────────┤
  │ tests/.../test_llamaindex_retrieval.py  │ 8 operator tests                         │
  ├─────────────────────────────────────────┼──────────────────────────────────────────┤
  │ docs/hooks/llamaindex.rst               │ Hook docs                                │
  ├─────────────────────────────────────────┼──────────────────────────────────────────┤
  │ docs/operators/llamaindex_embedding.rst │ EmbeddingOperator docs                   │
  ├─────────────────────────────────────────┼──────────────────────────────────────────┤
  │ docs/operators/llamaindex_retrieval.rst │ RetrievalOperator docs                   │
  ├─────────────────────────────────────────┼──────────────────────────────────────────┤
  │ provider.yaml                           │ Integration, hook, operator registration │
  ├─────────────────────────────────────────┼──────────────────────────────────────────┤
  │ docs/index.rst                          │ LlamaIndex Hook in Guides toctree        │
  ├─────────────────────────────────────────┼──────────────────────────────────────────┤
  │ docs/operators/index.rst                │ Chooser table rows                       │
  └─────────────────────────────────────────┴──────────────────────────────────────────┘

  Test plan

  - uv run --project providers/common/ai pytest providers/common/ai/tests/unit/common/ai/hooks/test_llamaindex.py -xvs (12 tests)
  - uv run --project providers/common/ai pytest providers/common/ai/tests/unit/common/ai/operators/test_llamaindex_embedding.py
  providers/common/ai/tests/unit/common/ai/operators/test_llamaindex_retrieval.py -xvs (18 tests)
  - Hook: init defaults, separate embed_conn_id, connection kwargs extraction, embedding model, LLM, Settings configuration
  - EmbeddingOperator: output shape, chunking, index persistence, vector inclusion/omission, splitter params
  - RetrievalOperator: output shape, chunk keys, top_k forwarding, multiple results, storage context

  ---
  Was generative AI tooling used to co-author this PR?

  - Yes — Claude Code (Opus 4.6)

  Generated-by: Claude Code (Opus 4.6) following
  https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions

* Refactor LlamaIndex hook + operators: no Settings mutation, BYO models, cloud URIs

Same playbook as #67192 (LangChain) and #67120 (DocumentLoader) plus
three LlamaIndex-specific architectural fixes:

Critical fixes
- Stop mutating LlamaIndex's global ``Settings`` singleton. The previous
  ``LlamaIndexHook.configure_settings()`` wrote ``Settings.embed_model``
  / ``Settings.llm`` process-wide, which leaks across concurrent tasks
  in the same worker. Replaced with per-call ``embed_model=`` /
  ``llm=`` parameters on ``VectorStoreIndex(...)`` and
  ``load_index_from_storage(...)``.
- Own ``llamaindex`` connection type instead of squatting on
  ``pydanticai``. Mirrors the LangChain / CrewAI fix.
- Remove ``documents`` from ``EmbeddingOperator.template_fields``.
  ``list[dict]`` doesn't survive Jinja stringification, and worse, a
  user document containing literal ``{{ var.value.api_key }}`` would
  leak secrets into the embedding store. Bind via ``loader.output``
  instead.

BYO embedding/LLM for non-OpenAI vendors
- LlamaIndex doesn't ship an ``init_chat_model`` / ``init_embedding_model``
  equivalent (verified in ``llama_index.core.embeddings.utils.resolve_embed_model``
  -- only ``"default"`` / ``"local"`` / ``"clip:"`` dispatch). The hook
  therefore covers OpenAI (matching LlamaIndex's own
  ``resolve_embed_model("default")`` behaviour) and operators accept a
  pre-built ``BaseEmbedding`` / ``LLM`` instance to bypass the hook for
  Cohere / Bedrock / Vertex / HuggingFace / etc.

Cloud-URI persistence
- ``EmbeddingOperator.persist_dir`` and
  ``RetrievalOperator.index_persist_dir`` accept storage URIs
  (``s3://``, ``gs://``, ``azure://``) resolved via
  ``ObjectStoragePath`` and fsspec, matching the merged
  ``DocumentLoaderOperator`` pattern.

Hook plumbing playbook (mirrors LangChain / CrewAI / DocumentLoader)
- ``conn_type = "llamaindex"`` + new ``connection-types`` entry in
  ``provider.yaml`` with ``embed_model`` / ``llm_model`` conn-fields.
- ``default_conn_name`` resolves at runtime via
  ``llm_conn_id: str | None = None``.
- ``_resolve_model`` honours ``conn.extra_dejson`` for parity with the
  sibling hooks (swallows ``JSONDecodeError``, applies secret masking).
- ``get_ui_field_behaviour`` added.
- ``[llamaindex]`` extra in ``pyproject.toml`` pinning
  ``llama-index-core``, ``llama-index-embeddings-openai``,
  ``llama-index-llms-openai`` (enough to back the hook's default
  OpenAI return values). Same in the ``dev`` group.

Misc operator/test fixes
- Wrap lazy ``llama_index`` imports with
  ``AirflowOptionalProviderFeatureException`` so missing extras surface
  cleanly.
- ``RetrievalOperator`` returns ``{"query": ..., "chunks": [...]}``
  (was ``"question"``) and ``chunks[*].node_id`` (was the misleading
  ``"source"`` key).
- ``RetrievalOperator`` raises ``FileNotFoundError`` with a "did you
  run EmbeddingOperator first?" hint when ``index_persist_dir`` is
  missing.
- All three test files get an autouse fixture stubbing
  ``llama_index.*`` in ``sys.modules`` so ``@patch`` resolves without
  ``llama-index-*`` packages installed in CI's non-DB test env
  (mirrors #67237).
- New ``example_llamaindex_hook.py`` with ``[START howto_*]`` markers
  for the docs to ``exampleinclude``.

* Rename LlamaIndex operators with framework prefix; fold in #67189 RAG examples

Per Kaxil's review r3267387604: ``RetrievalOperator`` / ``EmbeddingOperator``
are too generic in the common.ai namespace -- they risk colliding when
other frameworks add their own embedding/retrieval operators. Renamed
both with the LlamaIndex prefix:

- ``EmbeddingOperator`` -> ``LlamaIndexEmbeddingOperator``
- ``RetrievalOperator`` -> ``LlamaIndexRetrievalOperator``

Renames applied across the two operator modules, three docs RSTs, the
two test files, both example DAGs, and the cross-refs in
``docs/operators/index.rst``, ``docs/hooks/llamaindex.rst``,
``docs/operators/document_loader.rst``, and ``docs/hooks/index.rst``.

Folds in #67189 (``example_llamaindex_rag.py``) which would otherwise
sit blocked waiting for this PR to merge. Rewritten for the new API:

- Uses the renamed classes
- Drops ``documents="{{ ti.xcom_pull(...) }}"`` Jinja templating
  (template_fields removed; bind via ``loader.output`` direct)
- Switches LlamaIndex operators to ``llamaindex_default`` conn (was
  ``pydanticai_default``); the synthesis-step ``LLMOperator`` keeps
  ``pydanticai_default`` because it's pydantic-ai-backed (different
  framework, intentional split documented in the module docstring)
- Adds explicit ``embed_model="text-embedding-3-small"`` to every
  embedding/retrieval call (new operator validation requires it)
- Fixes the string-reference task chains (``load >> "build_index"`` ->
  ``load >> build_index``) which weren't valid task dependencies

Closes #67189.

* Address code-review findings on LlamaIndex operators

- Fix ObjectStoragePath conn_id mangling: pass raw URI to LlamaIndex
  persist_dir= and supply target.fs separately. str(target) returns
  s3://<conn_id>@<bucket>/..., which fsspec misinterprets.
- Add documents / embed_model / embed_conn_id to template_fields so
  XComArg resolution fires. The previous "list[dict] doesn't survive
  stringification" rationale was wrong; Templater unwraps resolvables
  before Jinja.
- Default llm_conn_id to None on both operators; LlamaIndexHook
  resolves to default_conn_name at runtime. Hard-coding
  "llamaindex_default" undid the hook's careful runtime resolution.
- Add embed_conn_id pass-through for separate embedding credentials.
- Replace isinstance(str) duck-typing with hasattr-based BaseEmbedding
  check; raise TypeError with a clear pointer instead of letting an
  unresolved XComArg or random object explode later.
- Hoist 'import os' and 'from pathlib import Path' to module top.
- Pad RST title underlines and refresh docs/tests to match the new
  surface.

* Fix mypy on LlamaIndex embedding operator

- Pass persist_dir as a typed str arg to _persist so the existing
  None-narrowing # type: ignore comments can go away.
- Cast SentenceSplitter nodes to list[TextNode] for the .text access:
  the splitter only ever returns TextNode, but the base
  get_nodes_from_documents signature is typed as list[BaseNode].

* Install llama-index in tests instead of stubbing sys.modules

llama-index-core / -embeddings-openai / -llms-openai were declared in
the common.ai provider's dev dependency group but missing from uv.lock,
so CI never actually installed them. The tests papered over that by
faking out llama_index.* in sys.modules with MagicMocks.

Refresh uv.lock so the packages get installed, then drop the
sys.modules manipulation:

- test_llamaindex.py: remove the autouse _stub_llama_index_modules
  fixture entirely; @patch resolves against the real modules.
- test_llamaindex_embedding.py / test_llamaindex_retrieval.py: replace
  the _stub_li fixture (sys.modules setitem) with a smaller _li fixture
  that uses monkeypatch.setattr against real llama_index.core symbols.

* Apply ruff lint/format fixes

---------

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants