Create document processor #195

afterrburn · 2025-06-15T22:34:51Z

AI Agent Documentation Sync

Overview

This PR implements an AI agent that automatically syncs documentation changes with a vector database for enhanced search and retrieval capabilities.

Workflow

Automated Sync (Main Branch)

Trigger: GitHub Action activated when PR is merged into main branch
Process:
- GitHub Action sends CURL request to agent with file changes from git diff
- Agent receives list of changed/deleted file paths
- For each changed file:
  - Search vector store by metadata (file path)
  - Delete existing vectors if found
  - Re-embed updated document content
  - Store new vectors back to database
- Deleted files are automatically removed from vector store

Document Processing

Chunking: Documents are chunked based on content type for optimal retrieval
Embedding: Uses OpenAI text-embedding-3-small model
- Note: May upgrade to larger model for improved accuracy based on quality assessment

Manual Operations

Full Refresh: Agent supports on-demand vector store clearing and complete re-upload of all documentation

Benefits

Real-time documentation sync with vector database
Efficient incremental updates (only changed files processed)
Maintains search accuracy with latest document versions
Flexible manual override capabilities

Summary by CodeRabbit

New Features
- Added multiple GitHub Actions workflows for manual, push-triggered, and full synchronization of documentation files to an external vector store.
- Introduced modules for content-aware chunking, keyword extraction using LLMs, embedding generation, and orchestration of document synchronization without filesystem dependency.
- Enhanced the agent to process JSON payloads for syncing documentation changes with validation and detailed error handling.
Documentation
- Added comprehensive design, user stories, and TODO documents detailing the Retrieval-Augmented Generation (RAG) system architecture and implementation plan.
Tests
- Added extensive tests covering content type detection and document chunking to ensure accurate and reliable processing.
Chores
- Updated project dependencies to include new libraries supporting document parsing, embedding, and testing.

…etadata handling

coderabbitai · 2025-06-15T22:34:57Z

Warning

Rate limit exceeded

@afterrburn has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 9 minutes and 9 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 48600eb and e48edbd.

📒 Files selected for processing (3)

.github/workflows/sync-docs-full.yml (1 hunks)
agent-docs/src/agents/doc-processing/docs-orchestrator.ts (1 hunks)
agent-docs/src/agents/doc-processing/embed-chunks.ts (1 hunks)

Walkthrough

This update introduces a comprehensive Retrieval-Augmented Generation (RAG) document processing pipeline for documentation. It adds new modules for chunking, embedding, keyword extraction, and orchestrating synchronization with a vector store, along with supporting design/user story documents, workflow automation via GitHub Actions, and initial test coverage for chunking logic.

Changes

File(s)	Change Summary
.github/workflows/@sync-docs.yml .github/workflows/sync-docs.yml .github/workflows/sync-docs-full.yml	Added GitHub Actions workflows to detect documentation changes and sync them to an external vector store via webhook, supporting manual triggers, push events, and full syncs of all docs.
agent-docs/RAG-TODO.md agent-docs/RAG-design.md agent-docs/RAG-user-stories.md	Added design, TODO, and user story documents outlining requirements, architecture, flows, and success criteria for the RAG documentation system.
agent-docs/package.json	Added `gray-matter`, `langchain`, and `vitest` as dependencies for document parsing, language processing, and testing.
agent-docs/src/agents/doc-processing/chunk-mdx.ts	Added content-aware MDX chunking logic, content type detection, and document enrichment functions for downstream processing.
agent-docs/src/agents/doc-processing/config.ts	Added export of the `VECTOR_STORE_NAME` constant for vector store configuration.
agent-docs/src/agents/doc-processing/docs-orchestrator.ts	Added logic to sync documentation from a webhook payload, handling changed and removed files, and managing vector store updates.
agent-docs/src/agents/doc-processing/docs-processor.ts	Added pipeline to chunk, enrich, and embed document content, producing vector upsert parameters for storage.
agent-docs/src/agents/doc-processing/embed-chunks.ts	Added embedding utility to generate vector representations of text chunks using OpenAI models.
agent-docs/src/agents/doc-processing/index.ts	Added main agent handler for documentation sync requests, with validation, error handling, and orchestration.
agent-docs/src/agents/doc-processing/keyword-extraction.ts	Added LLM-based keyword extraction module for document chunks, with configurable options and structured results.
agent-docs/src/agents/doc-processing/test/chunk-mdx.test.ts	Added comprehensive tests for chunking and content type detection, covering a variety of Markdown structures and edge cases.
agent-docs/src/agents/doc-processing/types.ts	Added TypeScript interfaces for file payloads, sync payloads, and sync statistics to type synchronization data and results.

Sequence Diagram(s)

sequenceDiagram
    participant GitHub
    participant Workflow
    participant Webhook
    participant Agent
    participant VectorStore

    GitHub->>Workflow: Push or PR event (docs changed)
    Workflow->>Webhook: POST sync payload (changed/removed files)
    Webhook->>Agent: Forward payload
    Agent->>Agent: Validate payload
    Agent->>Agent: For each changed file:
    Agent->>Agent: - Decode content, chunk, enrich
    Agent->>Agent: - Embed chunks
    Agent->>VectorStore: Upsert vectors with metadata
    Agent->>Agent: For each removed file:
    Agent->>VectorStore: Delete vectors by file path
    Agent-->>Webhook: Respond with sync stats
    Webhook-->>Workflow: Sync result

Suggested reviewers

rblalock

Poem

A rabbit hops through docs anew,
Chunking, embedding, syncing too.
Keywords found with LLM’s might,
Vectors stored for future insight.
Workflows dance with every push,
Tests ensure there’s not a mush.
🥕—RAG leaps forward, what a sight!

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate Unit Tests

Create PR with Unit Tests
Post Copyable Unit Tests in Comment
Commit Unit Tests in branch srith/agent-391-doc-processor

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai auto-generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

cloudflare-workers-and-pages · 2025-06-15T22:35:02Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Preview URL	Updated (UTC)
✅ Deployment successful! View logs	docs	`e48edbd`	Commit Preview URL	Jun 18 2025, 04:26 AM

…dependency

coderabbitai

Actionable comments posted: 17

🧹 Nitpick comments (17)

agent-docs/.gitignore (1)
15-15: Correct log file ignore pattern

The current entry _.log only matches filenames with a single character prefix. To ignore all .log files, use *.log instead:
- _.log
+ *.log
agent-docs/src/agents/doc-processing/config.ts (1)
1-1: Add explicit type annotation and use nullish coalescing

For clarity and to avoid unintentionally falling back on an empty string, annotate VECTOR_STORE_NAME and switch to ??:
-export const VECTOR_STORE_NAME = process.env.VECTOR_STORE_NAME || 'docs';
+export const VECTOR_STORE_NAME: string = process.env.VECTOR_STORE_NAME ?? 'docs';
agent-docs/README.md (2)
5-7: Missing alt text on badge image

<img src="https://app.agentuity.com/img/deploy.svg" /> lacks an alt attribute, triggering MD045 and reducing accessibility.
-            <img src="https://app.agentuity.com/img/deploy.svg" /> 
+            <img src="https://app.agentuity.com/img/deploy.svg" alt="Deploy to Agentuity" />
65-71: Specify a language for the fenced code block

Markdownlint MD040 warns when language identifiers are omitted.
-```
+```text
agent-docs/agentuity.yaml (1)
75-78: Typos & missing metadata

description: (l 15) is empty – fill this so the project is discoverable.

"An applicaiton that process documents" (l 77) ➜ “An application that processes documents”.
-    description: An applicaiton that process documents
+    description: An application that processes documents
agent-docs/RAG-TODO.md (1)

1-1: Grammar nit: use “To-Dos” (hyphen)

Heading should read “RAG System Implementation To-Dos” for correctness.

agent-docs/.cursor/rules/agent.mdc (1)

11-15: Consistent casing for “TypeScript”

Update “Typescript” ➜ “TypeScript” to match official spelling.
agent-docs/src/agents/doc-processing/docs-orchestrator.ts (1)
63-75: Avoid mutating loop variable & ensure metadata merge

chunk is reused; mutating chunk.metadata inside the for-loop can cause hidden side-effects if the object is reused elsewhere.

Prefer creating a new object for upsert:
-        chunk.metadata = {
-          ...chunk.metadata,
-          path: logicalPath,
-        };
-        await ctx.vector.upsert(VECTOR_STORE_NAME, chunk);
+        await ctx.vector.upsert(VECTOR_STORE_NAME, {
+          ...chunk,
+          metadata: { ...chunk.metadata, path: logicalPath },
+        });
agent-docs/index.ts (1)

3-9: Ambient-type augmentation is too broad

Adding isBun to the global Process interface inside a top-level file pollutes every consumer that imports this module, risking declaration collisions in downstream packages/tests. Prefer a dedicated types/global.d.ts (referenced via tsconfig.json -> files) so the augmentation is scoped to the package instead of the compiled JS bundle.
agent-docs/src/agents/doc-processing/index.ts (1)
37-60: Validation logic is correct but repetitive – consider a schema validator

Three hand-rolled loops check structure and types. A small Zod/Valibot schema would:

Collapse 20 LOC into 2.

Produce consolidated, descriptive errors.

Future-proof the endpoint as payload evolves.

Not blocking, yet improves reliability & readability.
const SyncSchema = z.object({
  commit: z.string().optional(),
  repo: z.string().optional(),
  changed: z.array(z.object({ path: z.string(), content: z.string() })),
  removed: z.array(z.string())
});
const payload = SyncSchema.parse(await req.data.json());
agent-docs/src/agents/doc-processing/test/chunk-mdx.test.ts (1)
5-6: Use the Document constructor to retain prototype helpers

Creating a plain object misses Document methods (e.g., .split()) that some downstream utilities rely on.
-const makeDoc = (content: string): Document => ({ pageContent: content, metadata: { contentType: "text" } });
+const makeDoc = (content: string) =>
+  new Document({ pageContent: content, metadata: { contentType: "text" } });
agent-docs/RAG-user-stories.md (1)

63-63: Minor wording duplication

are are duplicated in “Answers are accurate and up-to-date”.

agent-docs/.cursor/rules/sdk.mdc (1)

30-38: Consider using string[] instead of string for keywords in examples.

Throughout the documentation, keywords are treated conceptually as a list; showing them as a plain string (single value) may confuse users and diverge from the actual implementation in DocumentMetadata & the RAG pipeline, where keywords are an array.
Updating the sample types keeps docs and code aligned.

.github/workflows/@sync-docs.yml (2)

11-11: Upgrade to actions/checkout@v4 to avoid deprecation warnings.

v3 is still functional but now shows Node-20 deprecation warnings in CI. Switching to v4 is a drop-in replacement and future-proofs the workflow.

77-77: Add a newline at EOF & strip trailing spaces.
Resolves YAML-lint errors and keeps tooling quiet.

agent-docs/RAG-design.md (1)

16-23: keywords should be string[] for consistency.

All later sections treat keywords as an array (boosting, highlighting, etc.). Updating the interface prevents downstream type confusion.

agent-docs/src/agents/doc-processing/chunk-mdx.ts (1)

98-108: Avoid any[] – preserve chunk typing for downstream safety.

Use Document[] (or a dedicated MarkdownChunk interface) instead of any[] for finalChunks; this prevents silent shape drift later in the pipeline.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7e84056 and b52980b.

⛔ Files ignored due to path filters (1)

agent-docs/bun.lock is excluded by !**/*.lock

📒 Files selected for processing (25)

.github/workflows/@sync-docs.yml (1 hunks)
.github/workflows/sync-docs.yml (1 hunks)
agent-docs/.cursor/rules/agent.mdc (1 hunks)
agent-docs/.cursor/rules/agentuity.mdc (1 hunks)
agent-docs/.cursor/rules/sdk.mdc (1 hunks)
agent-docs/.editorconfig (1 hunks)
agent-docs/.gitignore (1 hunks)
agent-docs/RAG-TODO.md (1 hunks)
agent-docs/RAG-design.md (1 hunks)
agent-docs/RAG-user-stories.md (1 hunks)
agent-docs/README.md (1 hunks)
agent-docs/agentuity.yaml (1 hunks)
agent-docs/biome.json (1 hunks)
agent-docs/index.ts (1 hunks)
agent-docs/package.json (1 hunks)
agent-docs/src/agents/doc-processing/chunk-mdx.ts (1 hunks)
agent-docs/src/agents/doc-processing/config.ts (1 hunks)
agent-docs/src/agents/doc-processing/docs-orchestrator.ts (1 hunks)
agent-docs/src/agents/doc-processing/docs-processor.ts (1 hunks)
agent-docs/src/agents/doc-processing/embed-chunks.ts (1 hunks)
agent-docs/src/agents/doc-processing/index.ts (1 hunks)
agent-docs/src/agents/doc-processing/keyword-extraction.ts (1 hunks)
agent-docs/src/agents/doc-processing/test/chunk-mdx.test.ts (1 hunks)
agent-docs/tsconfig.json (1 hunks)
tsconfig.json (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (2)

agent-docs/src/agents/doc-processing/index.ts (1)

agent-docs/src/agents/doc-processing/docs-orchestrator.ts (1)

syncDocsFromPayload (49-102)

agent-docs/src/agents/doc-processing/docs-orchestrator.ts (2)

agent-docs/src/agents/doc-processing/config.ts (1)

VECTOR_STORE_NAME (1-1)

agent-docs/src/agents/doc-processing/docs-processor.ts (1)

processDoc (21-25)

🪛 LanguageTool

agent-docs/RAG-TODO.md

[grammar] ~1-~1: It appears that a hyphen is missing in the plural noun “to-dos”?
Context: # RAG System Implementation TODOs ## 1. Document Chunking & Metadata - [...

(TO_DO_HYPHEN)

agent-docs/RAG-design.md

[style] ~94-~94: This phrase is redundant. Consider writing “relevant”.
Context: ...yword matches. Why? - Ensures that highly relevant technical results (e.g., containing exa...

(HIGHLY_RELEVANT)

[uncategorized] ~217-~217: You might be missing the article “the” here.
Context: ...tion" } ``` --- ## 12. Summary - Only main content is embedded; keywords and metad...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)

agent-docs/RAG-user-stories.md

[duplication] ~63-~63: Possible typo: you repeated a word.
Context: ...les ## Success Criteria ### For Quick Answers - Answers are accurate and up-to-date - Responses...

(ENGLISH_WORD_REPEAT_RULE)

🪛 markdownlint-cli2 (0.17.2)

agent-docs/README.md

6-6: Images should have alternate text (alt text)
null

(MD045, no-alt-text)

65-65: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

🪛 actionlint (1.7.7)

.github/workflows/@sync-docs.yml

11-11: the runner of "actions/checkout@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

.github/workflows/sync-docs.yml

14-14: the runner of "actions/checkout@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🪛 YAMLlint (1.37.1)

.github/workflows/@sync-docs.yml

[error] 29-29: trailing spaces

(trailing-spaces)

[error] 33-33: trailing spaces

(trailing-spaces)

[error] 36-36: trailing spaces

(trailing-spaces)

[error] 64-64: trailing spaces

(trailing-spaces)

[error] 73-73: trailing spaces

(trailing-spaces)

[error] 77-77: no new line character at the end of file

(new-line-at-end-of-file)

[error] 77-77: trailing spaces

(trailing-spaces)

.github/workflows/sync-docs.yml

[error] 20-20: trailing spaces

(trailing-spaces)

[error] 24-24: trailing spaces

(trailing-spaces)

[error] 27-27: trailing spaces

(trailing-spaces)

[error] 55-55: trailing spaces

(trailing-spaces)

[error] 66-66: trailing spaces

(trailing-spaces)

[error] 71-71: no new line character at the end of file

(new-line-at-end-of-file)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Workers Builds: docs

🔇 Additional comments (13)

agent-docs/.gitignore (1)

1-40: Comprehensive ignore rules for agent-docs directory

The .gitignore covers common artifacts—dependencies, build outputs, coverage, logs, env files, caches, IDE files, macOS metadata, and Agentuity-specific files. This establishes a clean repo surface and prevents accidental check-ins of transient or sensitive files.

agent-docs/biome.json (1)

1-27: Validated Biome configuration

The Biome setup is well-structured: imports are organized, linting is enabled with recommended rules, formatting rules enforce 2-space indentation, single quotes, ES5 trailing commas, and mandatory semicolons for JS, and the .agentuity folder is correctly excluded.

tsconfig.json (1)

28-28: Confirm exclusion of the agent-docs subproject

Excluding "agent-docs" from the root TS build is correct since it has its own tsconfig.json. Ensure that the subproject’s config covers all intended files.

agent-docs/.cursor/rules/agentuity.mdc (1)

1-10: File is configuration-only – no issues detected

Nothing actionable surfaced in this cursor rule.

agent-docs/package.json (1)

3-5: main points to a file that is never emitted

tsconfig.json sets "noEmit": true, and your runtime entry in the start script is .agentuity/index.js. The "main": "index.js" field therefore advertises a file that doesn’t exist in the published package, breaking consumers that import/require it.

Options:

Remove the main field entirely if the package is private and only invoked through agentuity.

Or point it to the bundled artifact (.agentuity/index.js) created in prestart.

agent-docs/tsconfig.json (1)

20-22: Non-existent type reference

"types": ["@types/bun", "@agentuity/sdk"] – only the first entry resolves to a definitely-typed package. Unless the SDK ships its own types entry-point, this directive will fail type-resolution in editors.

Confirm the presence of node_modules/@agentuity/sdk/index.d.ts; otherwise, drop it or add a types export in the SDK.

agent-docs/agentuity.yaml (1)

53-57: Resource limits look suspiciously low for embedding + vector-store workloads

350 Mi memory / 500 m CPU may be insufficient when:

loading the OpenAI SDK & streaming embeddings;

holding ~1 k embeddings in memory during upserts.

Please benchmark a realistic sync (e.g. full docs refresh) and tune these values to avoid OOM kills or throttling.
Consider starting at 1 Gi memory & 1 CPU.

agent-docs/src/agents/doc-processing/docs-processor.ts (1)

43-47: Verify VectorUpsertParams field name

Most vector DB SDKs expect the embedding under values or vector, not embeddings.
Double-check the Agentuity SDK; otherwise upserts will fail at runtime.

agent-docs/src/agents/doc-processing/index.ts (1)

66-75: Error response leaks nothing sensitive – good practice

Catching unknown and returning the message while still logging stack traces keeps the external API clean. Nice.

agent-docs/src/agents/doc-processing/test/chunk-mdx.test.ts (1)

1-4: Test suite is tied to Bun – portability concern

bun:test is great locally, but CI & most developers default to Node + Vitest/Jest. Unless the entire repo standardises on Bun, consider exporting test helpers so the logic can be executed under any runner, or add a Node-based parallel config to avoid fragmenting the toolchain.

agent-docs/src/agents/doc-processing/keyword-extraction.ts (2)

31-38: Prompt may exceed model context for large chunks.

chunkContent is injected verbatim; a large chunk could breach token limits and hard-fail the request. Consider truncating or recursively splitting very long chunks before calling the LLM.

68-71: Return keywords even when extraction fails to keep schema stable.

If no keywords are extracted, return an empty array rather than undefined to avoid downstream undefined.map errors.
agent-docs/src/agents/doc-processing/chunk-mdx.ts (1)

44-46: List-item regex has incorrect precedence, causing false positives.

/^[-*+]\s+|^\d+\.\s+/ means “bullet or start-of-line digit” – the anchors only apply to the first alternative. Wrap the alternation:
- const listLines = lines.filter(line => /^[-*+]\s+|^\d+\.\s+/.test(line.trim()));
+ const listLines = lines.filter(line =>
+   /^([-*+]|\d+\.)\s+/.test(line.trim())
+ );
Likely an incorrect or invalid review comment.

agent-docs/.editorconfig

agent-docs/src/agents/doc-processing/embed-chunks.ts

agent-docs/agentuity.yaml

agent-docs/src/agents/doc-processing/docs-processor.ts

.github/workflows/@sync-docs.yml

agent-docs/src/agents/doc-processing/keyword-extraction.ts

agent-docs/src/agents/doc-processing/chunk-mdx.ts

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Seng Rith <50646727+afterrburn@users.noreply.github.com>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

.github/workflows/sync-docs.yml (2)
65-71: Fix JSON payload quoting to avoid syntax errors with embedded quotes.

Wrapping ${{ steps.files.outputs.payload }} in single quotes will break valid JSON containing single quotes. Use printf with double quotes or a heredoc:
- echo '${{ steps.files.outputs.payload }}' | jq '.'
+ printf '%s\n' "${{ steps.files.outputs.payload }}" | jq '.'

...
- curl https://agentuity.ai/... \
-   -d '${{ steps.files.outputs.payload }}'
+ curl https://agentuity.ai/... \
+   --data "${{ steps.files.outputs.payload }}"
34-41: Safely iterate over changed files to handle spaces in filenames.

The for f in $CHANGED_FILES; do loop splits on IFS (spaces) and will break on filenames containing spaces or special characters. Consider switching to a null-terminated approach:
- CHANGED_FILES=$(git diff --name-only ${{ github.event.before }} ${{ github.sha }} -- 'content/**/*.mdx' | sed 's|^content/||')
+ CHANGED_FILES=$(git diff --name-only -z ${{ github.event.before }} ${{ github.sha }} -- 'content/**/*.mdx' \
+   | sed -z 's|^content/||' \
+   | tr '\0' '\n')

...
-   for f in $CHANGED_FILES; do
+   while IFS= read -r f; do

🧹 Nitpick comments (2)

.github/workflows/sync-docs.yml (2)

20-27: Remove trailing whitespace in YAML.

YAML-lint flags trailing spaces on lines 20, 24, 27, 55, and 66. Please trim these to comply with YAML standards.

Also applies to: 55-66

71-71: Ensure newline at end of file.

Add a newline character at EOF to satisfy the new-line-at-end-of-file rule.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1599923 and c3f2303.

📒 Files selected for processing (1)

.github/workflows/sync-docs.yml (1 hunks)

🧰 Additional context used

🪛 YAMLlint (1.37.1)

.github/workflows/sync-docs.yml

[error] 20-20: trailing spaces

(trailing-spaces)

[error] 24-24: trailing spaces

(trailing-spaces)

[error] 27-27: trailing spaces

(trailing-spaces)

[error] 55-55: trailing spaces

(trailing-spaces)

[error] 66-66: trailing spaces

(trailing-spaces)

[error] 71-71: no new line character at the end of file

(new-line-at-end-of-file)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Workers Builds: docs

…gentuity/docs into srith/agent-391-doc-processor

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

agent-docs/src/agents/doc-processing/docs-orchestrator.ts (1)

9-24: Still capped at 1 000 vectors – previous feedback not addressed
The hard‐coded limit: 1000 means vectors beyond that window are silently left behind. This was already pointed out in an earlier review; please loop until no results remain or use a metadata bulk-delete API if available.

🧹 Nitpick comments (3)

agent-docs/src/agents/doc-processing/docs-orchestrator.ts (2)

19-21: Delete vectors in parallel or by bulk API for big docs
Sequential await ctx.vector.delete(...) can take minutes when thousands of vectors exist. If the store supports batch deletion or if you can Promise.all() deletes in chunks (e.g. groups of 50-100), overall sync time will drop dramatically.

64-71: Consider bulk upsert for throughput
Upserting each chunk individually increases latency and API overhead. If the vector store supports it, collect chunks and issue a single upsertMany/batch call.

agent-docs/src/agents/doc-processing/index.ts (1)

23-25: Blind type-cast skips structural validation
await req.data.json() as unknown as SyncPayload forces the compiler to trust the payload shape. Using a runtime schema (e.g. zod, io-ts) would remove the need for manual field checks below and catch extra/invalid properties early.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c3f2303 and a0b7960.

📒 Files selected for processing (3)

agent-docs/src/agents/doc-processing/docs-orchestrator.ts (1 hunks)
agent-docs/src/agents/doc-processing/index.ts (2 hunks)
agent-docs/src/agents/doc-processing/types.ts (1 hunks)

✅ Files skipped from review due to trivial changes (1)

agent-docs/src/agents/doc-processing/types.ts

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Workers Builds: docs

coderabbitai · 2025-06-18T03:39:41Z

agent-docs/src/agents/doc-processing/docs-orchestrator.ts

+      for (const chunk of chunks) {
+        chunk.metadata = {
+          ...chunk.metadata,
+          path: logicalPath,
+        };
+        await ctx.vector.upsert(VECTOR_STORE_NAME, chunk);


⚠️ Potential issue

chunk.metadata may be undefined – spread will crash
If processDoc returns chunks without a metadata object, the spread operator throws a TypeError. Safest fix:

- chunk.metadata = { - ...chunk.metadata, - path: logicalPath, - }; + chunk.metadata = { + ...(chunk.metadata ?? {}), + path: logicalPath, + };

🤖 Prompt for AI Agents

In agent-docs/src/agents/doc-processing/docs-orchestrator.ts around lines 64 to 69, the code spreads chunk.metadata which may be undefined, causing a TypeError. To fix this, ensure chunk.metadata is initialized as an object before spreading by using a default empty object if it is undefined. This prevents the spread operator from crashing when metadata is missing.

agent-docs/src/agents/doc-processing/docs-orchestrator.ts

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (2)

.github/workflows/@sync-docs.yml (2)

17-26: Diff logic misses branch and PR changes
Comparing only the last two commits on origin/main ignores feature-branch and PR diffs. Use GitHub-provided SHAs (github.event.before/github.event.after or PR head.sha/base.sha) to capture changes on the current ref.

73-76: Leak-prone hard-coded webhook URL
Store the webhook URL in an encrypted secret (e.g., AGENTUITY_WEBHOOK_URL) and reference ${{ secrets.AGENTUITY_WEBHOOK_URL }} to avoid exposing credentials.

🧹 Nitpick comments (3)

.github/workflows/sync-docs-full.yml (2)

3-5: Restrict full sync trigger
Triggering a full re-sync on every push can overload CI and the vector store; scope push events to main or use a schedule for periodic full syncs.

18-18: Clean up trailing whitespace and EOF newline
Multiple lines have trailing spaces and there's no final newline, causing YAML lint errors. Please remove trailing whitespace and add a newline at EOF.

Also applies to: 22-22, 29-29, 33-33, 41-41, 48-48, 53-53, 62-62, 65-65, 68-68, 72-72, 82-82, 87-87

.github/workflows/@sync-docs.yml (1)

28-28: Remove trailing spaces and ensure EOF newline
YAML lint reports multiple trailing spaces and a missing final newline. Please clean up whitespace and add a newline at EOF.

Also applies to: 32-32, 35-35, 63-63, 72-72, 76-76

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a0b7960 and 48600eb.

📒 Files selected for processing (2)

.github/workflows/@sync-docs.yml (1 hunks)
.github/workflows/sync-docs-full.yml (1 hunks)

🧰 Additional context used

🪛 actionlint (1.7.7)

.github/workflows/sync-docs-full.yml

10-10: the runner of "actions/checkout@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

.github/workflows/@sync-docs.yml

10-10: the runner of "actions/checkout@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🪛 YAMLlint (1.37.1)

.github/workflows/sync-docs-full.yml

[error] 18-18: trailing spaces

(trailing-spaces)

[error] 22-22: trailing spaces

(trailing-spaces)

[error] 29-29: trailing spaces

(trailing-spaces)

[error] 33-33: trailing spaces

(trailing-spaces)

[error] 41-41: trailing spaces

(trailing-spaces)

[error] 48-48: trailing spaces

(trailing-spaces)

[error] 53-53: trailing spaces

(trailing-spaces)

[error] 62-62: trailing spaces

(trailing-spaces)

[error] 65-65: trailing spaces

(trailing-spaces)

[error] 68-68: trailing spaces

(trailing-spaces)

[error] 72-72: trailing spaces

(trailing-spaces)

[error] 82-82: trailing spaces

(trailing-spaces)

[error] 87-87: no new line character at the end of file

(new-line-at-end-of-file)

[error] 87-87: trailing spaces

(trailing-spaces)

.github/workflows/@sync-docs.yml

[error] 28-28: trailing spaces

(trailing-spaces)

[error] 32-32: trailing spaces

(trailing-spaces)

[error] 35-35: trailing spaces

(trailing-spaces)

[error] 63-63: trailing spaces

(trailing-spaces)

[error] 72-72: trailing spaces

(trailing-spaces)

[error] 76-76: no new line character at the end of file

(new-line-at-end-of-file)

[error] 76-76: trailing spaces

(trailing-spaces)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Workers Builds: docs

.github/workflows/sync-docs-full.yml

.github/workflows/@sync-docs.yml

coderabbitai · 2025-06-18T04:01:23Z

.github/workflows/@sync-docs.yml

+name: Sync Docs to Vector Store (PR & Push)
+
+on:
+  workflow_dispatch


⚠️ Potential issue

Fix workflow_dispatch syntax
The workflow_dispatch event declaration must include a colon (workflow_dispatch:). Without it, the workflow won’t trigger.

🤖 Prompt for AI Agents

In .github/workflows/@sync-docs.yml at line 4, the workflow_dispatch event is missing a colon. Add a colon after workflow_dispatch to correct the syntax, changing it from "workflow_dispatch" to "workflow_dispatch:" so the workflow triggers properly.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Seng Rith <50646727+afterrburn@users.noreply.github.com>

afterrburn added 11 commits June 14, 2025 11:21

agentuity create init

0dc435b

delete bun.lock from root

003951b

remove root package langchain

9153b91

make build ignore agent-docs

610d803

add chunking logic

f9dab40

create document processor with keyword extractions

be26232

add design doc for easy reference for the LLM

d39127d

split doc processing into a new doc orchestrator for better modularity

a93bb39

Refactor document processing to enhance chunk structure and improve m…

2c119d1

…etadata handling

create github action to sync doc when PR merged to main

2a6d2ee

allow full reload option when needed

345c843

afterrburn added 10 commits June 16, 2025 07:47

update curl destination in github action

8d98f8d

use env var for db config

6a9dec1

code clean up

40afb44

add type safety to variables

3421fdd

update todo and design doc

4d30d18

test sync docs action

0de75c5

update test doc file

77a6a7f

fix test yaml

a6ae127

change orchestrator behavior to directly take content to remove path …

44eaee9

…dependency

add current time to chunk metadata

9546e19

Base automatically changed from seng/create-agent-docs to main June 17, 2025 14:01

Merge branch 'main' into srith/agent-391-doc-processor

b52980b

coderabbitai bot requested review from jhaynie and rblalock June 17, 2025 14:04

coderabbitai bot reviewed Jun 17, 2025

View reviewed changes

afterrburn added 2 commits June 17, 2025 08:14

merge

cea4ce2

add deps

1599923

Update .github/workflows/sync-docs.yml

c3f2303

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Seng Rith <50646727+afterrburn@users.noreply.github.com>

coderabbitai bot requested a review from rblalock June 17, 2025 14:45

coderabbitai bot reviewed Jun 17, 2025

View reviewed changes

afterrburn added 3 commits June 17, 2025 21:34

simplify payload of the request

f622235

:Merge branch 'srith/agent-391-doc-processor' of https://github.com/a…

a0b7960

…gentuity/docs into srith/agent-391-doc-processor

test full doc upload

49692d5

coderabbitai bot reviewed Jun 18, 2025

View reviewed changes

afterrburn added 7 commits June 17, 2025 21:41

update

aa21c13

fix sync docs

008cbd8

test

3554f08

fix full sync

286e4e1

another test

eab5770

another test fix

cdc8f9e

test

48600eb

coderabbitai bot reviewed Jun 18, 2025

View reviewed changes

afterrburn and others added 7 commits June 17, 2025 22:04

get code ready for production

5f24647

Update agent-docs/src/agents/doc-processing/embed-chunks.ts

b84f041

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Seng Rith <50646727+afterrburn@users.noreply.github.com>

Update agent-docs/src/agents/doc-processing/embed-chunks.ts

dacd20a

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Seng Rith <50646727+afterrburn@users.noreply.github.com>

bump vector search result

388a818

loop on clearing vector

151347f

catch potential corrupted base64 encode

6cc9f21

update full sync yml for coderabbit

e48edbd

afterrburn merged commit 8d70955 into main Jun 19, 2025
2 checks passed

afterrburn deleted the srith/agent-391-doc-processor branch June 19, 2025 01:34

coderabbitai bot mentioned this pull request Jun 19, 2025

Seng/fix gh action #198

Merged

This was referenced Jun 28, 2025

QA Agent Interface #208

Merged

Fix Bugs from QA Docs Feedback #218

Merged

coderabbitai bot mentioned this pull request Jul 29, 2025

add Guides pages and update screenshots #250

Merged

coderabbitai bot mentioned this pull request Sep 5, 2025

Seng/chat prototype #279

Merged

coderabbitai bot mentioned this pull request Sep 24, 2025

Training docs #284

Open

Create document processor #195

Create document processor #195

Uh oh!

Conversation

afterrburn commented Jun 15, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Agent Documentation Sync

Overview

Workflow

Automated Sync (Main Branch)

Document Processing

Manual Operations

Benefits

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

cloudflare-workers-and-pages bot commented Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying with Cloudflare Workers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

afterrburn commented Jun 15, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jun 15, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

cloudflare-workers-and-pages bot commented Jun 15, 2025 •

edited

Loading