Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
0dc435b
agentuity create init
Jun 14, 2025
003951b
delete bun.lock from root
Jun 14, 2025
9153b91
remove root package langchain
Jun 14, 2025
610d803
make build ignore agent-docs
Jun 14, 2025
f9dab40
add chunking logic
Jun 14, 2025
be26232
create document processor with keyword extractions
Jun 14, 2025
d39127d
add design doc for easy reference for the LLM
Jun 14, 2025
a93bb39
split doc processing into a new doc orchestrator for better modularity
Jun 15, 2025
2c119d1
Refactor document processing to enhance chunk structure and improve m…
Jun 15, 2025
2a6d2ee
create github action to sync doc when PR merged to main
Jun 15, 2025
345c843
allow full reload option when needed
Jun 15, 2025
8d98f8d
update curl destination in github action
Jun 16, 2025
6a9dec1
use env var for db config
Jun 16, 2025
40afb44
code clean up
Jun 16, 2025
3421fdd
add type safety to variables
Jun 16, 2025
4d30d18
update todo and design doc
Jun 16, 2025
0de75c5
test sync docs action
Jun 17, 2025
77a6a7f
update test doc file
Jun 17, 2025
a6ae127
fix test yaml
Jun 17, 2025
44eaee9
change orchestrator behavior to directly take content to remove path …
Jun 17, 2025
9546e19
add current time to chunk metadata
Jun 17, 2025
b52980b
Merge branch 'main' into srith/agent-391-doc-processor
Jun 17, 2025
cea4ce2
merge
Jun 17, 2025
1599923
add deps
Jun 17, 2025
c3f2303
Update .github/workflows/sync-docs.yml
afterrburn Jun 17, 2025
f622235
simplify payload of the request
Jun 18, 2025
a0b7960
:Merge branch 'srith/agent-391-doc-processor' of https://github.com/a…
Jun 18, 2025
49692d5
test full doc upload
Jun 18, 2025
aa21c13
update
Jun 18, 2025
008cbd8
fix sync docs
Jun 18, 2025
3554f08
test
Jun 18, 2025
286e4e1
fix full sync
Jun 18, 2025
eab5770
another test
Jun 18, 2025
cdc8f9e
another test fix
Jun 18, 2025
48600eb
test
Jun 18, 2025
5f24647
get code ready for production
Jun 18, 2025
b84f041
Update agent-docs/src/agents/doc-processing/embed-chunks.ts
afterrburn Jun 18, 2025
dacd20a
Update agent-docs/src/agents/doc-processing/embed-chunks.ts
afterrburn Jun 18, 2025
388a818
bump vector search result
Jun 18, 2025
151347f
loop on clearing vector
Jun 18, 2025
6cc9f21
catch potential corrupted base64 encode
Jun 18, 2025
e48edbd
update full sync yml for coderabbit
Jun 18, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions .github/workflows/sync-docs-full.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
name: Full Docs Sync to Vector Store

on:
workflow_dispatch:

jobs:
sync:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Get all MDX files and prepare payload
id: files
run: |
# First find all MDX files recursively
echo "Finding all MDX files..."
find content -type f -name "*.mdx" | sed 's|^content/||' > mdx_files.txt
echo "Found files:"
cat mdx_files.txt

# Create the changed array by processing each file through jq
echo "Processing files..."
jq -n --slurpfile paths <(
while IFS= read -r path; do
[ -z "$path" ] && continue
if [ -f "content/$path" ]; then
echo "Processing: content/$path"
jq -n \
--arg path "$path" \
--arg content "$(base64 -w0 < "content/$path")" \
'{path: $path, content: $content}'
fi
done < mdx_files.txt | jq -s '.'
) \
--slurpfile removed <(cat mdx_files.txt | jq -R . | jq -s .) \
--arg repo "$GITHUB_REPOSITORY" \
'{
repo: $repo,
changed: ($paths | .[0] // []),
removed: ($removed | .[0] // [])
}' > payload.json

# Show debug info
echo "Payload structure (without contents):"
jq 'del(.changed[].content)' payload.json

- name: Send to Agentuity
run: |
echo "About to sync these files:"
jq -r '.changed[].path' payload.json
echo -e "\nWill first remove these paths:"
jq -r '.removed[]' payload.json

# Uncomment to actually send
curl https://agentuity.ai/webhook/f61d5ce9d6ed85695cc992c55ccdc2a6 \
-X POST \
-H "Content-Type: application/json" \
-d @payload.json
71 changes: 71 additions & 0 deletions .github/workflows/sync-docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
name: Sync Docs to Vector Store

on:
push:
branches:
- main
paths:
- 'content/**'

jobs:
sync:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Get changed and removed files
id: files
run: |
git fetch origin ${{ github.event.before }}

# Get changed files (relative to content directory)
CHANGED_FILES=$(git diff --name-only ${{ github.event.before }} ${{ github.sha }} -- 'content/**/*.mdx' | sed 's|^content/||')
REMOVED_FILES=$(git diff --name-only --diff-filter=D ${{ github.event.before }} ${{ github.sha }} -- 'content/**/*.mdx' | sed 's|^content/||')

Comment on lines +16 to +24
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Trailing spaces & shell quoting – both break POSIX compliance

Lines 20-24 contain stray spaces that YAML-lint already highlighted. More importantly, $CHANGED_FILES iteration will explode on filenames containing spaces or $IFS characters.

-          for f in $CHANGED_FILES; do
+          IFS=$'\n'       # iterate safely, preserve spaces
+          for f in $CHANGED_FILES; do
             …
           done
+          unset IFS

Cleaning whitespace plus safe iteration prevents subtle sync failures.

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 YAMLlint (1.37.1)

[error] 20-20: trailing spaces

(trailing-spaces)


[error] 24-24: trailing spaces

(trailing-spaces)

🤖 Prompt for AI Agents
In .github/workflows/sync-docs.yml around lines 16 to 24, remove trailing spaces
from the shell script and update the iteration over CHANGED_FILES and
REMOVED_FILES to handle filenames safely by properly quoting variables and using
a null character delimiter with git diff and read loops. This prevents issues
with filenames containing spaces or special characters and ensures
POSIX-compliant, robust file handling.

echo "Changed files: $CHANGED_FILES"
echo "Removed files: $REMOVED_FILES"

# Build JSON payload with file contents
payload=$(jq -n \
--arg commit "${{ github.sha }}" \
--arg repo "${{ github.repository }}" \
--argjson changed "$(
if [ -n "$CHANGED_FILES" ]; then
for f in $CHANGED_FILES; do
if [ -f "content/$f" ]; then
jq -n \
--arg path "$f" \
--arg content "$(base64 -w0 < "content/$f")" \
'{path: $path, content: $content}'
fi
done | jq -s '.'
else
echo '[]'
fi
)" \
--argjson removed "$(
if [ -n "$REMOVED_FILES" ]; then
printf '%s\n' $REMOVED_FILES | jq -R -s -c 'split("\n") | map(select(length > 0))'
else
echo '[]'
fi
)" \
'{commit: $commit, repo: $repo, changed: $changed, removed: $removed}'
)

echo "payload<<EOF" >> $GITHUB_OUTPUT
echo "$payload" >> $GITHUB_OUTPUT
echo "EOF" >> $GITHUB_OUTPUT

- name: Trigger Agentuity Sync Agent
env:
AGENTUITY_TOKEN: ${{ secrets.AGENTUITY_TOKEN }}
run: |
echo "Sending payload to agent:"
echo '${{ steps.files.outputs.payload }}' | jq '.'

curl https://agentuity.ai/webhook/f61d5ce9d6ed85695cc992c55ccdc2a6 \
-X POST \
-H "Authorization: Bearer $AGENTUITY_TOKEN" \
-H "Content-Type: application/json" \
-d '${{ steps.files.outputs.payload }}'
Comment on lines +60 to +71
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

JSON payload may be corrupted by single-quote wrapping

${{ steps.files.outputs.payload }} is injected verbatim before the shell executes. Wrapping it in single quotes makes every ' inside the JSON terminate the quote, resulting in a syntax error. Use double quotes and printf %s or a heredoc.

-          echo '${{ steps.files.outputs.payload }}' | jq '.'
+          printf '%s\n' "${{ steps.files.outputs.payload }}" | jq '.'
...
-            -d '${{ steps.files.outputs.payload }}'
+            --data "${{ steps.files.outputs.payload }}"

Also ensure a trailing newline at EOF to silence YAML-lint.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Trigger Agentuity Sync Agent
env:
AGENTUITY_TOKEN: ${{ secrets.AGENTUITY_TOKEN }}
run: |
echo "Sending payload to agent:"
echo '${{ steps.files.outputs.payload }}' | jq '.'
curl https://agentuity.ai/webhook/f61d5ce9d6ed85695cc992c55ccdc2a6 \
-X POST \
-H "Authorization: Bearer $AGENTUITY_TOKEN" \
-H "Content-Type: application/json" \
-d '${{ steps.files.outputs.payload }}'
- name: Trigger Agentuity Sync Agent
env:
AGENTUITY_TOKEN: ${{ secrets.AGENTUITY_TOKEN }}
run: |
echo "Sending payload to agent:"
printf '%s\n' "${{ steps.files.outputs.payload }}" | jq '.'
curl https://agentuity.ai/webhook/f61d5ce9d6ed85695cc992c55ccdc2a6 \
-X POST \
-H "Authorization: Bearer $AGENTUITY_TOKEN" \
-H "Content-Type: application/json" \
--data "${{ steps.files.outputs.payload }}"
🧰 Tools
🪛 YAMLlint (1.37.1)

[error] 66-66: trailing spaces

(trailing-spaces)


[error] 71-71: no new line character at the end of file

(new-line-at-end-of-file)

🤖 Prompt for AI Agents
In .github/workflows/sync-docs.yml around lines 60 to 71, the JSON payload is
wrapped in single quotes which can break the JSON if it contains single quotes
inside, causing syntax errors. Replace the single quotes around the payload with
double quotes and use printf %s or a heredoc to safely pass the JSON string to
curl. Also, add a trailing newline at the end of the file to satisfy YAML-lint
requirements.

54 changes: 54 additions & 0 deletions agent-docs/RAG-TODO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# RAG System Implementation TODOs

## 1. Document Chunking & Metadata
- [x] Refine and test the chunking logic for MDX files.
- [x] Implement full metadata enrichment (id, path, chunkIndex, contentType, heading, keywords) in the chunking/processing pipeline.
- [x] Write unit tests for chunking and metadata extraction.

## 2. Keyword Extraction
- [x] Implement LLM-based keyword extraction for each chunk.
- [x] Write tests to validate keyword extraction quality.
- [ ] Integrate keyword in document processing pipeline

## 3. Embedding Generation
- [x] Implement embedding function for batch processing of chunk texts (using OpenAI SDK or Agentuity vector store as appropriate).
- [x] Integrate embedding generation into the chunk processing pipeline.
- [ ] Write tests to ensure embeddings are generated and stored correctly.

## 4. Vector Store Integration
- [x] Set up Agentuity vector database integration.
- [x] Store chunk content, metadata, keywords, and embeddings.

## 5. Hybrid Retrieval Logic
- [ ] Implement hybrid search (semantic + keyword boosting).
- [ ] Write tests to ensure correct ranking and recall.

## 6. Reranker Integration
- [ ] Integrate reranker model (API or local).
- [ ] Implement reranking step after hybrid retrieval.
- [ ] Write tests to validate reranker improves result quality.

## 7. API Layer
- [ ] Build modular API endpoints for search and retrieval.
- [ ] Ensure endpoints are stateless and testable.
- [ ] Write API tests (unit and integration).

## 8. UI Integration
- [ ] Add search bar and results display to documentation site.
- [ ] Implement keyword highlighting and breadcrumb navigation.
- [ ] Write UI tests for search and result presentation.

## 9. Monitoring & Analytics
- [ ] Add logging for search queries and result quality.
- [ ] Implement feedback mechanism for users to rate results.

## 10. Documentation & Developer Experience
- [ ] Document each module and its tests.
- [ ] Provide clear setup and usage instructions.

## 11. Sync/Processor Workflow Design
- [x] Design the documentation sync workflow:
- [x] Primary: Trigger sync via CI/CD or GitHub Action after merges to main/deploy branch.
- [x] Optional: Implement a webhook endpoint for manual or CMS-triggered syncs.
- [x] Ensure the sync process is idempotent and efficient (only updates changed docs/chunks).
- [x] Plan for operational workflow implementation after core modules are complete.
Loading