Skip to content

fix: URL-based fallback for search result number and repo extraction#2379

Merged
lpcox merged 7 commits intomainfrom
fix/search-result-filtering
Mar 23, 2026
Merged

fix: URL-based fallback for search result number and repo extraction#2379
lpcox merged 7 commits intomainfrom
fix/search-result-filtering

Conversation

@lpcox
Copy link
Copy Markdown
Collaborator

@lpcox lpcox commented Mar 23, 2026

Problem

Search results from the MCP gateway were getting filtered with none integrity because items lacked number and base.repo.full_name fields that the guard needs for per-item integrity labeling and enrichment.

Symptoms:

  • search_issues: items labeled issue:github/gh-aw-mcpg#unknown with integrity: ["none:github"]
  • search_pull_requests: items labeled pr:#unknown with integrity: ["none"]

Root Cause

  1. Missing number field: extract_resource_number() only checked item.number — search result items from the MCP server may not include this field directly
  2. Missing repo for PRs: response_items.rs PR handler only checked base.repo.full_name / head.repo.full_name — search results don't have this structure
  3. Blocked enrichment: Without a number, REST enrichment calls couldn't proceed, leaving items with empty integrity
  4. No tool_args fallback: search tools didn't check owner/repo in tool_args when query string extraction failed

Changes

helpers.rs — URL-based fallback extraction

  • extract_resource_number(): Falls back to parsing html_url/url for trailing number (e.g. .../issues/20932093)
  • New extract_number_from_url() helper
  • pr_integrity() / issue_integrity(): Enrichment now uses URL-based number fallback

response_items.rs — PR repo fallback

  • PR items now fall back to extract_repo_from_item() (parses repository_url, html_url) when base/head repo info is missing

tool_rules.rs — Search scope from tool_args

  • Both search_issues and search_pull_requests now check owner/repo in tool_args as fallback when query extraction fails

Tests

  • 8 new tests (154 total, all passing)
  • URL-based number extraction (direct, html_url, api url, preference, unknown)
  • PR search with repository_url fallback (items + paths)
  • Issue search URL number fallback

Validation

repo-assist workflow configured to build local container image for end-to-end testing.

lpcox and others added 2 commits March 23, 2026 09:14
Search results from the MCP gateway were getting filtered with 'none'
integrity because items lacked 'number' and 'base.repo.full_name' fields.

Changes:
- extract_resource_number: falls back to parsing html_url/url for trailing
  number segment when the number field is missing
- pr_integrity/issue_integrity: enrichment now uses URL-based number
  fallback, enabling REST enrichment for search result items
- response_items.rs: PR items fall back to extract_repo_from_item()
  (parses repository_url, html_url) when base/head repo info is missing
- tool_rules.rs: search_issues and search_pull_requests now check
  owner/repo in tool_args as fallback when query extraction fails
- Added 8 new tests (154 total, all passing)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Temporarily switch repo-assist to build :local from this branch
to validate the search result filtering fixes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 23, 2026 16:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses missing per-item integrity labeling for GitHub search results returned by the MCP gateway by adding fallbacks to derive issue/PR numbers and repository identity when key fields are absent.

Changes:

  • Add URL-based number extraction fallback (html_url / url) for issue/PR labeling and backend enrichment.
  • Add repository extraction fallback for PR search results when base/head repo info is missing.
  • Add tool_args (owner/repo) fallback for determining repo scope in search_issues / search_pull_requests tool labeling.
  • Update repo-assist lock workflow to build and run a local gateway container image for validation.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
guards/github-guard/rust-guard/src/labels/helpers.rs Adds URL-based fallback to extract issue/PR numbers and use it for enrichment.
guards/github-guard/rust-guard/src/labels/response_items.rs Adds PR repo fallback extraction for search results missing base/head repo fields.
guards/github-guard/rust-guard/src/labels/tool_rules.rs Adds repo-scope fallback for search tools using owner/repo from tool_args.
guards/github-guard/rust-guard/src/labels/mod.rs Adds unit tests covering new URL/repo extraction behavior.
.github/workflows/repo-assist.lock.yml Builds and uses a :local container image for end-to-end workflow validation.
Comments suppressed due to low confidence (1)

guards/github-guard/rust-guard/src/labels/tool_rules.rs:252

  • Same issue as search_issues: this branch scopes labels to s_repo_id but leaves baseline_scope and repo_private tied to the initial tool_args-derived repo context. If repo scope is extracted from the query (or tool_args omit owner/repo), ensure_integrity_baseline may strip the scoped integrity and private repos may not receive writer integrity. Recommend setting baseline_scope = s_repo_id.clone() when available and deriving repo_private from s_owner/s_repo for the private-writer check.
            if !s_repo_id.is_empty() {
                desc = format!("search_pull_requests:{}", s_repo_id);
                secrecy =
                    apply_repo_visibility_secrecy(&s_owner, &s_repo, &s_repo_id, secrecy, ctx);
                integrity = private_writer_integrity(&s_repo_id, repo_private, ctx);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +137 to +141
if !s_repo_id.is_empty() {
desc = format!("search_issues:{}", s_repo_id);
secrecy =
apply_repo_visibility_secrecy(&q_owner, &q_repo, &q_repo_id, secrecy, ctx);
integrity = private_writer_integrity(&q_repo_id, repo_private, ctx);
apply_repo_visibility_secrecy(&s_owner, &s_repo, &s_repo_id, secrecy, ctx);
integrity = private_writer_integrity(&s_repo_id, repo_private, ctx);
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this search_issues branch, integrity/secrecy are scoped to s_repo_id, but baseline_scope and repo_private are still derived from the top-level repo_id / tool_args owner+repo. If the repo scope comes from the query (or differs from tool_args), ensure_integrity_baseline can downgrade scoped integrity to none:*, and private repos may incorrectly get empty integrity. Consider setting baseline_scope = s_repo_id.clone() when present and computing repo_private using s_owner/s_repo for the private-writer decision.

This issue also appears on line 248 of the same file.

See below for a potential fix:

            // Only reuse the top-level repo_private flag when the scoped repo_id matches.
            let scoped_repo_private = if !s_repo_id.is_empty() && s_repo_id == repo_id {
                repo_private
            } else {
                false
            };
            if !s_repo_id.is_empty() {
                desc = format!("search_issues:{}", s_repo_id);
                secrecy =
                    apply_repo_visibility_secrecy(&s_owner, &s_repo, &s_repo_id, secrecy, ctx);
                integrity = private_writer_integrity(&s_repo_id, scoped_repo_private, ctx);

Copilot uses AI. Check for mistakes.
Comment on lines +401 to +407
- name: Build local container image
run: |
# Install Rust toolchain for WASM guard build
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --default-toolchain stable
source "$HOME/.cargo/env"
rustup target add wasm32-wasip1

Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workflow step installs Rust via curl https://sh.rustup.rs | sh and uses an unpinned stable toolchain. That makes the build less reproducible and increases supply-chain risk for the CI environment. Prefer using a pinned toolchain version (e.g., via dtolnay/rust-toolchain/actions-rs/toolchain) and avoiding piping curl directly into a shell (or verifying the installer/signature).

Suggested change
- name: Build local container image
run: |
# Install Rust toolchain for WASM guard build
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --default-toolchain stable
source "$HOME/.cargo/env"
rustup target add wasm32-wasip1
- name: Install Rust toolchain
uses: actions-rs/toolchain@v1
with:
toolchain: 1.76.0
target: wasm32-wasip1
default: true
profile: minimal
- name: Build local container image
run: |

Copilot uses AI. Check for mistakes.
lpcox and others added 5 commits March 23, 2026 10:27
When search_pull_requests tool_args lack owner/repo (only query
parameter), the guard couldn't determine repo privacy. This caused
private repo items to be treated as public, skipping the automatic
approved-level integrity boost and failing enrichment.

Changes:
- tool_rules.rs: check is_repo_private using search query's repo when
  tool_args-based repo_private is None
- response_items.rs: extract repo from query string for default_repo_private
- response_paths.rs: same query-based repo fallback for both PR and issue
  search response labeling

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When search APIs return zero results ({"total_count":0} with no items
array), the response was incorrectly treated as a single data item via
the is_object() fallback, producing #unknown number and none integrity.

Similarly, when the MCP server returns a plain-text error message (not
JSON) in content[0].text, extract_mcp_response leaves the MCP wrapper
unchanged, and the wrapper was treated as a single data item.

Root cause: the single-item fallback in response_items.rs only checked
is_graphql_wrapper() but not these two additional wrapper types.

Fix:
- Add is_search_result_wrapper() to detect {total_count:N} objects
- Add is_mcp_text_wrapper() to detect MCP content wrappers with
  non-JSON text
- Both single-item fallback locations now exclude these wrapper types
- Empty search results and text errors produce zero labeled items,
  falling back to resource-level labels from tool_rules

Validated against CI run 23451003849 JSONL logs which showed:
- search_pull_requests: {total_count:0,incomplete_results:false}
- search_issues: {total_count:0,incomplete_results:false}
- list_issues page=2: cursor-based pagination text error

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Apply the same is_search_result_wrapper and is_mcp_text_wrapper guards
to all remaining single-item fallback locations in response_items.rs:
- get_file_contents
- list_commits / get_commit
- list_gists / get_gist
- list_releases / get_latest_release / get_release_by_tag

These had unprotected is_object() fallbacks that would treat MCP text
errors or empty search wrappers as data items, producing incorrect
per-item labels.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
MCP text error messages (e.g., pagination guidance) and empty search
results (total_count:0) contain no repository data — they are
server-generated metadata. Previously these fell through to
resource-level labels which gave them 'none' integrity for public
repos, causing DIFC filtering to block the agent from seeing
helpful instructional messages.

Now the fallback in label_response detects server metadata via
is_mcp_text_wrapper and is_search_result_wrapper(total_count==0)
and labels them with approved:<scope> integrity so they pass
through to the agent.

Also extends infer_scope_for_baseline to handle search_issues and
search_pull_requests (previously only search_code), ensuring proper
scope inference from repo:owner/name in search queries.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The previous commit used raw repo-scoped tags (approved:github/gh-aw-mcpg)
but the DIFC system uses exact tag matching. When the policy scope is
owner-level (github/*), the agent's integrity uses 'approved:github' so
repo-scoped tags don't match.

Use writer_integrity() which calls normalize_scope() to produce tags that
match the policy scope token. Also extend infer_scope_for_baseline to
handle search_issues and search_pull_requests queries.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@lpcox lpcox merged commit 6859dff into main Mar 23, 2026
9 checks passed
@lpcox lpcox deleted the fix/search-result-filtering branch March 23, 2026 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants