Add data extension prompts, templates, and barrier/barrierGuard support#42
Add data extension prompts, templates, and barrier/barrierGuard support#42felickz wants to merge 2 commits intoadvanced-security:mainfrom
Conversation
Add comprehensive CodeQL data extension development guidance: - Common prompt with core principles, threat models, and CLI references - Language-specific prompts for C++, C#, Go, Java/Kotlin, JS/TS, Python, Ruby - Issue template and PR template for data extension workflow - barrierModel (sanitizers) and barrierGuardModel (validators) support across all languages (CodeQL 2.25.2+)
data-douser
left a comment
There was a problem hiding this comment.
Great work — dedicated, language-specific models-as-data guidance with barrier/barrierGuard coverage aligned to CodeQL 2.25.2 is exactly what this repo needs. The YAML examples, API Graph vs MaD format documentation, and real-world samples (HTTP4k, Apache Camel, Databricks, Undertow) are excellent.
Key concern: Several places in the prompts and templates use language that implies the goal is to write a new CodeQL query (.ql file), when the primary value of models-as-data is that you only need simple YAML. This framing risks misleading LLMs — especially Copilot Cloud Agent — into scaffolding QL code when they should be creating/updating .model.yml files and/or publishing model packs.
The three primary use cases that need better coverage:
- Creating a new
.model.ymlfor an unmodeled library (partially covered; needs an end-to-end procedural workflow including both the repo-level.github/codeql/extensions/path and the model pack path) - Updating an existing
.model.yml— adding new sinks/sources/barriers to an already-modeled library (not covered at all) - Publishing a model pack to GHCR for org-wide Default Setup (referenced but not walked through as a workflow; see org-level model packs and extending coverage for all repos in an org)
Opened #44 to track adding .github/skills/{create,publish}-model-pack/ agent skills as a follow-up to provide the procedural workflows for these use cases.
See inline comments for specifics and typo fixes.
| @@ -0,0 +1,143 @@ | |||
| name: Request new CodeQL Data Exension | |||
| description: Request a new CodeQL query for detecting specific code patterns | |||
| title: "[Data Extension Create]: " | |||
There was a problem hiding this comment.
This description says "Request a new CodeQL query" — but the whole point of data extensions is that you don't write a new CodeQL query. An LLM (especially Copilot Cloud Agent) reading this will anchor on "new CodeQL query" and may attempt to scaffold a .ql file instead of a .model.yml file.
Suggest:
description: Request a new CodeQL data extension (models-as-data) for an unmodeled library or framework| @@ -0,0 +1,143 @@ | |||
| name: Request new CodeQL Data Exension | |||
There was a problem hiding this comment.
Typo: "Exension" → "Extension"
| description: Which programming language should this query target? | ||
| options: | ||
| - actions | ||
| - cpp |
There was a problem hiding this comment.
The actions language is listed as an option, but there's no corresponding actions_data_extension_development.prompt.md in this PR. If Actions doesn't support models-as-data, remove it from this dropdown to avoid confusing agents. If it does, it needs a prompt file.
| This prompt provides common guidance for developing CodeQL data extensions across all supported languages, while language-specific prompts reference this common guidance and add language-specific details. | ||
|
|
||
| ## Product Documentation | ||
|
|
There was a problem hiding this comment.
The prompts are heavily oriented toward creating a brand new model from scratch, but the most common real-world workflows aren't well represented. Consider adding a "## Common Workflows" section covering:
- Creating a new
.model.ymlfile — end-to-end: identify library → create YAML → test with--additional-packs→ validate results - Updating an existing
.model.ymlfile — adding rows to an already-modeled library (where to find existing models, how to add without breaking, re-testing) - Publishing updates to an existing model pack — versioning,
codeql pack publish, and configuring the pack for Default Setup across an org
These three use cases are the primary value proposition of models-as-data, and an agent needs explicit procedural guidance for each.
| #### Default behavior | ||
|
|
||
| By default, only the **`remote`** threat model is enabled. This means only sources marked with `kind: "remote"` are active. To include local sources, you must explicitly enable additional threat models via `--threat-model` on the CLI or in the code scanning configuration. | ||
|
|
There was a problem hiding this comment.
Typos: "organizaiton" → "organization", and in the Development section further down, "easilly" → "easily".
|
|
||
| ### Threat Models | ||
|
|
||
| Threat models control which `sourceModel` entries are active during analysis. The `kind` column of a `sourceModel` determines its threat model category. |
There was a problem hiding this comment.
This header says "Query Quality Criteria" but the body talks about models/extensions. Should be "Model Quality Criteria" or "Extension Quality Criteria" to avoid reinforcing query-writing framing for agents.
| ## Product Documentation | ||
|
|
||
| - [Extending coverage for a repository](https://docs.github.com/en/code-security/how-tos/scan-code-for-vulnerabilities/manage-your-configuration/editing-your-configuration-of-default-setup#extending-coverage-for-a-repository) - `.github/codeql/extensions directory` for local model pack refrences (does not need a qlpack.yml) | ||
| - [Extending coverage for all repositories in an organization](https://docs.github.com/en/code-security/how-tos/scan-code-for-vulnerabilities/manage-your-configuration/editing-your-configuration-of-default-setup#extending-coverage-for-all-repositories-in-an-organization) - publishing model packs and referencing them globally (must be done click button in UI) |
There was a problem hiding this comment.
Typo: "refrences" → "references"
| Model packs can be used to expand code scanning analysis at scale. Model packs use data extensions, which are implemented as YAML and describe how to add data for new dependencies. When a model pack is specified, the data extensions in that pack will be added to the code scanning analysis automatically. | ||
|
|
||
| Generally each language will allow customization of the following extensible prdicates: | ||
|
|
There was a problem hiding this comment.
Typo: "prdicates" → "predicates"
|
|
||
| For general CodeQL data extension model development guidance, see [Common Data Extension Development](./data_extensions_development.prompt.md). | ||
| For general CodeQL query development guidance, see [Common Query Development](./query_development.prompt.md). | ||
|
|
There was a problem hiding this comment.
The cross-reference to query_development.prompt.md is prominently placed as the second line of every language prompt. For a data extension task, the agent should not need query development guidance — and this framing may cause an LLM to treat QL query writing as part of the expected workflow.
Consider moving this to the bottom under "Additional References" (where it already appears), or qualifying it: "If you need to write a custom CodeQL query instead of a data extension, see..." — making it clear data extensions are the primary path and QL queries are a fallback.
(Same feedback applies to all seven language-specific prompts.)
| ### Python Documentation | ||
|
|
||
| - [Customizing Library Models for Python](https://codeql.github.com/docs/codeql-language-guides/customizing-library-models-for-python/) | ||
| - Can also be found at [Customizing Library Models for Python Docs](https://github.com/github/codeql/blob/main/docs/codeql/codeql-language-guides/customizing-library-models-for-python.rst) |
There was a problem hiding this comment.
Typo: "acess" → "access"
Summary
Add comprehensive CodeQL data extension (Models as Data) development guidance as Copilot prompts, issue template, and PR template.
Sample MAD's created
see usage
see usage
What's included
10 new files:
.github/prompts/data_extensions_development.prompt.md.github/prompts/cpp_data_extension_development.prompt.mdArgument[*n]), namespace-based identification.github/prompts/csharp_data_extension_development.prompt.mdget_/set_).github/prompts/go_data_extension_development.prompt.mdArgument[receiver].github/prompts/java_data_extension_development.prompt.md.github/prompts/javascript_data_extension_development.prompt.mdFuzzy,GuardedRouteHandler,typeModel.github/prompts/python_data_extension_development.prompt.mdbuiltinstype.github/prompts/ruby_data_extension_development.prompt.mdMethod[]access paths,!suffix for class references.github/ISSUE_TEMPLATE/data-extension-create.yml.github/PULL_REQUEST_TEMPLATE/data-extension-create.mdBarrier and Barrier Guard support (CodeQL 2.25.2+)
All prompts include the new
barrierModel(sanitizers) andbarrierGuardModel(validators) extensible predicates announced in the April 21, 2026 changelog:barrierModel: Stops taint flow at the modeled element for a specified query kind (e.g., HTML-escaping prevents XSS)barrierGuardModel: Stops taint flow when a conditional check returns an expected boolean value (e.g., URL validation prevents open redirects)Each language prompt includes barrier/barrier guard examples from the official CodeQL docs:
mysql_real_escape_string(SQL injection barrier),is_safe(barrier guard)HttpRequest.RawUrl(URL redirection barrier),Uri.IsAbsoluteUri(barrier guard)Htmlquote(HTML injection barrier),IsSafe(barrier guard)File.getName()(path injection barrier),URI.isAbsolute()(request forgery barrier guard)encodeURIComponent(HTML injection barrier),isValid(barrier guard)html.escape(HTML injection barrier), Djangourl_has_allowed_host_and_scheme(barrier guard)Mysql2::Client#escape(SQL injection barrier),Validator.is_safe(barrier guard)References