[SPARK-56074][INFRA] Improve AGENTS.md with inline build/test commands, PR workflow, and dev notes#54899
[SPARK-56074][INFRA] Improve AGENTS.md with inline build/test commands, PR workflow, and dev notes#54899cloud-fan wants to merge 6 commits intoapache:masterfrom
Conversation
…w, and dev notes Rewrite AGENTS.md to follow industry best practices: inline actionable commands instead of links, add project overview, git pre-flight checks, PySpark venv setup, development notes for SQLQueryTestSuite and Spark Connect protos, and PR workflow guidelines. Add CLAUDE.md as a symlink so Claude Code also picks up the instructions. Generated-by: Claude Opus 4.6 Co-authored-by: Isaac
| > project sql | ||
| > testOnly *MySuite | ||
|
|
||
| ### PySpark Tests |
There was a problem hiding this comment.
also cc @gaogaotiantian @Yicong-Huang for the pyspark test part.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Please use JIRA IDs for trace-ability, @cloud-fan , because this is very hot area in these days.
AGENTS.md
Outdated
|
|
||
| build/sbt -Phive package | ||
|
|
||
| Set up and activate a virtual environment (Python 3.10+): |
There was a problem hiding this comment.
For many python developers, we have more than 1 venv in the working directory, with names that contains the python version in it. (We probably have uv developers but let's deal with it when they speak). Checking .venv here is too ideal - I don't have a .venv directory in any of my dev machine, I have something like venv3.12.
Do we assume the agent work from a new terminal or in the current user terminal session? They are probably smart enough to find the venv directory in root directory and activate it. If not, we should help them.
There was a problem hiding this comment.
The agent is supposed to work under the spark directory, this .venv is inside the spark directory. Do we need to have multiple virtual envs for the spark project?
BTW, I pick .venv as it's kind of a industry standard to put venv profile under the project dir, and seems people start using it already: #53836
There was a problem hiding this comment.
It's common to use .venv as the virtualenv dir but it's far from being industry standard. Even CPython docs mentioned two:
Contained in a directory, conventionally named .venv or venv in the project directory, or under a container directory for lots of virtual environments, such as ~/.virtualenvs.
Unfortunately we do need multiple virtual envs to develop spark, at least pyspark. We support all official Python versions and it's not uncommon to have issues that only reproduce on a specific Python version. Also python3 often does not point to the python version we really want to use.
Currently I have two virtual env directories under spark and I probably will have more in the future.
Actually, if I remember correctly, our fragile type annotation can only pass on a specific python version (3.12 for now). Oh and I think it reports an error if you just pip install -r dev/requirements.txt because it has a slightly different version for numpy or its stub.. Then I realize maybe it's helpful to add the lint/format/type system to this file too because we do not do industry standard either.
That being said, I don't have a perfect solution for python environment management for agents - I don't have enough knowledge about how it would work better. Python environment management is not the best part of Python, but we still need to deal with it.
There was a problem hiding this comment.
For my experience, I just let my agent create a venv for me if not pointed by me. However, I often do not create it under spark root as it is not picked up by .gitignore (yet?).
Given that there is no standard, maybe just let agent choose one location to prepare a venv if not already given?
AGENTS.md
Outdated
|
|
||
| build/sbt -Phive package | ||
|
|
||
| Set up and activate a virtual environment (Python 3.10+): |
There was a problem hiding this comment.
Well, this means we need to update this AGENTS.md file every time we deprecate a Python version. Can we make this as robust as possible which we don't need to edit this file?
There was a problem hiding this comment.
I removed it, python/run-tests already verifies python versions
AGENTS.md
Outdated
|
|
||
| ### PR Title | ||
|
|
||
| Format: `[SPARK-xxxx][COMPONENT] Title` or `[MINOR] Title` for trivial changes. |
There was a problem hiding this comment.
If possible, let's avoid to mention MINOR here.
AGENTS.md
Outdated
| Contributors push their feature branch to their personal fork and open PRs against `master` on the upstream Apache Spark repo. Run `git remote -v` to identify which remote is the fork and which is upstream (`apache/spark`). If the remotes are unclear, ask the user to set them up following the standard convention (`origin` for the fork, `upstream` for `apache/spark`). | ||
|
|
||
| - **PySpark Testing**: [python/docs/source/development/testing.rst](python/docs/source/development/testing.rst) | ||
| Use `gh pr create` to open PRs. If `gh` is not installed, generate the GitHub PR URL for the user and recommend installing the GitHub CLI. |
There was a problem hiding this comment.
This would be pretty useless if the agent does not have access to JIRA.
There was a problem hiding this comment.
it does not just save the clicking create PR button, but also generates PR title and description.
There was a problem hiding this comment.
I gave my agent access to JIRA. all you need is a Personal Access Token. maybe we can add it?
AGENTS.md
Outdated
|
|
||
| ### PR Title | ||
|
|
||
| Format: `[SPARK-xxxx][COMPONENT] Title` or `[MINOR] Title` for trivial changes. |
There was a problem hiding this comment.
Shall we add an instruction to ask the user to get the SPARK-XXX ID?
There was a problem hiding this comment.
second this. my agents sometimes hallucinate a SPARK ID for me. It will be good to tell them a source to get this ID.
… drop Python version check Co-authored-by: Isaac
|
Addressed review comment: removed |
Co-authored-by: Isaac
…re push Co-authored-by: Isaac
|
BTW, I sent an email to Spark mailing list in order to request for feedback. Thank you, Wenchen. This becomes more and more important . |
AGENTS.md
Outdated
| ## Pull Request Workflow | ||
|
|
||
| ### PR Title | ||
|
|
||
| Format: `[SPARK-xxxx][COMPONENT] Title` where `SPARK-xxxx` is the JIRA ticket ID. Ask the user to create a new ticket or provide an existing one if not given. | ||
| For follow-up fixes: `[SPARK-xxxx][COMPONENT][FOLLOWUP] Title`. | ||
|
|
||
| ### PR Description | ||
|
|
||
| Follow the template in `.github/PULL_REQUEST_TEMPLATE`. |
There was a problem hiding this comment.
How about telling it not to raise PRs without a human review beforehand?
There was a problem hiding this comment.
Or more accurately: I'd rather leave this part out, I still want a human to make the PRs.
There was a problem hiding this comment.
I asked llm to get human approval before pushing commits and creating PRs, does it address your concern?
There was a problem hiding this comment.
That's probably fine although I'd still rather have the humans make the PR (this isn't a big time saver to have the AI agent do it and in making the PR hopefully they'd do the review).
There was a problem hiding this comment.
To be clear: I think we can go with this and revisit if it becomes a problem
There was a problem hiding this comment.
Yeah I actually want to push people to review the code as initial reviewer and write the PR title and description by themselves. Someone shouldn't claim the outcome of LLM as their own if they do not understand what the code is, and I'm kind of worry that making the process easier might eventually end up with proceeding to what LLM suggests blindly.
But we have a belief among community, and more likely people who want to delegate the effort for the PR can just do the same with one more prompt (no way to prevent it), so it's probably not a big deal.
Co-authored-by: Isaac
dongjoon-hyun
left a comment
There was a problem hiding this comment.
+1, LGTM. Thank you for addressing my comments, @cloud-fan .
AGENTS.md
Outdated
| # Apache Spark | ||
|
|
||
| This file provides context and guidelines for AI coding assistants working with the Apache Spark codebase. | ||
| Apache Spark is a multi-language engine for large-scale data processing and analytics, primarily written in Scala and Java. It supports both batch and streaming workloads, with Spark Connect as an optional server-client protocol. |
There was a problem hiding this comment.
we don't need this for agentic tools
There was a problem hiding this comment.
Yeah I'm not actually very sure how this would be useful for Agent. Do we see its usefulness?
There was a problem hiding this comment.
Actually the entire overview section can be removed, as llm should be very familiar with Spark. But searching from web, it's very common to put an overview section with some general info.
There was a problem hiding this comment.
AGENTS.md
Outdated
|
|
||
| ## Before Making Changes | ||
|
|
||
| Before the first edit in a session, check the git state: |
There was a problem hiding this comment.
I also think we should use worktree.
There was a problem hiding this comment.
I think the it is up to developers to add new worktree?
There was a problem hiding this comment.
git worktree is a bit more heavy (a whole cope of all files) and I don't use it very often in my development. Maybe we can improve the If there are uncommitted changes part, and ask users whether they want to create a new worktree or not.
There was a problem hiding this comment.
When developer use agentic tools heavily, they will get the point how worktree helps, you know, you don't need to work on a single PR at a time with AI
There was a problem hiding this comment.
worktree can help with multi tasking and sub agents, which is kind of hard with branches. I think at one point the little heavy overhead can be justified. I adapted to worktree pretty quickly.
|
|
||
| ## Development Notes | ||
|
|
||
| SQL golden file tests are managed by `SQLQueryTestSuite` and its variants. Read the class documentation before running or updating these tests. |
There was a problem hiding this comment.
For rules and notes of a specific areas, can we only use Agents.md as a menu not details
There was a problem hiding this comment.
it's already menu like, this note asks llm to read details in the classdoc.
|
AGENTS.md
Outdated
| ### PR Title | ||
|
|
||
| Format: `[SPARK-xxxx][COMPONENT] Title` where `SPARK-xxxx` is the JIRA ticket ID. Ask the user to create a new ticket or provide an existing one if not given. | ||
| For follow-up fixes: `[SPARK-xxxx][COMPONENT][FOLLOWUP] Title`. | ||
|
|
||
| ### PR Description | ||
|
|
||
| Follow the template in `.github/PULL_REQUEST_TEMPLATE`. |
There was a problem hiding this comment.
the PR title and desc is very much easy thing for AI-tools to learn, it consumes too much room from the AGENTS.md actually
AGENTS.md
Outdated
|
|
||
| Before the first edit in a session, check the git state: | ||
| 1. If there are uncommitted changes, ask the user whether to continue editing or stash/commit first. | ||
| 2. If the branch is `master`, or has new commits compared to `master` (check with `git log master..HEAD`), create a new branch from `master` before editing. Inform the user of the new branch name. |
There was a problem hiding this comment.
I don't know if we should add the pull from upstream master, just so that the new branch is branched out from a newer master. I think my local agent always forgets to pull from upstream, but instead always pull from origin/master which is my fork and states that it is the latest already.
AGENTS.md
Outdated
|
|
||
| build/sbt -Phive package | ||
|
|
||
| Set up and activate a virtual environment (Python 3.10+): |
There was a problem hiding this comment.
For my experience, I just let my agent create a venv for me if not pointed by me. However, I often do not create it under spark root as it is not picked up by .gitignore (yet?).
Given that there is no standard, maybe just let agent choose one location to prepare a venv if not already given?
AGENTS.md
Outdated
| Set up and activate a virtual environment (Python 3.10+): | ||
|
|
||
| if [ ! -d .venv ]; then | ||
| python3 -c "import sys; assert sys.version_info >= (3, 10)" || { echo "Python 3.10+ required. Ask the user to install a compatible version."; exit 1; } |
There was a problem hiding this comment.
I know we support 3.10+, but is it a good idea to use a more recent Python for development? especially our CI environments are 3.12. So using a version similar to CI might help prevent some issues.
AGENTS.md
Outdated
|
|
||
| ### PR Title | ||
|
|
||
| Format: `[SPARK-xxxx][COMPONENT] Title` or `[MINOR] Title` for trivial changes. |
There was a problem hiding this comment.
second this. my agents sometimes hallucinate a SPARK ID for me. It will be good to tell them a source to get this ID.
AGENTS.md
Outdated
|
|
||
| ### PR Title | ||
|
|
||
| Format: `[SPARK-xxxx][COMPONENT] Title` or `[MINOR] Title` for trivial changes. |
There was a problem hiding this comment.
shall we add those supported components explicitly? e.g., [PYTHON], [SS], [TEST], [DOCS], otherwise where can it find this information?
AGENTS.md
Outdated
| Contributors push their feature branch to their personal fork and open PRs against `master` on the upstream Apache Spark repo. Run `git remote -v` to identify which remote is the fork and which is upstream (`apache/spark`). If the remotes are unclear, ask the user to set them up following the standard convention (`origin` for the fork, `upstream` for `apache/spark`). | ||
|
|
||
| - **PySpark Testing**: [python/docs/source/development/testing.rst](python/docs/source/development/testing.rst) | ||
| Use `gh pr create` to open PRs. If `gh` is not installed, generate the GitHub PR URL for the user and recommend installing the GitHub CLI. |
There was a problem hiding this comment.
I gave my agent access to JIRA. all you need is a Personal Access Token. maybe we can add it?
AGENTS.md
Outdated
| ## Build and Test | ||
|
|
||
| Prefer building in sbt over maven: | ||
| Prefer SBT over Maven for day-to-day development (faster incremental compilation). Replace `sql` below with the target module (e.g., `core`, `catalyst`, `connect`). |
There was a problem hiding this comment.
My cursor somehow tries a few random names for a module like 'sql' before it finds it. Wonder did you see it too, and this also helped it?
last time @HyukjinKwon mentioned to not duplicate the doc and just point, it's an alternative too to enhance the doc, but not sure
There was a problem hiding this comment.
for majority of other docs, I think it's better to leave a pointer here rather than copy paste data. but build and test is pretty stable (less likely to be changed very often), yet it would be very frequently used by the agent. So I actually would prefer just have the instructions copy pasted in this doc.
AGENTS.md
Outdated
| - SBT build instructions: See the ["Building with SBT"](docs/building-spark.md#building-with-sbt) section | ||
| - SBT testing: See the ["Testing with SBT"](docs/building-spark.md#testing-with-sbt) section | ||
| - Running individual tests: See the ["Running Individual Tests"](docs/building-spark.md#running-individual-tests) section | ||
| Contributors push their feature branch to their personal fork and open PRs against `master` on the upstream Apache Spark repo. Run `git remote -v` to identify which remote is the fork and which is upstream (`apache/spark`). If the remotes are unclear, ask the user to set them up following the standard convention (`origin` for the fork, `upstream` for `apache/spark`). |
There was a problem hiding this comment.
nice, is it, so one can say 'make a pr that does this'?
|
Thanks for adding these. The context is much more useful to agents than links to docs. However, I feel it is a bit over-engineered. Shorter is better for agent instructions, reducing noise and cost. The most valuable content is the non-obvious or unique stuff:
|
vaquarkhan
left a comment
There was a problem hiding this comment.
Two suggestions that have become standard practice for AGENTS.md in large monorepos:
- Add an "Architecture Boundaries" section with explicit negative constraints.
AI agents frequently hallucinate legacy patterns or bypass isolation layers. The most effective guardrail is a short "Never Do" list. For Spark, the critical ones are:
Architecture Boundaries
- Spark Connect: Uses a decoupled client-server architecture communicating via gRPC. Never share JVM memory or state between the client (
sql/connect/client/) and the server (sql/connect/server/). - Dependencies: Never introduce cyclic dependencies.
common/utilsis foundational; higher-level modules likesql/coremust not pull in dependencies that violate the module hierarchy. - Generated Files: Never hand-edit SQL golden files (
sql/core/src/test/resources/sql-tests/results/) or generated Protobuf files. Regenerate golden files withSPARK_GENERATE_GOLDEN_FILES=1and protos withdev/connect-gen-protos.sh.
- Establish the federated AGENTS.md pattern.
The AGENTS.md spec now explicitly supports subdirectory placement with nearest-wins cascading. To keep this root file lightweight (~100 lines) and establish that module-specific rules belong in nested files, add a note at the top:
Note: For component-specific instructions, look for nested
AGENTS.mdfiles in subdirectories (e.g.,sql/AGENTS.md,python/AGENTS.md). Agents read the nearest file in the directory tree, so the closest one takes precedence.
This aligns with @yaooqinn's comment about keeping this file as a menu rather than details ;sub-projects like PySpark or Catalyst can define their own hyper-specific guardrails locally without bloating the root file.
|
@vaquarkhan I think eventually AGENTS.md should contains architecture overview, but it will need more time to get consensus on it (Spark is large project with many modules, many architectures at different granularity). I want to only include dev ops related stuff in the first version, which already need a lot of time to reach agreement (see the discussions we have now). I'll keep this PR open for a few days to collect more feedback. |
AGENTS.md
Outdated
|
|
||
| ### PR Title | ||
|
|
||
| Format: `[SPARK-xxxx][COMPONENT] Title` where `SPARK-xxxx` is the JIRA ticket ID. Ask the user to create a new ticket or provide an existing one if not given. |
There was a problem hiding this comment.
Shall we specify which are the allowed values for COMPONENT?
There was a problem hiding this comment.
I don't think we have a clear rule and llm should be able to figure it out. For example, we use [SQL] a lot but the actual module names are catalyst, sql/core, hive, etc.
…ceholders, simplify PR section Co-authored-by: Isaac
|
|
||
| build/sbt -Phive package | ||
|
|
||
| Activate the virtual environment specified by the user, or default to `~/.virtualenvs/pyspark`: |
There was a problem hiding this comment.
@gaogaotiantian how about this? we give more control to the users.
There was a problem hiding this comment.
users can say "run test ... with venv at <path>".
There was a problem hiding this comment.
I think the agent should try to locate a virtual env at root dir of the repo, instead of a global user venv. If no venv is found, agent can create a venv or .venv at the root dir of the repo.
| This file provides context and guidelines for AI coding assistants working with the Apache Spark codebase. | ||
| ## Before Making Changes | ||
|
|
||
| Before the first edit in a session, ensure a clean working environment: |
There was a problem hiding this comment.
This is the common issue I hit when working with llm. Sometimes I forgot to clean up the working env and asking llm to make changes, it just did so and mess up my working env.
There was a problem hiding this comment.
Keep it simple and mandatory, without any conditions, otherwise, agent is tricky with conditional guidelines
- DON'T include any unrelated changes
| - SBT build instructions: See the ["Building with SBT"](docs/building-spark.md#building-with-sbt) section | ||
| - SBT testing: See the ["Testing with SBT"](docs/building-spark.md#testing-with-sbt) section | ||
| - Running individual tests: See the ["Running Individual Tests"](docs/building-spark.md#running-individual-tests) section | ||
| Contributors push their feature branch to their personal fork and open PRs against `master` on the upstream Apache Spark repo. |
There was a problem hiding this comment.
This is another common mistake. llm pushed my branch to upstream apache many times...
There was a problem hiding this comment.
For agent, we'd keep it simple, for example
- MUST push to my fork
| @@ -0,0 +1 @@ | |||
| AGENTS.md No newline at end of file | |||
There was a problem hiding this comment.
Do we need this? AGENTS.md is now like an open standard for project harness
There was a problem hiding this comment.
seems not well supported: anthropics/claude-code#34235
@cloud-fan That makes complete sense. Getting community consensus on global architecture boundaries for a project as massive as Spark would definitely stall this PR. Starting strictly with the DevOps, build, and test instructions is a great v1. As a lightweight compromise for this PR, perhaps we just include the Federated Pattern note at the top?
If we establish that routing rule now, it unlocks the ability for individual component maintainers to add their own hyper-specific architecture boundaries to their respective subdirectories later, without ever needing to bloat this root file or block on global consensus. Thanks for driving this forward! |
What changes were proposed in this pull request?
Rewrite AGENTS.md to follow industry best practices for AI coding agent instructions:
Why are the changes needed?
The existing AGENTS.md only contained links to build docs, which AI coding agents cannot follow. The rewrite provides inline commands and actionable guidance that agents can directly use. All commands have been verified to work.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
All build and test commands were manually verified:
build/sbt sql/compile,build/sbt sql/Test/compilebuild/sbt "sql/testOnly *MySuite",build/sbt "sql/testOnly *MySuite -- -z \"test name\""python/run-testswith both test suite and single test caseOBJC_DISABLE_INITIALIZE_FORK_SAFETYis no longer needed on macOSWas this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Opus 4.6