feat: implement Phase 2-4 - AST symbols, project metadata, doc-code sync#9
feat: implement Phase 2-4 - AST symbols, project metadata, doc-code sync#9
Conversation
Phase 2 (analyzer-repo-map): tree-sitter AST extraction for 6 languages (Rust, TypeScript, JavaScript, Python, Go, Java). Extracts exports, imports, definitions, struct/enum fields, interface properties, class methods, and enum variants. Computes cyclomatic complexity. Detects naming conventions and test patterns. Builds import graph. Phase 3 (analyzer-collectors): project metadata gathering - README sections, CI provider detection, license detection (SPDX + file pattern), language distribution, package manager detection. Phase 4 (analyzer-sync-check): doc-code cross-referencing using pulldown-cmark. Matches inline code refs against AST symbol table with camelCase-to-snake_case resolution. Detects stale references (deleted, renamed symbols) and hotspot references. Comprehensive noise filtering for URLs, file paths, CLI flags, env vars, stdlib globals. All phases integrated into repo-intel init/update pipeline. 142 tests, clippy clean. Validated on 16 repos across Rust (86% match), TypeScript (57-73% match), Go, and mixed-language projects.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request delivers a major expansion of the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This is a massive and impressive feature implementation, adding phases 2-4 for AST analysis, metadata collection, and doc-code sync. The changes are extensive and well-structured across new and existing crates. The implementation of the new features looks solid, with good use of tree-sitter for AST parsing and pulldown-cmark for documentation analysis. My review focuses on improving robustness, error handling, and maintainability in a few key areas. I've identified a fragile TOML parsing implementation that should be replaced, some inconsistencies in error handling within the CLI commands, and a few places where errors are silently ignored, which could be improved with warnings. Overall, this is a significant step forward for the project.
| if let Ok(content) = std::fs::read_to_string(repo_path.join("Cargo.toml")) { | ||
| let mut in_workspace_package = false; | ||
| for line in content.lines() { | ||
| let trimmed = line.trim(); | ||
|
|
||
| // Track [workspace.package] section for workspace-level license | ||
| if trimmed.starts_with('[') { | ||
| in_workspace_package = trimmed == "[workspace.package]"; | ||
| } | ||
|
|
||
| // Match `license = "MIT"` but not `license.workspace = true` | ||
| if trimmed.starts_with("license") && trimmed.contains('=') { | ||
| // Skip workspace delegation (license.workspace = true) | ||
| if trimmed.starts_with("license.workspace") || trimmed.starts_with("license.path") { | ||
| continue; | ||
| } | ||
| let value = trimmed | ||
| .split('=') | ||
| .nth(1)? | ||
| .trim() | ||
| .trim_matches('"') | ||
| .trim_matches('\''); | ||
| if !value.is_empty() && !value.contains('{') && value != "true" && value != "false" { | ||
| return Some(value.to_string()); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // If this is a workspace member with `license.workspace = true`, | ||
| // check if we found it in [workspace.package] section above | ||
| if !in_workspace_package { | ||
| // Re-scan for [workspace.package] license | ||
| let mut in_ws = false; | ||
| for line in content.lines() { | ||
| let trimmed = line.trim(); | ||
| if trimmed == "[workspace.package]" { | ||
| in_ws = true; | ||
| continue; | ||
| } | ||
| if in_ws && trimmed.starts_with('[') { | ||
| break; | ||
| } | ||
| if in_ws && trimmed.starts_with("license") && trimmed.contains('=') { | ||
| if trimmed.starts_with("license.") { | ||
| continue; | ||
| } | ||
| let value = trimmed | ||
| .split('=') | ||
| .nth(1)? | ||
| .trim() | ||
| .trim_matches('"') | ||
| .trim_matches('\''); | ||
| if !value.is_empty() && value != "true" && value != "false" { | ||
| return Some(value.to_string()); | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The current implementation for parsing Cargo.toml uses line-by-line string manipulation. This is fragile and can fail with valid TOML syntax like comments, different string quoting, or key ordering. For example, license = "MIT" # My license would not be parsed correctly.
To make this more robust and maintainable, I recommend using a dedicated TOML parser like the toml crate. This would be similar to how serde_json is used for package.json in the same function.
| eprintln!( | ||
| "[ERROR] No symbol data in map file. Run `repo-intel init` first." | ||
| ); | ||
| std::process::exit(1); | ||
| } |
There was a problem hiding this comment.
Using std::process::exit(1) here (and in the Dependents command and sync_check.rs) bypasses the Result-based error handling of the run function. It's better to return an error and let the main function handle printing it and exiting. This improves consistency and makes the function more reusable. You can use anyhow::bail! for this.
| eprintln!( | |
| "[ERROR] No symbol data in map file. Run `repo-intel init` first." | |
| ); | |
| std::process::exit(1); | |
| } | |
| anyhow::bail!("[ERROR] No symbol data in map file. Run `repo-intel init` first."); |
| fn load_map(path: &PathBuf) -> Result<RepoIntelData> { | ||
| let json = std::fs::read_to_string(path)?; | ||
| Ok(serde_json::from_str(&json)?) | ||
| } |
There was a problem hiding this comment.
This load_map function is duplicated in crates/analyzer-cli/src/commands/sync_check.rs. The other version provides more helpful error context using anyhow::with_context. To improve error messages and reduce code duplication, you should use the more descriptive version and centralize it. Consider moving this helper function to a shared module like crates/analyzer-cli/src/commands/utils.rs.
| fn load_map(path: &PathBuf) -> Result<RepoIntelData> { | |
| let json = std::fs::read_to_string(path)?; | |
| Ok(serde_json::from_str(&json)?) | |
| } | |
| fn load_map(path: &PathBuf) -> Result<RepoIntelData> { | |
| let json = std::fs::read_to_string(path) | |
| .with_context(|| format!("failed to read map file: {}", path.display()))?; | |
| serde_json::from_str(&json).context("failed to parse map JSON") | |
| } |
| }) | ||
| .ok(); |
There was a problem hiding this comment.
The .ok() call here silently swallows any I/O errors that might occur during walk::walk_files. This could lead to incomplete language statistics without any indication that something went wrong. It would be more robust to propagate the error.
I recommend changing detect_languages to return a Result<Vec<LanguageInfo>> and using ? to propagate the error from walk_files. The caller, collect_metadata, already returns a Result, so it can handle the propagated error.
.ok();| match extract_file_symbols(&source, *lang) { | ||
| Ok(syms) => Some((rel.clone(), syms)), | ||
| Err(_) => None, // skip files that fail to parse | ||
| } |
There was a problem hiding this comment.
Files that fail to parse are silently skipped. This can hide issues with the parsers or unexpected file contents from the user. It would be better to log a warning to stderr when a file cannot be parsed, so the user is aware of which files were skipped and why.
match extract_file_symbols(&source, *lang) {
Ok(syms) => Some((rel.clone(), syms)),
Err(e) => {
eprintln!("[WARN] Failed to parse {}: {}", rel, e);
None
}
}| fn parse_rust_use(node: &tree_sitter::Node, source: &[u8]) -> Option<ImportEntry> { | ||
| let text = node.utf8_text(source).ok()?; | ||
| // Simple parse: "use crate::foo::Bar;" or "use std::collections::HashMap;" | ||
| let text = text.trim().trim_start_matches("use ").trim_end_matches(';'); | ||
| let parts: Vec<&str> = text.split("::").collect(); | ||
| if parts.len() >= 2 { | ||
| let module = parts[..parts.len() - 1].join("::"); | ||
| let name = parts.last().unwrap_or(&"").to_string(); | ||
| Some(ImportEntry { | ||
| from: module, | ||
| names: vec![name], | ||
| }) | ||
| } else { | ||
| None | ||
| } | ||
| } |
There was a problem hiding this comment.
The parse_rust_use function uses a simple string split to parse use declarations. This is fragile and will not correctly handle more complex but valid use statements, such as grouped imports (use std::collections::{HashMap, HashSet};), aliasing (use std::fmt::Result as FmtResult;), or wildcard imports (use std::io::*;). This could lead to an incomplete or incorrect import graph.
A more robust approach would be to traverse the use_declaration node's syntax tree to extract the path components and imported items accurately. Given the scope of this PR, a full rewrite might not be necessary, but it's a significant limitation to be aware of for future improvements.
…ands New repo-intel query types for Phase 2-4 data: - symbols <file>: AST symbols from the symbol table - dependents <symbol>: reverse lookup of symbol importers - stale-docs: symbol-level doc staleness from sync-check - project-info: project metadata (languages, CI, license) Consistent query pattern for all consumers - plugins call repo-intel query <type> instead of separate commands.
Summary
All integrated into
repo-intel init/update. 142 tests (was 77), clippy clean.Validation results (16 repos)
JS plugin repos score lower because their source is shared agent-core library while docs reference prompt/skill concepts.
Test plan
cargo test --workspace- 142 tests passcargo clippy --workspace -- -D warnings- cleanOptionwithserde(default)