-
Notifications
You must be signed in to change notification settings - Fork 327
[refactor] Semantic Function Clustering Analysis - Code Organization Opportunities #6410
Description
Overview
Comprehensive semantic analysis of 299 non-test Go files (83,614 lines) identified significant refactoring opportunities through function clustering, duplicate detection, and organizational improvements. The analysis cataloged function naming patterns, identified outliers, and detected high-impact consolidation opportunities.
Key Statistics:
- Total files analyzed: 299 (164 in pkg/workflow, 99 in pkg/cli, 14 in pkg/parser, 22 utility files)
- Total lines of code: 83,614 lines
- Files >1000 lines: 11 files requiring attention
- Validation files: 18 separate files (should be consolidated)
- Major duplicate patterns: Token handling (3 similar functions), safe outputs config (270+ lines of repetition), upload artifact generation (7 similar functions)
Major Findings:
- Large files mixing multiple responsibilities (compiler_jobs.go: 1,456 lines with 14 functions + 29 external builders)
- Token handling logic duplicated 3 times with similar precedence chains
- Safe outputs configuration has 270+ lines of near-identical checks repeated in 5 functions
- Upload artifact generation duplicated 7 times following identical patterns
- 18 validation files scattered instead of being in dedicated subdirectory
Full Analysis Report
Repository Statistics
Package Distribution
| Package | Files | Lines | Functions | Largest File |
|---|---|---|---|---|
| pkg/workflow | 164 | ~44,000 | 2,761 | compiler_jobs.go (1,456 lines) |
| pkg/cli | 99 | ~30,000 | 580+ | logs.go (1,338 lines) |
| pkg/parser | 14 | ~6,000 | 150+ | frontmatter.go (1,283 lines) |
| Utilities | 22 | ~3,600 | - | - |
| Total | 299 | 83,614 | 3,500+ | - |
Function Naming Patterns (Top 15)
Analysis of all functions across the codebase reveals clear semantic clusters:
| Pattern | Count | Primary Use | Well-Organized? |
|---|---|---|---|
get* |
118 | Getters, retrievers | ✓ Appropriate |
parse* |
110 | Parsing logic | ✓ Distributed by domain |
generate* |
107 | Code/config generation | |
extract* |
103 | Data extraction | ✓ Concentrated appropriately |
build* |
97 | Builders, constructors | ✓ Well-clustered |
validate* |
52 | Validation | |
is* |
50 | Boolean checks | ✓ Appropriate |
render* |
49 | Output rendering | |
add* |
25 | Addition operations | ✓ Appropriate |
find* |
22 | Search operations | ✓ Appropriate |
resolve* |
17 | Resolution logic | ✓ Appropriate |
has* |
17 | Existence checks | ✓ Appropriate |
ensure* |
16 | Ensure patterns | |
check* |
16 | Validation checks | |
process* |
13 | Processing logic | ✓ Appropriate |
Critical Findings by Package
1. pkg/workflow (164 files, 44,000+ lines, 2,761 functions)
Issue 1.1: compiler_jobs.go - Massive Duplication in Safe Outputs Job Creation (HIGH PRIORITY)
File: pkg/workflow/compiler_jobs.go (1,456 lines, 14 functions)
Problem: The buildSafeOutputsJobs() function (lines 276-765, 489 lines) contains 20+ near-identical code blocks for building different job types:
// Pattern repeated 23 times:
if data.SafeOutputs.CreateIssues != nil {
createIssueJob, err := c.buildCreateOutputIssueJob(data, jobName)
if err != nil {
return fmt.Errorf("failed to build create_issue job: %w", err)
}
if threatDetectionEnabled {
createIssueJob.Needs = append(createIssueJob.Needs, constants.DetectionJobName)
createIssueJob.If = AddDetectionSuccessCheck(createIssueJob.If)
}
if err := c.jobManager.AddJob(createIssueJob); err != nil {
return fmt.Errorf("failed to add create_issue job: %w", err)
}
safeOutputJobNames = append(safeOutputJobNames, createIssueJob.Name)
}Recommendation: Consolidate using configuration-driven approach:
type SafeOutputJobConfig struct {
ConfigField interface{}
BuildFunc func(*WorkflowData, string) (*Job, error)
ExtraDeps []string
}
var safeOutputJobs = []SafeOutputJobConfig{
{data.SafeOutputs.CreateIssues, c.buildCreateOutputIssueJob, nil},
{data.SafeOutputs.CreateAgentTasks, c.buildCreateOutputAgentTaskJob, nil},
// ... 21 more entries
}
for _, config := range safeOutputJobs {
if config.ConfigField != nil {
job, err := config.BuildFunc(data, jobName)
// ... common error handling and dependency logic
}
}Estimated Impact:
- Reduce ~200 lines of duplicate code
- Single source of truth for job creation pattern
- Easier to add new safe output types
- Effort: 3-4 hours
Issue 1.2: compiler_yaml.go - Upload Artifact Generation Duplication (HIGH PRIORITY)
File: pkg/workflow/compiler_yaml.go (1,411 lines, 29 functions)
Problem: Seven functions generating artifact uploads follow identical patterns (lines 466-641):
func (c *Compiler) generateUploadAgentLogs(yaml *strings.Builder, logFileFull string) {
c.generateArtifactUpload(yaml, ArtifactUploadConfig{
StepName: "Upload Agent Logs",
ArtifactName: "agent-logs",
UploadPaths: []string{logFileFull},
IfNoFilesFound: "warn",
})
}
// IDENTICAL pattern for:
// - generateUploadAssets()
// - generateUploadAwInfo()
// - generateUploadPrompt()
// - generateUploadMCPLogs()
// - generateUploadSafeInputsLogs()
// - generateUploadAccessLogs()All are one-liner wrappers around generateArtifactUpload().
Recommendation: Replace with configuration table:
type ArtifactSpec struct {
Name string
StepName string
Path string
IfNoFilesFound string
}
var artifactSpecs = []ArtifactSpec{
{"agent-logs", "Upload Agent Logs", logFileFull, "warn"},
{"assets", "Upload Assets", "/tmp/gh-aw/assets", "ignore"},
// ... 5 more entries
}
func (c *Compiler) generateAllArtifactUploads(yaml *strings.Builder, specs []ArtifactSpec) {
for _, spec := range specs {
c.generateArtifactUpload(yaml, ArtifactUploadConfig{...})
}
}Estimated Impact:
- Eliminate 7 wrapper functions (~40-50 lines)
- Easier to update artifact upload action versions
- Effort: 1-2 hours
Issue 1.3: safe_outputs_config.go - Massive Repetition in Configuration Processing (CRITICAL)
File: pkg/workflow/safe_outputs_config.go (1,024 lines, 15+ functions)
Problem: The file contains 270+ lines of near-identical code repeated in 5 different functions:
Location 1: extractSafeOutputsConfig() (lines 170-464) - Extracts 24+ config types:
issuesConfig := c.parseIssuesConfig(outputMap)
if issuesConfig != nil {
config.CreateIssues = issuesConfig
}
agentTaskConfig := c.parseAgentTaskConfig(outputMap)
if agentTaskConfig != nil {
config.CreateAgentTasks = agentTaskConfig
}
// ... repeated 22+ more timesLocation 2: generateSafeOutputsConfig() (lines 576-847) - Generates JSON for each type:
if data.SafeOutputs.CreateIssues != nil {
issueConfig := map[string]any{}
maxValue := 1
if data.SafeOutputs.CreateIssues.Max > 0 {
maxValue = data.SafeOutputs.CreateIssues.Max
}
issueConfig["max"] = maxValue
if len(data.SafeOutputs.CreateIssues.AllowedLabels) > 0 {
issueConfig["allowed_labels"] = data.SafeOutputs.CreateIssues.AllowedLabels
}
safeOutputsConfig["create_issue"] = issueConfig
}
// ... repeated 23+ more times with minor variationsLocation 3: generateFilteredToolsJSON() (lines 912-986) - Checks enabled tools:
if data.SafeOutputs.CreateIssues != nil {
enabledTools["create_issue"] = true
}
if data.SafeOutputs.CreateAgentTasks != nil {
enabledTools["create_agent_task"] = true
}
// ... repeated 22+ timesAlso repeated in:
HasSafeOutputsEnabled()(lines 33-55) - 23 nil checksGetEnabledSafeOutputToolNames()(lines 74-148) - 24+ type mappings
Recommendation: Create metadata-driven system:
type SafeOutputMetadata struct {
Key string // "create_issue"
ConfigFieldName string // "CreateIssues"
DefaultMax int
CustomFields []string // Config-specific fields
}
var SafeOutputRegistry = []SafeOutputMetadata{
{"create_issue", "CreateIssues", 1, []string{"allowed_labels", "required_title_prefix"}},
{"create_agent_task", "CreateAgentTasks", 1, nil},
// ... all 24+ entries
}
// Use reflection/metadata to iterate instead of manual checks
func extractSafeOutputsConfig(outputMap map[string]any) *SafeOutputsConfig {
for _, meta := range SafeOutputRegistry {
if configData, exists := outputMap[meta.Key]; exists {
// Use reflection to set field
}
}
}Estimated Impact:
- Reduce 1,024 lines → ~450-500 lines (55% reduction!)
- Single source of truth for safe output types
- New types require 1 metadata entry instead of 5 scattered additions
- Effort: 6-8 hours
Issue 1.4: Token Handling Duplication (MEDIUM PRIORITY)
File: pkg/workflow/github_token.go
Problem: Three nearly identical functions with different token precedence:
// Function 1: Generic
func getEffectiveGitHubToken(customToken, toplevelToken string) string {
if customToken != "" { return customToken }
if toplevelToken != "" { return toplevelToken }
return "${{ secrets.GH_AW_GITHUB_MCP_SERVER_TOKEN || secrets.GH_AW_GITHUB_TOKEN || secrets.GITHUB_TOKEN }}"
}
// Function 2: Copilot-specific (80% similar)
func getEffectiveCopilotGitHubToken(customToken, toplevelToken string) string {
if customToken != "" { return customToken }
if toplevelToken != "" { return toplevelToken }
return "${{ secrets.COPILOT_GITHUB_TOKEN || secrets.GH_AW_GITHUB_TOKEN }}"
}
// Function 3: Agent-specific (70% similar)
func getEffectiveAgentGitHubToken(customToken string) string {
if customToken != "" { return customToken }
return "${{ secrets.GH_AW_AGENT_TOKEN }}"
}Recommendation: Consolidate using configuration:
type TokenPrecedence struct {
CustomToken string
TopLevelToken string
Fallbacks []string
}
func getEffectiveToken(precedence TokenPrecedence) string {
if precedence.CustomToken != "" { return precedence.CustomToken }
if precedence.TopLevelToken != "" { return precedence.TopLevelToken }
return strings.Join(precedence.Fallbacks, " || ")
}Estimated Impact:
- Reduce ~30 lines of duplicate code
- Single source of truth for token precedence
- Effort: 2-3 hours
Issue 1.5: Validation Functions Scattered Across 18 Files (MEDIUM PRIORITY)
Current State: Validation logic spread across pkg/workflow:
pkg/workflow/
├── agent_validation.go
├── bundler_validation.go
├── compiler_filters_validation.go
├── docker_validation.go
├── engine_validation.go
├── expression_validation.go
├── github_toolset_validation_error.go
├── mcp_config_validation.go
├── npm_validation.go
├── pip_validation.go
├── repository_features_validation.go
├── runtime_validation.go
├── safe_output_validation_config.go
├── schema_validation.go
├── step_order_validation.go
├── strict_mode_validation.go
├── template_validation.go
└── validation.go
Problem: 18 validation files + 51 validate* functions make it hard to:
- Understand validation boundaries
- Maintain consistent validation patterns
- Find validation logic for a specific domain
Recommendation: Create pkg/workflow/validation/ subdirectory:
pkg/workflow/validation/
├── agent.go (agent_validation.go)
├── bundler.go (bundler_validation.go)
├── docker.go (docker_validation.go)
├── engine.go (engine_validation.go)
├── expression.go (expression_validation.go)
├── filters.go (compiler_filters_validation.go)
├── mcp_config.go (mcp_config_validation.go)
├── npm.go (npm_validation.go)
├── pip.go (pip_validation.go)
├── repository.go (repository_features_validation.go)
├── runtime.go (runtime_validation.go)
├── safe_outputs.go (safe_output_validation_config.go)
├── schema.go (schema_validation.go)
├── step_order.go (step_order_validation.go)
├── strict_mode.go (strict_mode_validation.go)
├── template.go (template_validation.go)
└── validation.go (core types)
Estimated Impact:
- Clear module boundaries for validation
- Natural import path:
workflow/validation - Easier to locate validation logic
- Effort: 3-4 hours (mostly file moves + import updates)
2. pkg/cli (99 files, ~30,000 lines, 580+ functions)
Issue 2.1: logs.go Oversized and Mixed Responsibilities (HIGH PRIORITY)
File: pkg/cli/logs.go (1,338 lines, 9 functions)
Functions: Download logs, parse logs, analyze logs, render logs, cache logs
Problem: Single file handles multiple distinct concerns:
- Command creation and flag parsing
- Job status fetching from GitHub API
- Log downloading and aggregation
- Log parsing and analysis (engine-specific)
- Error detection and reporting
- MCP tool usage analysis
- Firewall log analysis
- Metrics calculation
- Output formatting (JSON + console)
- Cache management
Current Support Files: logs_parsing.go, logs_metrics.go, logs_report.go, logs_download.go, logs_cache.go, logs_models.go
Recommendation: Further split logs.go:
pkg/cli/
├── logs_command.go (~300 lines - command setup and orchestration)
├── logs_download.go (exists - artifact downloading)
├── logs_parsing.go (exists - log parsing)
├── logs_analysis.go (~400 lines - NEW - analysis logic)
├── logs_metrics.go (exists - metrics calculation)
├── logs_report.go (exists - report generation)
├── logs_cache.go (exists - caching)
└── logs_models.go (exists - data models)
Estimated Impact:
- Complete separation of concerns
- logs_command.go becomes thin orchestrator
- Effort: 4-6 hours
Issue 2.2: update_command.go Handles Multiple Update Types (MEDIUM PRIORITY)
File: pkg/cli/update_command.go (1,331 lines, 20 functions)
Problem: Mixes workflow updates, action updates, and Git operations:
UpdateWorkflows(),updateWorkflow()- Workflow updatesUpdateActions(),getLatestActionRelease()- Action updatescheckExtensionUpdate()- Extension checkingcreateUpdatePR()- PR creationnormalizeWhitespace()- Utility
Recommendation: Split into:
pkg/cli/
├── update_workflows.go - Workflow update logic
├── update_actions.go - Action update logic
└── update_command.go - Command orchestration
Estimated Impact:
- Three ~450 line files vs one 1,331 line file
- Effort: 3-4 hours
3. pkg/parser (14 files, ~6,000 lines, 150+ functions)
Issue 3.1: frontmatter.go Mixed Responsibilities (HIGH PRIORITY)
File: pkg/parser/frontmatter.go (1,283 lines, 33 functions)
Functions: Import processing (6), include processing (6), extraction (12), merging (4), utilities (5)
Recommendation: Split by functional domain:
pkg/parser/
├── frontmatter.go (~100 lines - core types)
├── frontmatter_imports.go (~350 lines - ProcessImports* functions)
├── frontmatter_includes.go (~250 lines - ExpandIncludes*, ProcessIncludes*)
├── frontmatter_extract.go (~200 lines - extract*FromContent functions)
└── frontmatter_merge.go (~150 lines - MergeTools, merging logic)
Estimated Impact:
- Five focused files (~100-350 lines each)
- Clear separation: imports vs includes vs extraction
- Effort: 6-8 hours
Issue 3.2: schema.go Multiple Concerns (HIGH PRIORITY)
File: pkg/parser/schema.go (1,156 lines, 34 functions)
Functions:
- Validation orchestration (8 functions)
- Suggestion generation (4 functions)
- Schema compilation/caching (2 functions)
- Generic utilities (LevenshteinDistance, removeDuplicates, min/max)
Recommendation: Split and extract utilities:
pkg/parser/
├── schema.go (~150 lines - public API + types)
├── schema_validate.go (~400 lines - validation + custom rules)
├── schema_suggestions.go (~250 lines - suggestions and navigation)
├── schema_compile.go (~100 lines - compilation and caching)
└── schema_deprecated.go (~100 lines - deprecated field handling)
pkg/util/
├── strings.go (LevenshteinDistance, StripANSI)
├── slices.go (removeDuplicates, generic helpers)
└── math.go (min, max)
Estimated Impact:
- Five focused files vs one 1,156-line file
- Reusable utilities available across packages
- Effort: 6-8 hours
Semantic Function Clustering Analysis
Well-Organized Patterns (✓ Keep These)
1. Expression Handling (pkg/workflow/)
Excellent organization:
- expression_parser.go - Parsing
- expression_builder.go - Building
- expression_extraction.go - Extraction
- expression_nodes.go - AST nodes
- expression_validation.go - Validation
Use this as the pattern for other feature areas!
2. MCP Commands (pkg/cli/)
Well-structured subcommands:
- mcp.go - Main command
- mcp_add.go, mcp_inspect.go, mcp_list.go - Subcommands
- mcp_server.go, mcp_validation.go, mcp_config_file.go - Supporting modules
This is exemplary organization.
Patterns Needing Consolidation
1. Job Building Functions (pkg/workflow/compiler_jobs.go)
- Issue: 29 external
build*Job()functions called from one orchestrator - Pattern: All follow similar structure (job creation, dependency management)
- Recommendation: Already identified in Issue 1.1
2. Prompt Generation (pkg/workflow/)
- Issue: Scattered across compiler_yaml.go, prompts.go, repo_memory_prompt.go, template.go
- Pattern: All append content to prompt with heredoc
- Recommendation: Unify prompt generation logic
3. Safe Output Parsers (pkg/workflow/)
- Issue: 24+ individual parser functions in separate files
- Pattern: All call
parseBaseSafeOutputConfig()then add custom fields - Recommendation: Already identified in Issue 1.3
Priority Refactoring Recommendations
Priority 1: High-Impact Quick Wins (Immediate Action)
1.1 Consolidate Safe Outputs Configuration Logic ⭐⭐⭐
File: pkg/workflow/safe_outputs_config.go
Impact: Reduce 1,024 lines → ~500 lines (50% reduction)
Effort: 6-8 hours
Benefits: Single source of truth, easier to add new safe output types
1.2 Consolidate Upload Artifact Functions ⭐⭐⭐
File: pkg/workflow/compiler_yaml.go
Impact: Eliminate 7 wrapper functions (~40-50 lines)
Effort: 1-2 hours
Benefits: Easier action version updates, DRY principle
1.3 Consolidate Job Creation Pattern ⭐⭐
File: pkg/workflow/compiler_jobs.go
Impact: Reduce ~200 lines of duplicate code
Effort: 3-4 hours
Benefits: Easier to add new job types
Priority 2: Structural Improvements (Next Sprint)
2.1 Create Validation Subdirectory ⭐⭐
Location: pkg/workflow/validation/
Files: Move 18 validation files
Effort: 3-4 hours
Impact: Clear module boundaries
2.2 Split frontmatter.go ⭐⭐
Current: 1,283 lines, 33 functions
Target: 5 files (~100-350 lines each)
Effort: 6-8 hours
Impact: Clearer separation of concerns
2.3 Split schema.go ⭐⭐
Current: 1,156 lines, 34 functions
Target: 5 files (~100-400 lines each)
Effort: 6-8 hours
Impact: Clearer validation vs suggestions separation
2.4 Split logs.go ⭐
Current: 1,338 lines mixed responsibilities
Target: logs_command.go as thin orchestrator
Effort: 4-6 hours
Impact: Complete separation of concerns
Priority 3: Long-term Improvements
3.1 Consolidate Token Handling
Impact: Reduce ~30 lines, single source of truth
Effort: 2-3 hours
3.2 Extract Generic Utilities to pkg/util
Impact: Reusable utilities across packages
Effort: 2-3 hours
3.3 Split compiler_yaml.go
Current: 1,411 lines, 29 functions
Target: 4-5 focused modules
Effort: 4-5 hours
Summary Statistics
By Package
| Package | Files | Lines | Files >1000 | High-Priority Issues |
|---|---|---|---|---|
| pkg/workflow | 164 | ~44,000 | 6 | 5 critical |
| pkg/cli | 99 | ~30,000 | 3 | 2 high |
| pkg/parser | 14 | ~6,000 | 2 | 2 high |
| Utilities | 22 | ~3,600 | 0 | None |
Function Pattern Quality
| Pattern | Count | Status | Action Needed |
|---|---|---|---|
| build* | 97 | ✓ Well-clustered | None |
| parse* | 110 | ✓ Distributed appropriately | None |
| extract* | 103 | ✓ Concentrated appropriately | None |
| generate* | 107 | Consolidate uploads | |
| validate* | 52 | Create subdirectory | |
| render* | 49 | Extract to modules |
Impact Summary
| Category | Files Affected | Lines Reduced | Estimated Effort |
|---|---|---|---|
| Duplicate Code | 5 | ~470 lines | 12-17 hours |
| File Splitting | 6 | Reorganize 7,500 lines | 30-38 hours |
| Modularization | 18 | Better boundaries | 3-4 hours |
| Utilities | 3 | Reusable helpers | 2-3 hours |
| TOTAL | 32 | ~470 lines removed, 7,500 reorganized | 47-62 hours |
Implementation Checklist
Phase 1: Quick Wins (Weeks 1-2)
- Consolidate upload artifact functions (Issue 1.2)
- Consolidate token handling (Issue 1.4)
- Consolidate job creation pattern (Issue 1.1)
- Test and verify no regressions
Estimated: ~230 lines reduced, 8-11 hours
Phase 2: Critical Consolidation (Weeks 3-4)
- Consolidate safe_outputs_config.go logic (Issue 1.3)
- Test and verify no regressions
Estimated: ~500 lines reduced, 6-8 hours
Phase 3: Structural (Weeks 5-6)
- Create validation subdirectory (Issue 1.5)
- Split frontmatter.go (Issue 3.1)
- Split schema.go (Issue 3.2)
- Test and verify no regressions
Estimated: Better organization, 16-20 hours
Phase 4: CLI Package (Weeks 7-8)
- Split logs.go further (Issue 2.1)
- Split update_command.go (Issue 2.2)
- Test and verify no regressions
Estimated: Better organization, 7-10 hours
Analysis Metadata
- Analysis Date: 2025-12-14
- Repository: githubnext/gh-aw
- Total Files Analyzed: 299 non-test Go files
- Total Lines Analyzed: 83,614 lines
- Total Functions Cataloged: 3,500+
- Detection Methods:
- Automated function signature extraction
- Semantic clustering by naming patterns
- Detailed file-by-file analysis using specialized agents
- Manual pattern recognition for duplicates
- Cross-file reference analysis
Conclusion
The codebase demonstrates strong architectural foundations with clear naming conventions and good separation of concerns at the package level. The primary opportunities for improvement are:
- Eliminating massive duplication in safe_outputs_config.go (~500 lines saved)
- Consolidating repetitive patterns in job creation and uploads (~240 lines saved)
- Splitting oversized files to improve cognitive load (11 files >1000 lines)
- Organizing validation into dedicated subdirectory (18 scattered files)
- Maintaining excellent patterns from expression handling and MCP commands
Recommended Approach: Start with Priority 1 quick wins (duplication elimination) to build momentum and demonstrate value, then tackle Priority 2 structural improvements incrementally to avoid disrupting ongoing development. The high-impact consolidations (safe_outputs_config.go, job creation patterns) should be prioritized as they offer 50%+ code reduction while improving maintainability.
Total Estimated Effort: 47-62 hours spread across 8 weeks
Total Code Reduction: ~470 lines eliminated + 7,500 lines better organized
Expected Benefits: Improved maintainability, easier testing, reduced cognitive load, single sources of truth for core patterns
AI generated by Semantic Function Refactoring