Skip to content

[refactor] Semantic Function Clustering Analysis - Code Organization Opportunities #6410

@github-actions

Description

@github-actions

Overview

Comprehensive semantic analysis of 299 non-test Go files (83,614 lines) identified significant refactoring opportunities through function clustering, duplicate detection, and organizational improvements. The analysis cataloged function naming patterns, identified outliers, and detected high-impact consolidation opportunities.

Key Statistics:

  • Total files analyzed: 299 (164 in pkg/workflow, 99 in pkg/cli, 14 in pkg/parser, 22 utility files)
  • Total lines of code: 83,614 lines
  • Files >1000 lines: 11 files requiring attention
  • Validation files: 18 separate files (should be consolidated)
  • Major duplicate patterns: Token handling (3 similar functions), safe outputs config (270+ lines of repetition), upload artifact generation (7 similar functions)

Major Findings:

  • Large files mixing multiple responsibilities (compiler_jobs.go: 1,456 lines with 14 functions + 29 external builders)
  • Token handling logic duplicated 3 times with similar precedence chains
  • Safe outputs configuration has 270+ lines of near-identical checks repeated in 5 functions
  • Upload artifact generation duplicated 7 times following identical patterns
  • 18 validation files scattered instead of being in dedicated subdirectory
Full Analysis Report

Repository Statistics

Package Distribution

Package Files Lines Functions Largest File
pkg/workflow 164 ~44,000 2,761 compiler_jobs.go (1,456 lines)
pkg/cli 99 ~30,000 580+ logs.go (1,338 lines)
pkg/parser 14 ~6,000 150+ frontmatter.go (1,283 lines)
Utilities 22 ~3,600 - -
Total 299 83,614 3,500+ -

Function Naming Patterns (Top 15)

Analysis of all functions across the codebase reveals clear semantic clusters:

Pattern Count Primary Use Well-Organized?
get* 118 Getters, retrievers ✓ Appropriate
parse* 110 Parsing logic ✓ Distributed by domain
generate* 107 Code/config generation ⚠️ Some duplication
extract* 103 Data extraction ✓ Concentrated appropriately
build* 97 Builders, constructors ✓ Well-clustered
validate* 52 Validation ⚠️ SCATTERED - needs consolidation
is* 50 Boolean checks ✓ Appropriate
render* 49 Output rendering ⚠️ Could consolidate
add* 25 Addition operations ✓ Appropriate
find* 22 Search operations ✓ Appropriate
resolve* 17 Resolution logic ✓ Appropriate
has* 17 Existence checks ✓ Appropriate
ensure* 16 Ensure patterns ⚠️ Pattern suggests abstraction needed
check* 16 Validation checks ⚠️ Overlaps with validate*
process* 13 Processing logic ✓ Appropriate

Critical Findings by Package

1. pkg/workflow (164 files, 44,000+ lines, 2,761 functions)

Issue 1.1: compiler_jobs.go - Massive Duplication in Safe Outputs Job Creation (HIGH PRIORITY)

File: pkg/workflow/compiler_jobs.go (1,456 lines, 14 functions)

Problem: The buildSafeOutputsJobs() function (lines 276-765, 489 lines) contains 20+ near-identical code blocks for building different job types:

// Pattern repeated 23 times:
if data.SafeOutputs.CreateIssues != nil {
    createIssueJob, err := c.buildCreateOutputIssueJob(data, jobName)
    if err != nil {
        return fmt.Errorf("failed to build create_issue job: %w", err)
    }
    if threatDetectionEnabled {
        createIssueJob.Needs = append(createIssueJob.Needs, constants.DetectionJobName)
        createIssueJob.If = AddDetectionSuccessCheck(createIssueJob.If)
    }
    if err := c.jobManager.AddJob(createIssueJob); err != nil {
        return fmt.Errorf("failed to add create_issue job: %w", err)
    }
    safeOutputJobNames = append(safeOutputJobNames, createIssueJob.Name)
}

Recommendation: Consolidate using configuration-driven approach:

type SafeOutputJobConfig struct {
    ConfigField    interface{}
    BuildFunc      func(*WorkflowData, string) (*Job, error)
    ExtraDeps      []string
}

var safeOutputJobs = []SafeOutputJobConfig{
    {data.SafeOutputs.CreateIssues, c.buildCreateOutputIssueJob, nil},
    {data.SafeOutputs.CreateAgentTasks, c.buildCreateOutputAgentTaskJob, nil},
    // ... 21 more entries
}

for _, config := range safeOutputJobs {
    if config.ConfigField != nil {
        job, err := config.BuildFunc(data, jobName)
        // ... common error handling and dependency logic
    }
}

Estimated Impact:

  • Reduce ~200 lines of duplicate code
  • Single source of truth for job creation pattern
  • Easier to add new safe output types
  • Effort: 3-4 hours

Issue 1.2: compiler_yaml.go - Upload Artifact Generation Duplication (HIGH PRIORITY)

File: pkg/workflow/compiler_yaml.go (1,411 lines, 29 functions)

Problem: Seven functions generating artifact uploads follow identical patterns (lines 466-641):

func (c *Compiler) generateUploadAgentLogs(yaml *strings.Builder, logFileFull string) {
    c.generateArtifactUpload(yaml, ArtifactUploadConfig{
        StepName:       "Upload Agent Logs",
        ArtifactName:   "agent-logs",
        UploadPaths:    []string{logFileFull},
        IfNoFilesFound: "warn",
    })
}

// IDENTICAL pattern for:
// - generateUploadAssets()
// - generateUploadAwInfo()
// - generateUploadPrompt()
// - generateUploadMCPLogs()
// - generateUploadSafeInputsLogs()
// - generateUploadAccessLogs()

All are one-liner wrappers around generateArtifactUpload().

Recommendation: Replace with configuration table:

type ArtifactSpec struct {
    Name           string
    StepName       string
    Path           string
    IfNoFilesFound string
}

var artifactSpecs = []ArtifactSpec{
    {"agent-logs", "Upload Agent Logs", logFileFull, "warn"},
    {"assets", "Upload Assets", "/tmp/gh-aw/assets", "ignore"},
    // ... 5 more entries
}

func (c *Compiler) generateAllArtifactUploads(yaml *strings.Builder, specs []ArtifactSpec) {
    for _, spec := range specs {
        c.generateArtifactUpload(yaml, ArtifactUploadConfig{...})
    }
}

Estimated Impact:

  • Eliminate 7 wrapper functions (~40-50 lines)
  • Easier to update artifact upload action versions
  • Effort: 1-2 hours

Issue 1.3: safe_outputs_config.go - Massive Repetition in Configuration Processing (CRITICAL)

File: pkg/workflow/safe_outputs_config.go (1,024 lines, 15+ functions)

Problem: The file contains 270+ lines of near-identical code repeated in 5 different functions:

Location 1: extractSafeOutputsConfig() (lines 170-464) - Extracts 24+ config types:

issuesConfig := c.parseIssuesConfig(outputMap)
if issuesConfig != nil {
    config.CreateIssues = issuesConfig
}

agentTaskConfig := c.parseAgentTaskConfig(outputMap)
if agentTaskConfig != nil {
    config.CreateAgentTasks = agentTaskConfig
}
// ... repeated 22+ more times

Location 2: generateSafeOutputsConfig() (lines 576-847) - Generates JSON for each type:

if data.SafeOutputs.CreateIssues != nil {
    issueConfig := map[string]any{}
    maxValue := 1
    if data.SafeOutputs.CreateIssues.Max > 0 {
        maxValue = data.SafeOutputs.CreateIssues.Max
    }
    issueConfig["max"] = maxValue
    if len(data.SafeOutputs.CreateIssues.AllowedLabels) > 0 {
        issueConfig["allowed_labels"] = data.SafeOutputs.CreateIssues.AllowedLabels
    }
    safeOutputsConfig["create_issue"] = issueConfig
}
// ... repeated 23+ more times with minor variations

Location 3: generateFilteredToolsJSON() (lines 912-986) - Checks enabled tools:

if data.SafeOutputs.CreateIssues != nil {
    enabledTools["create_issue"] = true
}
if data.SafeOutputs.CreateAgentTasks != nil {
    enabledTools["create_agent_task"] = true
}
// ... repeated 22+ times

Also repeated in:

  • HasSafeOutputsEnabled() (lines 33-55) - 23 nil checks
  • GetEnabledSafeOutputToolNames() (lines 74-148) - 24+ type mappings

Recommendation: Create metadata-driven system:

type SafeOutputMetadata struct {
    Key             string   // "create_issue"
    ConfigFieldName string   // "CreateIssues"
    DefaultMax      int
    CustomFields    []string // Config-specific fields
}

var SafeOutputRegistry = []SafeOutputMetadata{
    {"create_issue", "CreateIssues", 1, []string{"allowed_labels", "required_title_prefix"}},
    {"create_agent_task", "CreateAgentTasks", 1, nil},
    // ... all 24+ entries
}

// Use reflection/metadata to iterate instead of manual checks
func extractSafeOutputsConfig(outputMap map[string]any) *SafeOutputsConfig {
    for _, meta := range SafeOutputRegistry {
        if configData, exists := outputMap[meta.Key]; exists {
            // Use reflection to set field
        }
    }
}

Estimated Impact:

  • Reduce 1,024 lines → ~450-500 lines (55% reduction!)
  • Single source of truth for safe output types
  • New types require 1 metadata entry instead of 5 scattered additions
  • Effort: 6-8 hours

Issue 1.4: Token Handling Duplication (MEDIUM PRIORITY)

File: pkg/workflow/github_token.go

Problem: Three nearly identical functions with different token precedence:

// Function 1: Generic
func getEffectiveGitHubToken(customToken, toplevelToken string) string {
    if customToken != "" { return customToken }
    if toplevelToken != "" { return toplevelToken }
    return "${{ secrets.GH_AW_GITHUB_MCP_SERVER_TOKEN || secrets.GH_AW_GITHUB_TOKEN || secrets.GITHUB_TOKEN }}"
}

// Function 2: Copilot-specific (80% similar)
func getEffectiveCopilotGitHubToken(customToken, toplevelToken string) string {
    if customToken != "" { return customToken }
    if toplevelToken != "" { return toplevelToken }
    return "${{ secrets.COPILOT_GITHUB_TOKEN || secrets.GH_AW_GITHUB_TOKEN }}"
}

// Function 3: Agent-specific (70% similar)
func getEffectiveAgentGitHubToken(customToken string) string {
    if customToken != "" { return customToken }
    return "${{ secrets.GH_AW_AGENT_TOKEN }}"
}

Recommendation: Consolidate using configuration:

type TokenPrecedence struct {
    CustomToken   string
    TopLevelToken string
    Fallbacks     []string
}

func getEffectiveToken(precedence TokenPrecedence) string {
    if precedence.CustomToken != "" { return precedence.CustomToken }
    if precedence.TopLevelToken != "" { return precedence.TopLevelToken }
    return strings.Join(precedence.Fallbacks, " || ")
}

Estimated Impact:

  • Reduce ~30 lines of duplicate code
  • Single source of truth for token precedence
  • Effort: 2-3 hours

Issue 1.5: Validation Functions Scattered Across 18 Files (MEDIUM PRIORITY)

Current State: Validation logic spread across pkg/workflow:

pkg/workflow/
├── agent_validation.go
├── bundler_validation.go
├── compiler_filters_validation.go
├── docker_validation.go
├── engine_validation.go
├── expression_validation.go
├── github_toolset_validation_error.go
├── mcp_config_validation.go
├── npm_validation.go
├── pip_validation.go
├── repository_features_validation.go
├── runtime_validation.go
├── safe_output_validation_config.go
├── schema_validation.go
├── step_order_validation.go
├── strict_mode_validation.go
├── template_validation.go
└── validation.go

Problem: 18 validation files + 51 validate* functions make it hard to:

  • Understand validation boundaries
  • Maintain consistent validation patterns
  • Find validation logic for a specific domain

Recommendation: Create pkg/workflow/validation/ subdirectory:

pkg/workflow/validation/
├── agent.go              (agent_validation.go)
├── bundler.go            (bundler_validation.go)
├── docker.go             (docker_validation.go)
├── engine.go             (engine_validation.go)
├── expression.go         (expression_validation.go)
├── filters.go            (compiler_filters_validation.go)
├── mcp_config.go         (mcp_config_validation.go)
├── npm.go                (npm_validation.go)
├── pip.go                (pip_validation.go)
├── repository.go         (repository_features_validation.go)
├── runtime.go            (runtime_validation.go)
├── safe_outputs.go       (safe_output_validation_config.go)
├── schema.go             (schema_validation.go)
├── step_order.go         (step_order_validation.go)
├── strict_mode.go        (strict_mode_validation.go)
├── template.go           (template_validation.go)
└── validation.go         (core types)

Estimated Impact:

  • Clear module boundaries for validation
  • Natural import path: workflow/validation
  • Easier to locate validation logic
  • Effort: 3-4 hours (mostly file moves + import updates)

2. pkg/cli (99 files, ~30,000 lines, 580+ functions)

Issue 2.1: logs.go Oversized and Mixed Responsibilities (HIGH PRIORITY)

File: pkg/cli/logs.go (1,338 lines, 9 functions)

Functions: Download logs, parse logs, analyze logs, render logs, cache logs

Problem: Single file handles multiple distinct concerns:

  • Command creation and flag parsing
  • Job status fetching from GitHub API
  • Log downloading and aggregation
  • Log parsing and analysis (engine-specific)
  • Error detection and reporting
  • MCP tool usage analysis
  • Firewall log analysis
  • Metrics calculation
  • Output formatting (JSON + console)
  • Cache management

Current Support Files: logs_parsing.go, logs_metrics.go, logs_report.go, logs_download.go, logs_cache.go, logs_models.go

Recommendation: Further split logs.go:

pkg/cli/
├── logs_command.go       (~300 lines - command setup and orchestration)
├── logs_download.go      (exists - artifact downloading)
├── logs_parsing.go       (exists - log parsing)
├── logs_analysis.go      (~400 lines - NEW - analysis logic)
├── logs_metrics.go       (exists - metrics calculation)
├── logs_report.go        (exists - report generation)
├── logs_cache.go         (exists - caching)
└── logs_models.go        (exists - data models)

Estimated Impact:

  • Complete separation of concerns
  • logs_command.go becomes thin orchestrator
  • Effort: 4-6 hours

Issue 2.2: update_command.go Handles Multiple Update Types (MEDIUM PRIORITY)

File: pkg/cli/update_command.go (1,331 lines, 20 functions)

Problem: Mixes workflow updates, action updates, and Git operations:

  • UpdateWorkflows(), updateWorkflow() - Workflow updates
  • UpdateActions(), getLatestActionRelease() - Action updates
  • checkExtensionUpdate() - Extension checking
  • createUpdatePR() - PR creation
  • normalizeWhitespace() - Utility

Recommendation: Split into:

pkg/cli/
├── update_workflows.go   - Workflow update logic
├── update_actions.go     - Action update logic
└── update_command.go     - Command orchestration

Estimated Impact:

  • Three ~450 line files vs one 1,331 line file
  • Effort: 3-4 hours

3. pkg/parser (14 files, ~6,000 lines, 150+ functions)

Issue 3.1: frontmatter.go Mixed Responsibilities (HIGH PRIORITY)

File: pkg/parser/frontmatter.go (1,283 lines, 33 functions)

Functions: Import processing (6), include processing (6), extraction (12), merging (4), utilities (5)

Recommendation: Split by functional domain:

pkg/parser/
├── frontmatter.go             (~100 lines - core types)
├── frontmatter_imports.go     (~350 lines - ProcessImports* functions)
├── frontmatter_includes.go    (~250 lines - ExpandIncludes*, ProcessIncludes*)
├── frontmatter_extract.go     (~200 lines - extract*FromContent functions)
└── frontmatter_merge.go       (~150 lines - MergeTools, merging logic)

Estimated Impact:

  • Five focused files (~100-350 lines each)
  • Clear separation: imports vs includes vs extraction
  • Effort: 6-8 hours

Issue 3.2: schema.go Multiple Concerns (HIGH PRIORITY)

File: pkg/parser/schema.go (1,156 lines, 34 functions)

Functions:

  • Validation orchestration (8 functions)
  • Suggestion generation (4 functions)
  • Schema compilation/caching (2 functions)
  • Generic utilities (LevenshteinDistance, removeDuplicates, min/max)

Recommendation: Split and extract utilities:

pkg/parser/
├── schema.go                 (~150 lines - public API + types)
├── schema_validate.go        (~400 lines - validation + custom rules)
├── schema_suggestions.go     (~250 lines - suggestions and navigation)
├── schema_compile.go         (~100 lines - compilation and caching)
└── schema_deprecated.go      (~100 lines - deprecated field handling)

pkg/util/
├── strings.go                (LevenshteinDistance, StripANSI)
├── slices.go                 (removeDuplicates, generic helpers)
└── math.go                   (min, max)

Estimated Impact:

  • Five focused files vs one 1,156-line file
  • Reusable utilities available across packages
  • Effort: 6-8 hours

Semantic Function Clustering Analysis

Well-Organized Patterns (✓ Keep These)

1. Expression Handling (pkg/workflow/)

Excellent organization:

  • expression_parser.go - Parsing
  • expression_builder.go - Building
  • expression_extraction.go - Extraction
  • expression_nodes.go - AST nodes
  • expression_validation.go - Validation

Use this as the pattern for other feature areas!

2. MCP Commands (pkg/cli/)

Well-structured subcommands:

  • mcp.go - Main command
  • mcp_add.go, mcp_inspect.go, mcp_list.go - Subcommands
  • mcp_server.go, mcp_validation.go, mcp_config_file.go - Supporting modules

This is exemplary organization.


Patterns Needing Consolidation

1. Job Building Functions (pkg/workflow/compiler_jobs.go)

  • Issue: 29 external build*Job() functions called from one orchestrator
  • Pattern: All follow similar structure (job creation, dependency management)
  • Recommendation: Already identified in Issue 1.1

2. Prompt Generation (pkg/workflow/)

  • Issue: Scattered across compiler_yaml.go, prompts.go, repo_memory_prompt.go, template.go
  • Pattern: All append content to prompt with heredoc
  • Recommendation: Unify prompt generation logic

3. Safe Output Parsers (pkg/workflow/)

  • Issue: 24+ individual parser functions in separate files
  • Pattern: All call parseBaseSafeOutputConfig() then add custom fields
  • Recommendation: Already identified in Issue 1.3

Priority Refactoring Recommendations

Priority 1: High-Impact Quick Wins (Immediate Action)

1.1 Consolidate Safe Outputs Configuration Logic ⭐⭐⭐

File: pkg/workflow/safe_outputs_config.go
Impact: Reduce 1,024 lines → ~500 lines (50% reduction)
Effort: 6-8 hours
Benefits: Single source of truth, easier to add new safe output types

1.2 Consolidate Upload Artifact Functions ⭐⭐⭐

File: pkg/workflow/compiler_yaml.go
Impact: Eliminate 7 wrapper functions (~40-50 lines)
Effort: 1-2 hours
Benefits: Easier action version updates, DRY principle

1.3 Consolidate Job Creation Pattern ⭐⭐

File: pkg/workflow/compiler_jobs.go
Impact: Reduce ~200 lines of duplicate code
Effort: 3-4 hours
Benefits: Easier to add new job types


Priority 2: Structural Improvements (Next Sprint)

2.1 Create Validation Subdirectory ⭐⭐

Location: pkg/workflow/validation/
Files: Move 18 validation files
Effort: 3-4 hours
Impact: Clear module boundaries

2.2 Split frontmatter.go ⭐⭐

Current: 1,283 lines, 33 functions
Target: 5 files (~100-350 lines each)
Effort: 6-8 hours
Impact: Clearer separation of concerns

2.3 Split schema.go ⭐⭐

Current: 1,156 lines, 34 functions
Target: 5 files (~100-400 lines each)
Effort: 6-8 hours
Impact: Clearer validation vs suggestions separation

2.4 Split logs.go ⭐

Current: 1,338 lines mixed responsibilities
Target: logs_command.go as thin orchestrator
Effort: 4-6 hours
Impact: Complete separation of concerns


Priority 3: Long-term Improvements

3.1 Consolidate Token Handling

Impact: Reduce ~30 lines, single source of truth
Effort: 2-3 hours

3.2 Extract Generic Utilities to pkg/util

Impact: Reusable utilities across packages
Effort: 2-3 hours

3.3 Split compiler_yaml.go

Current: 1,411 lines, 29 functions
Target: 4-5 focused modules
Effort: 4-5 hours


Summary Statistics

By Package

Package Files Lines Files >1000 High-Priority Issues
pkg/workflow 164 ~44,000 6 5 critical
pkg/cli 99 ~30,000 3 2 high
pkg/parser 14 ~6,000 2 2 high
Utilities 22 ~3,600 0 None

Function Pattern Quality

Pattern Count Status Action Needed
build* 97 ✓ Well-clustered None
parse* 110 ✓ Distributed appropriately None
extract* 103 ✓ Concentrated appropriately None
generate* 107 ⚠️ Some duplication Consolidate uploads
validate* 52 ⚠️ Scattered Create subdirectory
render* 49 ⚠️ Could consolidate Extract to modules

Impact Summary

Category Files Affected Lines Reduced Estimated Effort
Duplicate Code 5 ~470 lines 12-17 hours
File Splitting 6 Reorganize 7,500 lines 30-38 hours
Modularization 18 Better boundaries 3-4 hours
Utilities 3 Reusable helpers 2-3 hours
TOTAL 32 ~470 lines removed, 7,500 reorganized 47-62 hours

Implementation Checklist

Phase 1: Quick Wins (Weeks 1-2)

  • Consolidate upload artifact functions (Issue 1.2)
  • Consolidate token handling (Issue 1.4)
  • Consolidate job creation pattern (Issue 1.1)
  • Test and verify no regressions

Estimated: ~230 lines reduced, 8-11 hours

Phase 2: Critical Consolidation (Weeks 3-4)

  • Consolidate safe_outputs_config.go logic (Issue 1.3)
  • Test and verify no regressions

Estimated: ~500 lines reduced, 6-8 hours

Phase 3: Structural (Weeks 5-6)

  • Create validation subdirectory (Issue 1.5)
  • Split frontmatter.go (Issue 3.1)
  • Split schema.go (Issue 3.2)
  • Test and verify no regressions

Estimated: Better organization, 16-20 hours

Phase 4: CLI Package (Weeks 7-8)

  • Split logs.go further (Issue 2.1)
  • Split update_command.go (Issue 2.2)
  • Test and verify no regressions

Estimated: Better organization, 7-10 hours


Analysis Metadata

  • Analysis Date: 2025-12-14
  • Repository: githubnext/gh-aw
  • Total Files Analyzed: 299 non-test Go files
  • Total Lines Analyzed: 83,614 lines
  • Total Functions Cataloged: 3,500+
  • Detection Methods:
    • Automated function signature extraction
    • Semantic clustering by naming patterns
    • Detailed file-by-file analysis using specialized agents
    • Manual pattern recognition for duplicates
    • Cross-file reference analysis

Conclusion

The codebase demonstrates strong architectural foundations with clear naming conventions and good separation of concerns at the package level. The primary opportunities for improvement are:

  1. Eliminating massive duplication in safe_outputs_config.go (~500 lines saved)
  2. Consolidating repetitive patterns in job creation and uploads (~240 lines saved)
  3. Splitting oversized files to improve cognitive load (11 files >1000 lines)
  4. Organizing validation into dedicated subdirectory (18 scattered files)
  5. Maintaining excellent patterns from expression handling and MCP commands

Recommended Approach: Start with Priority 1 quick wins (duplication elimination) to build momentum and demonstrate value, then tackle Priority 2 structural improvements incrementally to avoid disrupting ongoing development. The high-impact consolidations (safe_outputs_config.go, job creation patterns) should be prioritized as they offer 50%+ code reduction while improving maintainability.

Total Estimated Effort: 47-62 hours spread across 8 weeks
Total Code Reduction: ~470 lines eliminated + 7,500 lines better organized
Expected Benefits: Improved maintainability, easier testing, reduced cognitive load, single sources of truth for core patterns

AI generated by Semantic Function Refactoring

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions