[refactor] Semantic Function Clustering Analysis - Code Organization Opportunities

## Overview

Comprehensive semantic analysis of **299 non-test Go files** (83,614 lines) identified significant refactoring opportunities through function clustering, duplicate detection, and organizational improvements. The analysis cataloged function naming patterns, identified outliers, and detected high-impact consolidation opportunities.

**Key Statistics:**
- **Total files analyzed**: 299 (164 in pkg/workflow, 99 in pkg/cli, 14 in pkg/parser, 22 utility files)
- **Total lines of code**: 83,614 lines
- **Files >1000 lines**: 11 files requiring attention
- **Validation files**: 18 separate files (should be consolidated)
- **Major duplicate patterns**: Token handling (3 similar functions), safe outputs config (270+ lines of repetition), upload artifact generation (7 similar functions)

**Major Findings:**
- Large files mixing multiple responsibilities (compiler_jobs.go: 1,456 lines with 14 functions + 29 external builders)
- Token handling logic duplicated 3 times with similar precedence chains
- Safe outputs configuration has 270+ lines of near-identical checks repeated in 5 functions
- Upload artifact generation duplicated 7 times following identical patterns
- 18 validation files scattered instead of being in dedicated subdirectory

<details>
<summary><b>Full Analysis Report</b></summary>

## Repository Statistics

### Package Distribution
| Package | Files | Lines | Functions | Largest File |
|---------|-------|-------|-----------|--------------|
| **pkg/workflow** | 164 | ~44,000 | 2,761 | compiler_jobs.go (1,456 lines) |
| **pkg/cli** | 99 | ~30,000 | 580+ | logs.go (1,338 lines) |
| **pkg/parser** | 14 | ~6,000 | 150+ | frontmatter.go (1,283 lines) |
| **Utilities** | 22 | ~3,600 | - | - |
| **Total** | 299 | 83,614 | 3,500+ | - |

### Function Naming Patterns (Top 15)
Analysis of all functions across the codebase reveals clear semantic clusters:

| Pattern | Count | Primary Use | Well-Organized? |
|---------|-------|-------------|-----------------|
| `get*` | 118 | Getters, retrievers | ✓ Appropriate |
| `parse*` | 110 | Parsing logic | ✓ Distributed by domain |
| `generate*` | 107 | Code/config generation | ⚠️ Some duplication |
| `extract*` | 103 | Data extraction | ✓ Concentrated appropriately |
| `build*` | 97 | Builders, constructors | ✓ Well-clustered |
| `validate*` | 52 | Validation | ⚠️ **SCATTERED - needs consolidation** |
| `is*` | 50 | Boolean checks | ✓ Appropriate |
| `render*` | 49 | Output rendering | ⚠️ Could consolidate |
| `add*` | 25 | Addition operations | ✓ Appropriate |
| `find*` | 22 | Search operations | ✓ Appropriate |
| `resolve*` | 17 | Resolution logic | ✓ Appropriate |
| `has*` | 17 | Existence checks | ✓ Appropriate |
| `ensure*` | 16 | Ensure patterns | ⚠️ Pattern suggests abstraction needed |
| `check*` | 16 | Validation checks | ⚠️ Overlaps with validate* |
| `process*` | 13 | Processing logic | ✓ Appropriate |

---

## Critical Findings by Package

### 1. pkg/workflow (164 files, 44,000+ lines, 2,761 functions)

#### Issue 1.1: compiler_jobs.go - Massive Duplication in Safe Outputs Job Creation (HIGH PRIORITY)

**File**: pkg/workflow/compiler_jobs.go (1,456 lines, 14 functions)

**Problem**: The `buildSafeOutputsJobs()` function (lines 276-765, 489 lines) contains **20+ near-identical code blocks** for building different job types:

```go
// Pattern repeated 23 times:
if data.SafeOutputs.CreateIssues != nil {
    createIssueJob, err := c.buildCreateOutputIssueJob(data, jobName)
    if err != nil {
        return fmt.Errorf("failed to build create_issue job: %w", err)
    }
    if threatDetectionEnabled {
        createIssueJob.Needs = append(createIssueJob.Needs, constants.DetectionJobName)
        createIssueJob.If = AddDetectionSuccessCheck(createIssueJob.If)
    }
    if err := c.jobManager.AddJob(createIssueJob); err != nil {
        return fmt.Errorf("failed to add create_issue job: %w", err)
    }
    safeOutputJobNames = append(safeOutputJobNames, createIssueJob.Name)
}
```

**Recommendation**: Consolidate using configuration-driven approach:
```go
type SafeOutputJobConfig struct {
    ConfigField    interface{}
    BuildFunc      func(*WorkflowData, string) (*Job, error)
    ExtraDeps      []string
}

var safeOutputJobs = []SafeOutputJobConfig{
    {data.SafeOutputs.CreateIssues, c.buildCreateOutputIssueJob, nil},
    {data.SafeOutputs.CreateAgentTasks, c.buildCreateOutputAgentTaskJob, nil},
    // ... 21 more entries
}

for _, config := range safeOutputJobs {
    if config.ConfigField != nil {
        job, err := config.BuildFunc(data, jobName)
        // ... common error handling and dependency logic
    }
}
```

**Estimated Impact**: 
- Reduce ~200 lines of duplicate code
- Single source of truth for job creation pattern
- Easier to add new safe output types
- **Effort**: 3-4 hours

---

#### Issue 1.2: compiler_yaml.go - Upload Artifact Generation Duplication (HIGH PRIORITY)

**File**: pkg/workflow/compiler_yaml.go (1,411 lines, 29 functions)

**Problem**: Seven functions generating artifact uploads follow identical patterns (lines 466-641):

```go
func (c *Compiler) generateUploadAgentLogs(yaml *strings.Builder, logFileFull string) {
    c.generateArtifactUpload(yaml, ArtifactUploadConfig{
        StepName:       "Upload Agent Logs",
        ArtifactName:   "agent-logs",
        UploadPaths:    []string{logFileFull},
        IfNoFilesFound: "warn",
    })
}

// IDENTICAL pattern for:
// - generateUploadAssets()
// - generateUploadAwInfo()
// - generateUploadPrompt()
// - generateUploadMCPLogs()
// - generateUploadSafeInputsLogs()
// - generateUploadAccessLogs()
```

**All are one-liner wrappers** around `generateArtifactUpload()`.

**Recommendation**: Replace with configuration table:
```go
type ArtifactSpec struct {
    Name           string
    StepName       string
    Path           string
    IfNoFilesFound string
}

var artifactSpecs = []ArtifactSpec{
    {"agent-logs", "Upload Agent Logs", logFileFull, "warn"},
    {"assets", "Upload Assets", "/tmp/gh-aw/assets", "ignore"},
    // ... 5 more entries
}

func (c *Compiler) generateAllArtifactUploads(yaml *strings.Builder, specs []ArtifactSpec) {
    for _, spec := range specs {
        c.generateArtifactUpload(yaml, ArtifactUploadConfig{...})
    }
}
```

**Estimated Impact**:
- Eliminate 7 wrapper functions (~40-50 lines)
- Easier to update artifact upload action versions
- **Effort**: 1-2 hours

---

#### Issue 1.3: safe_outputs_config.go - Massive Repetition in Configuration Processing (CRITICAL)

**File**: pkg/workflow/safe_outputs_config.go (1,024 lines, 15+ functions)

**Problem**: The file contains **270+ lines of near-identical code** repeated in 5 different functions:

**Location 1**: `extractSafeOutputsConfig()` (lines 170-464) - Extracts 24+ config types:
```go
issuesConfig := c.parseIssuesConfig(outputMap)
if issuesConfig != nil {
    config.CreateIssues = issuesConfig
}

agentTaskConfig := c.parseAgentTaskConfig(outputMap)
if agentTaskConfig != nil {
    config.CreateAgentTasks = agentTaskConfig
}
// ... repeated 22+ more times
```

**Location 2**: `generateSafeOutputsConfig()` (lines 576-847) - Generates JSON for each type:
```go
if data.SafeOutputs.CreateIssues != nil {
    issueConfig := map[string]any{}
    maxValue := 1
    if data.SafeOutputs.CreateIssues.Max > 0 {
        maxValue = data.SafeOutputs.CreateIssues.Max
    }
    issueConfig["max"] = maxValue
    if len(data.SafeOutputs.CreateIssues.AllowedLabels) > 0 {
        issueConfig["allowed_labels"] = data.SafeOutputs.CreateIssues.AllowedLabels
    }
    safeOutputsConfig["create_issue"] = issueConfig
}
// ... repeated 23+ more times with minor variations
```

**Location 3**: `generateFilteredToolsJSON()` (lines 912-986) - Checks enabled tools:
```go
if data.SafeOutputs.CreateIssues != nil {
    enabledTools["create_issue"] = true
}
if data.SafeOutputs.CreateAgentTasks != nil {
    enabledTools["create_agent_task"] = true
}
// ... repeated 22+ times
```

**Also repeated in**:
- `HasSafeOutputsEnabled()` (lines 33-55) - 23 nil checks
- `GetEnabledSafeOutputToolNames()` (lines 74-148) - 24+ type mappings

**Recommendation**: Create metadata-driven system:
```go
type SafeOutputMetadata struct {
    Key             string   // "create_issue"
    ConfigFieldName string   // "CreateIssues"
    DefaultMax      int
    CustomFields    []string // Config-specific fields
}

var SafeOutputRegistry = []SafeOutputMetadata{
    {"create_issue", "CreateIssues", 1, []string{"allowed_labels", "required_title_prefix"}},
    {"create_agent_task", "CreateAgentTasks", 1, nil},
    // ... all 24+ entries
}

// Use reflection/metadata to iterate instead of manual checks
func extractSafeOutputsConfig(outputMap map[string]any) *SafeOutputsConfig {
    for _, meta := range SafeOutputRegistry {
        if configData, exists := outputMap[meta.Key]; exists {
            // Use reflection to set field
        }
    }
}
```

**Estimated Impact**:
- Reduce 1,024 lines → ~450-500 lines (55% reduction!)
- Single source of truth for safe output types
- New types require 1 metadata entry instead of 5 scattered additions
- **Effort**: 6-8 hours

---

#### Issue 1.4: Token Handling Duplication (MEDIUM PRIORITY)

**File**: pkg/workflow/github_token.go

**Problem**: Three nearly identical functions with different token precedence:

```go
// Function 1: Generic
func getEffectiveGitHubToken(customToken, toplevelToken string) string {
    if customToken != "" { return customToken }
    if toplevelToken != "" { return toplevelToken }
    return "${{ secrets.GH_AW_GITHUB_MCP_SERVER_TOKEN || secrets.GH_AW_GITHUB_TOKEN || secrets.GITHUB_TOKEN }}"
}

// Function 2: Copilot-specific (80% similar)
func getEffectiveCopilotGitHubToken(customToken, toplevelToken string) string {
    if customToken != "" { return customToken }
    if toplevelToken != "" { return toplevelToken }
    return "${{ secrets.COPILOT_GITHUB_TOKEN || secrets.GH_AW_GITHUB_TOKEN }}"
}

// Function 3: Agent-specific (70% similar)
func getEffectiveAgentGitHubToken(customToken string) string {
    if customToken != "" { return customToken }
    return "${{ secrets.GH_AW_AGENT_TOKEN }}"
}
```

**Recommendation**: Consolidate using configuration:
```go
type TokenPrecedence struct {
    CustomToken   string
    TopLevelToken string
    Fallbacks     []string
}

func getEffectiveToken(precedence TokenPrecedence) string {
    if precedence.CustomToken != "" { return precedence.CustomToken }
    if precedence.TopLevelToken != "" { return precedence.TopLevelToken }
    return strings.Join(precedence.Fallbacks, " || ")
}
```

**Estimated Impact**:
- Reduce ~30 lines of duplicate code
- Single source of truth for token precedence
- **Effort**: 2-3 hours

---

#### Issue 1.5: Validation Functions Scattered Across 18 Files (MEDIUM PRIORITY)

**Current State**: Validation logic spread across pkg/workflow:

```
pkg/workflow/
├── agent_validation.go
├── bundler_validation.go
├── compiler_filters_validation.go
├── docker_validation.go
├── engine_validation.go
├── expression_validation.go
├── github_toolset_validation_error.go
├── mcp_config_validation.go
├── npm_validation.go
├── pip_validation.go
├── repository_features_validation.go
├── runtime_validation.go
├── safe_output_validation_config.go
├── schema_validation.go
├── step_order_validation.go
├── strict_mode_validation.go
├── template_validation.go
└── validation.go
```

**Problem**: 18 validation files + 51 validate* functions make it hard to:
- Understand validation boundaries
- Maintain consistent validation patterns
- Find validation logic for a specific domain

**Recommendation**: Create `pkg/workflow/validation/` subdirectory:
```
pkg/workflow/validation/
├── agent.go              (agent_validation.go)
├── bundler.go            (bundler_validation.go)
├── docker.go             (docker_validation.go)
├── engine.go             (engine_validation.go)
├── expression.go         (expression_validation.go)
├── filters.go            (compiler_filters_validation.go)
├── mcp_config.go         (mcp_config_validation.go)
├── npm.go                (npm_validation.go)
├── pip.go                (pip_validation.go)
├── repository.go         (repository_features_validation.go)
├── runtime.go            (runtime_validation.go)
├── safe_outputs.go       (safe_output_validation_config.go)
├── schema.go             (schema_validation.go)
├── step_order.go         (step_order_validation.go)
├── strict_mode.go        (strict_mode_validation.go)
├── template.go           (template_validation.go)
└── validation.go         (core types)
```

**Estimated Impact**:
- Clear module boundaries for validation
- Natural import path: `workflow/validation`
- Easier to locate validation logic
- **Effort**: 3-4 hours (mostly file moves + import updates)

---

### 2. pkg/cli (99 files, ~30,000 lines, 580+ functions)

#### Issue 2.1: logs.go Oversized and Mixed Responsibilities (HIGH PRIORITY)

**File**: pkg/cli/logs.go (1,338 lines, 9 functions)

**Functions**: Download logs, parse logs, analyze logs, render logs, cache logs

**Problem**: Single file handles multiple distinct concerns:
- Command creation and flag parsing
- Job status fetching from GitHub API
- Log downloading and aggregation
- Log parsing and analysis (engine-specific)
- Error detection and reporting
- MCP tool usage analysis
- Firewall log analysis
- Metrics calculation
- Output formatting (JSON + console)
- Cache management

**Current Support Files**: logs_parsing.go, logs_metrics.go, logs_report.go, logs_download.go, logs_cache.go, logs_models.go

**Recommendation**: Further split logs.go:
```
pkg/cli/
├── logs_command.go       (~300 lines - command setup and orchestration)
├── logs_download.go      (exists - artifact downloading)
├── logs_parsing.go       (exists - log parsing)
├── logs_analysis.go      (~400 lines - NEW - analysis logic)
├── logs_metrics.go       (exists - metrics calculation)
├── logs_report.go        (exists - report generation)
├── logs_cache.go         (exists - caching)
└── logs_models.go        (exists - data models)
```

**Estimated Impact**:
- Complete separation of concerns
- logs_command.go becomes thin orchestrator
- **Effort**: 4-6 hours

---

#### Issue 2.2: update_command.go Handles Multiple Update Types (MEDIUM PRIORITY)

**File**: pkg/cli/update_command.go (1,331 lines, 20 functions)

**Problem**: Mixes workflow updates, action updates, and Git operations:
- `UpdateWorkflows()`, `updateWorkflow()` - Workflow updates
- `UpdateActions()`, `getLatestActionRelease()` - Action updates
- `checkExtensionUpdate()` - Extension checking
- `createUpdatePR()` - PR creation
- `normalizeWhitespace()` - Utility

**Recommendation**: Split into:
```
pkg/cli/
├── update_workflows.go   - Workflow update logic
├── update_actions.go     - Action update logic
└── update_command.go     - Command orchestration
```

**Estimated Impact**:
- Three ~450 line files vs one 1,331 line file
- **Effort**: 3-4 hours

---

### 3. pkg/parser (14 files, ~6,000 lines, 150+ functions)

#### Issue 3.1: frontmatter.go Mixed Responsibilities (HIGH PRIORITY)

**File**: pkg/parser/frontmatter.go (1,283 lines, 33 functions)

**Functions**: Import processing (6), include processing (6), extraction (12), merging (4), utilities (5)

**Recommendation**: Split by functional domain:
```
pkg/parser/
├── frontmatter.go             (~100 lines - core types)
├── frontmatter_imports.go     (~350 lines - ProcessImports* functions)
├── frontmatter_includes.go    (~250 lines - ExpandIncludes*, ProcessIncludes*)
├── frontmatter_extract.go     (~200 lines - extract*FromContent functions)
└── frontmatter_merge.go       (~150 lines - MergeTools, merging logic)
```

**Estimated Impact**:
- Five focused files (~100-350 lines each)
- Clear separation: imports vs includes vs extraction
- **Effort**: 6-8 hours

---

#### Issue 3.2: schema.go Multiple Concerns (HIGH PRIORITY)

**File**: pkg/parser/schema.go (1,156 lines, 34 functions)

**Functions**:
- Validation orchestration (8 functions)
- Suggestion generation (4 functions)
- Schema compilation/caching (2 functions)
- Generic utilities (LevenshteinDistance, removeDuplicates, min/max)

**Recommendation**: Split and extract utilities:
```
pkg/parser/
├── schema.go                 (~150 lines - public API + types)
├── schema_validate.go        (~400 lines - validation + custom rules)
├── schema_suggestions.go     (~250 lines - suggestions and navigation)
├── schema_compile.go         (~100 lines - compilation and caching)
└── schema_deprecated.go      (~100 lines - deprecated field handling)

pkg/util/
├── strings.go                (LevenshteinDistance, StripANSI)
├── slices.go                 (removeDuplicates, generic helpers)
└── math.go                   (min, max)
```

**Estimated Impact**:
- Five focused files vs one 1,156-line file
- Reusable utilities available across packages
- **Effort**: 6-8 hours

---

## Semantic Function Clustering Analysis

### Well-Organized Patterns (✓ Keep These)

#### 1. Expression Handling (pkg/workflow/)
**Excellent organization**:
- expression_parser.go - Parsing
- expression_builder.go - Building
- expression_extraction.go - Extraction
- expression_nodes.go - AST nodes
- expression_validation.go - Validation

**Use this as the pattern for other feature areas!**

#### 2. MCP Commands (pkg/cli/)
**Well-structured subcommands**:
- mcp.go - Main command
- mcp_add.go, mcp_inspect.go, mcp_list.go - Subcommands
- mcp_server.go, mcp_validation.go, mcp_config_file.go - Supporting modules

**This is exemplary organization.**

---

### Patterns Needing Consolidation

#### 1. Job Building Functions (pkg/workflow/compiler_jobs.go)
- **Issue**: 29 external `build*Job()` functions called from one orchestrator
- **Pattern**: All follow similar structure (job creation, dependency management)
- **Recommendation**: Already identified in Issue 1.1

#### 2. Prompt Generation (pkg/workflow/)
- **Issue**: Scattered across compiler_yaml.go, prompts.go, repo_memory_prompt.go, template.go
- **Pattern**: All append content to prompt with heredoc
- **Recommendation**: Unify prompt generation logic

#### 3. Safe Output Parsers (pkg/workflow/)
- **Issue**: 24+ individual parser functions in separate files
- **Pattern**: All call `parseBaseSafeOutputConfig()` then add custom fields
- **Recommendation**: Already identified in Issue 1.3

---

## Priority Refactoring Recommendations

### Priority 1: High-Impact Quick Wins (Immediate Action)

#### 1.1 Consolidate Safe Outputs Configuration Logic ⭐⭐⭐
**File**: pkg/workflow/safe_outputs_config.go  
**Impact**: Reduce 1,024 lines → ~500 lines (50% reduction)  
**Effort**: 6-8 hours  
**Benefits**: Single source of truth, easier to add new safe output types

#### 1.2 Consolidate Upload Artifact Functions ⭐⭐⭐
**File**: pkg/workflow/compiler_yaml.go  
**Impact**: Eliminate 7 wrapper functions (~40-50 lines)  
**Effort**: 1-2 hours  
**Benefits**: Easier action version updates, DRY principle

#### 1.3 Consolidate Job Creation Pattern ⭐⭐
**File**: pkg/workflow/compiler_jobs.go  
**Impact**: Reduce ~200 lines of duplicate code  
**Effort**: 3-4 hours  
**Benefits**: Easier to add new job types

---

### Priority 2: Structural Improvements (Next Sprint)

#### 2.1 Create Validation Subdirectory ⭐⭐
**Location**: pkg/workflow/validation/  
**Files**: Move 18 validation files  
**Effort**: 3-4 hours  
**Impact**: Clear module boundaries

#### 2.2 Split frontmatter.go ⭐⭐
**Current**: 1,283 lines, 33 functions  
**Target**: 5 files (~100-350 lines each)  
**Effort**: 6-8 hours  
**Impact**: Clearer separation of concerns

#### 2.3 Split schema.go ⭐⭐
**Current**: 1,156 lines, 34 functions  
**Target**: 5 files (~100-400 lines each)  
**Effort**: 6-8 hours  
**Impact**: Clearer validation vs suggestions separation

#### 2.4 Split logs.go ⭐
**Current**: 1,338 lines mixed responsibilities  
**Target**: logs_command.go as thin orchestrator  
**Effort**: 4-6 hours  
**Impact**: Complete separation of concerns

---

### Priority 3: Long-term Improvements

#### 3.1 Consolidate Token Handling
**Impact**: Reduce ~30 lines, single source of truth  
**Effort**: 2-3 hours

#### 3.2 Extract Generic Utilities to pkg/util
**Impact**: Reusable utilities across packages  
**Effort**: 2-3 hours

#### 3.3 Split compiler_yaml.go
**Current**: 1,411 lines, 29 functions  
**Target**: 4-5 focused modules  
**Effort**: 4-5 hours

---

## Summary Statistics

### By Package

| Package | Files | Lines | Files >1000 | High-Priority Issues |
|---------|-------|-------|-------------|---------------------|
| pkg/workflow | 164 | ~44,000 | 6 | 5 critical |
| pkg/cli | 99 | ~30,000 | 3 | 2 high |
| pkg/parser | 14 | ~6,000 | 2 | 2 high |
| Utilities | 22 | ~3,600 | 0 | None |

### Function Pattern Quality

| Pattern | Count | Status | Action Needed |
|---------|-------|--------|---------------|
| build* | 97 | ✓ Well-clustered | None |
| parse* | 110 | ✓ Distributed appropriately | None |
| extract* | 103 | ✓ Concentrated appropriately | None |
| generate* | 107 | ⚠️ Some duplication | Consolidate uploads |
| validate* | 52 | ⚠️ Scattered | Create subdirectory |
| render* | 49 | ⚠️ Could consolidate | Extract to modules |

### Impact Summary

| Category | Files Affected | Lines Reduced | Estimated Effort |
|----------|----------------|---------------|------------------|
| **Duplicate Code** | 5 | ~470 lines | 12-17 hours |
| **File Splitting** | 6 | Reorganize 7,500 lines | 30-38 hours |
| **Modularization** | 18 | Better boundaries | 3-4 hours |
| **Utilities** | 3 | Reusable helpers | 2-3 hours |
| **TOTAL** | 32 | ~470 lines removed, 7,500 reorganized | 47-62 hours |

---

## Implementation Checklist

### Phase 1: Quick Wins (Weeks 1-2)
- [ ] Consolidate upload artifact functions (Issue 1.2)
- [ ] Consolidate token handling (Issue 1.4)
- [ ] Consolidate job creation pattern (Issue 1.1)
- [ ] **Test and verify no regressions**

**Estimated**: ~230 lines reduced, 8-11 hours

### Phase 2: Critical Consolidation (Weeks 3-4)
- [ ] Consolidate safe_outputs_config.go logic (Issue 1.3)
- [ ] **Test and verify no regressions**

**Estimated**: ~500 lines reduced, 6-8 hours

### Phase 3: Structural (Weeks 5-6)
- [ ] Create validation subdirectory (Issue 1.5)
- [ ] Split frontmatter.go (Issue 3.1)
- [ ] Split schema.go (Issue 3.2)
- [ ] **Test and verify no regressions**

**Estimated**: Better organization, 16-20 hours

### Phase 4: CLI Package (Weeks 7-8)
- [ ] Split logs.go further (Issue 2.1)
- [ ] Split update_command.go (Issue 2.2)
- [ ] **Test and verify no regressions**

**Estimated**: Better organization, 7-10 hours

---

## Analysis Metadata

- **Analysis Date**: 2025-12-14
- **Repository**: githubnext/gh-aw
- **Total Files Analyzed**: 299 non-test Go files
- **Total Lines Analyzed**: 83,614 lines
- **Total Functions Cataloged**: 3,500+
- **Detection Methods**:
  - Automated function signature extraction
  - Semantic clustering by naming patterns
  - Detailed file-by-file analysis using specialized agents
  - Manual pattern recognition for duplicates
  - Cross-file reference analysis

</details>

---

## Conclusion

The codebase demonstrates **strong architectural foundations** with clear naming conventions and good separation of concerns at the package level. The primary opportunities for improvement are:

1. **Eliminating massive duplication** in safe_outputs_config.go (~500 lines saved)
2. **Consolidating repetitive patterns** in job creation and uploads (~240 lines saved)
3. **Splitting oversized files** to improve cognitive load (11 files >1000 lines)
4. **Organizing validation** into dedicated subdirectory (18 scattered files)
5. **Maintaining excellent patterns** from expression handling and MCP commands

**Recommended Approach**: Start with Priority 1 quick wins (duplication elimination) to build momentum and demonstrate value, then tackle Priority 2 structural improvements incrementally to avoid disrupting ongoing development. The high-impact consolidations (safe_outputs_config.go, job creation patterns) should be prioritized as they offer 50%+ code reduction while improving maintainability.

**Total Estimated Effort**: 47-62 hours spread across 8 weeks  
**Total Code Reduction**: ~470 lines eliminated + 7,500 lines better organized  
**Expected Benefits**: Improved maintainability, easier testing, reduced cognitive load, single sources of truth for core patterns




> AI generated by [Semantic Function Refactoring](https://github.com/githubnext/gh-aw/actions/runs/20205009819)

Package	Files	Lines	Functions	Largest File
pkg/workflow	164	~44,000	2,761	compiler_jobs.go (1,456 lines)
pkg/cli	99	~30,000	580+	logs.go (1,338 lines)
pkg/parser	14	~6,000	150+	frontmatter.go (1,283 lines)
Utilities	22	~3,600	-	-
Total	299	83,614	3,500+	-

Pattern	Count	Primary Use	Well-Organized?
`get*`	118	Getters, retrievers	✓ Appropriate
`parse*`	110	Parsing logic	✓ Distributed by domain
`generate*`	107	Code/config generation	⚠️ Some duplication
`extract*`	103	Data extraction	✓ Concentrated appropriately
`build*`	97	Builders, constructors	✓ Well-clustered
`validate*`	52	Validation	⚠️ SCATTERED - needs consolidation
`is*`	50	Boolean checks	✓ Appropriate
`render*`	49	Output rendering	⚠️ Could consolidate
`add*`	25	Addition operations	✓ Appropriate
`find*`	22	Search operations	✓ Appropriate
`resolve*`	17	Resolution logic	✓ Appropriate
`has*`	17	Existence checks	✓ Appropriate
`ensure*`	16	Ensure patterns	⚠️ Pattern suggests abstraction needed
`check*`	16	Validation checks	⚠️ Overlaps with validate*
`process*`	13	Processing logic	✓ Appropriate

Package	Files	Lines	Files >1000	High-Priority Issues
pkg/workflow	164	~44,000	6	5 critical
pkg/cli	99	~30,000	3	2 high
pkg/parser	14	~6,000	2	2 high
Utilities	22	~3,600	0	None

Pattern	Count	Status	Action Needed
build*	97	✓ Well-clustered	None
parse*	110	✓ Distributed appropriately	None
extract*	103	✓ Concentrated appropriately	None
generate*	107	⚠️ Some duplication	Consolidate uploads
validate*	52	⚠️ Scattered	Create subdirectory
render*	49	⚠️ Could consolidate	Extract to modules

Category	Files Affected	Lines Reduced	Estimated Effort
Duplicate Code	5	~470 lines	12-17 hours
File Splitting	6	Reorganize 7,500 lines	30-38 hours
Modularization	18	Better boundaries	3-4 hours
Utilities	3	Reusable helpers	2-3 hours
TOTAL	32	~470 lines removed, 7,500 reorganized	47-62 hours

[refactor] Semantic Function Clustering Analysis - Code Organization Opportunities #6410

Description

Overview

Repository Statistics

Package Distribution

Function Naming Patterns (Top 15)

Critical Findings by Package

1. pkg/workflow (164 files, 44,000+ lines, 2,761 functions)

Issue 1.1: compiler_jobs.go - Massive Duplication in Safe Outputs Job Creation (HIGH PRIORITY)

Issue 1.2: compiler_yaml.go - Upload Artifact Generation Duplication (HIGH PRIORITY)

Issue 1.3: safe_outputs_config.go - Massive Repetition in Configuration Processing (CRITICAL)

Issue 1.4: Token Handling Duplication (MEDIUM PRIORITY)

Issue 1.5: Validation Functions Scattered Across 18 Files (MEDIUM PRIORITY)

2. pkg/cli (99 files, ~30,000 lines, 580+ functions)

Issue 2.1: logs.go Oversized and Mixed Responsibilities (HIGH PRIORITY)

Issue 2.2: update_command.go Handles Multiple Update Types (MEDIUM PRIORITY)

3. pkg/parser (14 files, ~6,000 lines, 150+ functions)

Issue 3.1: frontmatter.go Mixed Responsibilities (HIGH PRIORITY)

Issue 3.2: schema.go Multiple Concerns (HIGH PRIORITY)

Semantic Function Clustering Analysis

Well-Organized Patterns (✓ Keep These)

1. Expression Handling (pkg/workflow/)

2. MCP Commands (pkg/cli/)

Patterns Needing Consolidation

1. Job Building Functions (pkg/workflow/compiler_jobs.go)

2. Prompt Generation (pkg/workflow/)

3. Safe Output Parsers (pkg/workflow/)

Priority Refactoring Recommendations

Priority 1: High-Impact Quick Wins (Immediate Action)

1.1 Consolidate Safe Outputs Configuration Logic ⭐⭐⭐

1.2 Consolidate Upload Artifact Functions ⭐⭐⭐

1.3 Consolidate Job Creation Pattern ⭐⭐

Priority 2: Structural Improvements (Next Sprint)

2.1 Create Validation Subdirectory ⭐⭐

2.2 Split frontmatter.go ⭐⭐

2.3 Split schema.go ⭐⭐

2.4 Split logs.go ⭐

Priority 3: Long-term Improvements

3.1 Consolidate Token Handling

3.2 Extract Generic Utilities to pkg/util

3.3 Split compiler_yaml.go

Summary Statistics

By Package

Function Pattern Quality

Impact Summary

Implementation Checklist

Phase 1: Quick Wins (Weeks 1-2)

Phase 2: Critical Consolidation (Weeks 3-4)

Phase 3: Structural (Weeks 5-6)

Phase 4: CLI Package (Weeks 7-8)

Analysis Metadata

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions