Skip to content

Conversation

Copy link

Copilot AI commented Sep 25, 2025

This PR implements a new sarif-splitter plugin that addresses the need to split large SARIF files into smaller, categorized files for better organization and to overcome GitHub Advanced Security upload restrictions.

Problem Solved

Large SARIF files can exceed GitHub's upload size limits and make it difficult to organize security alerts effectively. The new splitter plugin enables teams to:

  • Split large SARIF files into manageable chunks
  • Organize alerts by application areas (tests, frontend, backend, etc.)
  • Prioritize security reviews by severity levels
  • Improve dashboard filtering and search capabilities

Key Features

Path-Based Splitting

Split alerts based on file path patterns using glob matching:

python -m sariftoolkit --enable-splitter --split-by-path --language python --sarif results.sarif

Default path categories:

  • Tests: **/test/**, **/tests/**, **/*test*
  • App: **/web/**, **/api/**, **/src/**, **/app/**

Severity-Based Splitting

Split alerts by security severity levels automatically extracted from SARIF rule properties:

python -m sariftoolkit --enable-splitter --split-by-severity --language python --sarif results.sarif

Severity mapping:

  • Critical: security-severity ≥ 9.0
  • High: security-severity 7.0-8.9
  • Medium: security-severity 4.0-6.9
  • Low: security-severity < 4.0

Single Splitting Method Restriction

The plugin enforces that only one splitting method can be used at a time. Users must choose either --split-by-path OR --split-by-severity, not both, to ensure focused and predictable splitting behavior.

GitHub Advanced Security Integration

Each split SARIF file includes proper runAutomationDetails.id categories following GitHub's conventions:

  • Path-based: /language:python/category:Tests, /language:python/filter:none
  • Severity-based: /language:python/severity:Critical, /language:python/severity:High, /language:python/severity:Medium, /language:python/severity:Low

Summary Output Table

The plugin provides a comprehensive summary table showing before/after views:

  • Original SARIF file names and alert counts
  • Generated split files with their categories and alert counts
  • Totals verification to ensure no alerts are lost during splitting

Configurable Rules

Custom splitting rules via JSON configuration files:

{
  "path_rules": [
    {
      "name": "Frontend", 
      "patterns": ["**/web/**", "**/*.js", "**/*.jsx"]
    },
    {
      "name": "Backend",
      "patterns": ["**/api/**", "**/*.py", "**/*.java"] 
    }
  ]
}

Technical Implementation

SARIF Model Enhancement

  • Added AutomationDetailsModel to support GitHub Advanced Security categories
  • Enhanced RunsModel to include automationDetails field

Robust Property Access

The plugin handles various SARIF property formats for security-severity extraction:

# Handles multiple property name variations
security_severity = (props.get('security-severity') or 
                     props.get('security_severity') or
                     props.get('securitySeverity'))

No Alert Loss Guarantee

All alerts are preserved through fallback categories:

  • Unmatched file paths → /language:<lang>/filter:none
  • Unmatched severities → /language:<lang>/severity:Others

Usage Examples

Basic splitting (single method only):

# Split by severity levels only
python -m sariftoolkit --enable-splitter \
  --split-by-severity \
  --language javascript --sarif scan-results.sarif \
  --output ./categorized-results

Custom configuration:

python -m sariftoolkit --enable-splitter \
  --split-by-path --language java \
  --path-config custom-paths.json \
  --sarif large-scan.sarif

Bug Fixes

This PR also fixes an existing dataclass configuration bug that was preventing the toolkit from running:

# Before (caused ValueError)
relativepaths: PluginConfig = PluginConfig("RelativePaths", "...")

# After (uses field with default_factory)  
relativepaths: PluginConfig = field(default_factory=lambda: PluginConfig("RelativePaths", "..."))

Testing

The implementation has been thoroughly tested with:

  • Unit Tests: 6 comprehensive test methods covering all functionality with proper cleanup
  • Real SARIF Files: Tested with input-example.sarif and comprehensive test files
  • Alert Distribution Validation: Verified correct categorization across Critical (3), High (4), Medium (3), Low (2) severity levels
  • Category Validation: Confirmed proper runAutomationDetails.id formatting for GitHub Advanced Security
  • Summary Output: Validated before/after table shows correct file names, alert counts, and categories
  • Single Method Restriction: Verified plugin rejects multiple splitting methods with clear error messages
  • Security Analysis: Passed CodeQL analysis and dependency vulnerability checks

All generated SARIF files maintain complete metadata while properly categorizing alerts for improved dashboard organization with zero alert loss.

Original prompt

This section details on the original issue you should resolve

<issue_title>Create a sarif tool to split a sarif file in to many SARIF files but with different categories</issue_title>
<issue_description>### Description
Create a sarif toolkit plugin - to split a sarif file in to many SARIF files but with different categories. Very large SARIF files might not be able to be uploaded due to restrictions. Also, splitting alerts into different categories can be helpful to narrow down search queries in dashboards.

Propose Solution

Support different paradigms to split, work through them in order to ensure specificity

  • based on a file path glob - split into a different category naming of the description of the path
    • Category: /language:/Path:
  • ex yaml:
  - Name: Tests
    - Paths: 
      - **/Tests/**

  - Name: App
    - Paths: 
      - **/web/**
      - **/api/**
  • based on alert security-severity
    • Category: /language:/Severity:<critical/high/medium/low/error/warning/note>
  - Severity:
    - Critical
  - Severity:
    - High
    - Medium
  - Severity:
    - *     //AKA all remaining

NOTE

SARIF format and Categories are for GitHub Advanced Security as described here: https://docs.github.com/en/code-security/code-scanning/integrating-with-code-scanning/sarif-support-for-code-scanning#runautomationdetails-object

the processor should start with the same SARIF run info but focus on "cutting" the alerts out of the original and moving them over to the proper new as to not miss any.

If there are alerts that dont fit they should not be dropped! We will potentially need to keep the origional SARIF around and just give it a category of /language:<codeql language>/filter:none

</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes #19

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 3 commits September 25, 2025 21:47
Co-authored-by: felickz <1760475+felickz@users.noreply.github.com>
Co-authored-by: felickz <1760475+felickz@users.noreply.github.com>
Co-authored-by: felickz <1760475+felickz@users.noreply.github.com>
Copilot AI changed the title [WIP] Create a sarif tool to split a sarif file in to many SARIF files but with different categories Create sarif-splitter plugin to split SARIF files by categories Sep 25, 2025
Copilot AI requested a review from felickz September 25, 2025 22:01
Copilot finished work on behalf of felickz September 25, 2025 22:01
@felickz
Copy link

felickz commented Sep 25, 2025

@copilot

do not allow mutliple different splitting in one pass - its one technique only at a time.

When you do your testing, lets focus on the severity test with https://github.com/advanced-security/sarif-toolkit/blob/main/examples/input-example.sarif - this has critical, highs, mediums and a sarif file should be generated for each. Further, inspect those files to ensure the expected number of alerts is present in each. Why not create some unit tests for this! Since it will be writing files, ensure each test cleans up after and is not running in parallel. Ensure the category is as expected as this is critical in each file.

Add a summary output to the tool that iterates a before / after view. A small table that will have: Sarif file name/ # alerts / category name for each!

…d unit tests, and summary output

Co-authored-by: felickz <1760475+felickz@users.noreply.github.com>
Copy link
Author

Copilot AI commented Sep 25, 2025

@copilot

do not allow mutliple different splitting in one pass - its one technique only at a time.
...

Implemented all requested feedback in commit 128847b:

✅ Single splitting method restriction: Plugin now rejects multiple methods with clear error message
✅ Comprehensive unit tests: Added 6 test methods with proper cleanup, not running in parallel
✅ Summary output table: Added before/after view showing file names, alert counts, and categories

Testing with input-example.sarif and comprehensive test:

  • input-example.sarif: 2 High alerts → 1 split file correctly created
  • comprehensive-test.sarif: 12 alerts → 4 split files created:
    • Critical: 3 alerts (security-severity ≥9.0)
    • High: 4 alerts (security-severity 7.0-8.9)
    • Medium: 3 alerts (security-severity 4.0-6.9)
    • Low: 2 alerts (security-severity <4.0)

Category validation: All files have correct /language:python/severity:X automation details for GitHub Advanced Security integration.

Perfect alert preservation: Total input alerts = Total output alerts (no loss).

Copilot finished work on behalf of felickz September 25, 2025 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create a sarif tool to split a sarif file in to many SARIF files but with different categories

2 participants