UN-3393 [FEAT] Add SanitizedSerializerMixin foundation for input validation#1965
UN-3393 [FEAT] Add SanitizedSerializerMixin foundation for input validation#1965chandrasekharan-zipstack wants to merge 3 commits into
Conversation
…dation
Foundation PR for the input-validation hardening plan (CWE-20). No
behaviour change yet — nothing imports the new pre-mixed classes; the
sweep that rewrites serializer imports lands in a follow-up PR.
- New `utils.serializer.sanitization` exposes `SanitizedSerializerMixin`
plus pre-mixed `ModelSerializer`, `Serializer`, and
`HyperlinkedModelSerializer`. The mixin walks `self.fields` in
`__init__` and attaches `validate_no_html_tags` to every writable
`CharField`. Opt out per-serializer via `Meta.html_safe_fields = (...)`
for fields that legitimately accept HTML-like content (e.g. prompt
text, regex literals). Read-only fields are naturally exempt — DRF
skips validators on `read_only=True`.
- `utils.serializer.__init__` re-exports the pre-mixed classes alongside
the existing `IntegrityErrorMixin`, so consumers can do
`from utils.serializer import ModelSerializer`.
- `input_sanitizer.validate_no_html_tags` now emits a structured
`logger.warning("input_validation_rejected", extra={"field", "reason"})`
in each rejection branch (`html_tag`, `js_protocol`,
`event_handler`). Searchable in logs; no metric / alert wired.
- New `cleanup_html_payloads` management command (under
`backend/commands/...`) replaces stored pentest payloads in the
columns flagged by the May 2026 prod scan: `custom_tool.tool_name`,
`custom_tool.description`, `workflow.workflow_name`,
`workflow.description`, `api_deployment.display_name`,
`api_deployment.description`, `adapter_instance.adapter_name`. Default
mode is dry-run; `--apply` performs the replacement inside a
transaction.
- 14 new mixin unit tests; existing 36 `test_input_sanitizer` tests
unchanged and still pass (50 total).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
Summary by CodeRabbit
WalkthroughThis PR introduces automatic HTML sanitization for Django REST Framework serializer fields. The input sanitizer is refactored with structured logging; a new ChangesHTML Sanitization for Serializer Fields
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
| Filename | Overview |
|---|---|
| backend/utils/input_sanitizer.py | Adds _reject helper with -> NoReturn annotation, structured logger.warning calls, and refactors three raise sites to go through the helper. Logic is unchanged; annotation is correct. |
| backend/utils/serializer/init.py | New package init that re-exports IntegrityErrorMixin and all sanitization classes. Clean, no logic changes. |
| backend/utils/serializer/sanitization.py | New SanitizedSerializerMixin correctly deep-copy-safe via DRF's lazy fields property; partial binding avoids closure trap. Known gap: ListField(child=CharField()) children are not walked. |
| backend/utils/tests/test_sanitized_serializer_mixin.py | 14 new tests cover happy path, all three rejection vectors, opt-out, missing Meta, and raise_exception path. Read-only exemption test does not supply HTML in the read-only field, leaving the claimed behaviour unverified. |
Sequence Diagram
sequenceDiagram
participant Client
participant DRFSerializer as DRF Serializer
participant Mixin as SanitizedSerializerMixin.__init__
participant Field as CharField.validators
participant Validator as validate_no_html_tags (partial)
participant Logger
Client->>DRFSerializer: "FooSerializer(data={...})"
DRFSerializer->>Mixin: "__init__(*args, **kwargs)"
Mixin->>DRFSerializer: super().__init__() initialises lazy _fields
Mixin->>Field: "iterate self.fields, append partial(validate_no_html_tags, field_name=name)"
Note over Mixin,Field: Skips read_only fields and html_safe_fields exemptions
Client->>DRFSerializer: serializer.is_valid()
DRFSerializer->>Field: run_validators(value)
Field->>Validator: validator(value)
alt value contains HTML / JS protocol / event handler
Validator->>Logger: logger.warning input_validation_rejected
Validator-->>DRFSerializer: raise ValidationError
DRFSerializer-->>Client: 400 Bad Request
else value is clean
Validator-->>DRFSerializer: return value
DRFSerializer-->>Client: 200 OK + validated_data
end
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
backend/utils/serializer/sanitization.py:35-41
**`ListField(child=CharField())` falls outside the mixin's coverage**
The validator is appended only to fields where `isinstance(field, drf.CharField)` holds. A top-level `ListField(child=CharField())` field is a `ListField` instance — the child `CharField` inside it is never walked, so HTML content in that list's items bypasses the check entirely. This is a gap in the defence-in-depth guarantee. Serializers that use list-of-string fields (e.g. `tags`, `aliases`, `labels`) won't be protected when developers adopt the new base classes in the follow-up sweep PR.
### Issue 2 of 2
backend/utils/tests/test_sanitized_serializer_mixin.py:72-74
**`test_read_only_field_is_naturally_exempt` doesn't prove exemption**
The test only passes `{"name": "ok"}` — the `computed` read-only field is absent from the input entirely. This validates that a missing read-only field doesn't break the serializer, but says nothing about the exemption claim in the test name. To prove the read-only field is truly exempt from HTML validation, the test should supply `"computed": "<script>alert(1)</script>"` in `data` and assert `is_valid()` returns `True`.
Reviews (3): Last reviewed commit: "UN-3393 [FIX] Annotate _reject as NoRetu..." | Re-trigger Greptile
Removing the one-shot data-cleanup command added in the previous commit. Decision: overkill for the actual risk. - The 19 stored pentest payloads on US prod are inert under React / Django auto-escape (the only render path today). They don't fire. - Forward-validation via the mixin (PR2) prevents any *new* payloads from landing. - If a future non-React render path is added (PDF/email/CSV/server template), that path's owner can run a one-off SQL update at that time. Carrying a dedicated management command and the regex-import coupling it introduced is more code than the situation warrants. The KB ADR 0007 (cleanup-via-management-command) is superseded accordingly; see KB updates landing alongside. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Greptile P2 review comment on PR #1965. The `_reject` helper always raises `ValidationError`; annotating it `-> None` misleads mypy/pyright into thinking execution can continue past the call site, which can suppress dead-code / unreachable-branch warnings. `NoReturn` is the correct signal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Test ResultsSummary
Runner Tests - Full Report
SDK1 Tests - Full Report
|
|



What
SanitizedSerializerMixinplus pre-mixedModelSerializer/Serializer/HyperlinkedModelSerializerunderutils.serializer.sanitization, re-exported fromutils.serializer. Mixin auto-attachesvalidate_no_html_tagsto every writableCharField; opt out per-serializer viaMeta.html_safe_fields = (...)for fields that legitimately accept HTML-like content (e.g. prompt text, regex literals). Read-only fields are naturally exempt (DRF skips validators onread_only=True).logger.warning("input_validation_rejected", extra={"field", "reason"})in each rejection branch ofvalidate_no_html_tags(html_tag,js_protocol,event_handler). Searchable in logs; no metric or alert wired.cleanup_html_payloadsDjango management command (underbackend/commands/) — dry-run by default,--applyto redact stored pentest payloads incustom_tool.tool_name,custom_tool.description,workflow.workflow_name,workflow.description,api_deployment.display_name,api_deployment.description,adapter_instance.adapter_name.test_input_sanitizertests still pass (50 total).Why
<…>content. Migration risk for forward-validation is effectively zero.Obsidian Vault/zipstuff/UN-3393-input-validation/(ADRs 0001–0007).How
backend/utils/serializer/sanitization.py— new module.SanitizedSerializerMixin.__init__walksself.fields, readsMeta.html_safe_fields(default()), and appendsvalidate_no_html_tags(bound to the field name viafunctools.partial) to every writableCharField'svalidatorslist. Pre-mixedModelSerializer/Serializer/HyperlinkedModelSerializerare defined here.backend/utils/serializer/__init__.py— re-exports the new classes alongside the existingIntegrityErrorMixin. Consumer convention:from utils.serializer import ModelSerializer.backend/utils/input_sanitizer.py— adds a private_reject(field_name, reason, message)helper that logs and raises; the three rejection branches now go through it.backend/commands/management/commands/cleanup_html_payloads.py— usesdjango.apps.get_modelto resolve targets at runtime (so the command doesn't fail if a pluggable app is missing). Filters by*__isnull=Falseand non-empty, iterates with.only(...)and.iterator(), printsid+ truncated value per match, applies insidetransaction.atomic()when--applyis passed.backend/utils/tests/test_sanitized_serializer_mixin.py— covers happy path, HTML/JS-protocol/event-handler rejection,html_safe_fieldsopt-out, missingMeta, read-only exemption, pre-mixed-class inheritance,raise_exception=Truepath.Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)
backend/imports fromutils.serializer.sanitizationor the new pre-mixedModelSerializer/Serializer/HyperlinkedModelSerializerre-exports yet. Existing serializers continue to importfrom rest_framework import serializersand inheritserializers.ModelSerializerdirectly — they are unchanged.validate_no_html_tagsis emitted only when a rejection already happens (no new rejection paths) — same writes that previously got 400 still get 400, plus a singleWARNINGlog line per rejection.cleanup_html_payloadsmanagement command is opt-in; it runs only when an operator explicitly invokes it. Default mode is--dry-run(read-only).Database Migrations
Env Config
Relevant Docs
Obsidian Vault/zipstuff/UN-3393-input-validation/— INDEX, threat model, target architecture, rollout plan, prod baseline, ADRs 0001–0007.Related Issues or PRs
from rest_framework import serializerstofrom utils.serializer import …acrossbackend/+backend/pluggable_apps/, deletes the redundant manualvalidate_<field>methods in the 8 serializers already covered, addsMeta.html_safe_fieldsopt-outs as needed.Dependencies Versions
Notes on Testing
cd backend && uv run pytest utils/tests/test_sanitized_serializer_mixin.py utils/tests/test_input_sanitizer.py -q→ 50 passed.
python -c \"from utils.serializer import ModelSerializer, Serializer, HyperlinkedModelSerializer, SanitizedSerializerMixin; print('ok')\".python manage.py cleanup_html_payloads— should print candidate rows without making changes.Screenshots
n/a (no UI change)
Checklist
I have read and understood the Contribution Guidelines.