ENG-2787: Add PBAC evaluation service to fides OSS#7741
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
1 Skipped Deployment
|
Greptile SummaryThis PR introduces the open-source PBAC (Purpose-Based Access Control) evaluation service to fides, providing a The architecture is well-structured — pure types in
Confidence Score: 3/5
Important Files Changed
Reviews (1): Last reviewed commit: "ENG-2787: Add PBAC evaluation service to..." | Re-trigger Greptile |
src/fides/service/pbac/sql_parser.py
Outdated
|
|
||
| def detect_statement_type(query_text: str) -> str: | ||
| """Detect the SQL statement type from the query text.""" | ||
| import re |
There was a problem hiding this comment.
import re is placed inside detect_statement_type rather than at the top of the module. This violates the project's import placement rule.
| import re | |
| import re | |
| from __future__ import annotations | |
| from datetime import datetime, timezone | |
| from uuid import uuid4 | |
| import sqlglot | |
| import sqlglot.expressions as exp |
Actually, since from __future__ import annotations must be the first statement, the import re should be placed after the existing imports at the top of the module (after line 13).
Rule Used: Python imports should always be placed at the top ... (source)
Learnt From
ethyca/fides#6536
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
Done — moved import re to module-level imports.
| purpose_repo: DataPurposeRedisRepository, | ||
| ) -> RedisIdentityResolver: | ||
| """Factory to create a RedisIdentityResolver with proper dependencies.""" | ||
| from fides.api.util.cache import get_cache |
There was a problem hiding this comment.
from fides.api.util.cache import get_cache is placed inside build_identity_resolver rather than at the top of the module. This import should be moved to the top-level imports section.
| from fides.api.util.cache import get_cache | |
| from fides.api.util.cache import FidesopsRedis, get_cache |
Then update the function signature to remove the local import.
Rule Used: Move imports to the top of the module rather than ... (source)
Learnt From
ethyca/fidesplus#2441
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
Done — moved from fides.api.util.cache import get_cache to module-level imports.
src/fides/service/pbac/service.py
Outdated
| dataset_purposes=tuple(sorted(v.dataset_purposes)), | ||
| data_use=data_use, | ||
| reason=v.reason, | ||
| control="purpose_restriction", |
There was a problem hiding this comment.
Hardcoded magic string for control type
The string "purpose_restriction" is a magic string used as a control type identifier without a named constant. Per the project convention, define this as a named constant with an explanatory comment.
| control="purpose_restriction", | |
| # Control identifier for purpose-based access restriction violations | |
| CONTROL_PURPOSE_RESTRICTION = "purpose_restriction" |
Then reference it via control=CONTROL_PURPOSE_RESTRICTION at the call site.
Rule Used: Define magic strings as named variables with expla... (source)
Learnt From
ethyca/fidesplus#2515
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
Done — extracted "purpose_restriction" to a PBAC_CONTROL_TYPE module-level constant.
src/fides/service/pbac/service.py
Outdated
| dk: DatasetPurposes( | ||
| dataset_key=dk, | ||
| purpose_keys=frozenset(dataset_purpose_overrides.get(dk, [])), | ||
| ) | ||
| for dk in dataset_keys |
There was a problem hiding this comment.
Single-character and abbreviated variable names
Variables v (line 265) and dk (lines 127, 130, 131, 190, 191) are 1–2 character names that should use full descriptive names per the project's naming convention. Consider renaming dk to dataset_key and v to violation throughout these methods.
Rule Used: Use full names for variables, not 1 to 2 character... (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
Done — renamed dk to dataset_key and v/pk to violation/purpose_key throughout.
| def _build_dataset_purposes( | ||
| self, | ||
| dataset_keys: list[str], | ||
| ) -> dict[str, DatasetPurposes]: | ||
| result: dict[str, DatasetPurposes] = {} | ||
| for dk in dataset_keys: | ||
| result[dk] = DatasetPurposes( | ||
| dataset_key=dk, | ||
| purpose_keys=frozenset(), | ||
| collection_purposes={}, | ||
| ) | ||
| return result |
There was a problem hiding this comment.
Incomplete dataset purpose lookup causes silent non-enforcement
This method populates every dataset with an empty purpose_keys frozenset. Per Rule 3 in the evaluation engine ("if a dataset has no declared purposes, access is compliant"), this means all consumers with at least one declared purpose will always receive a compliant result, regardless of the actual dataset's allowed purposes.
Concretely: a consumer with purpose "marketing" accessing a dataset annotated for "billing" will not produce a violation, because the lookup returns empty rather than frozenset({"billing"}).
Violations are only generated today for consumers with zero declared purposes (Rule 1). Registered consumers with purposes bypass all checks unless dataset_purpose_overrides is explicitly provided on every call.
The intended implementation should fetch declared purposes from the Fides dataset catalog (fideslang annotations on datasets, collections, and fields). Please add a clear docstring or inline comment documenting this limitation so callers are not misled into thinking enforcement is active.
There was a problem hiding this comment.
Done — _build_dataset_purposes now uses a pre-built dataset_purposes map passed at init time, so purpose-overlap enforcement works in the OSS path. Added both unit and integration tests.
There was a problem hiding this comment.
Code Review: PBAC Evaluation Service
Good foundational work — the layered design (pure engine in evaluate.py, service boundary as a Protocol, no-op policy evaluator as a default) is clean and the module structure is well-thought-out. A few issues need attention before this lands.
Critical (Must Fix)
1. sqlglot not declared as a dependency (sql_parser.py:12-13)
sqlglot is imported at module level but is absent from pyproject.toml. This will cause an ImportError in any standard install. Needs to be added to the dependency manifest.
2. _build_dataset_purposes always returns empty purpose sets (service.py:184-195)
This is the most significant functional gap. Because every dataset resolves to purpose_keys=frozenset(), Rule 3 in the engine ("dataset with no purposes → compliant") fires for every dataset access by any consumer that has purposes. In practice, PBAC violations are only generated for Rule 1 (consumer with zero purposes). The core purpose-overlap check — the whole point of this service — is a no-op in the default path.
If fidesplus is expected to always pass dataset_purpose_overrides, that contract needs to be documented clearly and the parameter should arguably be required (or the method overridable). As-is, a caller that omits the override silently gets useless results.
Suggestions
3. Silent failure on SQL parse errors (sql_parser.py:48-49)
except Exception: return refs means parse failures are invisible. Downstream, the query looks compliant (no tables referenced). Add at minimum a logger.warning so operators know when queries couldn't be parsed.
4. import re inside function body (sql_parser.py:70)
re is stdlib and should be imported at module level with the other imports.
5. Read-then-write race condition in save() (redis_repository.py:74-85)
The existing entity is fetched outside the pipeline; index cleanup and new writes go inside it. A concurrent writer between the get and pipe.execute() can leave secondary indexes stale. This may be acceptable given expected write concurrency, but should be documented if so.
6. list_page loads all PKs into memory (redis_repository.py:173-174)
smembers materializes the full PK set before slicing. For large collections this will be slow. Worth noting as a known limitation.
7. _validate_purpose_keys raises ValueError (consumers/repository.py:207-210)
All other error paths raise from fides.service.pbac.exceptions. A ValueError breaks that convention — route handlers and callers that catch domain exceptions will miss this. Use ResourceNotFoundError("DataPurpose", fk) instead.
8. Members-based resolution undocumented gap (identity/resolver.py:21-67)
The README and BasicIdentityResolver both describe a three-step resolution chain including members-list lookup. RedisIdentityResolver only implements two steps, and the consumer repository doesn't maintain a members index. The README's "Resolution chain" section should reflect what's actually implemented in the OSS path.
Nice to Have
9. Rule 1 / Rule 3 asymmetry (evaluate.py:91-104)
A consumer with no purposes accessing a dataset with no purposes = violation. A consumer with purposes accessing a dataset with no purposes = compliant. The asymmetry is likely intentional but subtle enough to deserve a comment.
10. Test filtering logic duplicated from production (tests/service/pbac/policies/test_policy_filtering.py:115-136)
_filter_violations replicates the service's private method body. If the service implementation changes, these tests won't catch the regression. Consider testing through the actual service class.
11. Missing test coverage for core modules
evaluate.py, sql_parser.py, redis_repository.py, BasicIdentityResolver, and the end-to-end InProcessPBACEvaluationService.evaluate() path have no tests. The policy interface tests are thorough, but the engine logic and identity resolution are untested.
| from uuid import uuid4 | ||
|
|
||
| import sqlglot | ||
| import sqlglot.expressions as exp |
There was a problem hiding this comment.
Critical: sqlglot is not declared as a dependency.
sqlglot is imported at the top of this file but does not appear in pyproject.toml. Any environment that installs fides without it will get an ImportError at module load time — not a lazy failure.
Add sqlglot to the appropriate dependency group in pyproject.toml before merging.
There was a problem hiding this comment.
Done — added sqlglot~=30.0.3 to pyproject.toml dependencies and removed the type: ignore[import-not-found] comments from the imports.
| try: | ||
| parsed = sqlglot.parse(query_text) | ||
| except Exception: | ||
| return refs |
There was a problem hiding this comment.
Suggestion: silent failure on parse errors loses observability.
A bare except Exception: return refs means malformed SQL silently produces an empty table-reference list. Downstream, this will look like a compliant query (no datasets accessed) rather than an evaluation error, making bugs very hard to diagnose.
At minimum, add a logger.debug or logger.warning here so operators can see when queries fail to parse.
There was a problem hiding this comment.
Done — added logger.warning with exc_info=True so parse failures are visible in logs instead of silently returning an empty list.
src/fides/service/pbac/sql_parser.py
Outdated
|
|
||
| def detect_statement_type(query_text: str) -> str: | ||
| """Detect the SQL statement type from the query text.""" | ||
| import re |
There was a problem hiding this comment.
Suggestion: import re at function scope instead of module level.
re is a stdlib module and should be imported at the top of the file alongside other imports. Importing inside a function is unconventional and mildly confusing — it implies a lazy-import rationale that doesn't apply here.
There was a problem hiding this comment.
Done — moved import re to module-level imports.
| purpose_keys=frozenset(), | ||
| collection_purposes={}, | ||
| ) | ||
| return result |
There was a problem hiding this comment.
Critical: _build_dataset_purposes always returns empty purpose sets — PBAC violations will never be produced for consumers that have purposes.
By Rule 3 in evaluate.py: "If a dataset has NO declared purposes, access is considered compliant." This method always returns purpose_keys=frozenset() for every dataset, so unless dataset_purpose_overrides is passed by the caller, every consumer-with-purposes will always get a clean result regardless of what the dataset is.
The only case that still produces violations is Rule 1 (consumer has no purposes at all). The core PBAC purpose-overlap check (Rule 2) is effectively dead code in the default evaluation path.
If fidesplus is expected to supply purpose data via dataset_purpose_overrides, that assumption should be documented prominently here — and ideally the method should be overridable or accept a dataset-purpose lookup callable so the OSS service itself is useful.
There was a problem hiding this comment.
Done — _build_dataset_purposes now uses a pre-built dataset_purposes map passed at init time. The caller is responsible for building the map from the dataset catalog (including collection-level and field/sub-field-level purposes). Keys not in the map fall back to empty purposes. Added unit tests for evaluate_access() covering all three rules plus collection inheritance, and integration tests for InProcessPBACEvaluationService.evaluate() covering registered/unregistered consumers, the init-time map, overrides precedence, and backward compat.
| pipe.sadd(self._all_key(), pk) | ||
| for field, value in self._get_index_entries(entity): | ||
| pipe.sadd(self._index_key(field, value), pk) | ||
| pipe.execute() |
There was a problem hiding this comment.
Suggestion: read-then-write race condition in index maintenance.
The pattern here reads the existing entity with self._cache.get(key) (line 74) outside the pipeline, then queues the stale-index cleanup and new-index writes inside the pipeline. Between the get and pipe.execute(), another writer can mutate or delete the entity, leaving indexes in an inconsistent state.
Redis doesn't support optimistic locking via WATCH/MULTI/EXEC natively through a pipeline, but using WATCH + a MULTI/EXEC block would make this safe. Alternatively, document the known race as acceptable given the expected write throughput for this use case.
There was a problem hiding this comment.
Acknowledged — this is a known limitation at current write concurrency levels. The race window is narrow (single Redis round-trip) and the consequence is stale secondary indexes, not data loss. Will document and revisit if concurrent writes become a real pattern.
| """ | ||
| try: | ||
| pks = self._resolve_pks(filters) | ||
| sorted_pks = sorted(pks) if pks else [] |
There was a problem hiding this comment.
Suggestion: list_page loads all PKs into memory to paginate.
_resolve_pks calls smembers on the :_all key (or performs a sinter), loading every PK in the set into a Python list before slicing. For large consumer/purpose sets this will be slow and memory-intensive. Redis SSCAN with cursor-based iteration would avoid materializing the full set, though it makes consistent ordering harder. Worth noting as a known limitation at this stage.
There was a problem hiding this comment.
Acknowledged as a known limitation. At current expected collection sizes this is acceptable. Will revisit with cursor-based pagination if collections grow large enough to cause memory pressure.
| """Check that all purpose fides_keys exist in the purpose repo.""" | ||
| for fk in fides_keys: | ||
| if not self._purpose_repo.exists(fk): | ||
| raise ValueError(f"DataPurpose '{fk}' not found") |
There was a problem hiding this comment.
Suggestion: _validate_purpose_keys raises ValueError instead of a domain exception.
Every other error path in this repository raises ResourceNotFoundError or ResourceAlreadyExistsError from fides.service.pbac.exceptions. A ValueError here breaks that pattern — callers (and route handlers) that catch domain exceptions will miss this. Consider raising ResourceNotFoundError("DataPurpose", fk) for consistency.
There was a problem hiding this comment.
Done — replaced ValueError with ResourceNotFoundError("DataPurpose", fk) for consistency with other error paths.
| """Resolve platform table references to Fides dataset fides_keys. | ||
| Uses a configurable mapping strategy: | ||
| - Direct match: ``{dataset}.{table}`` → Fides dataset ``fides_key`` |
There was a problem hiding this comment.
Suggestion: members-based resolution is documented but not wired up.
The README and BasicIdentityResolver both document a three-step resolution chain: contact_email → external_id → members list. RedisIdentityResolver only implements the first two. The members index is not maintained in DataConsumerRedisRepository._get_index_entries, so there is no way to find a consumer by member email via Redis.
Either remove the members resolution step from the README's "Resolution chain" section for the OSS path, or add a members index to the consumer repository and implement the lookup here.
There was a problem hiding this comment.
Done — updated README to note that members-based resolution is not yet implemented in the OSS path. The resolution chain section now reflects what RedisIdentityResolver actually supports (contact_email and external_id only).
| consumer_purposes=consumer.purpose_keys, | ||
| dataset_purposes=effective, | ||
| reason="Consumer has no declared purposes", | ||
| ) |
There was a problem hiding this comment.
Nice to have: Rule 1 asymmetry with Rule 3 — consider documenting intent.
Rule 1 fires a violation when a consumer has no purposes, even when the dataset also has no declared purposes (effective == frozenset()). Rule 3, by contrast, says that a consumer with purposes accessing a dataset with no purposes is compliant.
This asymmetry means "consumer with no purposes + dataset with no purposes" = violation, while "consumer with purposes + dataset with no purposes" = compliant. That may be intentional (a consumer must declare something to access any dataset), but it's subtle enough to warrant a comment in the code or the docstring explaining why.
There was a problem hiding this comment.
Done — added a comment explaining the intentional asymmetry: an unregistered consumer (no purposes) should never silently pass evaluation, even against an unrestricted dataset.
| result = _make_validation_result(violations=violations) | ||
| evaluator = AllowSpecificDatasetEvaluator(allowed_datasets={"ds_allowed"}) | ||
| filtered = _filter_violations(result, evaluator) | ||
| assert len(filtered.violations) == 1 |
There was a problem hiding this comment.
Suggestion: the filtering logic under test is duplicated from production code.
_filter_violations replicates the body of InProcessPBACEvaluationService._filter_violations_through_policies verbatim. If the production implementation changes, these tests will silently continue to pass against the copy rather than the real code.
Consider testing the method on the actual service instance (with a test double for the cache), or at minimum extract the filtering function to a standalone helper that both the service and the tests import.
There was a problem hiding this comment.
Acknowledged — deferring this refactor to a follow-up. The duplicated logic is a test isolation choice; will revisit when the service interface stabilizes.
kruulik
left a comment
There was a problem hiding this comment.
I don't have anything to add beyond what Claude has already commented - approving in advance to unblock.
Purpose-Based Access Control (PBAC) evaluation engine and service boundary for open-source use. Evaluates whether data consumers have the declared purposes required to access datasets. Components: - types.py: Standardized input/output types (RawQueryLogEntry, EvaluationResult) - evaluate.py: Zero-dependency PBAC engine (purpose overlap check) - service.py: PBACEvaluationService Protocol + InProcessPBACEvaluationService - sql_parser.py: Generic SQL -> RawQueryLogEntry via sqlglot - identity/: IdentityResolver Protocol + BasicIdentityResolver + RedisIdentityResolver - policies/: AccessPolicyEvaluator Protocol + NoOpPolicyEvaluator + implementation guide - consumers/, purposes/: Redis-backed entity storage - README.md: OSS user flow documentation The evaluation service accepts a RawQueryLogEntry and returns a flat EvaluationResult. Fidesplus adds platform connectors (BigQuery, Snowflake) and the access control dashboard on top. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add type: ignore comments for imports that reference modules not yet available in fides OSS (data_purpose, data_consumer, sqlglot). Add type annotation for the collections variable in evaluate.py. Add changelog entry for this PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move type: ignore[import-not-found] to the from-line so mypy respects the suppression. Use empty tuple default for collections to match the dict[str, tuple[str, ...]] return type. Update changelog 7700 PR number to 7741 to match this PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
c4da20e to
f6ac648
Compare
- Add sqlglot~=30.0.3 to pyproject.toml dependencies - Add logger.warning on SQL parse failures for observability - Move import re and get_cache to module-level imports - Wire dataset purposes through init-time map instead of empty stub - Rename short variables (dk, v, pk) to descriptive names - Extract PBAC_CONTROL_TYPE constant for magic string - Replace ValueError with ResourceNotFoundError in purpose validation - Update README to reflect actual OSS resolution chain (no members yet) - Add comment explaining Rule 1/3 asymmetry in evaluate engine - Add unit tests for evaluate_access and integration tests for the service Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dependency ReviewThe following issues were found:
Snapshot WarningsEnsure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice. Scanned Files
|
Ticket ENG-2787
Description Of Changes
Adds the open-source PBAC (Purpose-Based Access Control) evaluation service to fides. This defines the service boundary that fidesplus delegates to for all PBAC evaluation — purpose-based checks, identity resolution, and the Policy v2 evaluator interface.
The entire evaluation stack is OSS:
PBACEvaluationServiceProtocol as the service boundary (input:RawQueryLogEntry, output:EvaluationResult)InProcessPBACEvaluationServicewith identity resolution, purpose overlap engine, and policy filteringAccessPolicyEvaluatorProtocol withNoOpPolicyEvaluatordefault and implementation guide for Policy v2BasicIdentityResolverfor OSS (email, external_id, members list matching)Fidesplus adds platform connectors (BigQuery, Snowflake) and the access control dashboard on top.
Code Changes
src/fides/service/pbac/types.py— All shared types:RawQueryLogEntry,TableRef,ConsumerPurposes,DatasetPurposes,QueryAccess,Violation,ValidationResult,ResolvedConsumer,EvaluationViolation,EvaluationResultsrc/fides/service/pbac/evaluate.py— Zero-dependency PBAC engine (purpose overlap check)src/fides/service/pbac/reason.py— Human-readable violation reason generationsrc/fides/service/pbac/service.py—PBACEvaluationServiceProtocol +InProcessPBACEvaluationServicesrc/fides/service/pbac/sql_parser.py— Generic SQL ->RawQueryLogEntryvia sqlglotsrc/fides/service/pbac/identity/—IdentityResolverProtocol,BasicIdentityResolver,RedisIdentityResolversrc/fides/service/pbac/policies/—AccessPolicyEvaluatorProtocol,NoOpPolicyEvaluator, implementation guidesrc/fides/service/pbac/consumers/—DataConsumerEntity+DataConsumerRedisRepositorysrc/fides/service/pbac/purposes/—DataPurposeEntity+DataPurposeRedisRepositorysrc/fides/service/pbac/redis_repository.py— GenericRedisRepository[T]base classsrc/fides/service/pbac/exceptions.py— Domain exceptionssrc/fides/service/pbac/engine/— Backward-compat re-export shimsrc/fides/service/pbac/evaluation/— Backward-compat re-export shimtests/service/pbac/policies/— Unit tests for interface types, NoOp evaluator, and violation filteringSteps to Confirm
from fides.service.pbac.types import EvaluationResultfrom fides.service.pbac.engine import evaluate_accesspytest tests/service/pbac/fides-pkgflag and runpython demo/pbac_demo.py— all 7 steps should passPre-Merge Checklist
CHANGELOG.mdupdated🤖 Generated with Claude Code