Storing large access data externally by galvana · Pull Request #6199 · ethyca/fides

galvana · 2025-06-04T21:12:17Z

Description Of Changes

This PR introduces an automatic external storage fallback system to handle data that exceeds PostgreSQL's 1GB column limit. The solution centers around a new EncryptedLargeDataDescriptor that can be seamlessly applied to SQLAlchemy model columns. This descriptor maintains backward compatibility by storing small data directly in the database as before, while automatically detecting large datasets and transparently offloading them to encrypted external storage through configurable backends (local, S3, or GCS).

RequestTask.data_for_erasures: Data size (1,111,650,541 bytes) exceeds threshold (671,088,640 bytes), storing externally

EncryptedLargeDataDescriptor

Generic Python descriptor that can be applied to any encrypted SQLAlchemy column
- Currently applied to RequestTask.access_results, RequestTask.data_for_erasures, and PrivacyRequest.filtered_final_upload
Automatically determines whether to store data in database or external storage based on size
Provides transparent getter/setter interface that maintains existing API compatibility
Handles metadata tracking for externally stored files

AES GCM Encryption Utilities

Leverages existing SQLAlchemy-Utils AES GCM encryption for consistency
Introduces new memory-efficient chunked processing using Python's cryptography library
Processes large datasets in batches to avoid memory overflow
Maintains cross-compatibility between both encryption implementations

ExternalStorageService

Unified interface supporting multiple storage backends (local filesystem, S3, GCS)
Handles consistent file naming and organization across different storage types
Manages complete lifecycle: store, retrieve, and cleanup operations
Provides encrypted file storage with proper key management and metadata tracking

Additional Code Changes

Memory-efficient util to estimate the size on-disk for in-memory access results
No longer cache large results in Redis for DSR 3.0

Steps to Confirm

Check out the branch on https://github.com/ethyca/fidesplus/pull/2232
Enable DSR 3.0, put this in your .env FIDES__EXECUTION__USE_DSR_3_0=true
Make sure local storage is set to the default storage config with html
Create a new system with a Large Data Test Connector. The connection test will fail, that's ok

Submit an access request for jane@example.com
The encrypted values and final DSR package should be under fidesplus/fides_uploads

This could take a while, it took ~13 minutes from beginning to end

Pre-Merge Checklist

vercel · 2025-06-04T21:12:21Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
fides-plus-nightly	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jun 11, 2025 9:59pm

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
fides-privacy-center	⬜️ Ignored (Inspect)			Jun 11, 2025 9:59pm

codecov · 2025-06-08T05:05:04Z

Codecov Report

Attention: Patch coverage is 37.73585% with 264 lines in your changes missing coverage. Please review.

Project coverage is 76.69%. Comparing base (4d4d933) to head (c750ecc).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/fides/api/service/external_data_storage.py	27.15%	110 Missing ⚠️
...des/api/util/encryption/aes_gcm_encryption_util.py	18.34%	89 Missing ⚠️
...des/api/models/field_types/encrypted_large_data.py	54.21%	35 Missing and 3 partials ⚠️
src/fides/api/util/data_size.py	35.71%	25 Missing and 2 partials ⚠️

❌ Your patch status has failed because the patch coverage (37.73%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project status has failed because the head coverage (76.69%) is below the target coverage (85.00%). You can increase the head coverage or adjust the target coverage.

❗ There is a different number of reports uploaded between BASE (4d4d933) and HEAD (c750ecc). Click for more details.

HEAD has 2 uploads less than BASE

Flag BASE (4d4d933) HEAD (c750ecc)

12 10

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #6199       +/-   ##
===========================================
- Coverage   87.14%   76.69%   -10.45%     
===========================================
  Files         433      439        +6     
  Lines       26863    27278      +415     
  Branches     2935     2969       +34     
===========================================
- Hits        23410    20922     -2488     
- Misses       2820     5638     +2818     
- Partials      633      718       +85

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

JadeCara

A few suggestions, no blockers though :) Nice work and so many tests!

JadeCara · 2025-06-10T18:37:34Z

+
+        return f"{self.model_class}/{instance_id}/{self.field_name}/{timestamp}.txt"
+
+    def __get__(self, instance: Any, owner: Type) -> Any:


I don't see owner being used anywhere in this function - can probably be removed from the signature?

This is one of the methods that are part of the descriptor protocol, we need to leave it even if we don't use it https://docs.python.org/3/reference/datamodel.html#object.__get__

JadeCara · 2025-06-10T18:42:27Z

I am not sure if this belongs in models top level? It kind of blends in with the regular models.

I moved it under /models/field_types

JadeCara · 2025-06-10T18:54:06Z

+        )
+
+    return is_large
+


These top two functions might be useful in other areas as well. I wonder if putting them into a data_size.py type util would make them easier to find/use?

JadeCara · 2025-06-10T18:57:40Z

+            # Get storage config
+            if storage_key:
+                storage_config = (
+                    db.query(StorageConfig)
+                    .filter(StorageConfig.key == storage_key)
+                    .first()
+                )
+                if not storage_config:
+                    msg = f"Storage configuration with key '{storage_key}' not found"
+                    logger.error(msg)
+                    raise ExternalDataStorageError(msg)
+            else:
+                storage_config = get_active_default_storage_config(db)
+                if not storage_config:
+                    msg = "No active default storage configuration available for large data"
+                    logger.error(msg)
+                    raise ExternalDataStorageError(msg)
+


This function is doing a number of things - this top part might be a good candidate to pull out into its own testable function.

cypress · 2025-06-11T22:19:22Z

fides Run #12981

Run Properties: Passed #12981 • cd2c20930b: Storing large access data externally (#6199)

Project	`fides`
Branch Review	`main`
Run status	`Passed #12981`
Run duration	`00m 53s`
Commit	`cd2c20930b: Storing large access data externally (#6199)`
Committer	`Adrian Galvan`
View all properties for this run ↗︎

Test results
Failures	`0`
Flaky	`0`
Pending	`0`
Skipped	`0`
Passing	`5`
View all changes introduced in this branch ↗︎

Storing large access data externally

7b2ff24

galvana added 6 commits June 5, 2025 08:24

Adding encryption

a5c78cd

Clean up

b31a00f

Fixing failing tests

12c51d4

Fixing tests

a71a72e

Fixing tests

1f4f4ff

Fixing test

3103315

galvana added 3 commits June 7, 2025 23:43

Fixing tests

11d6a92

Resetting LARGE_DATA_THRESHOLD_BYTES

956988c

Merge branch 'main' into ENG-684-save-large-access-data-externally

1361bd8

vercel Bot deployed to Preview – fides-plus-nightly June 8, 2025 08:15 View deployment

galvana and others added 6 commits June 8, 2025 21:52

Cleaning up code and adding fallback to privacy request model

f3c52e1

Fixing static checks

87ebbe9

Removing pytest mark

2a23e82

Fixing tests

9cfd942

Test cleanup

813d15a

Adding more tests

39ed44b

galvana commented Jun 10, 2025

View reviewed changes

Comment thread src/fides/api/models/encrypted_large_data.py Outdated

galvana and others added 4 commits June 9, 2025 19:02

Merge branch 'main' into ENG-684-save-large-access-data-externally

cceee47

Static fixes

f08172b

Fixing test

b510d17

Fixing S3 file limit

a8b52e2

galvana requested a review from JadeCara June 10, 2025 17:49

galvana marked this pull request as ready for review June 10, 2025 17:53

JadeCara approved these changes Jun 10, 2025

View reviewed changes

galvana and others added 3 commits June 10, 2025 18:01

Updating large file threshold and optimizing memory usage

972a01e

Merge branch 'main' into ENG-684-save-large-access-data-externally

af971aa

Changes based on PR feedback

9ed4144

galvana added 2 commits June 11, 2025 12:19

Removing unused file

525ca86

Fixing comment format

48ff1c5

galvana commented Jun 11, 2025

View reviewed changes

Comment thread src/fides/api/models/field_types/encrypted_large_data.py

galvana and others added 6 commits June 11, 2025 12:32

Merge branch 'main' into ENG-684-save-large-access-data-externally

73c9fea

Fixing patch path

e1e79ca

Merge branch 'main' into ENG-684-save-large-access-data-externally

59600c9

Updating change log

bfe61ea

Fixing change log

9d31e68

Merge branch 'main' into ENG-684-save-large-access-data-externally

c750ecc

vercel Bot deployed to Preview – fides-plus-nightly June 11, 2025 21:59 View deployment

galvana merged commit cd2c209 into main Jun 11, 2025
17 checks passed

galvana deleted the ENG-684-save-large-access-data-externally branch June 11, 2025 22:02

galvana added a commit that referenced this pull request Jun 12, 2025

Storing large access data externally (#6199)

273c628

galvana mentioned this pull request Oct 1, 2025

Async polling #6566

Merged

16 tasks


		return f"{self.model_class}/{instance_id}/{self.field_name}/{timestamp}.txt"

		def __get__(self, instance: Any, owner: Type) -> Any:

Conversation

galvana commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description Of Changes

Additional Code Changes

Steps to Confirm

Pre-Merge Checklist

Uh oh!

vercel Bot commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Jun 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

JadeCara left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cypress Bot commented Jun 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

galvana commented Jun 4, 2025 •

edited

Loading

vercel Bot commented Jun 4, 2025 •

edited

Loading

codecov Bot commented Jun 8, 2025 •

edited

Loading