-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-27323: Iceberg: malformed manifest file or list can cause data breach. #4910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| "The number of threads to be used for deleting files during expire snapshot. If set to 0 or below it uses the" + | ||
| " defult DirectExecutorService"), | ||
|
|
||
| HIVE_ICEBERG_ALLOW_DATA_IN_TABLE_LOCATION_ONLY("hive.iceberg.allow.data.in.table.location.only", false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use managed table config for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, unfortunately the issue that is addressed here is Iceberg specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ayushtkn Could we extend the description with the note that this breaks Iceberg tables with data files located outside of the table location?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ayushtkn, maybe hive.iceberg.allow.datafiles.in.table.location.only
also, it should be added to the restricted list, so that only the administrator can change it, otherwise, it could be changed at the session level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed & added to restricted list. I think what it breaks is meant for the documentation, we don't usually mention the after effects of the configs in the description.
It is indicative the files should be within table location
| job.getBoolean(HiveConf.ConfVars.HIVE_ICEBERG_ALLOW_DATA_IN_TABLE_LOCATION_ONLY.varname, | ||
| HiveConf.ConfVars.HIVE_ICEBERG_ALLOW_DATA_IN_TABLE_LOCATION_ONLY.defaultBoolVal); | ||
| if (dataFilesWithingTableLocationOnly) { | ||
| Path tableLocation = new Path(job.get(InputFormatConfig.TABLE_LOCATION)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we restrict on reads - that is expensive? we shouldn't allow any additions outside of table dir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there is Spark or whatever that is not secure - that is not our issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they didn't write the data files via hive, they were there in a secured table, they got hold of the paths & put them in the metadata of their new table.
So, the new table was referencing paths of other table, which they don't have access. So, it is the read flow only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if they managed to do so, they can read the data with some custom script or even patch Hive with custom jar. I don't see how that prevents the data breach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the read path constraint is to avoid to read other location's - aka other table's - data from such a malicious table. Such table can be constructed manually, not necessarily written by spark (actually most probably constructed with other methods).
The problem here is that without this read limitation, the user can use hive's elevated privileges (doAs=false) to access secured data even if data doesn't belong to the user's own table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If jar injection to override behaviour is only possible by admin, then this should not be a blocker for this scope. E.g. admin could even inject jar to override the AuthN chain and dump username/password pairs from those who are connecting to HS2 e.g. via jdbc via this auth method.
doAs=true is not an option for Fine-Grained Access Control where this issues is the most significant (e.g. otherwise masked data breached as non-masked). Spark has no FGAC and elevated privilege based data access decoupling from end-user based file-access.
if the data location is sensitive - why should we leak it?
Historically it was not sensitive and even with Iceberg it should not be treated as sensitive. The issues comes from Iceberg's new behaviour where instead of limiting the read to a directory in a Hive table format case, Iceberg now can read data files from anywhere the hive service user has access to, if the location is in its manifest file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note, that would also constrain the existing Iceberg functionality, where you can load data into the table from multiple source locations and avoid data copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, and it is also breaking the multi-data locations in case of tiered-storage usage or just the movement of the Iceberg table's write.data.path, but that's why this is behind a configuration flag not enabled by default and considered only as a temporary quick fix for those who are rather break it temporarily (especially if they use Iceberg as standard external table only not noticing the limitations), than experience data breach. We are also already discussing a more complex possible solution where neither the Iceberg API+functionality are limited, nor the malicious tables could expose other tables' data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, also one of the findings (Implement LOAD data for partitioned tables via Append API) won't be possible under this config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. We can articulate the limitation better in the HiveConf.java change, like adding that this breaks all Iceberg tables with data located outside of the table's location.
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java
Outdated
Show resolved
Hide resolved
| job.getBoolean(HiveConf.ConfVars.HIVE_ICEBERG_ALLOW_DATA_IN_TABLE_LOCATION_ONLY.varname, | ||
| HiveConf.ConfVars.HIVE_ICEBERG_ALLOW_DATA_IN_TABLE_LOCATION_ONLY.defaultBoolVal); | ||
| if (dataFilesWithingTableLocationOnly) { | ||
| Path tableLocation = new Path(job.get(InputFormatConfig.TABLE_LOCATION)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
place a custom jar into the classpath
That would be adding a malicious jar to AUX path like faking it as a UDF, right?
Would a custom/per-user jar could lead to the same class override?
- If only the AUX path placed jar is the problem, then I would not bring it into this issue's scope as adding jar to AUX path is something the user can't do, only admins.
- If a user can add a jar at runtime to override these Iceberg classes, then this should be definitely a follow up fix on this issue, but only if this is not a global problem, like overriding other classes, especially e.g. Ranger authorization classes, or masking functions policies could rely on.
Why don't we consider storage-based authorization
This is a quick fix until the storage-based authorization solution - or other more robust one - is worked out.
not expose the metadata
Unfortunately that might break the Iceberg functionality e.g. via limiting the output of the metadata table queries like catalog.table.file or .snapshots, etc. and if the fix would depend on assuming the data locations are not leaked is also not a robust one.
Plus the issue this fix is trying to works around for now also affects targeting non-Iceberg tables, where getting the data locations is easy via INPUT__FILE__NAME.
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java
Outdated
Show resolved
Hide resolved
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java
Outdated
Show resolved
Hide resolved
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
Outdated
Show resolved
Hide resolved
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
Outdated
Show resolved
Hide resolved
| // Until the vectorized reader can handle delete files, let's fall back to non-vector mode for V2 tables | ||
| fallbackToNonVectorizedModeBasedOnProperties(tableDesc.getProperties()); | ||
|
|
||
| boolean allowDataInTableLocationOnly = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be consistent with IcebergInputFormat var: dataFilesWithinTableLocationOnly, maybe change in both allowDataFilesWithinTableLocationOnly
deniskuzZ
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, minor comment on var naming
|
Kudos, SonarCloud Quality Gate passed!
|
…le location. (#4910). (Ayush Saxena, reviewed by Denys Kuzmenko)
…le location. (apache#4910). (Ayush Saxena, reviewed by Denys Kuzmenko)
…le location. (apache#4910). (Ayush Saxena, reviewed by Denys Kuzmenko)
…le location. (apache#4910). (Ayush Saxena, reviewed by Denys Kuzmenko)









What changes were proposed in this pull request?
Add a config which ensures all the data files are within the table location.
Why are the changes needed?
Security use-cases
Does this PR introduce any user-facing change?
Yes, if the config is turned on, the iceberg tables with data files outside the table directory won't be readable.
Is the change a dependency upgrade?
No
How was this patch tested?
UT