HIVE-27323: Iceberg: malformed manifest file or list can cause data breach. #4910

ayushtkn · 2023-11-30T11:56:47Z

What changes were proposed in this pull request?

Add a config which ensures all the data files are within the table location.

Why are the changes needed?

Security use-cases

Does this PR introduce any user-facing change?

Yes, if the config is turned on, the iceberg tables with data files outside the table directory won't be readable.

Is the change a dependency upgrade?

No

How was this patch tested?

UT

deniskuzZ · 2023-11-30T13:42:00Z

common/src/java/org/apache/hadoop/hive/conf/HiveConf.java

        "The number of threads to be used for deleting files during expire snapshot. If set to 0 or below it uses the" +
            " defult DirectExecutorService"),
+
+    HIVE_ICEBERG_ALLOW_DATA_IN_TABLE_LOCATION_ONLY("hive.iceberg.allow.data.in.table.location.only", false,


can we use managed table config for that?

No, unfortunately the issue that is addressed here is Iceberg specific.

@ayushtkn Could we extend the description with the note that this breaks Iceberg tables with data files located outside of the table location?

@ayushtkn, maybe hive.iceberg.allow.datafiles.in.table.location.only
also, it should be added to the restricted list, so that only the administrator can change it, otherwise, it could be changed at the session level

Changed & added to restricted list. I think what it breaks is meant for the documentation, we don't usually mention the after effects of the configs in the description.
It is indicative the files should be within table location

deniskuzZ · 2023-11-30T13:45:26Z

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java

+        job.getBoolean(HiveConf.ConfVars.HIVE_ICEBERG_ALLOW_DATA_IN_TABLE_LOCATION_ONLY.varname,
+            HiveConf.ConfVars.HIVE_ICEBERG_ALLOW_DATA_IN_TABLE_LOCATION_ONLY.defaultBoolVal);
+    if (dataFilesWithingTableLocationOnly) {
+      Path tableLocation = new Path(job.get(InputFormatConfig.TABLE_LOCATION));


why do we restrict on reads - that is expensive? we shouldn't allow any additions outside of table dir

if there is Spark or whatever that is not secure - that is not our issue

they didn't write the data files via hive, they were there in a secured table, they got hold of the paths & put them in the metadata of their new table.
So, the new table was referencing paths of other table, which they don't have access. So, it is the read flow only

if they managed to do so, they can read the data with some custom script or even patch Hive with custom jar. I don't see how that prevents the data breach.

the read path constraint is to avoid to read other location's - aka other table's - data from such a malicious table. Such table can be constructed manually, not necessarily written by spark (actually most probably constructed with other methods).
The problem here is that without this read limitation, the user can use hive's elevated privileges (doAs=false) to access secured data even if data doesn't belong to the user's own table.

If jar injection to override behaviour is only possible by admin, then this should not be a blocker for this scope. E.g. admin could even inject jar to override the AuthN chain and dump username/password pairs from those who are connecting to HS2 e.g. via jdbc via this auth method.

doAs=true is not an option for Fine-Grained Access Control where this issues is the most significant (e.g. otherwise masked data breached as non-masked). Spark has no FGAC and elevated privilege based data access decoupling from end-user based file-access.

if the data location is sensitive - why should we leak it?

Historically it was not sensitive and even with Iceberg it should not be treated as sensitive. The issues comes from Iceberg's new behaviour where instead of limiting the read to a directory in a Hive table format case, Iceberg now can read data files from anywhere the hive service user has access to, if the location is in its manifest file.

note, that would also constrain the existing Iceberg functionality, where you can load data into the table from multiple source locations and avoid data copy.

I agree, and it is also breaking the multi-data locations in case of tiered-storage usage or just the movement of the Iceberg table's write.data.path, but that's why this is behind a configuration flag not enabled by default and considered only as a temporary quick fix for those who are rather break it temporarily (especially if they use Iceberg as standard external table only not noticing the limitations), than experience data breach. We are also already discussing a more complex possible solution where neither the Iceberg API+functionality are limited, nor the malicious tables could expose other tables' data.

ok, also one of the findings (Implement LOAD data for partitioned tables via Append API) won't be possible under this config

Right. We can articulate the limitation better in the HiveConf.java change, like adding that this breaks all Iceberg tables with data located outside of the table's location.

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java

jkovacs-hwx · 2023-12-01T10:46:15Z

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java

+        job.getBoolean(HiveConf.ConfVars.HIVE_ICEBERG_ALLOW_DATA_IN_TABLE_LOCATION_ONLY.varname,
+            HiveConf.ConfVars.HIVE_ICEBERG_ALLOW_DATA_IN_TABLE_LOCATION_ONLY.defaultBoolVal);
+    if (dataFilesWithingTableLocationOnly) {
+      Path tableLocation = new Path(job.get(InputFormatConfig.TABLE_LOCATION));


place a custom jar into the classpath

That would be adding a malicious jar to AUX path like faking it as a UDF, right?
Would a custom/per-user jar could lead to the same class override?

If only the AUX path placed jar is the problem, then I would not bring it into this issue's scope as adding jar to AUX path is something the user can't do, only admins.

If a user can add a jar at runtime to override these Iceberg classes, then this should be definitely a follow up fix on this issue, but only if this is not a global problem, like overriding other classes, especially e.g. Ranger authorization classes, or masking functions policies could rely on.

Why don't we consider storage-based authorization

This is a quick fix until the storage-based authorization solution - or other more robust one - is worked out.

not expose the metadata

Unfortunately that might break the Iceberg functionality e.g. via limiting the output of the metadata table queries like catalog.table.file or .snapshots, etc. and if the fix would depend on assuming the data locations are not leaked is also not a robust one.
Plus the issue this fix is trying to works around for now also affects targeting non-Iceberg tables, where getting the data locations is easy via INPUT__FILE__NAME.

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java

…le location

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java

deniskuzZ · 2023-12-04T13:46:53Z

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java

    // Until the vectorized reader can handle delete files, let's fall back to non-vector mode for V2 tables
    fallbackToNonVectorizedModeBasedOnProperties(tableDesc.getProperties());
+
+    boolean allowDataInTableLocationOnly =


should it be consistent with IcebergInputFormat var: dataFilesWithinTableLocationOnly, maybe change in both allowDataFilesWithinTableLocationOnly

deniskuzZ

LGTM, minor comment on var naming

sonarqubecloud · 2023-12-05T01:24:48Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
2 Code Smells

No Coverage information
No Duplication information

The version of Java (11.0.8) you have used to run this analysis is deprecated and we will stop accepting it soon. Please update to at least Java 17.
Read more here

…le location. (#4910). (Ayush Saxena, reviewed by Denys Kuzmenko)

…le location. (apache#4910). (Ayush Saxena, reviewed by Denys Kuzmenko)

asf-ci-hive added the tests pending label Nov 30, 2023

ayushtkn force-pushed the HIVE-27323 branch from ad94cdc to a891560 Compare November 30, 2023 13:36

deniskuzZ reviewed Nov 30, 2023

View reviewed changes

asf-ci-hive added tests failed tests pending tests passed and removed tests pending tests failed labels Nov 30, 2023

jkovacs-hwx reviewed Dec 1, 2023

View reviewed changes

deniskuzZ reviewed Dec 1, 2023

View reviewed changes

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java Outdated Show resolved Hide resolved

deniskuzZ reviewed Dec 1, 2023

View reviewed changes

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java Outdated Show resolved Hide resolved

HIVE-27926: Iceberg: Allow restricting Iceberg data file reads to tab…

76d06a4

…le location

ayushtkn force-pushed the HIVE-27323 branch from a891560 to 76d06a4 Compare December 1, 2023 17:12

asf-ci-hive added tests pending tests unstable and removed tests passed tests pending labels Dec 1, 2023

Fix Restricted List Test.

ded13f4

asf-ci-hive added tests pending tests passed and removed tests unstable tests pending labels Dec 2, 2023

deniskuzZ reviewed Dec 4, 2023

View reviewed changes

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java Outdated Show resolved Hide resolved

deniskuzZ reviewed Dec 4, 2023

View reviewed changes

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java Outdated Show resolved Hide resolved

Address review comments.

967484b

asf-ci-hive added tests pending and removed tests passed labels Dec 4, 2023

deniskuzZ reviewed Dec 4, 2023

View reviewed changes

deniskuzZ approved these changes Dec 4, 2023

View reviewed changes

var rename.

0416a98

asf-ci-hive added tests passed tests pending and removed tests pending tests passed labels Dec 4, 2023

asf-ci-hive added tests passed and removed tests pending labels Dec 5, 2023

ayushtkn merged commit 66b51d6 into apache:master Dec 5, 2023

asfgit pushed a commit that referenced this pull request Dec 5, 2023

HIVE-27926: Iceberg: Allow restricting Iceberg data file reads to tab…

8e9cd06

…le location. (#4910). (Ayush Saxena, reviewed by Denys Kuzmenko)

ayushtkn added a commit to ayushtkn/hive that referenced this pull request Dec 7, 2023

HIVE-27926: Iceberg: Allow restricting Iceberg data file reads to tab…

ec0da79

…le location. (apache#4910). (Ayush Saxena, reviewed by Denys Kuzmenko)

tarak271 pushed a commit to tarak271/hive-1 that referenced this pull request Dec 19, 2023

HIVE-27926: Iceberg: Allow restricting Iceberg data file reads to tab…

89b577f

…le location. (apache#4910). (Ayush Saxena, reviewed by Denys Kuzmenko)

dengzhhu653 pushed a commit to dengzhhu653/hive that referenced this pull request Mar 7, 2024

HIVE-27926: Iceberg: Allow restricting Iceberg data file reads to tab…

58102d2

…le location. (apache#4910). (Ayush Saxena, reviewed by Denys Kuzmenko)

HIVE-27323: Iceberg: malformed manifest file or list can cause data breach. #4910

HIVE-27323: Iceberg: malformed manifest file or list can cause data breach. #4910

Uh oh!

Conversation

ayushtkn commented Nov 30, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deniskuzZ Dec 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deniskuzZ Nov 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deniskuzZ left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Dec 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

deniskuzZ Dec 1, 2023 •

edited

Loading

deniskuzZ Nov 30, 2023 •

edited

Loading