Skip to content

Conversation

@ayushtkn
Copy link
Member

What changes were proposed in this pull request?

Add a config which ensures all the data files are within the table location.

Why are the changes needed?

Security use-cases

Does this PR introduce any user-facing change?

Yes, if the config is turned on, the iceberg tables with data files outside the table directory won't be readable.

Is the change a dependency upgrade?

No

How was this patch tested?

UT

"The number of threads to be used for deleting files during expire snapshot. If set to 0 or below it uses the" +
" defult DirectExecutorService"),

HIVE_ICEBERG_ALLOW_DATA_IN_TABLE_LOCATION_ONLY("hive.iceberg.allow.data.in.table.location.only", false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use managed table config for that?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, unfortunately the issue that is addressed here is Iceberg specific.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ayushtkn Could we extend the description with the note that this breaks Iceberg tables with data files located outside of the table location?

Copy link
Member

@deniskuzZ deniskuzZ Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ayushtkn, maybe hive.iceberg.allow.datafiles.in.table.location.only
also, it should be added to the restricted list, so that only the administrator can change it, otherwise, it could be changed at the session level

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed & added to restricted list. I think what it breaks is meant for the documentation, we don't usually mention the after effects of the configs in the description.
It is indicative the files should be within table location

job.getBoolean(HiveConf.ConfVars.HIVE_ICEBERG_ALLOW_DATA_IN_TABLE_LOCATION_ONLY.varname,
HiveConf.ConfVars.HIVE_ICEBERG_ALLOW_DATA_IN_TABLE_LOCATION_ONLY.defaultBoolVal);
if (dataFilesWithingTableLocationOnly) {
Path tableLocation = new Path(job.get(InputFormatConfig.TABLE_LOCATION));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we restrict on reads - that is expensive? we shouldn't allow any additions outside of table dir

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is Spark or whatever that is not secure - that is not our issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they didn't write the data files via hive, they were there in a secured table, they got hold of the paths & put them in the metadata of their new table.
So, the new table was referencing paths of other table, which they don't have access. So, it is the read flow only

Copy link
Member

@deniskuzZ deniskuzZ Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if they managed to do so, they can read the data with some custom script or even patch Hive with custom jar. I don't see how that prevents the data breach.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the read path constraint is to avoid to read other location's - aka other table's - data from such a malicious table. Such table can be constructed manually, not necessarily written by spark (actually most probably constructed with other methods).
The problem here is that without this read limitation, the user can use hive's elevated privileges (doAs=false) to access secured data even if data doesn't belong to the user's own table.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If jar injection to override behaviour is only possible by admin, then this should not be a blocker for this scope. E.g. admin could even inject jar to override the AuthN chain and dump username/password pairs from those who are connecting to HS2 e.g. via jdbc via this auth method.

doAs=true is not an option for Fine-Grained Access Control where this issues is the most significant (e.g. otherwise masked data breached as non-masked). Spark has no FGAC and elevated privilege based data access decoupling from end-user based file-access.

if the data location is sensitive - why should we leak it?

Historically it was not sensitive and even with Iceberg it should not be treated as sensitive. The issues comes from Iceberg's new behaviour where instead of limiting the read to a directory in a Hive table format case, Iceberg now can read data files from anywhere the hive service user has access to, if the location is in its manifest file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, that would also constrain the existing Iceberg functionality, where you can load data into the table from multiple source locations and avoid data copy.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and it is also breaking the multi-data locations in case of tiered-storage usage or just the movement of the Iceberg table's write.data.path, but that's why this is behind a configuration flag not enabled by default and considered only as a temporary quick fix for those who are rather break it temporarily (especially if they use Iceberg as standard external table only not noticing the limitations), than experience data breach. We are also already discussing a more complex possible solution where neither the Iceberg API+functionality are limited, nor the malicious tables could expose other tables' data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, also one of the findings (Implement LOAD data for partitioned tables via Append API) won't be possible under this config

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. We can articulate the limitation better in the HiveConf.java change, like adding that this breaks all Iceberg tables with data located outside of the table's location.

job.getBoolean(HiveConf.ConfVars.HIVE_ICEBERG_ALLOW_DATA_IN_TABLE_LOCATION_ONLY.varname,
HiveConf.ConfVars.HIVE_ICEBERG_ALLOW_DATA_IN_TABLE_LOCATION_ONLY.defaultBoolVal);
if (dataFilesWithingTableLocationOnly) {
Path tableLocation = new Path(job.get(InputFormatConfig.TABLE_LOCATION));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

place a custom jar into the classpath

That would be adding a malicious jar to AUX path like faking it as a UDF, right?
Would a custom/per-user jar could lead to the same class override?

  • If only the AUX path placed jar is the problem, then I would not bring it into this issue's scope as adding jar to AUX path is something the user can't do, only admins.
  • If a user can add a jar at runtime to override these Iceberg classes, then this should be definitely a follow up fix on this issue, but only if this is not a global problem, like overriding other classes, especially e.g. Ranger authorization classes, or masking functions policies could rely on.

Why don't we consider storage-based authorization

This is a quick fix until the storage-based authorization solution - or other more robust one - is worked out.

not expose the metadata

Unfortunately that might break the Iceberg functionality e.g. via limiting the output of the metadata table queries like catalog.table.file or .snapshots, etc. and if the fix would depend on assuming the data locations are not leaked is also not a robust one.
Plus the issue this fix is trying to works around for now also affects targeting non-Iceberg tables, where getting the data locations is easy via INPUT__FILE__NAME.

// Until the vectorized reader can handle delete files, let's fall back to non-vector mode for V2 tables
fallbackToNonVectorizedModeBasedOnProperties(tableDesc.getProperties());

boolean allowDataInTableLocationOnly =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be consistent with IcebergInputFormat var: dataFilesWithinTableLocationOnly, maybe change in both allowDataFilesWithinTableLocationOnly

Copy link
Member

@deniskuzZ deniskuzZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor comment on var naming

@sonarqubecloud
Copy link

sonarqubecloud bot commented Dec 5, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 2 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

warning The version of Java (11.0.8) you have used to run this analysis is deprecated and we will stop accepting it soon. Please update to at least Java 17.
Read more here

@ayushtkn ayushtkn merged commit 66b51d6 into apache:master Dec 5, 2023
asfgit pushed a commit that referenced this pull request Dec 5, 2023
…le location. (#4910). (Ayush Saxena, reviewed by Denys Kuzmenko)
ayushtkn added a commit to ayushtkn/hive that referenced this pull request Dec 7, 2023
…le location. (apache#4910). (Ayush Saxena, reviewed by Denys Kuzmenko)
tarak271 pushed a commit to tarak271/hive-1 that referenced this pull request Dec 19, 2023
…le location. (apache#4910). (Ayush Saxena, reviewed by Denys Kuzmenko)
dengzhhu653 pushed a commit to dengzhhu653/hive that referenced this pull request Mar 7, 2024
…le location. (apache#4910). (Ayush Saxena, reviewed by Denys Kuzmenko)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants