-
Notifications
You must be signed in to change notification settings - Fork 3k
Vectorized Reads of Parquet with Identity Partitions #1287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -31,10 +31,13 @@ | |
| import org.apache.iceberg.hadoop.HadoopTables; | ||
| import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; | ||
| import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; | ||
| import org.apache.iceberg.spark.SparkSchemaUtil; | ||
| import org.apache.iceberg.spark.SparkTableUtil; | ||
| import org.apache.iceberg.types.Types; | ||
| import org.apache.spark.sql.Dataset; | ||
| import org.apache.spark.sql.Row; | ||
| import org.apache.spark.sql.SparkSession; | ||
| import org.apache.spark.sql.catalyst.TableIdentifier; | ||
| import org.junit.AfterClass; | ||
| import org.junit.Assert; | ||
| import org.junit.Before; | ||
|
|
@@ -110,24 +113,52 @@ public static void stopSpark() { | |
| private Table table = null; | ||
| private Dataset<Row> logs = null; | ||
|
|
||
| @Before | ||
| public void setupTable() throws Exception { | ||
| /** | ||
| * Use the Hive Based table to make Identity Partition Columns with no duplication of the data in the underlying | ||
| * parquet files. This makes sure that if the identity mapping fails, the test will also fail. | ||
| */ | ||
| private void setupParquet() throws Exception { | ||
| File location = temp.newFolder("logs"); | ||
| File hiveLocation = temp.newFolder("hive"); | ||
| String hiveTable = "hivetable"; | ||
| Assert.assertTrue("Temp folder should exist", location.exists()); | ||
|
|
||
| Map<String, String> properties = ImmutableMap.of(TableProperties.DEFAULT_FILE_FORMAT, format); | ||
| this.table = TABLES.create(LOG_SCHEMA, spec, properties, location.toString()); | ||
| this.logs = spark.createDataFrame(LOGS, LogMessage.class).select("id", "date", "level", "message"); | ||
| spark.sql(String.format("DROP TABLE IF EXISTS %s", hiveTable)); | ||
| logs.orderBy("date", "level", "id").write().partitionBy("date", "level").format("parquet") | ||
| .option("path", hiveLocation.toString()).saveAsTable(hiveTable); | ||
|
|
||
| this.table = TABLES.create(SparkSchemaUtil.schemaForTable(spark, hiveTable), | ||
| SparkSchemaUtil.specForTable(spark, hiveTable), properties, location.toString()); | ||
|
|
||
| SparkTableUtil.importSparkTable(spark, new TableIdentifier(hiveTable), table, location.toString()); | ||
| } | ||
|
|
||
| logs.orderBy("date", "level", "id").write().format("iceberg").mode("append").save(location.toString()); | ||
| @Before | ||
| public void setupTable() throws Exception { | ||
| if (format.equals("parquet")) { | ||
| setupParquet(); | ||
| } else { | ||
| File location = temp.newFolder("logs"); | ||
| Assert.assertTrue("Temp folder should exist", location.exists()); | ||
|
|
||
| Map<String, String> properties = ImmutableMap.of(TableProperties.DEFAULT_FILE_FORMAT, format); | ||
| this.table = TABLES.create(LOG_SCHEMA, spec, properties, location.toString()); | ||
| this.logs = spark.createDataFrame(LOGS, LogMessage.class).select("id", "date", "level", "message"); | ||
|
|
||
| logs.orderBy("date", "level", "id").write().format("iceberg").mode("append").save(location.toString()); | ||
| } | ||
| } | ||
|
|
||
| @Test | ||
| public void testFullProjection() { | ||
| List<Row> expected = logs.orderBy("id").collectAsList(); | ||
| List<Row> actual = spark.read().format("iceberg") | ||
| .option("vectorization-enabled", String.valueOf(vectorized)) | ||
| .load(table.location()).orderBy("id").collectAsList(); | ||
| .load(table.location()).orderBy("id") | ||
| .select("id", "date", "level", "message") | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't this the default? Why was it necessary to add
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When I added in the Hive Import it gets the schema in a different order, I think this may be an issue with the import code? I'm not sure, but I know the default column order does not come out the same way :/
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's suspicious. We'll have to look into why the schema has the wrong order. I see
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll try to figure out the actual issue today, but I agree it shouldn't work this way. My assumption is that the Hive table schema is just being listed in a different order or when we use SparkSchemaUtil the order is getting scrambled.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I spent some time digging into this, val tableDesc = CatalogTable(
identifier = tableIdent,
tableType = tableType,
storage = storage,
schema = new StructType,
provider = Some(source),
partitionColumnNames = partitioningColumns.getOrElse(Nil),
bucketSpec = getBucketSpec)Which strips out whatever incoming schema you have. So the new table is created without any information about the actual ordering of columns you used in the create. Then when the Relation is resolved, that's when the attributes are looked up again and the schema is created from the Attribute output. So long story short, saveAsTable doesn't care about your field ordering as far as I can tell. This is all in Spark and I'm not sure we can do anything about it here.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm fine with this, then. Thanks for looking into it! |
||
| .collectAsList(); | ||
| Assert.assertEquals("Rows should match", expected, actual); | ||
| } | ||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.