Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-7156: Support empty Parquet files creation #1836

Closed
wants to merge 1 commit into from

Conversation

@oleg-zinovev
Copy link
Contributor

commented Aug 5, 2019

PR for Drill empty parquet files read and write support.

Known limitations:

  1. Not working for hive parquet for now
  2. Ignores all schemas except last while writing empty parquet file
  3. Not support empty schemas (e.g. create table .. as select * from empty.json, e.g. {})

Short changes description:

  1. Parquet footer metadata added
  2. Parquet writer checks that at least 1 row has been written. If not - creates a empty parquet file with footer.
  3. EmptyParquetRowGroupScan and EmptyParquetScanBatchCreator added

Questions:

  1. TestParquetWriterEmptyFiles#testMultipleWriters now creates several empty files, but not fails, since reading of empty parquet is supported. Should I rewrite comment or remove the test?
@oleg-zinovev

This comment has been minimized.

Copy link
Contributor Author

commented Aug 5, 2019

@arina-ielchiieva, could you please review?

@oleg-zinovev oleg-zinovev force-pushed the oleg-zinovev:DRILL-7156-alt branch from 7d6bf74 to c803a09 Aug 5, 2019
@arina-ielchiieva

This comment has been minimized.

Copy link
Member

commented Aug 5, 2019

@oleg-zinovev thanks for making the changes, though the situation is little bit awkward, since I was working on similar changes and did not know you intend to do them as well (https://issues.apache.org/jira/browse/DRILL-4517). Though I was working on reading empty parquet files but not writing them. I suggest you separate out writing empty parquet files into separate PR as for reading it might be better if my changes will be used instead: first you change metadata cache files and this would affect backward compatibility as well as will have to store more information than needed, secondly your changes does not seem to optimize reading complex types. What do you think?

@oleg-zinovev

This comment has been minimized.

Copy link
Contributor Author

commented Aug 5, 2019

Ok, I'll try to rewrite the commit within a week.

@arina-ielchiieva

This comment has been minimized.

Copy link
Member

commented Aug 5, 2019

@arina-ielchiieva

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

@oleg-zinovev support empty parquet files reading is already merged into master (4f4e1af). Do you plan on working on adding support for writing empty parquet files? We plan to include it in next Drill release (end of August / beginning of September). If yes, please factor out writing empty parquet and update the PR.

@oleg-zinovev

This comment has been minimized.

Copy link
Contributor Author

commented Aug 13, 2019

@oleg-zinovev support empty parquet files reading is already merged into master (4f4e1af). Do you plan on working on adding support for writing empty parquet files? We plan to include it in next Drill release (end of August / beginning of September). If yes, please factor out writing empty parquet and update the PR.

Yes, I think I’ll do it today.

@arina-ielchiieva

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

Regarding this comment

Questions:
TestParquetWriterEmptyFiles#testMultipleWriters now creates several empty files, but not fails, since reading of empty parquet is supported. Should I rewrite comment or remove the test?

I guess you can remove these tests and add new tests into org.apache.drill.exec.store.parquet.TestEmptyParquet.

@arina-ielchiieva

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

Ignores all schemas except last while writing empty parquet file

Please provide example...

Not support empty schemas (e.g. create table .. as select * from empty.json, e.g. {})

What behavior will be in this case? Failure? No-op?

@oleg-zinovev oleg-zinovev force-pushed the oleg-zinovev:DRILL-7156-alt branch from c803a09 to 537b3b9 Aug 13, 2019
@oleg-zinovev

This comment has been minimized.

Copy link
Contributor Author

commented Aug 13, 2019

What behavior will be in this case? Failure? No-op?

Remained unchanged (No-op)

@oleg-zinovev oleg-zinovev force-pushed the oleg-zinovev:DRILL-7156-alt branch from 537b3b9 to 5c0a54d Aug 13, 2019
@oleg-zinovev

This comment has been minimized.

Copy link
Contributor Author

commented Aug 13, 2019

Please provide example...

I added a test TestParquetWriterEmptyFiles#testWriteEmptySchemaChange. As you can see, there is no "a" field in the written schema.

Probably, it would be correct to write schema for all empty scans, but it will lead to writing "garbage" empty parquet files, if the scan with data is at the end of batch.

@arina-ielchiieva

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

Is there a way to write combined schema in this case?

@oleg-zinovev

This comment has been minimized.

Copy link
Contributor Author

commented Aug 13, 2019

Is there a way to write combined schema in this case?

@arina-ielchiieva , thank you for your comment.

I can try to make a combined scheme, but:

  • What type of field should be written if the first BatchSchema contains field "A" with type "bigint", and the second - field "A" with type "varchar"? The last one wins?
  • What about union vectors?
@arina-ielchiieva

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

Good questions, you can investigate how now union types are handled when there are data. Regarding who wins, maybe you can look into org.apache.drill.exec.physical.impl.union.UnionAllRecordBatch to see how it creates combined schema using precedence rules.

@oleg-zinovev

This comment has been minimized.

Copy link
Contributor Author

commented Aug 13, 2019

Good questions, you can investigate how now union types are handled. Regarding who wins, maybe you can look into org.apache.drill.exec.physical.impl.union.UnionAllRecordBatch to see how it Cretes combined schema using precedence rules.

@arina-ielchiieva , thanks for your advice.
I will try to make a combined scheme by the end of the week.

@arina-ielchiieva arina-ielchiieva changed the title DRILL-7156: empty parquet files support DRILL-7156: support empty Parquet files creation Aug 13, 2019
@arina-ielchiieva arina-ielchiieva changed the title DRILL-7156: support empty Parquet files creation DRILL-7156: Support empty Parquet files creation Aug 13, 2019
@oleg-zinovev oleg-zinovev force-pushed the oleg-zinovev:DRILL-7156-alt branch from 5c0a54d to 575f7c4 Aug 14, 2019
@oleg-zinovev

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2019

@arina-ielchiieva
Some time spent debugging the test showed that the last schema contains all fields. The field is added in ProjectRecordBatch#setupNewSchemaFromInput.
In the original version of the test, field A was not added due to plan optimization - condition 1=0 was replaced by limit 0

I can still provide a solution with combining schema if required.

}

private void flushParquetFileWriter() throws IOException {
assert parquetFileWriter != null;

This comment has been minimized.

Copy link
@arina-ielchiieva

codecFactory.release();
}

private void createParquetFileWriter() throws IOException {
assert parquetFileWriter == null;

This comment has been minimized.

Copy link
@arina-ielchiieva

arina-ielchiieva Aug 14, 2019

Member

Please remove, there is no need for this check.

newSchema();
}
} catch (Exception e) {
throw new DrillRuntimeException(e);
}
}

private void flush() throws IOException {
private void flush(final boolean cleanUp) throws IOException {

This comment has been minimized.

Copy link
@arina-ielchiieva

arina-ielchiieva Aug 14, 2019

Member
Suggested change
private void flush(final boolean cleanUp) throws IOException {
private void flush(boolean cleanUp) throws IOException {
test("CREATE TABLE dfs.tmp.%s AS %s", outputFileName, query);

// Make sure that only 1 parquet file was created
Assert.assertEquals(1, outputFile.list((dir, name) -> name.endsWith("parquet")).length);

This comment has been minimized.

Copy link
@arina-ielchiieva

arina-ielchiieva Aug 14, 2019

Member

Please use static import.

.unOrdered()
.sqlQuery("select * from dfs.tmp.%s", outputFileName)
.schemaBaseLine(expectedSchema)
.go();
}

@Test // see DRILL-2408

This comment has been minimized.

Copy link
@arina-ielchiieva

arina-ielchiieva Aug 14, 2019

Member

Please remove see DRILL-2408 references in the class, now the make no sense.

runSQL("alter session set `planner.slice_target` = " + ExecConstants.SLICE_TARGET_DEFAULT);
}
// Make sure that only 1 parquet file was created
Assert.assertEquals(1, outputFile.list((dir, name) -> name.endsWith("parquet")).length);

This comment has been minimized.

Copy link
@arina-ielchiieva

arina-ielchiieva Aug 14, 2019

Member

Static import.

public void testEmptyFileSchema() throws Exception {
final String outputFileName = "testparquetwriteremptyfiles_testemptyfileschema";

test("CREATE TABLE dfs.tmp.%s AS SELECT * FROM cp.`employee.json` WHERE 1=0", outputFileName);

This comment has been minimized.

Copy link
@arina-ielchiieva

arina-ielchiieva Aug 14, 2019

Member

Please replace select from JSON file with select from Parquet, we need to test that schema is created correctly from already known schema.

@@ -122,6 +122,8 @@
private PrimitiveTypeName logicalTypeForDecimals;
private boolean usePrimitiveTypesForDecimals;

private boolean empty = true;

This comment has been minimized.

Copy link
@arina-ielchiieva

arina-ielchiieva Aug 14, 2019

Member

Please add comment describing the purpose of this flag.

@arina-ielchiieva

This comment has been minimized.

Copy link
Member

commented Aug 14, 2019

@arina-ielchiieva
Some time spent debugging the test showed that the last schema contains all fields. The field is added in ProjectRecordBatch#setupNewSchemaFromInput.
In the original version of the test, field A was not added due to plan optimization - condition 1=0 was replaced by limit 0

I can still provide a solution with combining schema if required.

This case we don't need schema merge,

@oleg-zinovev

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2019

@arina-ielchiieva , thanks for review. Fixed.

@oleg-zinovev oleg-zinovev force-pushed the oleg-zinovev:DRILL-7156-alt branch from 6a3f023 to 77f9c79 Aug 14, 2019

@Category({ParquetTest.class, UnlikelyTest.class})
public class TestParquetWriterEmptyFiles extends BaseTestQuery {

@BeforeClass
public static void initFs() throws Exception {
updateTestCluster(3, null);
dirTestWatcher.copyResourceToRoot(Paths.get("schemachange"));
dirTestWatcher.copyResourceToRoot(Paths.get("parquet", "empty"));
}

@Test // see DRILL-2408

This comment has been minimized.

Copy link
@arina-ielchiieva

arina-ielchiieva Aug 14, 2019

Member

Please remove see DRILL-2408

This comment has been minimized.

Copy link
@oleg-zinovev

oleg-zinovev Aug 14, 2019

Author Contributor

My bad. Done

@oleg-zinovev oleg-zinovev force-pushed the oleg-zinovev:DRILL-7156-alt branch from 77f9c79 to 506a679 Aug 14, 2019
@@ -122,6 +122,9 @@
private PrimitiveTypeName logicalTypeForDecimals;
private boolean usePrimitiveTypesForDecimals;

/** Whether no rows was written. */

This comment has been minimized.

Copy link
@arina-ielchiieva

arina-ielchiieva Aug 14, 2019

Member

Is used to ensure that empty Parquet file will be written if no rows were provided.

}

@Test // see DRILL-2408
@Test
public void testSimpleEmptyFileSchema() throws Exception {

This comment has been minimized.

Copy link
@arina-ielchiieva

arina-ielchiieva Aug 14, 2019

Member

This test is redundant since testComplexEmptyFileSchema both cases.

This comment has been minimized.

Copy link
@arina-ielchiieva

arina-ielchiieva Aug 14, 2019

Member

Also we need to add test where we select from non-empty Parquet file but filter condition eliminates all rows, similar as you have for JSON.

@arina-ielchiieva

This comment has been minimized.

Copy link
Member

commented Aug 14, 2019

@oleg-zinovev thanks for making the changes, a couple of minor comments are left...

@oleg-zinovev oleg-zinovev force-pushed the oleg-zinovev:DRILL-7156-alt branch from 506a679 to 6814da8 Aug 15, 2019
@oleg-zinovev

This comment has been minimized.

Copy link
Contributor Author

commented Aug 15, 2019

@arina-ielchiieva , thanks for review. Done

@arina-ielchiieva

This comment has been minimized.

Copy link
Member

commented Aug 15, 2019

Ran all tests on the test cluster, all passed. LGTM, +1
@oleg-zinovev thanks for making the changes.

@asfgit asfgit closed this in ef93515 Aug 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.