DRILL-7978: Fixed Width Format Plugin #2282

MFoss19 · 2021-07-29T18:52:14Z

DRILL-7978: Fixed Width Format Plugin

Description

Developing format plugin to parse fixed width files.

Fixed Width Text File Definition: https://www.oracle.com/webfolder/technetwork/data-quality/edqhelp/Content/introduction/getting_started/configuring_fixed_width_text_file_formats.htm

Documentation

Users can now create a format configuration to parse fixed width files.

Testing

Unit tests added. More to come.

cgivre

@MFoss19
Thank you very much for this contribution to Drill and nice work! I left some review comments. General comment would be to please remove any extraneous commented out code and Drill does have an automated style sheet which you can install for Intellij or Eclipse.

Also, to pass the CI and merge this, all the files will need to have the Apache license in them at the beginning of the file. The only exception would be the fwf test files. For that, you're going to have to add an exclusion to the RAT check in the root pom.xml file.

contrib/format-fixedwidth/pom.xml

...t-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java

...edwidth/src/test/java/org/apache/drill/exec/store/fixedwidth/TestFixedwidthRecordReader.java

cgivre · 2021-07-29T19:30:11Z

...edwidth/src/test/java/org/apache/drill/exec/store/fixedwidth/TestFixedwidthRecordReader.java

+
+
+    new RowSetComparison(expected).verifyAndClearAll(results);
+    System.out.println("Test complete.");


Please add tests for:

Serialization/Deserialization

Compressed file

Various invalid schemata. For instance, what happens if you don't have fields defined in the config and try to query the data?

...t-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java

...t-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFieldConfig.java

cgivre · 2021-07-30T02:34:52Z

...-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFormatPlugin.java

+    return EasyFormatConfig.builder()
+            .readable(true)
+            .writable(false)
+            .blockSplittable(false)


I think this actually might be blocksplittable.

paul-rogers

@MFoss19, thanks much for the contribution. This should be very helpful for folks.

My comments are about some of the nuances of creating a storage plugin. I reference a couple of Drill internals. Let me know if you want more details about these.

...t-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFieldConfig.java

paul-rogers · 2021-08-19T23:54:02Z

...t-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFieldConfig.java

+    this.fieldName = fieldName;
+    this.dateTimeFormat = dateTimeFormat;
+    this.startIndex = startIndex;
+    this.fieldWidth = fieldWidth;


Since configs are created by hand, having good defaults is helpful. Perhaps:

name: required; throw an exception if blank, or if the stripped name is not a valid SQL symbol.

type: default to VARCHAR

dateTimeFormat: null is allowed, so no default.

index: required, must be >= 0.

width: either required, or can be optional. If provided must be > 0. (See below.)

For this plugin, we also have to check the set of fields.

No two fields can have the same name.

Should fields be allowed to overlap?

We could be clever. Scan all fields and sort into ascending order. If a field omits the width, just compute it from the index of this and the next field.

paul-rogers · 2021-08-19T23:55:29Z

...t-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFieldConfig.java

+  private final int startIndex;
+  private final int fieldWidth;
+
+  public FixedwidthFieldConfig(@JsonProperty("dataType") TypeProtos.MinorType dataType,


Does it work to use the MinorType here? Is that type set up for Jackson serialization? I don't know the answer, just noting we should double-check to ensure it works OK.

This was tested configuring the plugin within the Drill UI (via manual test). I will also add an automated unit test for parsing the Json. To answer your question, yes it works to use MinorType here. Jackson can always read in Java enums, MinorType was generated as part of our protobuf and for this type of data they generate proper Java enums.

...-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFormatConfig.java

paul-rogers · 2021-08-19T23:58:50Z

...-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFormatPlugin.java

+    return EasyFormatConfig.builder()
+      .readable(true)
+      .writable(false)
+      .blockSplittable(false)


I'm pretty sure fix-width files are splittable. If every records resides on a single line, they the file is spittable if we add code that, if the start offset !=0, scan to the next newline. And, on read, read rows until the file position is greater than the block end. See the text file (CSV) plugin for details (though, don't follow its implementation as that implementation is rather unique to that one use case.)

paul-rogers · 2021-08-20T00:01:44Z

...t-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java

+import java.time.format.DateTimeFormatter;
+import java.util.Locale;
+
+public class FixedwidthBatchReader implements ManagedReader<FileSchemaNegotiator> {


This uses "EVF V1". Your plugin provides schema information, and is thus a perfect fit for "EVF V2" which can use your schema information to set up the row set loader schema automatically for you.

@paul-rogers Is there an example somewhere of how to use the rowset loader to set up the schema automatically? Is this as simple as checking to see whether the schema is provided and if so, use that?

@cgivre, it's been a while since I looked at that stuff. As I recall, the CSV reader was converted, but it is a bit obscure. The unit test also show what can be done, IIRC.

Basically, V2 gets schema from multiple sources:

Schema info in the plan object (if available in the future)

Provided schema (the current interim solution to system-provided schema)

Schema implied by the selected columns (for example, a[1] means a has to be an array.)

Schema defined by the data source itself (as in Parquet, or in CSV where you can choose the columns array)

Schema discovered by the traditional "schema on read"

The V2 SchemaNegotiator provides the needed methods. V2 will then come up with the final schema. V2 lets you read the entire row (columns a, b, c, say) even if the query wants only column a: V2 silently discards the unwanted columns.

paul-rogers · 2021-08-20T00:04:10Z

...t-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java

+      dateTimeFormat = field.getDateTimeFormat();
+      DateTimeFormatter formatter = DateTimeFormatter.ofPattern(dateTimeFormat, Locale.ENGLISH);
+      try {
+        switch (dataType) {


This is OK, but slow because of the switch. There is a set of field converter classes which can handle the string-to-whatever conversions. With that, there is a direct call per field (inner loop) from reading the field to convert to write into value vectors. The field-specific-type switch is done only once, at setup time.

These converters are used in the CSV reader when a schema is provided. I can dig up more examples if helpful.

To be clearer: what we want is to minimize the per-field work. Ideally, we'd set up an array of column converters so the loop looks like:

for (int i = 0; i < config.getFields().size(); i++) {
FixedwidthFieldConfig field = config.getFields().get(i);
value = line.substring(field.getIndex() - 1, field.getIndex() + field.getWidth() - 1);
writer.scalar(i).setString(value)
}

However, the above requires the user to specify the config. For every file. On every Drill cluster. Better, if the schema is not given, infer it from the first (few) row(s). Then, set up runtime field objects: ```java for (FieldReader fieldReader : fieldReaders) { fieldReader.load(line); }

The field reader:

Remembers the start and offset

Pulls out the string, handling a truncated line

Calls a cached column writer

@cgivre mentioned the idea of a column converter. There is a defined set for the common cases. The underlying mechanism sets them up for you. (V2 makes it simpler.) That way, a call to setString() directly invokes the thing that converts from string and writes the resulting value: no per-column switch needed.

lgtm-com · 2021-10-18T19:58:30Z

This pull request introduces 2 alerts when merging 2d17f1b into 0c9451e - view on LGTM.com

new alerts:

2 for Unused format argument

lgtm-com · 2021-10-20T21:11:52Z

This pull request introduces 2 alerts when merging 18380ea into f4ea90c - view on LGTM.com

new alerts:

2 for Unused format argument

lgtm-com · 2021-10-21T17:04:37Z

This pull request introduces 2 alerts when merging a91be4c into f4ea90c - view on LGTM.com

new alerts:

2 for Unused format argument

lgtm-com · 2021-10-26T22:04:54Z

This pull request introduces 2 alerts when merging dc60d28 into b6da35e - view on LGTM.com

new alerts:

2 for Unused format argument

lgtm-com · 2021-10-28T21:26:55Z

This pull request introduces 2 alerts when merging 05ae3f1 into 58ced60 - view on LGTM.com

new alerts:

2 for Unused format argument

jnturton · 2021-11-04T14:51:44Z

...t-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java

+    TypeProtos.MinorType dataType;
+    String dateTimeFormat;
+    String value;
+    for (FixedwidthFieldConfig field : config.getFields()) {


@cgivre @MFoss19 @estherbuchwalter here we are reading column data types from the format config, where we also specify their start and stop offsets. But this format plugin can also accept data types from a provided schema. So my question is: can we drop the data type information from the format config so that we don't introduce multiple ad-hoc ways of specifying this info? This is genuinely a question because I don't know this subject well, but should we not work with data type specs here exactly the same way we do for CSV (cannot be provided in the format config I don't think)?

My original understanding of this was that for the fixed width plugin was that it would work in a similar manner to the log regex reader where the user provides the schema in the config, either in the format config or at query time using the table() function.

What you want to do is to resolve the schema at open time, not when parsing. At open time, you can:

Get the schema from the plugin, if provided, and use that as the schema.

Else, sample the first line to infer the schema.

Since this is a fixed format, we don't want to rediscover the schema on every line: that costs too much. (Think of the case of reading 100M or 1B rows: optimizing the inner loop is critical.)

lgtm-com · 2021-11-05T14:26:02Z

This pull request introduces 2 alerts when merging 9d66f91 into 52838ef - view on LGTM.com

new alerts:

2 for Unused format argument

lgtm-com · 2021-11-05T16:37:28Z

This pull request introduces 2 alerts when merging 9b95c45 into 52838ef - view on LGTM.com

new alerts:

2 for Unused format argument

jnturton · 2021-11-08T08:58:37Z

@MFoss19 @estherbuchwalter following some recent chat with @paul-rogers and my last comment here, how about a reduced format config such as the following? The goal is to get to something terse and consistent with what we do for other text formats.

"fixedwidth": {
  "type": "fixedwidth",
  "extensions": [
    "fwf"
  ],
  "extractHeader": true,
  "trimStrings": true,
  "columnOffsets": [1, 11, 21, 31],
  "columnWidths": [10, 10, 10, 10]
}

Column names and types can already come from a provided schema or aliasing after calls to CAST(). Incidentally, the settings above can be overriden per query using a provided schema too.

There's also a part of that wonders whether we could have justified adding our fixed width functionality to the existing delimited text format reader.

cgivre · 2021-11-08T13:04:54Z

@MFoss19 @estherbuchwalter following some recent chat with @paul-rogers and my last comment here, how about a reduced format config such as the following? The goal is to get to something terse and consistent with what we do for other text formats.
"fixedwidth": {
  "type": "fixedwidth",
  "extensions": [
    "fwf"
  ],
  "extractHeader": true,
  "trimStrings": true,
  "columnOffsets": [1, 11, 21, 31],
  "columnWidths": [10, 10, 10, 10]
}
Column names and types can already come from a provided schema or aliasing after calls to CAST(). Incidentally, the settings above can be overriden per query using a provided schema too.

There's also a part of that wonders whether we could have justified adding our fixed width functionality to the existing delimited text format reader.

@dzamo In this case, I'd respectfully disagree here. In effect, the configuration is providing a schema to the user, similar to the way the logRegex reader works. In this case, the user will get the best data possible if we can include datatypes and field names in the schema, so that they can just do a SELECT * and not have to worry about casting etc.

Let's consider a real world use case: some fixed width log generated by a database. Since the fields may be mashed together, there isn't a delimiter that you can use to divide the fields. You could use however the logRegex reader to do this. That point aside for the moment, the way I imagined someone using this was that different configs could be set up and linked to workspaces such that if a file was in the mysql_logs folder, it would use the mysql log config, and if it was in the postgres it would use another.

My opinion here is that the goal should be to get the cleanest data to the user as possible without the user having to rely on CASTs and other complicating factors.

jnturton · 2021-11-08T13:38:54Z

Let's consider a real world use case: some fixed width log generated by a database. Since the fields may be mashed together, there isn't a delimiter that you can use to divide the fields. You could use however the logRegex reader to do this. That point aside for the moment, the way I imagined someone using this was that different configs could be set up and linked to workspaces such that if a file was in the mysql_logs folder, it would use the mysql log config, and if it was in the postgres it would use another.

@cgivre This use case would still work after two CREATE SCHEMA statements to set the names and data types, wouldn't it? The schemas would be applied every subsequent query.

My opinion here is that the goal should be to get the cleanest data to the user as possible without the user having to rely on CASTs and other complicating factors.

Let's drop the CASTs, those aren't fun. So we're left with different ways a user can specify column names and types.

With a CREATE SCHEMA against a directory.
With an inline schema to a table function.
With some plugin-specific format config that works for this plugin but generally not for others.

Any one requires some effort, any one gets you to select * returning nice results (disclaimer: is this claim I'm making actually true?) which is super valuable. So shouldn't we avoid the quirky 3 and commit to 1 and 2 consistently wherever we can?

lgtm-com · 2021-11-17T22:20:55Z

This pull request introduces 2 alerts when merging f9e96fe into 42e7b77 - view on LGTM.com

new alerts:

2 for Unused format argument

lgtm-com · 2021-11-18T20:41:37Z

This pull request introduces 2 alerts when merging 428a512 into 14d96d1 - view on LGTM.com

new alerts:

2 for Unused format argument

lgtm-com · 2021-11-23T22:36:00Z

This pull request introduces 2 alerts when merging 428a2dd into 17f3654 - view on LGTM.com

new alerts:

2 for Unused format argument

lgtm-com · 2021-11-24T17:23:37Z

This pull request introduces 2 alerts when merging 881d465 into 38d0c1d - view on LGTM.com

new alerts:

2 for Unused format argument

lgtm-com · 2021-11-24T23:52:27Z

This pull request introduces 2 alerts when merging 56d8f6e into 38d0c1d - view on LGTM.com

new alerts:

2 for Unused format argument

cgivre

@MFoss19 @estherbuchwalter I added some review comments. Please make sure the unit tests pass and also that there are no code style violations before pushing to github.

cgivre · 2021-11-25T00:28:01Z

...t-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java

+        .dataReadError(e)
+        .message("Failed to open input file: {}", split.getPath().toString())
+        .addContext(errorContext)
+        .addContext(e.getMessage())


You can remove this second line. Also, please add e.getMessage() to the message line.

I like that second line (if you mean the message call). Another solution is to change message() to addContext(). This way, we preserve the message from the actual error, and add context to explain the source of the error. Then, as Charles suggested, we don't need the addContext(e.getMessage()) bit.

cgivre · 2021-11-25T00:29:04Z

...t-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java

+  @Override
+  public boolean next() { // Use loader to read data from file to turn into Drill rows
+    String line;
+    RowSetLoader writer = loader.writer();


This line should be in the open() method.

cgivre · 2021-11-25T00:30:35Z

...t-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java

+        .dataReadError(e)
+        .message("Failed to read input file: {}", split.getPath().toString())
+        .addContext(errorContext)
+        .addContext(e.getMessage())


For the error message, you don't need to have multiple addContext() calls. The main thing is to pass the errorContext. I would add the e.getMessage() to the message() call.

Here and elsewhere.

See explanation above. Ideally:

throw UserException .dataReadError(e) .addContext("Failed to read input file: {}", split.getPath().toString()) .addContext(errorContext) .addContext("Line Number", lineNum) .build(logger);

Thanks for adding the line number: nice touch.

cgivre · 2021-11-25T00:31:08Z

...t-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java

+    RowSetLoader writer = loader.writer();
+
+    try {
+      line = reader.readLine();


Why not include this in the loop?

cgivre · 2021-11-25T00:37:17Z

...t-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java

+        .addContext("Line Number", lineNum)
+        .build(logger);
+    }
+    return writer.limitReached(maxRecords);  // returns false when maxRecords limit has been reached


The next() method needs some work. Really this should be called nextBatch() as the next method returns true when there is more data, to read, false if not.

@Override public boolean next() { while (!rowWriter.isFull()) { if (!processNextLine()) { return false; } } return true; }

This method will iterate through the batch of data, and when the rowWriter is full, (IE the batch is full) it will stop reading, BUT the method will return true because there is more data to read. The limit is pushed down in the processNextLine() method.

cgivre · 2021-11-25T00:39:35Z

...t-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java

+    lineNum = 0;
+    try {
+      fsStream = negotiator.fileSystem().openPossiblyCompressedStream(split.getPath());
+      negotiator.tableSchema(buildSchema(), true);


Here's where you can check to see whether the user provided a schema or not. You could do something like this:

if (negotiator.hasProvidedSchema()) { TupleMetadata providedSchema = negotiator.providedSchema(); // Build column writer array negotiator.tableSchema(finalSchema, true); } else { negotiator.tableSchema(buildSchema(), true); }

This is done internally in V2. V2 does it that way because it became clear that there is no reason for every reader to have to know to do this same pattern.

cgivre · 2021-11-25T00:39:59Z

...t-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java

+  @Override
+  public void close() {
+    if (fsStream != null){
+      AutoCloseables.closeSilently(fsStream);


This line should be out of the if statement.

The reason it an be out of the if statement is that the method itself handles null values. Otherwise, the code would be fine as it is if it called close() directly.

cgivre · 2021-11-25T00:45:11Z

...t-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java

+    TypeProtos.MinorType dataType;
+    String dateTimeFormat;
+    String value;
+    for (FixedwidthFieldConfig field : config.getFields()) {


My original understanding of this was that for the fixed width plugin was that it would work in a similar manner to the log regex reader where the user provides the schema in the config, either in the format config or at query time using the table() function.

cgivre · 2021-11-25T00:46:08Z

...-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFormatPlugin.java

+    return EasyFormatConfig.builder()
+      .readable(true)
+      .writable(false)
+      .blockSplittable(false) // Change to true


Change to true.

But only if the additional work described above is done.

paul-rogers · 2021-11-25T21:14:43Z

...t-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java

+import java.time.format.DateTimeFormatter;
+import java.util.Locale;
+
+public class FixedwidthBatchReader implements ManagedReader<FileSchemaNegotiator> {


@cgivre, it's been a while since I looked at that stuff. As I recall, the CSV reader was converted, but it is a bit obscure. The unit test also show what can be done, IIRC.

Basically, V2 gets schema from multiple sources:

Schema info in the plan object (if available in the future)

Provided schema (the current interim solution to system-provided schema)

Schema implied by the selected columns (for example, a[1] means a has to be an array.)

Schema defined by the data source itself (as in Parquet, or in CSV where you can choose the columns array)

Schema discovered by the traditional "schema on read"

The V2 SchemaNegotiator provides the needed methods. V2 will then come up with the final schema. V2 lets you read the entire row (columns a, b, c, say) even if the query wants only column a: V2 silently discards the unwanted columns.

paul-rogers · 2021-11-25T21:16:25Z

...t-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java

+    lineNum = 0;
+    try {
+      fsStream = negotiator.fileSystem().openPossiblyCompressedStream(split.getPath());
+      negotiator.tableSchema(buildSchema(), true);


This is done internally in V2. V2 does it that way because it became clear that there is no reason for every reader to have to know to do this same pattern.

paul-rogers · 2021-11-25T21:18:49Z

...t-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java

+        .dataReadError(e)
+        .message("Failed to open input file: {}", split.getPath().toString())
+        .addContext(errorContext)
+        .addContext(e.getMessage())


I like that second line (if you mean the message call). Another solution is to change message() to addContext(). This way, we preserve the message from the actual error, and add context to explain the source of the error. Then, as Charles suggested, we don't need the addContext(e.getMessage()) bit.

paul-rogers · 2021-11-25T21:22:37Z

...t-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java

+        writer.start();
+        parseLine(line, writer);
+        writer.save();
+        line = reader.readLine();


You can lose a line here: read the next line, but the writer is full, so we discard it. As @cgivre suggests:

while (!writer.isFull()) { String line = reader.readLine(); if (line == null) { break; } writer.start(); parseLine(line, writer); writer.save(); }

paul-rogers · 2021-11-25T21:24:56Z

...t-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java

+        .dataReadError(e)
+        .message("Failed to read input file: {}", split.getPath().toString())
+        .addContext(errorContext)
+        .addContext(e.getMessage())


See explanation above. Ideally:

throw UserException .dataReadError(e) .addContext("Failed to read input file: {}", split.getPath().toString()) .addContext(errorContext) .addContext("Line Number", lineNum) .build(logger);

Thanks for adding the line number: nice touch.

paul-rogers · 2021-11-25T22:06:21Z

...-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFormatPlugin.java

+    return EasyFormatConfig.builder()
+      .readable(true)
+      .writable(false)
+      .blockSplittable(false) // Change to true


But only if the additional work described above is done.

...-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFormatPlugin.java

...edwidth/src/test/java/org/apache/drill/exec/store/fixedwidth/TestFixedwidthRecordReader.java

paul-rogers · 2021-11-25T22:09:16Z

...edwidth/src/test/java/org/apache/drill/exec/store/fixedwidth/TestFixedwidthRecordReader.java

+      .addNullable("Letter", TypeProtos.MinorType.VARCHAR)
+      .buildSchema();
+    RowSet expected = new RowSetBuilder(client.allocator(), expectedSchema)
+      .addRow(567, Instant.parse("2021-02-10T15:30:27.00Z"), LocalDate.parse("2021-02-10"), "test")


Generally, for something like this, you can use a LIMIT 2 and only check the first row or two. The code is not going to change column order after the 10th row!

paul-rogers · 2021-11-25T22:11:49Z

contrib/format-fixedwidth/src/test/resources/fwf/test_blankrow.fwf

+77.77 yzzz 777 06-05-7777 05:11:11 06-05-7777T05:11:11.11Z
+88.88 aabb 888 07-06-8888 06:22:22 07-07-8888T06:22:22.22Z
+88.88 aabb 888 07-06-8888 06:22:22 07-07-8888T06:22:22.22Z
+


Is the blank line intentional? Kind of hard to document data files. You could allow comments (lines that start with #, say, but there is always someone who has valid data that starts with #.) Another idea would be to add a "explanation.txt" file that explains the purpose of each of the data files.

Also, I'd hoped to see all manner of invalid files:

Truncated line (not just an entirely blank line)

String where a number should be.

Badly formatted date or time.

Such files will ensure that the error handling works and will raise that perpetual question: if we're a hundred million lines into a fixed-width file, and we hit an error, should we ignore that line and move to the next, or should we fail the query?

It is intentional to test what would happen if there was a blank line in a user input file. Good idea to add the explanation.txt file and additional invalid files.

…t being de-allocated.

…ing in Batch Reader

- Simplified FieldConfig variables - Added compressed file test - Added unit test for explicit column references - Modified close() to include AutoCloseables - Added Long data type - Added Decimal data type - not fully implemented

…cleaned up compressed file test

…eFormat when not appropriate, started adding methods to perform field name verification (not complete).

…licates. Modified tests to enable testing of new constructor.

…cation

…ing two setters in field config to set default value for data types and calculate/set width based on indices.

…o code.

…d valid SQL syntax.

…xedwidth

…mat-fixedwidth

Commented out a few files

lgtm-com · 2022-03-22T20:23:14Z

This pull request introduces 1 alert when merging bf6a16c into 4e97f5c - view on LGTM.com

new alerts:

1 for Unused format argument

luocooong assigned MFoss19 Jul 29, 2021

luocooong added documentation PRs relating to documentation enhancement PRs that add a new functionality to Drill new-storage New Storage Plugin labels Jul 29, 2021

cgivre requested changes Jul 30, 2021

View reviewed changes

paul-rogers reviewed Aug 20, 2021

View reviewed changes

MFoss19 force-pushed the format-fixedwidth branch from 18380ea to a91be4c Compare October 21, 2021 15:25

estherbuchwalter force-pushed the format-fixedwidth branch from dc60d28 to 05ae3f1 Compare October 28, 2021 20:03

luocooong added new-format New Format Plugin and removed new-storage New Storage Plugin labels Nov 1, 2021

jnturton reviewed Nov 4, 2021

View reviewed changes

MFoss19 force-pushed the format-fixedwidth branch from 05ae3f1 to 9d66f91 Compare November 5, 2021 13:06

cgivre mentioned this pull request Nov 8, 2021

DRILL-8028: Add PDF Format Plugin #2359

Merged

cgivre requested changes Nov 25, 2021

View reviewed changes

paul-rogers requested changes Nov 25, 2021

View reviewed changes

WIP converting to EVF v2. Pushing to repo for troubleshooting purposes.

4b221b5

MFoss19 force-pushed the format-fixedwidth branch from 054b4c0 to 4b221b5 Compare March 10, 2022 19:43

Megan Foss and others added 27 commits March 22, 2022 12:19

Start of fixed width format plugin

4875367

Work in Progress. Producing Rows. Currently complains about buffer no…

ef0bc82

…t being de-allocated.

First working version

be14e25

Added more data types, refactored code

c9014d2

Checkstyle fixes

6d7a2a5

Removed println statement from Batch Reader, Simplified logic

056df13

Modified format, fixed maxRecords in next(), modified Exception handl…

d2097b3

…ing in Batch Reader

Addressing Review Comments.

f2920a0

- Simplified FieldConfig variables - Added compressed file test - Added unit test for explicit column references - Modified close() to include AutoCloseables - Added Long data type - Added Decimal data type - not fully implemented

Added Serialization/Deserialization test, added blank row test file, …

7a68da5

…cleaned up compressed file test

Fixed Serialization/Deserialization test

e0110da

Added another constructor to enable user to not have to enter dateTim…

f01f1aa

…eFormat when not appropriate, started adding methods to perform field name verification (not complete).

Added method to validate field name input and verify there are no dup…

5da7a77

…licates. Modified tests to enable testing of new constructor.

Added two getters to FixedwidthFormatConfig to prep for offset verifi…

22ccbcb

…cation

Added a check for overlapping fields

ecf6fb8

Updated check for overlapping fields

1978e14

Added field validation for data types, indices, width. Includes creat…

a79f8a5

…ing two setters in field config to set default value for data types and calculate/set width based on indices.

Modified validation for field width and field index. Added comments t…

32e5312

…o code.

Added to field validation for field names. Checks for valid length an…

4d534df

…d valid SQL syntax.

WIP converting to EVF v2. Pushing to repo for troubleshooting purposes.

aa74ec5

Updating pom.xml with new drill snapshot version

1972fb9

Merge branch 'apache:master' into format-fixedwidth

f498123

Renamed classes

1e75757

Merge branch 'master' into format-fixedwidth

a134619

Merge remote-tracking branch 'megan/format-fixedwidth' into format-fi…

dfe894e

…xedwidth

Merge branch 'format-fixedwidth' of github.com:MFoss19/drill into for…

28df7b2

…mat-fixedwidth

Merge branch 'format-fixedwidth' of github.com:MFoss19/drill into for…

b68c54d

…mat-fixedwidth

Updated pom.xml

bf6a16c

Commented out a few files



		new RowSetComparison(expected).verifyAndClearAll(results);
		System.out.println("Test complete.");

DRILL-7978: Fixed Width Format Plugin #2282

Are you sure you want to change the base?

DRILL-7978: Fixed Width Format Plugin #2282

Conversation

MFoss19 commented Jul 29, 2021

DRILL-7978: Fixed Width Format Plugin

Description

Documentation

Testing

cgivre left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paul-rogers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgtm-com bot commented Oct 18, 2021

lgtm-com bot commented Oct 20, 2021

lgtm-com bot commented Oct 21, 2021

lgtm-com bot commented Oct 26, 2021

lgtm-com bot commented Oct 28, 2021

jnturton Nov 4, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgtm-com bot commented Nov 5, 2021

lgtm-com bot commented Nov 5, 2021

jnturton commented Nov 8, 2021 • edited

cgivre commented Nov 8, 2021

jnturton commented Nov 8, 2021 • edited

lgtm-com bot commented Nov 17, 2021

lgtm-com bot commented Nov 18, 2021

lgtm-com bot commented Nov 23, 2021

lgtm-com bot commented Nov 24, 2021

lgtm-com bot commented Nov 24, 2021

cgivre left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgtm-com bot commented Mar 22, 2022

jnturton Nov 4, 2021 •

edited

jnturton commented Nov 8, 2021 •

edited

jnturton commented Nov 8, 2021 •

edited