Skip to content

JSON serialization of LocalInputSource changes meaning of files #13359

@paul-rogers

Description

@paul-rogers

Serialization of the LocalInputSourc object converts relative file paths to absolute paths, changing the meaning of the MSQ ingest query.

Affected Version

Latest 25.0.0-SNAPSHOT

Description

MSQ introduces the extern function to specify an external data source. One of the possible input source is LocalInputSource, one of the properties defined as:

  @JsonProperty
  @JsonInclude(JsonInclude.Include.NON_EMPTY)
  public List<File> getFiles()
  {
    return files;
  }

Unfortunately, it seems that Jackson does not take the actual value of the File object, but rather the absolute path. Doing so completely changes the meaning of the query.

This was discovered in writing a unit test for the catalog extensions to MSQ. Here is the definition used:

  protected final ExternalDataSource localDataSource = new ExternalDataSource(
      new LocalInputSource(
          new File("/tmp"),
          "*.csv",
          Arrays.asList(new File("foo.csv"), new File("bar.csv"))
      ),
      ...

This is then run though the existing CalciteIngestionDmlTest.externSql() method which returns:

TABLE(extern(U&'\007B\0022type\0022\003A\0022local\0022\002C\0022baseDir\0022\003A\0022\002Ftmp\0022\002C\0022filter\0022\003A\0022\002A\002Ecsv\0022\002C\0022files\0022\003A\005B\0022\002FUsers\002Fpaul\002Fgit\002Fdruid\002Fsql\002Ffoo\002Ecsv\0022\002C\0022\002FUsers\002Fpaul\002Fgit\002Fdruid\002Fsql\002Fbar\002Ecsv\0022\005D\007D', U&'\007B\0022type\0022\003A\0022csv\0022\002C\0022columns\0022\003A\005B\0022x\0022\002C\0022y\0022\002C\0022z\0022\005D\007D', U&'\005B\007B\0022name\0022\003A\0022x\0022\002C\0022type\0022\003A\0022STRING\0022\007D\002C\007B\0022name\0022\003A\0022y\0022\002C\0022type\0022\003A\0022STRING\0022\007D\002C\007B\0022name\0022\003A\0022z\0022\002C\0022type\0022\003A\0022LONG\0022\007D\005D'))

Unfortunately, the above is hard to read. But, notice this excerpt:

files\0022\003A\005B\0022\002FUsers\002Fpaul\002Fgit\002Fdruid\002Fsql\002Ffoo\002Ecsv

That indicates that Jackson has included the absolute path (relative to my Druid build directory) in the list of files.

As a result, query validation failse:

expected:

<ScanQuery{dataSource='ExternalDataSource{inputSource=LocalInputSource{
    baseDir="/tmp",filter=*.csv,files=[foo.csv, bar.csv]}, ...

but was:

<ScanQuery{dataSource='ExternalDataSource{inputSource=LocalInputSource{
    baseDir="/tmp",filter=*.csv,files=[/Users/paul/git/druid/sql/foo.csv, /Users/paul/git/druid/sql/bar.csv]}, ...

Impact

Though the above was found during unit testing of a MSQ table function, the issue very likely affects all users who use the local input source. Anytime we serialize that object (such as creating the MSQ controller task), the meaning of the files argument will change.

Suggested Fix

The correct fix is probably to change the baseDir and files parameters to work with Strings rather than with File objects. This would be a breaking change for anyone using the code API, but should be transparent to anyone using the JSON form. One mediation is to provide a second constructor that is backward compatible, while changing the JSON versions to work with String objects.

Workaround

The workaround is to not use the baseDir property and always include the absolute path in the files argument. This workaround was used to allow the unit test in question to pass, but will be surprising to users who read the documentation.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions