-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
Serialization of the LocalInputSourc object converts relative file paths to absolute paths, changing the meaning of the MSQ ingest query.
Affected Version
Latest 25.0.0-SNAPSHOT
Description
MSQ introduces the extern function to specify an external data source. One of the possible input source is LocalInputSource, one of the properties defined as:
@JsonProperty
@JsonInclude(JsonInclude.Include.NON_EMPTY)
public List<File> getFiles()
{
return files;
}Unfortunately, it seems that Jackson does not take the actual value of the File object, but rather the absolute path. Doing so completely changes the meaning of the query.
This was discovered in writing a unit test for the catalog extensions to MSQ. Here is the definition used:
protected final ExternalDataSource localDataSource = new ExternalDataSource(
new LocalInputSource(
new File("/tmp"),
"*.csv",
Arrays.asList(new File("foo.csv"), new File("bar.csv"))
),
...This is then run though the existing CalciteIngestionDmlTest.externSql() method which returns:
TABLE(extern(U&'\007B\0022type\0022\003A\0022local\0022\002C\0022baseDir\0022\003A\0022\002Ftmp\0022\002C\0022filter\0022\003A\0022\002A\002Ecsv\0022\002C\0022files\0022\003A\005B\0022\002FUsers\002Fpaul\002Fgit\002Fdruid\002Fsql\002Ffoo\002Ecsv\0022\002C\0022\002FUsers\002Fpaul\002Fgit\002Fdruid\002Fsql\002Fbar\002Ecsv\0022\005D\007D', U&'\007B\0022type\0022\003A\0022csv\0022\002C\0022columns\0022\003A\005B\0022x\0022\002C\0022y\0022\002C\0022z\0022\005D\007D', U&'\005B\007B\0022name\0022\003A\0022x\0022\002C\0022type\0022\003A\0022STRING\0022\007D\002C\007B\0022name\0022\003A\0022y\0022\002C\0022type\0022\003A\0022STRING\0022\007D\002C\007B\0022name\0022\003A\0022z\0022\002C\0022type\0022\003A\0022LONG\0022\007D\005D'))
Unfortunately, the above is hard to read. But, notice this excerpt:
files\0022\003A\005B\0022\002FUsers\002Fpaul\002Fgit\002Fdruid\002Fsql\002Ffoo\002Ecsv
That indicates that Jackson has included the absolute path (relative to my Druid build directory) in the list of files.
As a result, query validation failse:
expected:
<ScanQuery{dataSource='ExternalDataSource{inputSource=LocalInputSource{
baseDir="/tmp",filter=*.csv,files=[foo.csv, bar.csv]}, ...
but was:
<ScanQuery{dataSource='ExternalDataSource{inputSource=LocalInputSource{
baseDir="/tmp",filter=*.csv,files=[/Users/paul/git/druid/sql/foo.csv, /Users/paul/git/druid/sql/bar.csv]}, ...
Impact
Though the above was found during unit testing of a MSQ table function, the issue very likely affects all users who use the local input source. Anytime we serialize that object (such as creating the MSQ controller task), the meaning of the files argument will change.
Suggested Fix
The correct fix is probably to change the baseDir and files parameters to work with Strings rather than with File objects. This would be a breaking change for anyone using the code API, but should be transparent to anyone using the JSON form. One mediation is to provide a second constructor that is backward compatible, while changing the JSON versions to work with String objects.
Workaround
The workaround is to not use the baseDir property and always include the absolute path in the files argument. This workaround was used to allow the unit test in question to pass, but will be surprising to users who read the documentation.