Skip to content

Warning about RemoveOrphanFiles#1466

Merged
rdblue merged 1 commit intoapache:masterfrom
RussellSpitzer:RemoveOrphansWarning
Sep 16, 2020
Merged

Warning about RemoveOrphanFiles#1466
rdblue merged 1 commit intoapache:masterfrom
RussellSpitzer:RemoveOrphansWarning

Conversation

@RussellSpitzer
Copy link
Member

@rdblue here is the warning we discussed in the ASF Slack

@rdblue rdblue merged commit da16227 into apache:master Sep 16, 2020
@rdblue
Copy link
Contributor

rdblue commented Sep 16, 2020

Thanks for adding this! I merged it and will redeploy the docs.

What do you think the solution for this is? Should we have a callback to rewrite file paths to match?

@RussellSpitzer
Copy link
Member Author

I guess that may be the correct answer, if you change authorities you should probably re-write all old path information?

We were tossing around other ideas of just comparing scheme + path, but I keep imagining HDFS-like implementations which put metadata in other fields. Like one could imagine file://path#someOtherMetadataThatIsUserSpecific screwing everything up again ...

@RussellSpitzer RussellSpitzer deleted the RemoveOrphansWarning branch September 16, 2020 18:33
@aokolnychyi
Copy link
Contributor

@rdblue @RussellSpitzer @manishmalhotrawork It sounds reasonable to me to just look at scheme and path parts of URI. That way, we should be safe w.r.t. changing authorities or query params (is it even possible?). I think we definitely need a fix for this.

@aokolnychyi
Copy link
Contributor

@rdblue @RussellSpitzer @manishmalhotrawork Right now, we have a UDF to produce file names that we use in the join condition. What about having a UDF that would take 2 strings (one actual location and one location referenced in the metadata) and produce a boolean to indicate whether they match or not. Inside the UDF, we can construct Path from String, call toUri on it, and then call getScheme and getPath. Path parts must match and the scheme in the metadata may be either null or must match the actual one.

@aokolnychyi
Copy link
Contributor

Having a UDF that accepts columns from two relations does not eliminate the cross join.

I guess we have two options:

  • keep the join by file name and replace contains condition with another UDF that would ignore authority
  • replace the existing UDF that produces file names with another UDF that would produce a scheme and a relative path and then use DataFrame operations. That way, we will have only one UDF.
Column pathCond = actualFileDF.col("relative_path").equalTo(validDataFileDF.col("relative_path"));
Column schemeEquality = actualFileDF.col("scheme").equalTo(validDataFileDF.col("scheme")); 
Column schemeCond = validDataFileDF.col("scheme").isNull().or(schemeEquality);
Column joinCond = pathCond.and(schemeCond);

@manishmalhotrawork
Copy link
Contributor

manishmalhotrawork commented Sep 17, 2020

Having a UDF that accepts columns from two relations does not eliminate the cross join.

I guess we have two options:

  • keep the join by file name and replace contains condition with another UDF that would ignore authority
  • replace the existing UDF that produces file names with another UDF that would produce a scheme and a relative path and then use DataFrame operations. That way, we will have only one UDF.
Column pathCond = actualFileDF.col("relative_path").equalTo(validDataFileDF.col("relative_path"));
Column schemeEquality = actualFileDF.col("scheme").equalTo(validDataFileDF.col("scheme")); 
Column schemeCond = validDataFileDF.col("scheme").isNull().or(schemeEquality);
Column joinCond = pathCond.and(schemeCond);

@aokolnychyi thanks for adding dtails.

yeah as our internal discussion and internal PR I have is similar to this.
Though while testing, realized checking name of the scheme can also be troublesome as name of the scheme is configurable like instead of hdfs we can myhdfs and same for S3.

I believe not considering scheme should also be ok, as there should not be case where table location is in HDFS but data its pointing to S3 which needs to be removed or vice versa?

So, either can avoid checking scheme as well, or have a flag to consider that or not.

@manishmalhotrawork
Copy link
Contributor

@rdblue @RussellSpitzer @manishmalhotrawork Right now, we have a UDF to produce file names that we use in the join condition. What about having a UDF that would take 2 strings (one actual location and one location referenced in the metadata) and produce a boolean to indicate whether they match or not. Inside the UDF, we can construct Path from String, call toUri on it, and then call getScheme and getPath. Path parts must match and the scheme in the metadata may be either null or must match the actual one.

@aokolnychyi yeah almost similar the change I did, will be raising PR shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants