Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter out external feature columns in consistency job comparison #599

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

aarongreen-stripe
Copy link
Collaborator

Summary

Filters out contextual features from external parts from the comparison dataframe before computing consistency metrics with the logged data.

Why / Goal

The logged data from the online chronon system does not include the contextual features and the consistency job will fail if you don't filter out these features first.

Test Plan

  • Integration tested

Checklist

  • Documentation update

Reviewers

@pengyu-hou
Copy link
Collaborator

@hzding621 this seems legit. Could you take a look? Then we could fix the conflicts

// External parts / contextual features don't get logged in the online data but do appear in the comparison table.
// We need to remove them from the comparison df otherwise the comparison metric computation will fail.
// We need to prepend `ext_contextual_` to the feature name since that's how it appears in the comparison df.
val externalFeatureColumns = joinConf.getExternalFeatureCols.map(col => s"ext_contextual_$col")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the column name logic seems wrong. getExternalFeatureCols currently has no usages in the codebase and doesn't handle prefix correctly.

ext_contextual_ is the prefix for only contextual features which is a special case of external features, the general case external features has prefix based on its source name.

So instead we should use:

Option(joinConf.onlineExternalParts).toSeq.flatMap(_.toScala.map(_.fullName))

@@ -88,10 +88,18 @@ class ConsistencyJob(session: SparkSession, joinConf: Join, endDate: String) ext
if (unfilledRanges.isEmpty) return null
val allMetrics = unfilledRanges.map { unfilled =>
val comparisonDf = tableUtils.sql(unfilled.genScanQuery(null, joinConf.metaData.comparisonTable))
// External parts / contextual features don't get logged in the online data but do appear in the comparison table.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the root cause description might be wrong.

External / contextual parts do get logged in the online data and show up in the log table. They also show up in the comparison table but the values will be NULL.

However, currently CompareBaseJob.compare will only compare columns that appear both on the left and on the right. In the current code, we pass in loggedDfNoExternalCols to CompareBaseJob.compare, and therefore already remove the external / contextual parts. So IIUC the current code shouldn't fail for this scenario.

Copy link
Collaborator

@hzding621 hzding621 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plz see my other comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants