Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SL-1704] [SL-1703] [Feature] Enable fill_nulls_with and join_to_timespine on metric spec #1031

Open
3 tasks done
courtneyholcomb opened this issue Feb 14, 2024 · 6 comments
Assignees
Labels
enhancement New feature or request Low priority Created by Linear-GitHub Sync Metricflow Created by Linear-GitHub Sync triage Tasks that need to be triaged

Comments

@courtneyholcomb
Copy link
Contributor

courtneyholcomb commented Feb 14, 2024

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing metricflow functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Currently, fill_nulls_with and join_to_timespine are only available on metric input measures. This can result in unwanted nulls if the measure is nested in a ratio or derived metric and then joined to other metrics. If we add those options to the metric config, users will be able to resolve that issue and also have more granular control over when nulls are filled.

This means that after the metric is calculated, the metric subquery can be joined to a time spine subquery to fill missing dates, and then coalesce nulls to the integer value specified in the fill_nulls_with property. We also may consider allowing variables and other expressions to be used to fill nulls instead of only integers.

Describe alternatives you've considered

No response

Who will this benefit?

No response

Are you interested in contributing this feature?

No response

Anything else?

No response

SL-1703

@courtneyholcomb courtneyholcomb added enhancement New feature or request triage Tasks that need to be triaged linear labels Feb 14, 2024
@courtneyholcomb courtneyholcomb changed the title [Feature] Enable fill_nulls_with and join_to_timespine on metric spec [SL-1704] [Feature] Enable fill_nulls_with and join_to_timespine on metric spec Feb 14, 2024
@courtneyholcomb courtneyholcomb changed the title [SL-1704] [Feature] Enable fill_nulls_with and join_to_timespine on metric spec [SL-1703] [Feature] Enable fill_nulls_with and join_to_timespine on metric spec Feb 14, 2024
@courtneyholcomb courtneyholcomb changed the title [SL-1703] [Feature] Enable fill_nulls_with and join_to_timespine on metric spec [SL-1704] [SL-1703] [Feature] Enable fill_nulls_with and join_to_timespine on metric spec Feb 14, 2024
@courtneyholcomb courtneyholcomb closed this as not planned Won't fix, can't repro, duplicate, stale Feb 14, 2024
@courtneyholcomb courtneyholcomb added the Metricflow Created by Linear-GitHub Sync label Feb 15, 2024
@courtneyholcomb courtneyholcomb self-assigned this Feb 15, 2024
@courtneyholcomb courtneyholcomb added the Low priority Created by Linear-GitHub Sync label Feb 15, 2024
@siljamardla
Copy link

siljamardla commented Mar 22, 2024

Also reported #1098, which is basically a duplicate for this.

Adding some business context, to push the priority up.

Imagine this very common scenario:
total_revenue =
revenue_from_business_line1 (a business line with regular incremental revenue)
+ revenue_from_business_line2 (a business line with rare but big revenue, so has nulls in some periods!)
+ revenue_from_business_line3 (very seasonal business, so has nulls in some periods!)
total_cost = cost_from_business_line1 + cost_from_business_line2 + headquarter_costs
total_net (income/cost) = total revenue - total cost
cost_to_revenue_rate = total_cost/total_revenue

Revenue and costs (especially for different business lines) come from different tables. Not an uncommon scenario.

Now, we want time based reporting (daily, weekly, monthly metric values). As soon as ANY of the input metrics have a NULL in them for this particular period, the whole thing will be NULL. Total revenue is NULL because we didn't have any revenue to show from business line 3 !?!

No workaround for this feature missing, as far as I know.

This feature should have the highest priority! It's a significant blocker for real world use.

@Jstein77
Copy link
Contributor

@siljamardla I think there is an easy workaround for this. You can add coalesce statements to the expr: for derived metrics. Going back to your example in #1098. I created another metric which is the sum of metric8 and metric7 and added a coalesce statement in the expression

  - name: metric10
    label: metric10
    type: derived
    type_params:
      metrics:
        - name: metric7
        - name: metric8
      expr: coalesce(metric7,0) + coalesce(metric8,0)

Metric10 is able to safely sum metric7 and metric8 even when there are nulls present.

METRIC_TIME__DAY CALENDAR_DATE_LOCAL USER_ID METRIC1 METRIC7 METRIC8 METRIC10
2024-01-01 2024-01-01 1.00 1 3.00 nan 3
2024-01-01 2024-01-01 2.00 1 3.00 3.00 6
2024-01-01 2024-01-01 3.00 0 nan 3.00 3
2024-01-02 2024-01-02 1.00 1 3.00 nan 3
2024-01-03 2024-01-03 2.00 0 nan 3.00 3

Definitely agree that it would be better to support plumbing the null values all the way down to the input metric, but with this approach we are still able to accurately calculate the metric value for metric 10 for user id = 1 on 2024-01-01. This makes this feature feel less urgent to me since there is a workaround.

I believe adding a coalesce to the expr should allow you to calculate the total_net metric you described as well. Let me know if there is a case i'm not thinking about.

@siljamardla
Copy link

siljamardla commented Mar 26, 2024

Thanks @Jstein77, this looks promising.

I'm also extending this to cover the ratio metric case:

  - name: metric11
    label: metric11
    type: derived # instead of ratio, because we want to do manual null handling
    type_params:
      metrics:
        - name: metric7
        - name: metric8
      expr: COALESCE(metric7,0) / NULLIF(metric8,0)

I did try adding the coalesce into the ratio metric definition, but that throws an error.

  - name: metric12
    label: metric12
    type: ratio
    type_params:
      numerator: COALESCE(metric7,0)
      denominator: metric8

That's why I've used derived also for ratio.

@siljamardla
Copy link

siljamardla commented Mar 26, 2024

After some additional testing I have to come back here.

I have made updates to my metric7, metric8 and metric9. I've also added your metric10.

  # Derived metrics

  - name: metric7
    label: metric7
    type: derived
    type_params:
          expr: COALESCE(metric1_for_metric7,0) + COALESCE(metric2_for_metric7,0) + COALESCE(metric3_for_metric7,0)
          metrics:
            - name: metric1
              alias: metric1_for_metric7
            - name: metric2
              alias: metric2_for_metric7
            - name: metric3
              alias: metric3_for_metric7

  - name: metric8
    label: metric8
    type: derived
    type_params:
          expr: COALESCE(metric4_for_metric8,0) + COALESCE(metric5_for_metric8,0) + COALESCE(metric6_for_metric8,0)
          metrics:
            - name: metric4
              alias: metric4_for_metric8
            - name: metric5
              alias: metric5_for_metric8
            - name: metric6
              alias: metric6_for_metric8

  - name: metric9
    label: metric9
    type: derived
    type_params:
          expr: COALESCE(metric7_for_metric9,0) / NULLIF(COALESCE(metric8_for_metric9,0),0)
          metrics:
            - name: metric7
              alias: metric7_for_metric9
            - name: metric8
              alias: metric8_for_metric9

  - name: metric10
    label: metric10
    type: derived
    type_params:
          expr: COALESCE(metric7_for_metric9,0) + COALESCE(metric8_for_metric9,0)
          metrics:
            - name: metric7
              alias: metric7_for_metric9
            - name: metric8
              alias: metric8_for_metric9

Querying all of them together:

mf query --metrics metric1,metric2,metric3,metric4,metric5,metric6,metric7,metric8,metric9,metric10\
 --group-by calendar_date_local,user_id\
 --order calendar_date_local,user_id

Will return:

calendar_date_local user_id metric1 metric2 metric3 metric4 metric5 metric6 metric7 metric8 metric9 metric10
2024-01-01 1 1 1 1 0 0 0 3.00 nan 3
2024-01-01 2 1 1 1 1 1 1 3.00 3.00 1.00 6
2024-01-01 3 0 0 0 1 1 1 nan 3.00 0.00 3
2024-01-02 1 1 1 1 0 0 0 3.00 nan 3
2024-01-03 2 0 0 0 1 1 1 nan 3.00 0.00 3

Initially I was going to suggest there's a difference between derived metrics defined on top of simple vs other derived metrics.

Notice that metric7 and metric8 are based on simple metrics and they still display nulls, while metric9 and metric10 are based on derived metrics and produce the expected results.

However, when I review the compiled code, I think the metric type used in the derived metric is irrelevant. The beginning of the compiled query reads:

SELECT
  COALESCE(subq_4.calendar_date_local, subq_9.calendar_date_local, subq_15.calendar_date_local, subq_21.calendar_date_local, subq_35.calendar_date_local, subq_49.calendar_date_local) AS calendar_date_local
  , COALESCE(subq_4.user_id, subq_9.user_id, subq_15.user_id, subq_21.user_id, subq_35.user_id, subq_49.user_id) AS user_id
  , COALESCE(MAX(subq_4.metric1), 0) AS metric1
  , COALESCE(MAX(subq_4.metric2), 0) AS metric2
  , COALESCE(MAX(subq_4.metric3), 0) AS metric3
  , COALESCE(MAX(subq_9.metric4), 0) AS metric4
  , COALESCE(MAX(subq_9.metric5), 0) AS metric5
  , COALESCE(MAX(subq_9.metric6), 0) AS metric6
  , MAX(subq_15.metric7) AS metric7
  , MAX(subq_21.metric8) AS metric8
  , MAX(subq_35.metric9) AS metric9
  , MAX(subq_49.metric10) AS metric10
FROM ...

All the null handling happens in subq_15, subq_21, subq_35 and subq_49. However, we add more null rows in this stage of FULL OUTER JOINing them together, but without null handling. I will try to produce an updated dataset that will show this. This happens to all the derived metrics.

@siljamardla
Copy link

siljamardla commented Mar 26, 2024

I have added a third table (in this table we have data for user_id 4 and 5, nothing for 1, 2 and 3):

id user_id calendar_date_local value33
1 4 2024-01-01 1
2 5 2024-01-03 1

I have defined a metric on that third table:

  - name: metric33
    label: metric33
    type: simple
    type_params:
      measure:
        name: sum_of_value33
        fill_nulls_with: 0
        join_to_timespine: true

I am now querying all of the above plus the new metric:

mf query --metrics metric1,metric2,metric3,metric4,metric5,metric6,metric7,metric8,metric9,metric10,metric33\
 --group-by calendar_date_local,user_id\
 --order calendar_date_local,user_id

and voila, we see the null problem also in metric10:

calendar_date_local user_id metric1 metric2 metric3 metric4 metric5 metric6 metric7 metric8 metric9 metric10 metric33
2024-01-01 1 1 1 1 0 0 0 3.00 nan 3.00 0
2024-01-01 2 1 1 1 1 1 1 3.00 3.00 1.00 6.00 0
2024-01-01 3 0 0 0 1 1 1 nan 3.00 0.00 3.00 0
2024-01-01 4 0 0 0 0 0 0 nan nan nan 1
2024-01-02 1 1 1 1 0 0 0 3.00 nan 3.00 0
2024-01-03 2 0 0 0 1 1 1 nan 3.00 0.00 3.00 0
2024-01-03 5 0 0 0 0 0 0 nan nan nan 1

However, it looks like there are no errors in the values any more. Now it's just about rendering the output.

How to fix it?

It should be relatively simple to add a fill_nulls_with config to derived metrics and then use it in the outer most SELECT. This config would only handle the final output while the null handling within a derived metric calculation would remain a part of the derived metric definition, as introduced in the workaround. In fact, the workaround is starting to look like part of the actual solution.

For addition and subtraction metrics (metric7, metric8 and metric10) I would set the fill_nulls_with to 0, while for division I would keep the nulls and for multiplication it might depend on business logic.

What do you think? I'm wondering if there are cases not covered by this logic...

Current workarounds

Whenever a user uses a MetricFlow query similar to this, they have to process the output by manually wrapping or not wrapping the output in COALESCE depending on the metric type (that they have to know).

@Jstein77
Copy link
Contributor

@siljamardla thanks for providing such detailed test cases! Metric10 will still have null values because of the full outer join, but i believe the actual metric value calculation is correct if you use the workaround.

I agree with your proposed fix. Adding fill_nulls_with to metric inputs will allow us to correctly fill any null values in the output table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Low priority Created by Linear-GitHub Sync Metricflow Created by Linear-GitHub Sync triage Tasks that need to be triaged
Projects
None yet
Development

No branches or pull requests

3 participants