Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use DB provided analytical functions #538

Closed
klausb opened this issue Mar 24, 2020 · 9 comments
Closed

How to use DB provided analytical functions #538

klausb opened this issue Mar 24, 2020 · 9 comments
Labels
question The issue is a question. Please use Stack Overflow for questions.

Comments

@klausb
Copy link

klausb commented Mar 24, 2020

How can I use analytical functions from e.g. Vertica alongside with raw data access in a schema?
For example, I want to expose a calculated metric from a moving average computed in the DB, like this

SELECT ts, bid, AVG(bid) 
   OVER(ORDER BY ts
       RANGE BETWEEN INTERVAL '40 seconds' 
       PRECEDING AND CURRENT ROW)
FROM ticks 
WHERE stock = 'abc' 
GROUP BY bid, ts 
ORDER BY ts;

See here: https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AnalyzingData/SQLAnalytics/CalculatingMovingAverage.htm

Whenever I create a calculated metric and use this snippet

AVG(bid) 
   OVER(ORDER BY ts
       RANGE BETWEEN INTERVAL '40 seconds' 
       PRECEDING AND CURRENT ROW)

I see SQL errors. I need to add the corresponding GROUP BY to the end. But how?

@paveltiunov
Copy link
Member

Hey @klausb ! You can define it as a number measure:

averageBid: {
  sql: `AVG(${CUBE}.bid) 
   OVER(ORDER BY ${CUBE}.ts
       RANGE BETWEEN INTERVAL '40 seconds' 
       PRECEDING AND CURRENT ROW)`,
  type: `number`
}

@paveltiunov paveltiunov added the question The issue is a question. Please use Stack Overflow for questions. label Mar 24, 2020
@klausb
Copy link
Author

klausb commented Mar 25, 2020

Hey @paveltiunov ,

I tried that, but it does not work because of the mentioned error.
The above analytical function requires that the SQL statement contains a GROUP BY bid at the end, which is not there in the generated SQL.

My example is a variation and fails because of the same issue

cube(`ScopeCpu`, {
  sql: `SELECT * FROM store.scope_cpu`,
    joins: {  
  },
  
  measures: {
    count: {
      type: `count`,
      drillMembers: []
    },
    moveAvg: {
      sql: `AVG(${CUBE}.BYCPU_CPU_SYS_MODE_UTIL) 
              OVER(ORDER BY ${ts} 
                RANGE BETWEEN INTERVAL '40 seconds' 
                PRECEDING AND CURRENT ROW)`,
      type: `number`
    },
    sysUtil: {
      sql: `${CUBE}.BYCPU_CPU_SYS_MODE_UTIL`,
      type: `avg`
    },
  },
  
  dimensions: {    
    ts: {
      sql: `to_timestamptz(timestamp_utc)`,
      type: `time`
    }, 
  }
});

Asking for moveAvg using the Cube Playground app gets me this error from the DB:

Error: Error: Column "scope_cpu.BYCPU_CPU_SYS_MODE_UTIL" must appear in the GROUP BY clause or be used in an aggregate function

Any idea how to add this GROUP BY to the resulting statement?

@paveltiunov
Copy link
Member

@klausb Ah. Yep. Sorry. Right. Because it's dimension in fact as it's rolling window. Could you please try to define it as dimension instead?

@klausb
Copy link
Author

klausb commented Mar 26, 2020

@paveltiunov With a dimension I get closer. One problem still remains: cubejs automatically appends GROUP BY snippets for selected dimensions. For the above dimension this is an error ( Analytic functions not allowed in GROUP BY clause).
Is there a way to tell cubejs to not apply a grouping for a particular dimension?

Edit: I tried with subQuery, but then the schema compiler complains about missing "case".

@paveltiunov
Copy link
Member

@klausb Ah. Yep. Right. I believe the only option is to put it in cube SQL then and use it as a dimension referencing to a column.

@klausb
Copy link
Author

klausb commented Mar 27, 2020

...use it as a dimension referencing to a column...

Can you explain how that works?

I tried this so far:

cube(`ScopeCpu`, {
  sql: `select * FROM store.scope_cpu`,
  
  measures: {
    sysModeUtilAvg: {
      sql: `${CUBE}.BYCPU_CPU_SYS_MODE_UTIL`,
      type: `avg`
    },
    sysModeUtil: {
      sql: `${CUBE}.BYCPU_CPU_SYS_MODE_UTIL`,
      type: `number`
    }
  },
  
  dimensions: {    
    ts: {
      sql: `to_timestamptz(timestamp_utc)`,
      type: `time`
    }, 

    avgCpu: {
      sql: `AVG(${CUBE}.BYCPU_CPU_SYS_MODE_UTIL) 
            OVER(ORDER BY ${ts} 
              RANGE BETWEEN INTERVAL '60 seconds' 
              PRECEDING AND CURRENT ROW)`,
      type: `number`
    },
       
    hostName: {
      sql: `host_name`,
      type: `string`
    }   
  }
});

With the cube Playground I create this query:

"query": {
    "measures": [
      "ScopeCpu.sysModeUtil"
    ],
    "timeDimensions": [],
    "dimensions": [
      "ScopeCpu.avgCpu",
      "ScopeCpu.ts"
    ],
    "filters": []
  }

Which leads to this SQL

SELECT
      AVG("scope_cpu".BYCPU_CPU_SYS_MODE_UTIL)
            OVER(ORDER BY to_timestamptz(timestamp_utc)
              RANGE BETWEEN INTERVAL '60 seconds'
              PRECEDING AND CURRENT ROW) "scope_cpu__avg_cpu", to_timestamptz(timestamp_utc) "scope_cpu__ts", "scope_cpu".BYCPU_CPU_
SYS_MODE_UTIL "scope_cpu__sys_mode_util"
    FROM
      store.scope_cpu AS "scope_cpu"
  GROUP BY 1, 2 ORDER BY 3 DESC LIMIT 10000

And then this error:

Error: Column "scope_cpu.BYCPU_CPU_SYS_MODE_UTIL" must appear in the GROUP BY clause or be used in an aggregate function.

It works, if I manually change to GROUP BY 2,3.

@klausb
Copy link
Author

klausb commented Apr 3, 2020

@paveltiunov , any further ideas?
Otherwise I suggest to close it and assume that this is not possible with cubejs.

@paveltiunov
Copy link
Member

@klausb You can do it as simply as

cube(`ScopeCpu`, {
  sql: `select *, AVG(bid) 
   OVER(ORDER BY ts
       RANGE BETWEEN INTERVAL '40 seconds' 
       PRECEDING AND CURRENT ROW) as avg_bid FROM ticks`,
   // ...
});

This approach however expects you have fixed set of dimensions you want to analyze. If your dimension set should change please use https://cube.dev/docs/measures#parameters-rolling-window

@paveltiunov
Copy link
Member

@klausb Did it work out for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question The issue is a question. Please use Stack Overflow for questions.
Projects
None yet
Development

No branches or pull requests

2 participants