Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregate function to transform data into "wide" schema #4332

Open
philrz opened this issue Jan 24, 2023 · 1 comment
Open

Aggregate function to transform data into "wide" schema #4332

philrz opened this issue Jan 24, 2023 · 1 comment

Comments

@philrz
Copy link
Contributor

philrz commented Jan 24, 2023

I've been doing some recent work transforming data using Zed to prepare it for time-series plotting. The attached sample data fuel.csv illustrates the challenge.

$ cat fuel.csv 
SURVEY_DATE,PRODUCT_ID,PRODUCT_NAME,PRICE,VAT,EXCISE,NET,CHANGE
2005-01-03,1,"Euro-Super 95",1115.75,185.96,558.64,371.15,-1.57
2005-01-03,2,"Automotive gas oil",1018.28,169.71,403.21,445.36,-0.33
2005-01-03,3,"Heating gas oil",948.5,158.08,403.21,387.21,-22.55
2005-01-03,5,LPG,552.5,92.08,156.62,303.8,0.22
2005-01-03,6,"Residual fuel oil",553.25,50.3,166.84,336.11,-12.21
2005-01-03,8,"Heavy fuel oil",229.52,0,31.39,198.13,-5.37
2005-01-10,1,"Euro-Super 95",1088,181.33,558.64,348.03,-27.75
2005-01-10,2,"Automotive gas oil",1004.39,167.4,403.21,433.78,-13.89
2005-01-10,3,"Heating gas oil",947.94,157.99,403.21,386.74,-0.56
2005-01-10,5,LPG,552.57,92.09,156.62,303.86,0.07
2005-01-10,6,"Residual fuel oil",554.22,50.38,166.84,337,0.97
2005-01-10,8,"Heavy fuel oil",238.37,0,31.39,206.98,8.85

As we can see, the SURVEY_DATE column repeats six times, once for each category of fuel. However, many time-series tools have a much easier time reading such data if a "wide" schema is used with a single time/date stamp and multiple metrics per row. So in this case if we wanted to isolate just the PRICE data, we'd prefer it to look something like:

$ zq -version
Version: v1.7.0-2-g36dd506e

$ zq -f csv '
  map(|{PRODUCT_NAME:PRICE}|) by SURVEY_DATE
  | over map with d=SURVEY_DATE => (
    yield {key:[key],value}
    | collect(this)
    | unflatten(this)
    | put SURVEY_DATE:=d
  )' fuel.csv 
  
LPG,Euro-Super 95,Heavy fuel oil,Heating gas oil,Residual fuel oil,Automotive gas oil,SURVEY_DATE
552.5,1115.75,229.52,948.5,553.25,1018.28,2005-01-03
552.57,1088,238.37,947.94,554.22,1004.39,2005-01-10

That multi-line Zed idiom has been handy and I've since used it in multiple contexts (e.g., the https://github.com/brimdata/grafana-zed-datasource README). However, it's not the most self-documenting thing and I'm concerned that it might make the Zed language look needlessly complex to new users. Therefore I'd propose an aggregate function that provides this functionality. As a starting idea for a name, I'd propose widen() since the end data format is sometimes referred to as a "wide schema" in time-series materials.

It has also been noted by the team that as an alternative to delivering this in a purpose-built aggregation function, we could wrap the idiom in a parameterized subgraph (#4152).

@philrz
Copy link
Contributor Author

philrz commented Jun 25, 2023

I was indeed able to implement this with the recently added user-defined operators.

$ cat widen.zed 
op widen(key_field, value_field, by_field): (
  map(|{key_field:value_field}|) by by_field
  | over map with d=by_field => (
    yield {key:[key],value}
    | collect(this)
    | unflatten(this)
    | put by_field:=d
  )
)

$ zq -version
Version: v1.8.1-39-gcf5826b1

$ zq -f csv -I widen.zed 'widen(PRODUCT_NAME, PRICE, SURVEY_DATE)' fuel.csv
LPG,Euro-Super 95,Heavy fuel oil,Heating gas oil,Residual fuel oil,Automotive gas oil,SURVEY_DATE
552.5,1115.75,229.52,948.5,553.25,1018.28,2005-01-03
552.57,1088,238.37,947.94,554.22,1004.39,2005-01-10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant