Aggregate function to transform data into "wide" schema #4332

philrz · 2023-01-24T23:02:46Z

I've been doing some recent work transforming data using Zed to prepare it for time-series plotting. The attached sample data fuel.csv illustrates the challenge.

$ cat fuel.csv 
SURVEY_DATE,PRODUCT_ID,PRODUCT_NAME,PRICE,VAT,EXCISE,NET,CHANGE
2005-01-03,1,"Euro-Super 95",1115.75,185.96,558.64,371.15,-1.57
2005-01-03,2,"Automotive gas oil",1018.28,169.71,403.21,445.36,-0.33
2005-01-03,3,"Heating gas oil",948.5,158.08,403.21,387.21,-22.55
2005-01-03,5,LPG,552.5,92.08,156.62,303.8,0.22
2005-01-03,6,"Residual fuel oil",553.25,50.3,166.84,336.11,-12.21
2005-01-03,8,"Heavy fuel oil",229.52,0,31.39,198.13,-5.37
2005-01-10,1,"Euro-Super 95",1088,181.33,558.64,348.03,-27.75
2005-01-10,2,"Automotive gas oil",1004.39,167.4,403.21,433.78,-13.89
2005-01-10,3,"Heating gas oil",947.94,157.99,403.21,386.74,-0.56
2005-01-10,5,LPG,552.57,92.09,156.62,303.86,0.07
2005-01-10,6,"Residual fuel oil",554.22,50.38,166.84,337,0.97
2005-01-10,8,"Heavy fuel oil",238.37,0,31.39,206.98,8.85

As we can see, the SURVEY_DATE column repeats six times, once for each category of fuel. However, many time-series tools have a much easier time reading such data if a "wide" schema is used with a single time/date stamp and multiple metrics per row. So in this case if we wanted to isolate just the PRICE data, we'd prefer it to look something like:

$ zq -version
Version: v1.7.0-2-g36dd506e

$ zq -f csv '
  map(|{PRODUCT_NAME:PRICE}|) by SURVEY_DATE
  | over map with d=SURVEY_DATE => (
    yield {key:[key],value}
    | collect(this)
    | unflatten(this)
    | put SURVEY_DATE:=d
  )' fuel.csv 
  
LPG,Euro-Super 95,Heavy fuel oil,Heating gas oil,Residual fuel oil,Automotive gas oil,SURVEY_DATE
552.5,1115.75,229.52,948.5,553.25,1018.28,2005-01-03
552.57,1088,238.37,947.94,554.22,1004.39,2005-01-10

That multi-line Zed idiom has been handy and I've since used it in multiple contexts (e.g., the https://github.com/brimdata/grafana-zed-datasource README). However, it's not the most self-documenting thing and I'm concerned that it might make the Zed language look needlessly complex to new users. Therefore I'd propose an aggregate function that provides this functionality. As a starting idea for a name, I'd propose widen() since the end data format is sometimes referred to as a "wide schema" in time-series materials.

It has also been noted by the team that as an alternative to delivering this in a purpose-built aggregation function, we could wrap the idiom in a parameterized subgraph (#4152).

The text was updated successfully, but these errors were encountered:

philrz · 2023-06-25T23:52:53Z

I was indeed able to implement this with the recently added user-defined operators.

$ cat widen.zed 
op widen(key_field, value_field, by_field): (
  map(|{key_field:value_field}|) by by_field
  | over map with d=by_field => (
    yield {key:[key],value}
    | collect(this)
    | unflatten(this)
    | put by_field:=d
  )
)

$ zq -version
Version: v1.8.1-39-gcf5826b1

$ zq -f csv -I widen.zed 'widen(PRODUCT_NAME, PRICE, SURVEY_DATE)' fuel.csv
LPG,Euro-Super 95,Heavy fuel oil,Heating gas oil,Residual fuel oil,Automotive gas oil,SURVEY_DATE
552.5,1115.75,229.52,948.5,553.25,1018.28,2005-01-03
552.57,1088,238.37,947.94,554.22,1004.39,2005-01-10

philrz mentioned this issue Mar 23, 2023

Grouping by every field not referenced in aggregation function #4452

Open

philrz mentioned this issue Apr 27, 2023

Zed shorthand to create a record field name from string #4555

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregate function to transform data into "wide" schema #4332

Aggregate function to transform data into "wide" schema #4332

philrz commented Jan 24, 2023 •

edited

Loading

philrz commented Jun 25, 2023

Aggregate function to transform data into "wide" schema #4332

Aggregate function to transform data into "wide" schema #4332

Comments

philrz commented Jan 24, 2023 • edited Loading

philrz commented Jun 25, 2023

philrz commented Jan 24, 2023 •

edited

Loading