Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform functions in Pinot schema #5135

Closed
npawar opened this issue Mar 10, 2020 · 3 comments
Closed

Transform functions in Pinot schema #5135

npawar opened this issue Mar 10, 2020 · 3 comments
Assignees

Comments

@npawar
Copy link
Contributor

npawar commented Mar 10, 2020

Consider,
X: Data at source. This can be either a stream or data files. The formats are typically JSON, AVRO, CSV etc.
Y: Data in Pinot. This is the record/document in Pinot.

When data is ingested into Pinot (either realtime ingestion or batch ingestion), all columns in X directly need to map to Y. The only exception to this is the time column, where we allow transformation from one time format to another, but we are limited to 1 column. This means that every column in the destination schema should be present exactly as it is in the source schema (except the time column).
This is not always practical. It is often desirable to have some amount of transformations to the source columns before they get to the destination.

For example, consider this sample ads data schema
Source columns - userID, name.firstName, name.lastName, IP, eventType, cost, timestamp

  {
    "userID": 1,
    "name”: { “firstName": "John", "lastName": "Doe"},
    "IP": "10.1.2.3",
    "eventType": "IMPRESSION",
    "cost": 2000,
    “timestamp”: 1583882502198
  },
  {
    "userID": 2,
    “name”: { "firstName": "Mary", "lastName": "Smith"},
    "IP": "10.5.6.7",
    "eventType": "IMPRESSION",
    "cost": 4000,
    “timestamp”: 1583882502198
  },
  {
    "userID": 3,
    “name”: { "firstName": "Rita", "lastName": "Skeeter"},
    "IP": "10.9.8.7",
    "eventType": "CLICK",
    "cost": 600,
    “timestamp”: 1583882502198
  }

Destination columns - userId, fullName, country, zipcode, impressions, clicks, cost, hoursSinceEpoch, daysSinceEpoch
userId - Map userID to userId
fullName - Concat name.firstName and name.lastName
country - Extract country from IP
zipcode - Extract zipcode from IP
impressions - 1 if eventType=IMPRESSION, 0 otherwise
clicks - 1 if eventType=CLICK, 0 otherwise
cost - Directly maps from cost, no transformations
hoursSinceEpoch - convert timestamp to epoch hours
daysSinceEpoch - convert timestamp to epoch days

The only way to achieve this in Pinot is for the user to write a custom transformation job and prepare data based on the destination schema

Hence, the motivations for this proposal are as follows:

  1. Source and destination are not always 1:1 - Users have to write a transformation job, separately for realtime and for offline, which can lead to inconsistencies. It also adds an additional step for user onboarding.
  2. Be able to read nested source data fields
  3. Be able to support multiple time columns - in order to use dataTimeSpec, we need to have support for derived functions.
  4. Be able to share transformation functions across usecases, instead of each user writing one for themselves
  5. Better schema evolution - When you add a new column, if it is derived from existing columns it can be backfilled with the correct values instead of default null values.
@npawar npawar self-assigned this Mar 10, 2020
@npawar
Copy link
Contributor Author

npawar commented Mar 10, 2020

@npawar
Copy link
Contributor Author

npawar commented May 7, 2020

Next steps:

  1. Transformations using columns which themselves are a product of transformation - Transformations using columns which themselves are a result of transformation #5351
  2. Support for custom functions (non-Groovy function evaluators) - Support for custom functions in schema transformation #5352
  3. Date time related custom functions - Add date time transform functions #5313
  4. Flatten - Flatten fields during ingestion #5264
  5. Filter - Filter during ingestion #5268

@npawar
Copy link
Contributor Author

npawar commented May 7, 2020

Closing this, as there's an issue for every followup

@npawar npawar closed this as completed May 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant