Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make DateTimeFieldSpecs mainstream, deprecate TimeFieldSpec #2756

Open
npawar opened this issue Apr 24, 2018 · 3 comments
Open

Make DateTimeFieldSpecs mainstream, deprecate TimeFieldSpec #2756

npawar opened this issue Apr 24, 2018 · 3 comments
Assignees

Comments

@npawar
Copy link
Contributor

npawar commented Apr 24, 2018

In the Pinot Schema, stop using TimeFieldSpec, in favor of DateTimeFieldSpec.

The TimeFieldSpec is not adequate for the following reasons

  1. A schema can have only one TimeFieldSpec. As a result, we can have only one time column in the schema.
  2. Users get around the above limitation by adding the other time columns as dimensions. However, dimension columns can only have attributes name and dataType. TimeFieldSpec has attributes such as time unit (HOURS, MINUTES), time format (EPOCH/SDF), pattern (yyyyMMdd), which help describe the time column in more detail. Adding additional time columns as dimensions makes us lose out on these attributes of the time column.
  3. TimeFieldSpec has attributes which describe the format of the time column. However, it doesn’t convey anything about the bucketing of the time column. On the other hand, DateTimeFieldSpec has a richer spec, wherein the bucketing information is also captured. Bucketing is a useful concept when time columns are expressed in formats different from the bucketing. For example, if the time column is expressed in epoch millis, but every value has been rounded to the nearest hour, we can set DateTimeFieldSpec as
"dateTimeFieldSpecs": [
    {
      "name": "timestamp",
      "dataType": "LONG",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:HOURS"
    }
  ]
  1. The conversion mechanism used in TimeFieldSpec is not very flexible. The conversion can only be done from an incoming TimeGranularitySpec to an outgoing TimeGranularitySpec.
@npawar
Copy link
Contributor Author

npawar commented Apr 19, 2020

@npawar npawar changed the title Deprecate TimeFieldSpec and start using only DateTimeFieldSpec Support DateTimeFieldSpecs, deprecate TimeFieldSpec Apr 19, 2020
@npawar npawar changed the title Support DateTimeFieldSpecs, deprecate TimeFieldSpec Make DateTimeFieldSpecs mainstream, deprecate TimeFieldSpec Apr 19, 2020
@npawar npawar self-assigned this Apr 20, 2020
@snleee
Copy link
Contributor

snleee commented Apr 24, 2020

I want to discuss about enforcing a uniform type & format and only allow users to pick the granularity for the primary time column, which is configured in the table config.

  • start with the primary time column (enforced to have long, "1:milliseconds:epoch", user-defined-granularity).

  • Users can initially try to use time conversion UDF on the primary time column for their needs

  • If they need the better performance, they can add a new time column to dateTimeFieldSpecs with the specific time format & type that they want to materialize.

I think that having a uniform type & format for the primary column can simplify a lot of components because we won't need to support all different type formats when we need to play with the time. For example, segment roll-up config can be very simple - instead of having type, format, granularity, only granularity will be needed.

Another huge benefit will be an easier integration with external UIs such as Superset. I believe that Pinot-Superset connector have some extra logic to convert the time column values into a format that Superset understands. Since pinot's time column format/type will be different for each table, the external system now needs to know about Pinot's table configuration & schema (or similar time conversion config need to be specified for each table on Superset) for correctly parsing the time column value.

Since we make the interface change to the time column, it would be great if we can discuss about this.

@kishoreg @npawar @mayankshriv @Jackie-Jiang How do you guys think?

@npawar
Copy link
Contributor Author

npawar commented Apr 24, 2020

I agree this would be nice to have, and see the benefits we gain for merge/rollup and integrations.
I would prefer if we first get to making dateTimeFieldSpec the default, and then focus on this. Once we have only dateTimeFieldSpec, it'll be easier to add tooling and validations for achieving this. I feel this change is kinda tricky as it is, and don't want to mix up goals.

npawar added a commit that referenced this issue May 5, 2020
This is an item of project: #2756
We plan to internally start treating timeFieldSpec as dateTimeFieldSpec. This PR adds a utility function which converts a timeFieldSpec to an equivalent dateTimeFieldSpec.
Note that dateTimeFieldSpec doesn't have the concept of incoming/outgoing, and hence we construct and add a transform function to convey the conversion. Introduced a lot of epoch time transform functions for this.

NOTE: the toEpochXXXBucket method is added to help with conversion from timeFieldSpec to dateTimeFieldSpec. Practically, we would only be using toEpochXXX and toEpochXXXRounded.
npawar added a commit that referenced this issue May 7, 2020
Another step towards Issue #2756. As I was making changes to Schema to treat TIME as DATE_TIME, found a subset of changes that can go prior to the bigger change.
Removing 3 methods from the Schema which are related to TimeFieldSpec
1. getIncomingTimeUnit - was unused
2. getOutgoingTimeUnit - was unused
3. getTimeColumnName - was able to remove the 4 usages
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants