-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EVG-16134: CLI to convert BSON to Parquet #435
Conversation
402ccc2
to
3c080ec
Compare
model/test_results.go
Outdated
@@ -88,6 +90,11 @@ func (t *TestResults) Setup(e cedar.Environment) { t.env = e } | |||
// IsNil returns if the TestResults is populated or not. | |||
func (t *TestResults) IsNil() bool { return !t.populated } | |||
|
|||
// PrestoPartitionKey returns the partition key for the S3 bucket in Presto. | |||
func (t *TestResults) PrestoPartitionKey() string { | |||
return fmt.Sprintf("project=%s/task_create_iso=%s/%s", t.Info.Project, t.CreatedAt.UTC().Format("2006-01-02"), t.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're doing date then project iirc
LogURL *string `parquet:"name=log_url, type=BYTE_ARRAY, convertedtype=UTF8"` | ||
RawLogURL *string `parquet:"name=raw_log_url, type=BYTE_ARRAY, convertedtype=UTF8"` | ||
LineNum *int32 `parquet:"name=line_num, type=INT32"` | ||
TestStartTime int64 `parquet:"name=test_start_time, type=INT64, logicaltype=TIMESTAMP, logicaltype.unit=MILLIS, logicaltype.isadjustedtoutc=true"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a raw int64, should this variable be named TestStartTimeUnixMillis
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rtimmons what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-iterating a bit from slack here for posterity:
I think the type of the field matters in this case. If it's a timestamp field then it's not a count of millis from unix epoch, it's in fact a timestamp of when the test started (which under the covers may use epoch millis or whatever but that's its business)
so what's weird is that in go it's an int64 but in parquet/presto it's a timestamp
the "correct" thing seems to be to call it different things in parquet versus go, but that seems potentially confusing
so tbh it's a judgment call. I would choose the same name and live with the type oddity in go (perhaps make the field "hidden" in go and add an accessor function/method to use the timestamp-typed version which is constructed from the int)? But there's downsides there about consistency so shrug.
Either way it's no biggie imho as long as we're storing it as a proper timestamp in presto and that the field name in presto isn't confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine to leave it as is since the struct tags make it explicit that this a millisecond unix timestamp in parquet. I am going to merge this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM mod the one comment thread.
JIRA: https://jira.mongodb.org/browse/EVG-16134
The CLI is just for testing and not for production, the following ticket EVG-16137 will reuse some of the code to actually write parquet test results to an S3 bucket.