Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EVG-16134: CLI to convert BSON to Parquet #435

Merged
merged 26 commits into from
Feb 18, 2022

Conversation

julianedwards
Copy link
Contributor

@julianedwards julianedwards commented Feb 3, 2022

JIRA: https://jira.mongodb.org/browse/EVG-16134

The CLI is just for testing and not for production, the following ticket EVG-16137 will reuse some of the code to actually write parquet test results to an S3 bucket.

@@ -88,6 +90,11 @@ func (t *TestResults) Setup(e cedar.Environment) { t.env = e }
// IsNil returns if the TestResults is populated or not.
func (t *TestResults) IsNil() bool { return !t.populated }

// PrestoPartitionKey returns the partition key for the S3 bucket in Presto.
func (t *TestResults) PrestoPartitionKey() string {
return fmt.Sprintf("project=%s/task_create_iso=%s/%s", t.Info.Project, t.CreatedAt.UTC().Format("2006-01-02"), t.ID)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're doing date then project iirc

cmd/convert-bson-to-parquet/.gitignore Show resolved Hide resolved
model/test_results.go Outdated Show resolved Hide resolved
model/test_results.go Outdated Show resolved Hide resolved
model/test_results.go Show resolved Hide resolved
model/test_results_test.go Outdated Show resolved Hide resolved
LogURL *string `parquet:"name=log_url, type=BYTE_ARRAY, convertedtype=UTF8"`
RawLogURL *string `parquet:"name=raw_log_url, type=BYTE_ARRAY, convertedtype=UTF8"`
LineNum *int32 `parquet:"name=line_num, type=INT32"`
TestStartTime int64 `parquet:"name=test_start_time, type=INT64, logicaltype=TIMESTAMP, logicaltype.unit=MILLIS, logicaltype.isadjustedtoutc=true"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a raw int64, should this variable be named TestStartTimeUnixMillis or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rtimmons what do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-iterating a bit from slack here for posterity:

I think the type of the field matters in this case. If it's a timestamp field then it's not a count of millis from unix epoch, it's in fact a timestamp of when the test started (which under the covers may use epoch millis or whatever but that's its business)

so what's weird is that in go it's an int64 but in parquet/presto it's a timestamp

the "correct" thing seems to be to call it different things in parquet versus go, but that seems potentially confusing

so tbh it's a judgment call. I would choose the same name and live with the type oddity in go (perhaps make the field "hidden" in go and add an accessor function/method to use the timestamp-typed version which is constructed from the int)? But there's downsides there about consistency so shrug.

Either way it's no biggie imho as long as we're storing it as a proper timestamp in presto and that the field name in presto isn't confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to leave it as is since the struct tags make it explicit that this a millisecond unix timestamp in parquet. I am going to merge this.

Copy link
Contributor

@Kimchelly Kimchelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM mod the one comment thread.

@julianedwards julianedwards merged commit 9153ea7 into evergreen-ci:main Feb 18, 2022
@julianedwards julianedwards deleted the EVG-16134 branch February 18, 2022 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants