-
Notifications
You must be signed in to change notification settings - Fork 499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add contents_as_json
column to github_workflows
table
#16846
Conversation
This PR has the following changes to source plugin(s) tables:
|
cb10954
to
b91db33
Compare
b91db33
to
328ba5f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @akash1810 and thanks for the PR.
I understand the use case of querying JSON instead of YAML but there are a few things to consider:
-
A workflow file can be invalid YAML either because of a syntax error, or simply because GitHub workflows YAML syntax differs from the spec, so we can't assume the content is YAML
-
The PR to fix AWS CloudFormation templates is a different use case. We assumed the data to be JSON, however later discovered it can be YAML, so we try to coerce the data to be JSON, and added a new column with the raw data, for the JSON column to be deprecated sometime in the future.
Generally CloudQuery doesn't do any transformation on the data and saves it as raw as possible, since there can be endless use cases on how to transform it. We found it's better to do any transformations post sync, and I believe you could run a query to add this column if needed
Please let me know if that makes sense
@@ -86,3 +94,17 @@ func resolveContents(ctx context.Context, meta schema.ClientMeta, resource *sche | |||
} | |||
return resource.Set(c.Name, content) | |||
} | |||
|
|||
func resolveContentsAsJson(ctx context.Context, meta schema.ClientMeta, resource *schema.Resource, c schema.Column) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Column resolvers run in parallel so the data in contents
might not be ready when this code runs.
The way to do this is to add resource.Set("contents_as_json", contentAsJson)
to the existing resolver
return resource.Set(c.Name, content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! That's a lot cleaner!
Maybe opening an issue with your specific use case will help clarify the need for the column? Could be easier to discuss this over an issue |
Issue opened - #16850 |
Hi @akash1810, sorry for the delay. We discussed this change internally and agreed it makes sense to be able to query the workflow YAML via JSON syntax as suggested in the issue #16850. We've considered recommending using a Postgres extension to add a virtual column/view to handle the YAML conversion, however realized that might not work for hosted Postgres solutions. Would you like to follow up on this PR with my suggestion from #16846 (comment) to extend the existing resolver? |
Ah excellent.
Will do! I'll hope to get this done this week. Any thoughts on schema validation (as mentioned in #16850) here? If CloudQuery tries to represent the source APIs as closely as possible, its probably out of scope to perform a schema check here too? |
Is the purpose to rely on the schema to query the YAML? If so I think we could verify it. I'd say not validate for now, and write queries that rely on the schema but can gracefully handle invalid one if not too complicated |
328ba5f
to
f0ff17e
Compare
Adds a new column `contents_as_json` to the `github_workflows` table, representing the Workflow's content in JSON form. This column depends on the `contents` column, and is set within the `resolveContents` resolver. The resolver for the `contents_as_json` column is a noop.
f0ff17e
to
afa4dab
Compare
Pushed an update now. I've not written much Go, so any feedback on code style, etc welcomed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akash1810 thanks for the following, just got around to testing this. Works great 🚀
Nice one with the no-op resolver
OK added 7b90067 as I don't think we should fail if YAML parsing fail. We should still keep the text content. If that makes sense to you I'll go ahead and merge the PR |
LGTM! Thanks also for the |
🤖 I have created a release *beep* *boop* --- ## [8.1.0](plugins-source-github-v8.0.2...plugins-source-github-v8.1.0) (2024-02-28) ### This Release has the Following Changes to Tables - Table `github_workflows`: column added with name `contents_as_json` and type `json` ### Features * Add `contents_as_json` column to `github_workflows` table ([#16846](#16846)) ([16d9db0](16d9db0)) ### Bug Fixes * **deps:** Update module github.com/cloudquery/plugin-sdk/v4 to v4.31.0 ([#16899](#16899)) ([2fac27a](2fac27a)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
This change includes an update to the `github_workflows` column, which could allow us to replace the `github_actions_usage` lambda with a SQL statement. See cloudquery/cloudquery#16846.
This change includes an update to the `github_workflows` table, which could allow us to replace the `github_actions_usage` lambda with a SQL statement. See cloudquery/cloudquery#16846.
Summary
Adds a new column
contents_as_json
to thegithub_workflows
table, representing the Workflow's content in JSON form.The
contents
column of this table is stringified YAML. Postgres doesn't have first class support for querying YAML; JSON columns can be natively queried.The column's resolver gets the value in the
contents
column, and usesgithub.com/ghodss/yaml
to convert it to JSON. This is similar to how AWS CloudFormation templates are processed.❓ I'm not sure if the mechanism used to get the
contents
column is preferable. It is different from the docs because we've different types - the table hasgithub.Workflow
and the columngithub.RepositoryContent
. Is this likely to cause issues?