Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add contents_as_json column to github_workflows table #16846

Merged
merged 5 commits into from
Feb 28, 2024

Conversation

akash1810
Copy link
Contributor

Summary

Adds a new column contents_as_json to the github_workflows table, representing the Workflow's content in JSON form.

The contents column of this table is stringified YAML. Postgres doesn't have first class support for querying YAML; JSON columns can be natively queried.

The column's resolver gets the value in the contents column, and uses github.com/ghodss/yaml to convert it to JSON. This is similar to how AWS CloudFormation templates are processed.

❓ I'm not sure if the mechanism used to get the contents column is preferable. It is different from the docs because we've different types - the table has github.Workflow and the column github.RepositoryContent. Is this likely to cause issues?

@akash1810 akash1810 requested a review from a team as a code owner February 23, 2024 18:20
@akash1810 akash1810 requested review from erezrokah and removed request for a team February 23, 2024 18:20
@cq-bot
Copy link
Contributor

cq-bot commented Feb 23, 2024

This PR has the following changes to source plugin(s) tables:

  • Table github_workflows: column added with name contents_as_json and type json

Copy link
Contributor

@erezrokah erezrokah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @akash1810 and thanks for the PR.
I understand the use case of querying JSON instead of YAML but there are a few things to consider:

  1. A workflow file can be invalid YAML either because of a syntax error, or simply because GitHub workflows YAML syntax differs from the spec, so we can't assume the content is YAML

  2. The PR to fix AWS CloudFormation templates is a different use case. We assumed the data to be JSON, however later discovered it can be YAML, so we try to coerce the data to be JSON, and added a new column with the raw data, for the JSON column to be deprecated sometime in the future.

Generally CloudQuery doesn't do any transformation on the data and saves it as raw as possible, since there can be endless use cases on how to transform it. We found it's better to do any transformations post sync, and I believe you could run a query to add this column if needed

Please let me know if that makes sense

@@ -86,3 +94,17 @@ func resolveContents(ctx context.Context, meta schema.ClientMeta, resource *sche
}
return resource.Set(c.Name, content)
}

func resolveContentsAsJson(ctx context.Context, meta schema.ClientMeta, resource *schema.Resource, c schema.Column) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column resolvers run in parallel so the data in contents might not be ready when this code runs.
The way to do this is to add resource.Set("contents_as_json", contentAsJson) to the existing resolver

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! That's a lot cleaner!

@erezrokah
Copy link
Contributor

Maybe opening an issue with your specific use case will help clarify the need for the column? Could be easier to discuss this over an issue

@akash1810
Copy link
Contributor Author

Maybe opening an issue with your specific use case will help clarify the need for the column? Could be easier to discuss this over an issue

Issue opened - #16850

@akash1810 akash1810 marked this pull request as draft February 24, 2024 08:26
@erezrokah
Copy link
Contributor

Hi @akash1810, sorry for the delay. We discussed this change internally and agreed it makes sense to be able to query the workflow YAML via JSON syntax as suggested in the issue #16850.

We've considered recommending using a Postgres extension to add a virtual column/view to handle the YAML conversion, however realized that might not work for hosted Postgres solutions.

Would you like to follow up on this PR with my suggestion from #16846 (comment) to extend the existing resolver?

@akash1810
Copy link
Contributor Author

We discussed this change internally and agreed it makes sense to be able to query the workflow YAML via JSON syntax as suggested in the issue #16850.

Ah excellent.

Would you like to follow up on this PR with my suggestion from #16846 (comment) to extend the existing resolver?

Will do! I'll hope to get this done this week.

Any thoughts on schema validation (as mentioned in #16850) here? If CloudQuery tries to represent the source APIs as closely as possible, its probably out of scope to perform a schema check here too?

@erezrokah
Copy link
Contributor

its probably out of scope to perform a schema check here too?

Is the purpose to rely on the schema to query the YAML? If so I think we could verify it.
However the file could still be valid YAML without matching the schema (in case of user error), so you'd lose querying capabilities in that case. Another issue is that the schema can change over time so you'd need to know which schema was used during the sync to validate.

I'd say not validate for now, and write queries that rely on the schema but can gracefully handle invalid one if not too complicated

Adds a new column `contents_as_json` to the `github_workflows` table,
representing the Workflow's content in JSON form.

This column depends on the `contents` column,
and is set within the `resolveContents` resolver.

The resolver for the `contents_as_json` column is a noop.
@akash1810
Copy link
Contributor Author

I'll hope to get this done this week.

Pushed an update now. I've not written much Go, so any feedback on code style, etc welcomed!

@akash1810 akash1810 marked this pull request as ready for review February 27, 2024 22:16
Copy link
Contributor

@erezrokah erezrokah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akash1810 thanks for the following, just got around to testing this. Works great 🚀
Nice one with the no-op resolver

@erezrokah erezrokah added automerge Automatically merge once required checks pass and removed automerge Automatically merge once required checks pass labels Feb 28, 2024
@erezrokah
Copy link
Contributor

OK added 7b90067 as I don't think we should fail if YAML parsing fail. We should still keep the text content. If that makes sense to you I'll go ahead and merge the PR

@akash1810
Copy link
Contributor Author

OK added 7b90067 as I don't think we should fail if YAML parsing fail. We should still keep the text content. If that makes sense to you I'll go ahead and merge the PR

LGTM!

Thanks also for the style: commit; I forgot to run make lint before pushing...

@erezrokah erezrokah added the automerge Automatically merge once required checks pass label Feb 28, 2024
@kodiakhq kodiakhq bot merged commit 16d9db0 into cloudquery:main Feb 28, 2024
13 checks passed
kodiakhq bot pushed a commit that referenced this pull request Feb 28, 2024
🤖 I have created a release *beep* *boop*
---


## [8.1.0](plugins-source-github-v8.0.2...plugins-source-github-v8.1.0) (2024-02-28)


### This Release has the Following Changes to Tables
- Table `github_workflows`: column added with name `contents_as_json` and type `json`

### Features

* Add `contents_as_json` column to `github_workflows` table ([#16846](#16846)) ([16d9db0](16d9db0))


### Bug Fixes

* **deps:** Update module github.com/cloudquery/plugin-sdk/v4 to v4.31.0 ([#16899](#16899)) ([2fac27a](2fac27a))

---
This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
akash1810 added a commit to guardian/service-catalogue that referenced this pull request Feb 29, 2024
This change includes an update to the `github_workflows` column,
which could allow us to replace the `github_actions_usage` lambda
with a SQL statement.

See cloudquery/cloudquery#16846.
akash1810 added a commit to guardian/service-catalogue that referenced this pull request Feb 29, 2024
This change includes an update to the `github_workflows` table,
which could allow us to replace the `github_actions_usage` lambda
with a SQL statement.

See cloudquery/cloudquery#16846.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/plugin/source/github automerge Automatically merge once required checks pass
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants