feat: Add `contents_as_json` column to `github_workflows` table #16846

akash1810 · 2024-02-23T18:20:19Z

Summary

Adds a new column contents_as_json to the github_workflows table, representing the Workflow's content in JSON form.

The contents column of this table is stringified YAML. Postgres doesn't have first class support for querying YAML; JSON columns can be natively queried.

The column's resolver gets the value in the contents column, and uses github.com/ghodss/yaml to convert it to JSON. This is similar to how AWS CloudFormation templates are processed.

❓ I'm not sure if the mechanism used to get the contents column is preferable. It is different from the docs because we've different types - the table has github.Workflow and the column github.RepositoryContent. Is this likely to cause issues?

cq-bot · 2024-02-23T18:21:20Z

This PR has the following changes to source plugin(s) tables:

Table github_workflows: column added with name contents_as_json and type json

erezrokah

Hi @akash1810 and thanks for the PR.
I understand the use case of querying JSON instead of YAML but there are a few things to consider:

A workflow file can be invalid YAML either because of a syntax error, or simply because GitHub workflows YAML syntax differs from the spec, so we can't assume the content is YAML
The PR to fix AWS CloudFormation templates is a different use case. We assumed the data to be JSON, however later discovered it can be YAML, so we try to coerce the data to be JSON, and added a new column with the raw data, for the JSON column to be deprecated sometime in the future.

Generally CloudQuery doesn't do any transformation on the data and saves it as raw as possible, since there can be endless use cases on how to transform it. We found it's better to do any transformations post sync, and I believe you could run a query to add this column if needed

Please let me know if that makes sense

erezrokah · 2024-02-23T18:54:38Z

plugins/source/github/resources/services/actions/workflows.go

@@ -86,3 +94,17 @@ func resolveContents(ctx context.Context, meta schema.ClientMeta, resource *sche
 	}
 	return resource.Set(c.Name, content)
 }
+
+func resolveContentsAsJson(ctx context.Context, meta schema.ClientMeta, resource *schema.Resource, c schema.Column) error {


Column resolvers run in parallel so the data in contents might not be ready when this code runs.
The way to do this is to add resource.Set("contents_as_json", contentAsJson) to the existing resolver

cloudquery/plugins/source/github/resources/services/actions/workflows.go

Line 87 in 8ab9b43

return resource.Set(c.Name, content)

Ah! That's a lot cleaner!

erezrokah · 2024-02-23T19:10:36Z

Maybe opening an issue with your specific use case will help clarify the need for the column? Could be easier to discuss this over an issue

akash1810 · 2024-02-23T20:03:41Z

Maybe opening an issue with your specific use case will help clarify the need for the column? Could be easier to discuss this over an issue

Issue opened - #16850

erezrokah · 2024-02-27T14:20:43Z

Hi @akash1810, sorry for the delay. We discussed this change internally and agreed it makes sense to be able to query the workflow YAML via JSON syntax as suggested in the issue #16850.

We've considered recommending using a Postgres extension to add a virtual column/view to handle the YAML conversion, however realized that might not work for hosted Postgres solutions.

Would you like to follow up on this PR with my suggestion from #16846 (comment) to extend the existing resolver?

akash1810 · 2024-02-27T15:01:10Z

We discussed this change internally and agreed it makes sense to be able to query the workflow YAML via JSON syntax as suggested in the issue #16850.

Ah excellent.

Would you like to follow up on this PR with my suggestion from #16846 (comment) to extend the existing resolver?

Will do! I'll hope to get this done this week.

Any thoughts on schema validation (as mentioned in #16850) here? If CloudQuery tries to represent the source APIs as closely as possible, its probably out of scope to perform a schema check here too?

erezrokah · 2024-02-27T15:09:40Z

its probably out of scope to perform a schema check here too?

Is the purpose to rely on the schema to query the YAML? If so I think we could verify it.
However the file could still be valid YAML without matching the schema (in case of user error), so you'd lose querying capabilities in that case. Another issue is that the schema can change over time so you'd need to know which schema was used during the sync to validate.

I'd say not validate for now, and write queries that rely on the schema but can gracefully handle invalid one if not too complicated

Adds a new column `contents_as_json` to the `github_workflows` table, representing the Workflow's content in JSON form. This column depends on the `contents` column, and is set within the `resolveContents` resolver. The resolver for the `contents_as_json` column is a noop.

akash1810 · 2024-02-27T22:16:44Z

I'll hope to get this done this week.

Pushed an update now. I've not written much Go, so any feedback on code style, etc welcomed!

erezrokah

@akash1810 thanks for the following, just got around to testing this. Works great 🚀
Nice one with the no-op resolver

erezrokah · 2024-02-28T15:38:09Z

OK added 7b90067 as I don't think we should fail if YAML parsing fail. We should still keep the text content. If that makes sense to you I'll go ahead and merge the PR

akash1810 · 2024-02-28T18:45:11Z

OK added 7b90067 as I don't think we should fail if YAML parsing fail. We should still keep the text content. If that makes sense to you I'll go ahead and merge the PR

LGTM!

Thanks also for the style: commit; I forgot to run make lint before pushing...

🤖 I have created a release *beep* *boop* --- ## [8.1.0](plugins-source-github-v8.0.2...plugins-source-github-v8.1.0) (2024-02-28) ### This Release has the Following Changes to Tables - Table `github_workflows`: column added with name `contents_as_json` and type `json` ### Features * Add `contents_as_json` column to `github_workflows` table ([#16846](#16846)) ([16d9db0](16d9db0)) ### Bug Fixes * **deps:** Update module github.com/cloudquery/plugin-sdk/v4 to v4.31.0 ([#16899](#16899)) ([2fac27a](2fac27a)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).

This change includes an update to the `github_workflows` column, which could allow us to replace the `github_actions_usage` lambda with a SQL statement. See cloudquery/cloudquery#16846.

This change includes an update to the `github_workflows` table, which could allow us to replace the `github_actions_usage` lambda with a SQL statement. See cloudquery/cloudquery#16846.

akash1810 requested a review from a team as a code owner February 23, 2024 18:20

akash1810 requested review from erezrokah and removed request for a team February 23, 2024 18:20

cq-bot added the area/plugin/source/github label Feb 23, 2024

akash1810 force-pushed the aa/gh-workflow-json branch from cb10954 to b91db33 Compare February 23, 2024 18:24

akash1810 mentioned this pull request Feb 23, 2024

Create github-actions-usage lambda guardian/service-catalogue#810

Merged

akash1810 force-pushed the aa/gh-workflow-json branch from b91db33 to 328ba5f Compare February 23, 2024 18:36

erezrokah requested changes Feb 23, 2024

View reviewed changes

akash1810 mentioned this pull request Feb 23, 2024

feat: Save GitHub Workflow content as JSON #16850

Closed

1 task

akash1810 marked this pull request as draft February 24, 2024 08:26

akash1810 force-pushed the aa/gh-workflow-json branch from 328ba5f to f0ff17e Compare February 27, 2024 22:12

akash1810 force-pushed the aa/gh-workflow-json branch from f0ff17e to afa4dab Compare February 27, 2024 22:14

akash1810 marked this pull request as ready for review February 27, 2024 22:16

style: Remove extra space

5ab0197

erezrokah approved these changes Feb 28, 2024

View reviewed changes

erezrokah added automerge Automatically merge once required checks pass and removed automerge Automatically merge once required checks pass labels Feb 28, 2024

fix: Log warning instead of failing on yaml parse failure

7b90067

erezrokah approved these changes Feb 28, 2024

View reviewed changes

erezrokah added the automerge Automatically merge once required checks pass label Feb 28, 2024

erezrokah added 2 commits February 28, 2024 20:05

chore: Tidy

8e235fd

Merge branch 'main' into aa/gh-workflow-json

77e12a2

vercel bot deployed to Preview February 28, 2024 20:22 View deployment

kodiakhq bot merged commit 16d9db0 into cloudquery:main Feb 28, 2024
13 checks passed

cq-bot mentioned this pull request Feb 28, 2024

chore(main): Release plugins-source-github v8.1.0 #16927

Merged

akash1810 mentioned this pull request Feb 29, 2024

chore(deps): Update the GitHub plugin for CloudQuery to v8.1.0 guardian/service-catalogue#828

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add `contents_as_json` column to `github_workflows` table #16846

feat: Add `contents_as_json` column to `github_workflows` table #16846

akash1810 commented Feb 23, 2024

cq-bot commented Feb 23, 2024

erezrokah left a comment

erezrokah Feb 23, 2024

akash1810 Feb 23, 2024

erezrokah commented Feb 23, 2024

akash1810 commented Feb 23, 2024

erezrokah commented Feb 27, 2024

akash1810 commented Feb 27, 2024

erezrokah commented Feb 27, 2024

akash1810 commented Feb 27, 2024

erezrokah left a comment

erezrokah commented Feb 28, 2024

akash1810 commented Feb 28, 2024

feat: Add contents_as_json column to github_workflows table #16846

feat: Add contents_as_json column to github_workflows table #16846

Conversation

akash1810 commented Feb 23, 2024

Summary

cq-bot commented Feb 23, 2024

This PR has the following changes to source plugin(s) tables:

erezrokah left a comment

Choose a reason for hiding this comment

erezrokah Feb 23, 2024

Choose a reason for hiding this comment

akash1810 Feb 23, 2024

Choose a reason for hiding this comment

erezrokah commented Feb 23, 2024

akash1810 commented Feb 23, 2024

erezrokah commented Feb 27, 2024

akash1810 commented Feb 27, 2024

erezrokah commented Feb 27, 2024

akash1810 commented Feb 27, 2024

erezrokah left a comment

Choose a reason for hiding this comment

erezrokah commented Feb 28, 2024

akash1810 commented Feb 28, 2024

feat: Add `contents_as_json` column to `github_workflows` table #16846

feat: Add `contents_as_json` column to `github_workflows` table #16846