Add metadata scraper #27

ulucinar · 2022-06-29T06:32:25Z

Description of your changes

Related to https://github.com/upbound/official-providers/issues/19
Depends on #19

Adds a metadata scraper, which is invoked during provider code generation to generate a provider metadata file.

I have:

Read and followed Crossplane's contribution process.
Run make reviewable to ensure this PR is ready for review.
Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

Enabled the example generation pipelines in this PR.

muvaf

Thanks a lot for this PR @ulucinar ! It's quite challenging to follow recursive traversal like most functions here, would you mind adding unit tests so that we don't break it accidentally and see immediately what we expect them to do?

muvaf · 2022-06-30T17:17:52Z

pkg/registry/resource.go

@@ -61,12 +61,17 @@ type Resource struct {
 	// external name used in the generated example manifests to be
 	// overridden for a specific resource via configuration.
 	ExternalName string `yaml:"-"`
+	scrapeConfig *ScrapeConfiguration


This seems to be intended as per-resource but in practice, we set it to the same for all resources. I wonder how it'd look if we had ScrapeConfiguration include existing fields as well as XPath configs and used as input only for the ScrapeRepo function call. This way registry.ProviderMetadata and registry.Resource would be free of scraping-methodology-specific details. What do you think?

Removed scraper configuration fields from registry.Resource and registry.ProviderMetadata. Thanks @muvaf!

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>

…etadata} Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>

muvaf

Opened a couple issues for stuff that may require refactoring since we need to wrap up metadata work given our timelines and we can iterate on it later. Dropped a few comments that are non-blocking but LGTM! Thanks!

muvaf · 2022-07-04T15:05:25Z

go.mod

@@ -25,6 +28,13 @@ require (
 	sigs.k8s.io/yaml v1.3.0
 )

+require (


Do we need this separate require block?

Done. Looks like go mod tidy populated those dependencies in a separate block. Manually merged those two blocks for required dependencies and ran go mod tidy again.

muvaf · 2022-07-04T15:08:12Z

pkg/registry/meta.go

+		return errors.Wrap(err, "failed to parse HTML")
+	}
+
+	if err := r.scrapePrelude(doc, config.PreludeXPath); err != nil {


Seems like all scrape functions require the doc and an x-path string. Do you think there is a room for a common interface with Scrape(r *Resource, doc, xpath string) error function which would let us write unit tests focused on single units?

I have not included unit tests for the scrape functions as they are not exported (only the registry.ProviderMetadata.ScrapeRepo and registry.ProviderMetadata.Store methods are currently exported). Let's consider this as a future enhancement.

muvaf · 2022-07-04T15:09:48Z

pkg/registry/meta.go

+	// parse prelude
+	nodes := htmlquery.Find(doc, preludeXPath)
+	rawData := ""
+	if len(nodes) > 0 {


nitpick: seems like a chance to return early here.

muvaf · 2022-07-04T16:17:37Z

pkg/registry/meta.go

+	codeNodes := htmlquery.Find(doc, fieldXPath)
+	for _, n := range codeNodes {
+		attrName := ""
+		doc := r.scrapeDocString(n, &attrName, processed)


doc seems to be already used.

Done. Renamed as docStr.

muvaf · 2022-07-04T16:17:56Z

pkg/registry/meta.go

+		if doc == "" {
+			continue
+		}
+		r.AddArgumentDoc(attrName, doc)


nitpick: does this deserve its own function given it's used only here?

Done. Embedded the function.

muvaf · 2022-07-05T09:55:42Z

pkg/registry/meta.go

+}
+
+func (r *Resource) addExampleManifest(file *hcl.File, body *hclsyntax.Block) error {
+	refs, err := r.findReferences("", file, body)


I still think that the metadata document should be as raw as it can be and we can do everything references-related in a single place, which is Upjet itself in this PR. But changing it won't have much effect on authoring experience at the moment even though I believe it'd have been simpler to reason about how Upjet works. Opened this issue to have a place to drop feedback about it.

muvaf · 2022-07-05T10:00:47Z

pkg/registry/meta.go

+		}
+		if b.Labels[0] != *resourceName {
+			if exactMatch {
+				dependencies[depKey] = m


From our chat, I remember that having a separate dependencies array doesn't make much difference for the later stages so we could possibly remove it to reduce the complexity here. Opened this issue to track the discussion.

muvaf · 2022-07-05T10:08:50Z

pkg/registry/meta.go

+				continue
+			}
+		}
+		r.Name = *resourceName


In https://github.com/upbound/official-providers/pull/312 , I see that all entries have the same value for name, such as aws_acm_certificate_validation in provider-aws. Is that intentional?

From the markdown files, we scrape the resource name first from the title of the documentation page, and then if any example manifests are available, use what's available in that HCL configuration's block label. I think in the past I have seen some cases where the title did not correctly reflect the resource name. But for most resources title name and the extracted resource name match. And when/if we generate per-resource metadata files, we can get rid of the duplicated map keys.

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>

ulucinar · 2022-07-08T16:10:22Z

Thanks @muvaf.

pull-request-size bot added the size/XL label Jun 29, 2022

ulucinar force-pushed the fix-providers-19 branch 3 times, most recently from 6276f9a to af6b041 Compare June 30, 2022 08:43

pull-request-size bot added size/L and removed size/XL labels Jun 30, 2022

muvaf reviewed Jun 30, 2022

View reviewed changes

ulucinar force-pushed the fix-providers-19 branch 2 times, most recently from 8ca4bc4 to d5b8bb2 Compare July 3, 2022 22:55

pull-request-size bot added size/XXL and removed size/L labels Jul 3, 2022

ulucinar force-pushed the fix-providers-19 branch from d5b8bb2 to 4f865fb Compare July 3, 2022 23:17

ulucinar added 2 commits July 4, 2022 02:21

Add common metadata scraper

69de535

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>

Remove scraper configuration fields from registry.{Resource,ProviderM…

2dc0251

…etadata} Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>

ulucinar force-pushed the fix-providers-19 branch from 4f865fb to d479ba1 Compare July 3, 2022 23:22

muvaf mentioned this pull request Jul 5, 2022

Extract references in a single place #31

Open

muvaf approved these changes Jul 5, 2022

View reviewed changes

ulucinar force-pushed the fix-providers-19 branch from d479ba1 to d1ddadf Compare July 8, 2022 15:27

Add metadata scraper tests

3c03dec

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>

ulucinar force-pushed the fix-providers-19 branch from d1ddadf to 3c03dec Compare July 8, 2022 15:52

ulucinar merged commit 7d3d113 into crossplane:main Jul 8, 2022

ulucinar deleted the fix-providers-19 branch July 8, 2022 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metadata scraper #27

Add metadata scraper #27

ulucinar commented Jun 29, 2022 •

edited

Loading

muvaf left a comment

muvaf Jun 30, 2022

ulucinar Jul 1, 2022

muvaf left a comment

muvaf Jul 4, 2022

ulucinar Jul 8, 2022

muvaf Jul 4, 2022

ulucinar Jul 8, 2022

muvaf Jul 4, 2022

ulucinar Jul 8, 2022

muvaf Jul 4, 2022

ulucinar Jul 8, 2022 •

edited

Loading

muvaf Jul 4, 2022

ulucinar Jul 8, 2022

muvaf Jul 5, 2022

muvaf Jul 5, 2022

muvaf Jul 5, 2022

ulucinar Jul 8, 2022

ulucinar commented Jul 8, 2022

Add metadata scraper #27

Add metadata scraper #27

Conversation

ulucinar commented Jun 29, 2022 • edited Loading

Description of your changes

How has this code been tested

muvaf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

muvaf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ulucinar Jul 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ulucinar commented Jul 8, 2022

ulucinar commented Jun 29, 2022 •

edited

Loading

ulucinar Jul 8, 2022 •

edited

Loading