Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metadata scraper #27

Merged
merged 3 commits into from
Jul 8, 2022
Merged

Conversation

ulucinar
Copy link
Collaborator

@ulucinar ulucinar commented Jun 29, 2022

Description of your changes

Related to https://github.com/upbound/official-providers/issues/19
Depends on #19

Adds a metadata scraper, which is invoked during provider code generation to generate a provider metadata file.

I have:

  • Read and followed Crossplane's contribution process.
  • Run make reviewable to ensure this PR is ready for review.
  • Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

Enabled the example generation pipelines in this PR.

@ulucinar ulucinar force-pushed the fix-providers-19 branch 3 times, most recently from 6276f9a to af6b041 Compare June 30, 2022 08:43
Copy link
Member

@muvaf muvaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this PR @ulucinar ! It's quite challenging to follow recursive traversal like most functions here, would you mind adding unit tests so that we don't break it accidentally and see immediately what we expect them to do?

@@ -61,12 +61,17 @@ type Resource struct {
// external name used in the generated example manifests to be
// overridden for a specific resource via configuration.
ExternalName string `yaml:"-"`
scrapeConfig *ScrapeConfiguration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be intended as per-resource but in practice, we set it to the same for all resources. I wonder how it'd look if we had ScrapeConfiguration include existing fields as well as XPath configs and used as input only for the ScrapeRepo function call. This way registry.ProviderMetadata and registry.Resource would be free of scraping-methodology-specific details. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed scraper configuration fields from registry.Resource and registry.ProviderMetadata. Thanks @muvaf!

@ulucinar ulucinar force-pushed the fix-providers-19 branch 2 times, most recently from 8ca4bc4 to d5b8bb2 Compare July 3, 2022 22:55
Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
…etadata}

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
Copy link
Member

@muvaf muvaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened a couple issues for stuff that may require refactoring since we need to wrap up metadata work given our timelines and we can iterate on it later. Dropped a few comments that are non-blocking but LGTM! Thanks!

go.mod Outdated
@@ -25,6 +28,13 @@ require (
sigs.k8s.io/yaml v1.3.0
)

require (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this separate require block?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Looks like go mod tidy populated those dependencies in a separate block. Manually merged those two blocks for required dependencies and ran go mod tidy again.

return errors.Wrap(err, "failed to parse HTML")
}

if err := r.scrapePrelude(doc, config.PreludeXPath); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like all scrape functions require the doc and an x-path string. Do you think there is a room for a common interface with Scrape(r *Resource, doc, xpath string) error function which would let us write unit tests focused on single units?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not included unit tests for the scrape functions as they are not exported (only the registry.ProviderMetadata.ScrapeRepo and registry.ProviderMetadata.Store methods are currently exported). Let's consider this as a future enhancement.

// parse prelude
nodes := htmlquery.Find(doc, preludeXPath)
rawData := ""
if len(nodes) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: seems like a chance to return early here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

codeNodes := htmlquery.Find(doc, fieldXPath)
for _, n := range codeNodes {
attrName := ""
doc := r.scrapeDocString(n, &attrName, processed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc seems to be already used.

Copy link
Collaborator Author

@ulucinar ulucinar Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Renamed as docStr.

if doc == "" {
continue
}
r.AddArgumentDoc(attrName, doc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: does this deserve its own function given it's used only here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Embedded the function.

}

func (r *Resource) addExampleManifest(file *hcl.File, body *hclsyntax.Block) error {
refs, err := r.findReferences("", file, body)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that the metadata document should be as raw as it can be and we can do everything references-related in a single place, which is Upjet itself in this PR. But changing it won't have much effect on authoring experience at the moment even though I believe it'd have been simpler to reason about how Upjet works. Opened this issue to have a place to drop feedback about it.

}
if b.Labels[0] != *resourceName {
if exactMatch {
dependencies[depKey] = m
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our chat, I remember that having a separate dependencies array doesn't make much difference for the later stages so we could possibly remove it to reduce the complexity here. Opened this issue to track the discussion.

continue
}
}
r.Name = *resourceName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In https://github.com/upbound/official-providers/pull/312 , I see that all entries have the same value for name, such as aws_acm_certificate_validation in provider-aws. Is that intentional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the markdown files, we scrape the resource name first from the title of the documentation page, and then if any example manifests are available, use what's available in that HCL configuration's block label. I think in the past I have seen some cases where the title did not correctly reflect the resource name. But for most resources title name and the extracted resource name match. And when/if we generate per-resource metadata files, we can get rid of the duplicated map keys.

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
@ulucinar
Copy link
Collaborator Author

ulucinar commented Jul 8, 2022

Thanks @muvaf.

@ulucinar ulucinar merged commit 7d3d113 into crossplane:main Jul 8, 2022
@ulucinar ulucinar deleted the fix-providers-19 branch July 8, 2022 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants