-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metadata scraper #27
Conversation
6276f9a
to
af6b041
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this PR @ulucinar ! It's quite challenging to follow recursive traversal like most functions here, would you mind adding unit tests so that we don't break it accidentally and see immediately what we expect them to do?
pkg/registry/resource.go
Outdated
@@ -61,12 +61,17 @@ type Resource struct { | |||
// external name used in the generated example manifests to be | |||
// overridden for a specific resource via configuration. | |||
ExternalName string `yaml:"-"` | |||
scrapeConfig *ScrapeConfiguration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be intended as per-resource but in practice, we set it to the same for all resources. I wonder how it'd look if we had ScrapeConfiguration
include existing fields as well as XPath configs and used as input only for the ScrapeRepo
function call. This way registry.ProviderMetadata
and registry.Resource
would be free of scraping-methodology-specific details. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed scraper configuration fields from registry.Resource
and registry.ProviderMetadata
. Thanks @muvaf!
8ca4bc4
to
d5b8bb2
Compare
d5b8bb2
to
4f865fb
Compare
Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
…etadata} Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
4f865fb
to
d479ba1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened a couple issues for stuff that may require refactoring since we need to wrap up metadata work given our timelines and we can iterate on it later. Dropped a few comments that are non-blocking but LGTM! Thanks!
go.mod
Outdated
@@ -25,6 +28,13 @@ require ( | |||
sigs.k8s.io/yaml v1.3.0 | |||
) | |||
|
|||
require ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this separate require
block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Looks like go mod tidy
populated those dependencies in a separate block. Manually merged those two blocks for required dependencies and ran go mod tidy
again.
return errors.Wrap(err, "failed to parse HTML") | ||
} | ||
|
||
if err := r.scrapePrelude(doc, config.PreludeXPath); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like all scrape functions require the doc
and an x-path string. Do you think there is a room for a common interface with Scrape(r *Resource, doc, xpath string) error
function which would let us write unit tests focused on single units?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not included unit tests for the scrape functions as they are not exported (only the registry.ProviderMetadata.ScrapeRepo
and registry.ProviderMetadata.Store
methods are currently exported). Let's consider this as a future enhancement.
pkg/registry/meta.go
Outdated
// parse prelude | ||
nodes := htmlquery.Find(doc, preludeXPath) | ||
rawData := "" | ||
if len(nodes) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: seems like a chance to return early here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pkg/registry/meta.go
Outdated
codeNodes := htmlquery.Find(doc, fieldXPath) | ||
for _, n := range codeNodes { | ||
attrName := "" | ||
doc := r.scrapeDocString(n, &attrName, processed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc
seems to be already used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Renamed as docStr
.
pkg/registry/meta.go
Outdated
if doc == "" { | ||
continue | ||
} | ||
r.AddArgumentDoc(attrName, doc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: does this deserve its own function given it's used only here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Embedded the function.
} | ||
|
||
func (r *Resource) addExampleManifest(file *hcl.File, body *hclsyntax.Block) error { | ||
refs, err := r.findReferences("", file, body) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think that the metadata document should be as raw as it can be and we can do everything references-related in a single place, which is Upjet itself in this PR. But changing it won't have much effect on authoring experience at the moment even though I believe it'd have been simpler to reason about how Upjet works. Opened this issue to have a place to drop feedback about it.
} | ||
if b.Labels[0] != *resourceName { | ||
if exactMatch { | ||
dependencies[depKey] = m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From our chat, I remember that having a separate dependencies array doesn't make much difference for the later stages so we could possibly remove it to reduce the complexity here. Opened this issue to track the discussion.
continue | ||
} | ||
} | ||
r.Name = *resourceName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In https://github.com/upbound/official-providers/pull/312 , I see that all entries have the same value for name
, such as aws_acm_certificate_validation
in provider-aws. Is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the markdown files, we scrape the resource name first from the title of the documentation page, and then if any example manifests are available, use what's available in that HCL configuration's block label. I think in the past I have seen some cases where the title did not correctly reflect the resource name. But for most resources title name and the extracted resource name match. And when/if we generate per-resource metadata files, we can get rid of the duplicated map keys.
d479ba1
to
d1ddadf
Compare
Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
d1ddadf
to
3c03dec
Compare
Thanks @muvaf. |
Description of your changes
Related to https://github.com/upbound/official-providers/issues/19
Depends on #19
Adds a metadata scraper, which is invoked during provider code generation to generate a provider metadata file.
I have:
make reviewable
to ensure this PR is ready for review.backport release-x.y
labels to auto-backport this PR if necessary.How has this code been tested
Enabled the example generation pipelines in this PR.