Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 60 additions & 16 deletions build_docs.pl
Original file line number Diff line number Diff line change
Expand Up @@ -331,6 +331,7 @@ sub check_links {
$link_checker->check;

check_kibana_links( $build_dir, $link_checker ) if exists $Conf->{repos}{kibana};
check_elasticsearch_links( $build_dir, $link_checker ) if exists $Conf->{repos}{elasticsearch};
if ( $link_checker->has_bad ) {
say $link_checker->report;
}
Expand All @@ -350,22 +351,6 @@ sub check_kibana_links {

say "Checking Kibana links";

# ${baseUrl}guide/en/elasticsearch/reference/${urlVersion}/modules-scripting-expression.html
# ${ELASTIC_WEBSITE_URL}guide/en/beats/filebeat/${DOC_LINK_VERSION}
# ${ELASTIC_DOCS}search-aggregations-bucket-datehistogram-aggregation.html
# ${ELASTICSEARCH_DOCS}update-transform.html
# ${KIBANA_DOCS}canvas.html
# ${PLUGIN_DOCS}repository-s3.html
# ${FLEET_DOCS}fleet-overview.html
# ${APM_DOCS}overview.html
# ${STACK_DOCS}upgrading-elastic-stack.html
# ${SECURITY_SOLUTION_DOCS}sec-requirements.html
# ${STACK_GETTING_STARTED}get-started-elastic-stack.html
# ${APP_SEARCH_DOCS}authentication.html
# ${ENTERPRISE_SEARCH_DOCS}authentication.html
# ${WORKPLACE_SEARCH_DOCS}workplace-search-getting-started.html
# ${MACHINE_LEARNING_DOCS}machine-learning-intro.html

my $extractor = sub {
my $contents = shift;
return sub {
Expand Down Expand Up @@ -453,6 +438,65 @@ sub check_kibana_links {
}
}

#===================================
sub check_elasticsearch_links {
#===================================
my $build_dir = shift;
my $link_checker = shift;
my $branch;
my $version;

say "Checking Elasticsearch links";

# Grab URLs from the JSON file. This is lame, but we sort of need to parse
# using regexes because that's what the rest of the infrastructure expects.
# So we grab all quoted strings that contain `html`. This *should* be fine
# for a while because the keys in the file are all in SHOUTING_SNAKE_CASE
# so even if one contains "html" it'll contain "HTML" which doesn't match.
my $extractor = sub {
my $contents = shift;
return sub {
while ( $contents =~ m!"([^"\#]+)(?:\#([^"]+))?"!g ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parsing filenames out of JSON with Regular expressions is fun :-), but not any worse than what we're already doing with TypeScript for Kibana.

If I had to summarize what this did, I would say it:

  • pulls out any quoted elements in the JSON, splitting off any URL fragment (part after the "#" if it exists)
  • If the path contains "html" consider it a path to a file that should get checked.

Is that about right? Is it worth adding a comment with a bit of this detail?

This is probably robust enough. Looking at the current file the keys are all UPPER_SNAKE_CASE so won't match the lower case "html", correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. a comment is good. I could pull in a json parsing library easy enough. but then I'd have to emit this into a line per url to check and use a url parser to see it because the link checker really wants that. I think. It does feel dirty. But you got it right. Including the SHOUTING_SNAKE_CASE defense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of questions:

  • Would a different format in the ES source make this more robust? I used JSON mainly because JSON is easy in ES, but we could do something else for sure. YAML permits unquoted strings for instance. Or, have we run any Gradle stuff at this point? If so we could have a Gradle task that extracts just the bits you need into a more pleasant format.
  • Does this mean we do not validate the fragment ID (the bit after the #)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One line per link in a file at the top level would amazing. But we can deal. I really could have imported a json parser, but it would have made more work in perl which I'm not familiar with.

We are certainly supposed to be testing the fragment. We should be selecting it out here, but I admit to having fought with the silly regex for longer than I'd like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are certainly supposed to be testing the fragment.

Ok that's good enough for me :) I admit I gave up trying to parse the regex myself.

Copy link
Contributor

@DaveCTurner DaveCTurner Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I thought I'd just try this in elastic/elasticsearch#94794 elastic/elasticsearch#94815 and it looks like the docs build passes even though I used an invalid fragment. It does check the bit before the # tho.

Edit: opened a dedicated test PR at elastic/elasticsearch#94815

my $path = $1;
next unless $path =~ m!html!;
return "en/elasticsearch/reference/$version/$path";
}
return;
};
};

my $src_path = 'server/src/main/resources/org/elasticsearch/common/reference-docs-links.json';
my $repo = ES::Repo->get_repo('elasticsearch');

my @versions = sort map { $_->basename }
grep { $_->is_dir } $build_dir->subdir('en/elasticsearch/reference')->children;

my $link_check_name = 'link-check-elasticsearch';

for (@versions) {
$version = $_;
# check versions after 8.6
next if $version eq 'current' || $version =~ /^(\d+)\.(\d+)/ && ($1 lt 8 || ($1 eq 8 && $2 lt 7));
# @versions is looping through the directories in the output (which
# still contains `master`), but we need to look in the `main` branch of
# the ES repo for this file.
#
# TODO: remove as part of
# https://github.com/elastic/docs/issues/2264
$branch = $version eq "master" ? "main" : $version;
say " Branch: $branch, Version: $version";
my $source = $repo->show_file( $link_check_name, $branch, $src_path );

$link_checker->check_source( $source, $extractor,
"Elasticsearch [$version]: $src_path" );

# Mark the file that we need for the link check done so we can use
# --keep_hash with it during some other build.
$repo->mark_done( $link_check_name, $branch, $src_path, 0 );
}
}


#===================================
sub build_entries {
#===================================
Expand Down
1 change: 0 additions & 1 deletion lib/ES/LinkCheck.pm
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,6 @@ sub check_source {
my $seen = $self->seen;

while ( my ( $path, $fragment ) = $link_it->() ) {

my $dest = $self->root->file($path);
unless ( $self->_file_exists( $dest, $path ) ) {
$self->add_bad( $file_descr, $path );
Expand Down