Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop making validation tests for non HTML content #2623

Conversation

lunkwill42
Copy link
Member

@lunkwill42 lunkwill42 commented May 25, 2023

The web crawler would generate HTML validation tests for all reachable pages, and marked them as skipped if the page is blacklisted or the retrieved page has a content-type that doesn't appear to be HTML-related.

Skipped tests should be an indicator of some unusual situation or missing condition that prevents the test from running. However, non-HTML pages can never be validated as such, and the normal situation is that these tests are useless. Since a lot of the crawled content isn't actually HTML, a large number of SKIP results is produced in test reports. These results only serve to hide the results of tests that were skipped for factually unexpected or extraordinary reasons.

This change will filter crawled pages based on blacklist status and content-type before the result is used to parametrize the test_page_should_be_valid_html test.

The web crawler would generate HTML validation tests for all reachable
pages, and marked them as *skipped* if the page is blacklisted or the
retrieved page has a content-type that doesn't appear to be
HTML-related.

Skipped tests should be an indicator of some unusual situation or
missing condition that prevents the test from running.  However,
non-HTML pages can never be validated as such, and the *normal*
situation is that these tests are useless.  Since a lot of the crawled
content isn't actually HTML, a large number of SKIP results is produced
in test reports.  These results only serve to hide the results of tests
that were skipped for *factually* unexpected or extraordinary reasons.

This change will filter crawled pages based on blacklist status and
content-type before it the result is used to parametrize the
`test_page_should_be_valid_html` test.
@github-actions
Copy link

github-actions bot commented May 25, 2023

Test results

     12 files       12 suites   12m 32s ⏱️
3 152 tests 3 152 ✔️ 0 💤 0
8 931 runs  8 931 ✔️ 0 💤 0

Results for commit 40ed7e3.

♻️ This comment has been updated with latest results.

@codecov
Copy link

codecov bot commented May 25, 2023

Codecov Report

Merging #2623 (40ed7e3) into master (4672022) will increase coverage by 0.31%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #2623      +/-   ##
==========================================
+ Coverage   54.20%   54.52%   +0.31%     
==========================================
  Files         558      558              
  Lines       40634    40644      +10     
==========================================
+ Hits        22026    22160     +134     
+ Misses      18608    18484     -124     

see 14 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@johannaengland johannaengland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/Uninett/nav/pull/2623/files#diff-3fe8b88842890c3fc40bf5de74cf87a0bf99537ed5ae7c50f35958ebebc44ff5L265-L266

This could be removed since we call should_validate in crawl_only_html, or am I not understanding this correctly?

@lunkwill42
Copy link
Member Author

https://github.com/Uninett/nav/pull/2623/files#diff-3fe8b88842890c3fc40bf5de74cf87a0bf99537ed5ae7c50f35958ebebc44ff5L265-L266

This could be removed since we call should_validate in crawl_only_html, or am I not understanding this correctly?

Indeed, you are correct. I was probably thinking "better safe than sorry", but this condition (and the one above it) should never be triggered any more.

1. As mentioned in review comments, `test_page_should_be_valid_html` no
longer needs to test whether a page should be validated, since its input
 is now guaranteed to be filtered.

2. `should_validate()` now performs both filtering checks:
A blacklisted page should not be validated, and a non-HTML page should
not be validated.

3. With the above changes, `crawl_only_html()` can now be refactored to
   a one-liner.
@sonarcloud
Copy link

sonarcloud bot commented May 26, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Copy link
Contributor

@hmpf hmpf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks sensible except for the names. Approved as is, but as later polish, rename crawl, crawl_only_html and should_validate. Naming, it is hard, yes.

should_validate checks if something is crawlable, so is_crawlable is a better name.

crawl_only_html makes the actual list of urls to crawl, so maybe urls_to_crawl , crawl or filtered_crawl.

The actual crawling is now done by crawl. It is only used by crawl_only_html, so it should at least start with an underscore.

@lunkwill42
Copy link
Member Author

This all looks sensible except for the names. Approved as is, but as later polish, rename crawl, crawl_only_html and should_validate. Naming, it is hard, yes.

should_validate checks if something is crawlable, so is_crawlable is a better name.

crawl_only_html makes the actual list of urls to crawl, so maybe urls_to_crawl , crawl or filtered_crawl.

The actual crawling is now done by crawl. It is only used by crawl_only_html, so it should at least start with an underscore.

You might get this sense of things if you only read the diff and not the whole code, @hmpf.

crawl performs the full crawl of the web site and returns all pages that it found a link to. crawl is still used as the input parameter to test_link_should_be_reachable, as we do want to test all (internal) links for reachability.

should_validate does not verify that something is crawlable, it verifies whether a page that has already been crawled should be validated as HTML (images, CSS files, JavaScript should not be validated as HTML, but they should still be tested for reachability by test_link_should_be_reachable).

@lunkwill42 lunkwill42 merged commit 08236b7 into Uninett:master Jun 5, 2023
13 checks passed
@lunkwill42 lunkwill42 deleted the test/do-not-generate-validation-tests-for-non-html branch June 5, 2023 08:37
@hmpf
Copy link
Contributor

hmpf commented Jun 5, 2023

This all looks sensible except for the names.

You might get this sense of things if you only read the diff and not the whole code, @hmpf.

I read more than the diff, but not everything, no.

crawl performs the full crawl of the web site and returns all pages that it found a link to. crawl is still used as the input parameter to test_link_should_be_reachable, as we do want to test all (internal) links for reachability.

It looks like it is crawled twice this way. Unless pytest runs the fixture before any tests use it?

should_validate does not verify that something is crawlable, it verifies whether a page that has already been crawled should be validated as HTML (images, CSS files, JavaScript should not be validated as HTML, but they should still be tested for reachability by test_link_should_be_reachable).

So rename it should_be_valid_html. When are pages put on the blacklist?

@lunkwill42
Copy link
Member Author

crawl performs the full crawl of the web site and returns all pages that it found a link to. crawl is still used as the input parameter to test_link_should_be_reachable, as we do want to test all (internal) links for reachability.

It looks like it is crawled twice this way. Unless pytest runs the fixture before any tests use it?

crawl() caches its output and responds from the cache on subsequent calls (after all, WebCrawler was written specifically for use with our test suite).

should_validate does not verify that something is crawlable, it verifies whether a page that has already been crawled should be validated as HTML (images, CSS files, JavaScript should not be validated as HTML, but they should still be tested for reachability by test_link_should_be_reachable).

So rename it should_be_valid_html.

Agreed.

I have another branch where I am re-working the test suite to properly use fixtures for external dependencies, so that we don't need to run the entire test suite from within a very specific Docker environment. I have reworked the crawler tests somewhat there, and any such naming cleanups would fit nicely there.

When are pages put on the blacklist?

The blacklist is actually a global constant of the module. It's currently empty, but IIRC it's there because it once used to include pages from 3rd party tools that were mounted on the NAV site (i.e. we had no control over their HTML, but they still needed to be reachable in a complete site).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants