Add a test for gallery RSS content #3676

tartley · 2023-03-28T04:34:09Z

Pull Request Checklist

I’ve read the guidelines for contributing.
I updated AUTHORS.txt and CHANGES.txt (if the change is non-trivial) and documentation (if applicable).
I tested my changes.

Description

I added a new test, which asserts about the content of the samplesite gallery RSS feed.

This PR advances issue #3671, by completing the checklist item "Add a test of gallery RSS content".

I believe this PR is complete and mergeable.

FIXED now: There's a build fail. The RSS 'publish-date' of the recently added .webp image is not the same as the text expects.
REBASE done: I realized this PR also contained the diffs from my other branch. I don't use github regularly and forgot it behaves this way (my daily repo host, Launchpad, is smarter about this). I just waited for the other PR to merge, rebased this one, and all is good.
FIXED now: I also just realized the new test's use of 'zip' is faulty. Zip will stop at the end of the shortest of the two iterables. So if the 'actual' result is too short, we will just compare the items that are present, and then silently stop, and pass.
FIXED: There was another build failure: In CI on Windows the inferred timestamp of one of the other images was an hour off. At first I thought this was a timezone/daylight savings bug. But it turned out to be confusion of use of forward/backslashes on Windows during the generation of excluded_image_list, causing the exclude.meta file not to work properly, hence the first image in the gallery was not the expected one, and had an unexpected timestamp. This is all fixed in this MP, and discussed below.
FIXED: Currently my parsing of the RSS file returns empty enclosures. I want the test to get access to the information in the enclosures so it can assert about the mimetypes in there, so that this test will be useful when I make the imminent code change to fall-back to an "application/octet-stream" mime type (in a subsequent PR, or maybe this one, I don't have strong opinions).
FIXED: One Windows, various parts of the RSS feed like the item GUIDs and URLs erroneously used platform specific directory separators (ie backslashes). These have all been fixed to use forward slashes.
There are outstanding parts of the new test marked 'TODO'. My tentative understanding is that they mark parts of the test that demonstrate buggy behavior, and we ought to change the implementation and change these parts of the test to match. But to avoid biting off too much, other than the above, the aim of this PR is to just introduce a test that demonstrates existing behavior as-is, and then we can discuss other potential fixes later, in a separate PR.

tartley · 2023-03-29T13:03:45Z

Update on the test's failure to parse the enclosures in the generated RSS: This seems to be caused by a problem in the RSS parser I chose, "rss_parser". I do not understand how its parsing of enclosures can ever work. I filed yet another verbose issue there, but I think perhaps the situation there is nontrivial, so I don't have much hope of seeing that resolved soon. I think maybe I made a mistake by selecting someone's hobby RSS parser, rather than something that's actually currently usable. (I searched for "python rss parser", it was late and I clicked the top result.) Current plan is to switch to another RSS parser, maybe "feedparser".

tartley · 2023-03-29T23:43:16Z

Update on the CI failure in Python 3.11 on Windows latest. The image timestamp is off by one hour, and GMT clocks sprang forward an hour on Sunday night. Is it a timezone/daylight savings bug?

tartley · 2023-03-30T14:19:06Z

I fixed the failure of the new test to parse RSS enclosures by switching RSS parsing library, from parse-rss to feedparser. Seems to work now.

The only wrinkle is that it doesn't seem to automatically return the gallery 'description' field (which is empty in our data, but i'd like to assert about it so that maybe later we could fill it with something, like the gallery description text.)

Kwpolska · 2023-04-08T18:24:45Z

If the issue is with dates and times only, it might be worth it to ignore it somehow (e.g. by removing the date/time field altogether).

I have a few outstanding questions, marked 'TODO', to resolve in code review.

The publish date on the webp image is now different, since we rebased onto a change in the preceding branch, which correctly populates this EXIF field on the new webp file. This affects the order if the items in the RSS feed, and the title of the RSS feed. (possibly that last part is a bug? But it is not today's problem.)

The use of zip here was faulty. Zip stops iterating at the length of the shortest given iterable. So if the actual RSS feed contained initial correct items, but then ended early (including the degenerate case of being totally empty), then the zip would iterate over the number of items in 'parsed.feed', comparing them to the initial items of 'expected_items', and then would silently stop iterating and the test would erroneously pass. Using zip_longest guards against that. Adding a default value to `to_dict()`'s use of getattr, and a message to the assertion at the end of test_gallery_rss produces much nicer error messages on failure.

The previous library, rss_parser, doesn't successfully parse enclosures: dhvcc/rss-parser#37 I switched to feedparser, changed the test to use their format of returned parsed data structure. I still need to tell it to grab the (custom?) field of the gallery description, but everything else seems to work.

The new test now makes assertions about the content of the enclosure (a data structure about linked items, ie the images in this case) in each RSS feed item. This will enable us, in a later commit, to test the fallback to a different MIME type when a file isn't recognized.

The new gallery RSS test can get access to the 'description' field in the feed XML. It is a field labelled 'subtitle' in the parsed feed data. Delete the TODO related to this item. Delete some debug printing. Add a TODO wondering whether the gallery description should be blank, or should be populated with the content from the gallery index.txt.

Use new variable BUILDTIME to explicitly show that the RSS output feed.updated field derives its value from the time the `nikola build` was done.

(eg. on Windows, comparison is case-insensitive, and converts slashes into backslashes)

tartley · 2023-04-12T04:49:59Z

I pushed some test changes which give better error messages. Turns out, the test failure due to an off-by-one-hour timestamp isn't a daylight savings problem at all. It is caused by the gallery's exclude.meta file not working properly on Windows. Hence, the excluded image tesla2_lg.jpg is still included in the gallery RSS. (Incidentally, this image has an EXIF 'created' timestamp which is one hour less than the expected first RSS item, which is why we were seeing a one hour time difference exception earlier.) It's late, so I won't make any decisions now, but I think the options are: (a) This PR could include a fix for this, if it's trivial, or (b) This PR could change the test to exercise the current erroneous behavior and pass.

tartley · 2023-04-12T04:54:51Z

A pure guess: Might it be nikola/plugins/task/galleries.py:547, in get_excluded_images():

excluded_image_list = ["{0}/{1}".format(gallery_path, i) for i in excluded_image_name_list]

The / in the format template perhaps causes problems on Windows when the image removal task runs.

The newly-added test fails because the gallery image tesla2_lg.jpg should be excluded from the gallery - it is mentioned in exclude.meta. But on Windows, it does not get excluded, so the gallery RSS feed includes one erroneous extra item.

tartley · 2023-04-12T13:28:23Z

OK, so, this is interesting. The 'tentative fix' commit, above, did fix the previous test failure, ie:

The test failure was that the 'id' entry of each item in the RSS feed was unexpected. On linux/mac, it is:

'galleries/demo/tesla4_lg.jpg'

but on Windows, it was:

'galleries\\demo\\tesla4_lg.jpg'

i.e. with single backslashes (the above line shows them escaped, as the failing test output does).

The 'tentative fix' commit is for galleries.py to generate the same 'id' on Windows as it does elsewhere, i.e. use os.path.join instead of {0}/{1}".format(...), as seems common elsewhere in galleries.py. I'm not at all sure this is the right thing to do. Maybe the 'id' really should be different on Windows, to match the filename. Advice welcome. I guess my current thinking is that this PR should be conservative, and not change behavior unless I'm absolutely sure it is a bug, so I'll probably modify this to restore the original behavior, and make the test demonstrate whatever that is.

Also, now with that fix in place, there is a new, similar failure:

FAILED tests/integration/test_demo_build.py::test_gallery_rss - AssertionError: Item 0 mismatch: link: [https://example.com/galleries\demo\tesla4_lg.jpg](https://example.com/galleries/demo/tesla4_lg.jpg) != https://example.com/galleries/demo/tesla4_lg.jpg
assert 'https://exam...tesla4_lg.jpg/' == 'https://exam...tesla4_lg.jpg/'
  - https://example.com/galleries/demo/tesla4_lg.jpg
  ?                              ^    ^
  + [https://example.com/galleries\demo\tesla4_lg.jpg](https://example.com/galleries/demo/tesla4_lg.jpg)
  ?                              ^    ^

The same chain of reasoning applies: I'm pretty certain the windows version is not a valid URL, but it works because browsers will helpfully massage the backslashes into forward slashes and resolve the resulting URL. So probably the ultimately right thing to do is fix the URL generation on Windows? But in the absence of strong advice from the project owners, this PR should probably be conservative, preserve current behavior, and tweak the test to demonstrate that (with a "TODO" comment in the test that I think this might be a bug.)

Thoughts welcome. More as it happens. Thanks people!

Prepare test to diagnostically output the actual results on Windows, so that subsequent commits can modify the test to expect the actual current behavior.

Kwpolska · 2023-04-12T20:27:10Z

The URL generation on Windows is probably buggy and should be fixed, even if that would mean some minor regressions in places depending on the buggy behavior (e.g. RSS feed GUIDs). We usually don’t pay much attention to pull requests being small or very self-contained.

Due to a probable bug, images in the gallery exclude.meta are not excluded on Windows. The test now expects this, to demonstrate the current behavior.

...instead of dict literals, for consistency with the other larger dicts in the same test.

Windows RSS item ids are now of same form as on other OSes, ie: 'galleries/demo/tesla_tower1_lg.jpg' Formerly these used backslashes on Windows, which caused a bug where excluded images (listed in exclude.meta), were not excluded because their id didn't match. This means I've removed a big ugly TODO from the test, which no longer expects these excluded items to appear in the RSS feed on Windows. This is a re-application of the 'tentative fix' that was applied and reverted earlier in this PR.

tartley · 2023-07-04T15:27:35Z

@Kwpolska: The URL generation on Windows is probably buggy and should be fixed, even if that would mean some minor regressions in places depending on the buggy behavior (e.g. RSS feed GUIDs).

Thank you for the advice! This is now fixed, and I believe this PR is complete and AFAIK can be merged.

Kwpolska · 2023-07-06T20:48:58Z

nikola/nikola.py

@@ -1926,7 +1927,8 @@ def path(self, kind, name, lang=None, is_link=False, **kwargs):
            else:
                return link
        else:
-            return os.path.join(*path)
+            # URLs should always use forward slash separators, even on Windows
+            return pathlib.PurePosixPath(*path)


This should return a string, not a pathlib.Path instance.

Ahar, fair point, thank you. I was curious about why this wasn't causing problems in the tests, so I had a quick look around and it seems like almost every call to this function is using the return value as a component passed to os.path.join(), which accepts pathlib.Paths. So that explains why the problem isn't visibly manifesting. But yes, you are right, future callers to this (and maybe the existing one in Nikola.link()) might barf on pathlib.Paths. So I'll convert it to str, and swallow my idealism over this fix not requiring a test change.

tartley mentioned this pull request Mar 28, 2023

WebP images in a gallery cause an unhandled exception during build #3671

Closed

9 tasks

tartley force-pushed the test-gallery-rss branch from af46402 to 0d3819a Compare March 29, 2023 00:57

tartley marked this pull request as draft March 30, 2023 14:28

tartley added 9 commits April 11, 2023 22:13

Add a test of the Gallery RSS content

91a68b7

I have a few outstanding questions, marked 'TODO', to resolve in code review.

Rename variable. Add TODO comment.

badea00

Linting

457b8ac

Use common var to show origin of output timestamp

2f43d3c

Use new variable BUILDTIME to explicitly show that the RSS output feed.updated field derives its value from the time the `nikola build` was done.

tartley force-pushed the test-gallery-rss branch from 107a057 to 2f43d3c Compare April 12, 2023 03:14

tartley added 4 commits April 11, 2023 22:28

Add some diagnostic prints for Windows CI build

4a509b5

More CI test diagnostics

fc26715

Test OS normalizes generated paths

f30c9f0

(eg. on Windows, comparison is case-insensitive, and converts slashes into backslashes)

lint

473e9ab

Restore original behavior on Windows

38bca3d

Prepare test to diagnostically output the actual results on Windows, so that subsequent commits can modify the test to expect the actual current behavior.

tartley added 3 commits April 12, 2023 19:12

Reword TODO comments for clarity

e997e3f

Test now expects RSS exclude.meta bug on Windows

881e368

Due to a probable bug, images in the gallery exclude.meta are not excluded on Windows. The test now expects this, to demonstrate the current behavior.

Format test expected data dicts as constructors

339aca1

...instead of dict literals, for consistency with the other larger dicts in the same test.

Remove a TODO that apparently needs no work

10d1fe5

tartley force-pushed the test-gallery-rss branch from 9dd464c to 10d1fe5 Compare April 13, 2023 00:16

tartley added 4 commits April 12, 2023 19:24

Merge branch 'master' into test-gallery-rss

a0c5199

Use forward slash in RSS GUIDs on all platforms

3495d39

Lint fix

8944d20

tartley force-pushed the test-gallery-rss branch 2 times, most recently from 7899cad to ad03124 Compare July 4, 2023 00:58

Use forward slashes in RSS links fields

2b40c40

tartley force-pushed the test-gallery-rss branch 2 times, most recently from 20aa60a to 66da315 Compare July 4, 2023 04:35

Fix source of backslashes in URLs

ff131b8

tartley force-pushed the test-gallery-rss branch from 66da315 to ff131b8 Compare July 4, 2023 05:26

Fix conversion of paths to use forward slashes

e15f2b9

tartley marked this pull request as ready for review July 4, 2023 15:16

Kwpolska reviewed Jul 6, 2023

View reviewed changes

Return values from Nikola.path should be str

03b1fce

Kwpolska approved these changes Jul 7, 2023

View reviewed changes

Kwpolska merged commit 84f6a2b into getnikola:master Jul 7, 2023
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a test for gallery RSS content #3676

Add a test for gallery RSS content #3676

tartley commented Mar 28, 2023 •

edited

Loading

tartley commented Mar 29, 2023 •

edited

Loading

tartley commented Mar 29, 2023

tartley commented Mar 30, 2023

Kwpolska commented Apr 8, 2023

tartley commented Apr 12, 2023

tartley commented Apr 12, 2023

tartley commented Apr 12, 2023 •

edited

Loading

Kwpolska commented Apr 12, 2023

tartley commented Jul 4, 2023

Kwpolska Jul 6, 2023

tartley Jul 7, 2023 •

edited

Loading

Add a test for gallery RSS content #3676

Add a test for gallery RSS content #3676

Conversation

tartley commented Mar 28, 2023 • edited Loading

Pull Request Checklist

Description

tartley commented Mar 29, 2023 • edited Loading

tartley commented Mar 29, 2023

tartley commented Mar 30, 2023

Kwpolska commented Apr 8, 2023

tartley commented Apr 12, 2023

tartley commented Apr 12, 2023

tartley commented Apr 12, 2023 • edited Loading

Kwpolska commented Apr 12, 2023

tartley commented Jul 4, 2023

Kwpolska Jul 6, 2023

Choose a reason for hiding this comment

tartley Jul 7, 2023 • edited Loading

Choose a reason for hiding this comment

tartley commented Mar 28, 2023 •

edited

Loading

tartley commented Mar 29, 2023 •

edited

Loading

tartley commented Apr 12, 2023 •

edited

Loading

tartley Jul 7, 2023 •

edited

Loading