Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a test for gallery RSS content #3676

Merged
merged 27 commits into from
Jul 7, 2023
Merged

Conversation

tartley
Copy link
Contributor

@tartley tartley commented Mar 28, 2023

Pull Request Checklist

  • I’ve read the guidelines for contributing.
  • I updated AUTHORS.txt and CHANGES.txt (if the change is non-trivial) and documentation (if applicable).
  • I tested my changes.

Description

I added a new test, which asserts about the content of the samplesite gallery RSS feed.

This PR advances issue #3671, by completing the checklist item "Add a test of gallery RSS content".

I believe this PR is complete and mergeable.

  • FIXED now: There's a build fail. The RSS 'publish-date' of the recently added .webp image is not the same as the text expects.
  • REBASE done: I realized this PR also contained the diffs from my other branch. I don't use github regularly and forgot it behaves this way (my daily repo host, Launchpad, is smarter about this). I just waited for the other PR to merge, rebased this one, and all is good.
  • FIXED now: I also just realized the new test's use of 'zip' is faulty. Zip will stop at the end of the shortest of the two iterables. So if the 'actual' result is too short, we will just compare the items that are present, and then silently stop, and pass.
  • FIXED: There was another build failure: In CI on Windows the inferred timestamp of one of the other images was an hour off. At first I thought this was a timezone/daylight savings bug. But it turned out to be confusion of use of forward/backslashes on Windows during the generation of excluded_image_list, causing the exclude.meta file not to work properly, hence the first image in the gallery was not the expected one, and had an unexpected timestamp. This is all fixed in this MP, and discussed below.
  • FIXED: Currently my parsing of the RSS file returns empty enclosures. I want the test to get access to the information in the enclosures so it can assert about the mimetypes in there, so that this test will be useful when I make the imminent code change to fall-back to an "application/octet-stream" mime type (in a subsequent PR, or maybe this one, I don't have strong opinions).
  • FIXED: One Windows, various parts of the RSS feed like the item GUIDs and URLs erroneously used platform specific directory separators (ie backslashes). These have all been fixed to use forward slashes.
  • There are outstanding parts of the new test marked 'TODO'. My tentative understanding is that they mark parts of the test that demonstrate buggy behavior, and we ought to change the implementation and change these parts of the test to match. But to avoid biting off too much, other than the above, the aim of this PR is to just introduce a test that demonstrates existing behavior as-is, and then we can discuss other potential fixes later, in a separate PR.

@tartley
Copy link
Contributor Author

tartley commented Mar 29, 2023

Update on the test's failure to parse the enclosures in the generated RSS: This seems to be caused by a problem in the RSS parser I chose, "rss_parser". I do not understand how its parsing of enclosures can ever work. I filed yet another verbose issue there, but I think perhaps the situation there is nontrivial, so I don't have much hope of seeing that resolved soon. I think maybe I made a mistake by selecting someone's hobby RSS parser, rather than something that's actually currently usable. (I searched for "python rss parser", it was late and I clicked the top result.) Current plan is to switch to another RSS parser, maybe "feedparser".

@tartley
Copy link
Contributor Author

tartley commented Mar 29, 2023

Update on the CI failure in Python 3.11 on Windows latest. The image timestamp is off by one hour, and GMT clocks sprang forward an hour on Sunday night. Is it a timezone/daylight savings bug?

@tartley
Copy link
Contributor Author

tartley commented Mar 30, 2023

I fixed the failure of the new test to parse RSS enclosures by switching RSS parsing library, from parse-rss to feedparser. Seems to work now.

The only wrinkle is that it doesn't seem to automatically return the gallery 'description' field (which is empty in our data, but i'd like to assert about it so that maybe later we could fill it with something, like the gallery description text.)

@tartley tartley marked this pull request as draft March 30, 2023 14:28
@Kwpolska
Copy link
Member

Kwpolska commented Apr 8, 2023

If the issue is with dates and times only, it might be worth it to ignore it somehow (e.g. by removing the date/time field altogether).

I have a few outstanding questions, marked 'TODO', to resolve in
code review.
The publish date on the webp image is now different,
since we rebased onto a change in the preceding branch,
which correctly populates this EXIF field on the new
webp file.

This affects the order if the items in the RSS feed,
and the title of the RSS feed. (possibly that last part
is a bug? But it is not today's problem.)
The use of zip here was faulty. Zip stops iterating at the length of the
shortest given iterable. So if the actual RSS feed contained initial
correct items, but then ended early (including the degenerate case of
being totally empty), then the zip would iterate over the number of
items in 'parsed.feed', comparing them to the initial items of
'expected_items', and then would silently stop iterating and the test
would erroneously pass.

Using zip_longest guards against that.

Adding a default value to `to_dict()`'s use of getattr, and a message
to the assertion at the end of test_gallery_rss produces much nicer
error messages on failure.
The previous library, rss_parser, doesn't successfully parse enclosures:
dhvcc/rss-parser#37

I switched to feedparser, changed the test to use their format of
returned parsed data structure. I still need to tell it to grab the
(custom?) field of the gallery description, but everything else seems to
work.
The new test now makes assertions about the content of the enclosure (a
data structure about linked items, ie the images in this case) in each
RSS feed item.

This will enable us, in a later commit, to test the fallback to a
different MIME type when a file isn't recognized.
The new gallery RSS test can get access to the 'description' field
in the feed XML. It is a field labelled 'subtitle' in the parsed
feed data.

Delete the TODO related to this item.

Delete some debug printing.

Add a TODO wondering whether the gallery description should be blank,
or should be populated with the content from the gallery index.txt.
Use new variable BUILDTIME to explicitly show that the RSS output
feed.updated field derives its value from the time the `nikola build`
was done.
@tartley
Copy link
Contributor Author

tartley commented Apr 12, 2023

I pushed some test changes which give better error messages. Turns out, the test failure due to an off-by-one-hour timestamp isn't a daylight savings problem at all. It is caused by the gallery's exclude.meta file not working properly on Windows. Hence, the excluded image tesla2_lg.jpg is still included in the gallery RSS. (Incidentally, this image has an EXIF 'created' timestamp which is one hour less than the expected first RSS item, which is why we were seeing a one hour time difference exception earlier.) It's late, so I won't make any decisions now, but I think the options are: (a) This PR could include a fix for this, if it's trivial, or (b) This PR could change the test to exercise the current erroneous behavior and pass.

@tartley
Copy link
Contributor Author

tartley commented Apr 12, 2023

A pure guess: Might it be nikola/plugins/task/galleries.py:547, in get_excluded_images():

excluded_image_list = ["{0}/{1}".format(gallery_path, i) for i in excluded_image_name_list]

The / in the format template perhaps causes problems on Windows when the image removal task runs.

The newly-added test fails because the gallery image tesla2_lg.jpg
should be excluded from the gallery - it is mentioned in exclude.meta.
But on Windows, it does not get excluded, so the gallery RSS feed
includes one erroneous extra item.
@tartley
Copy link
Contributor Author

tartley commented Apr 12, 2023

OK, so, this is interesting. The 'tentative fix' commit, above, did fix the previous test failure, ie:

The test failure was that the 'id' entry of each item in the RSS feed was unexpected. On linux/mac, it is:

'galleries/demo/tesla4_lg.jpg'

but on Windows, it was:

'galleries\\demo\\tesla4_lg.jpg'

i.e. with single backslashes (the above line shows them escaped, as the failing test output does).

The 'tentative fix' commit is for galleries.py to generate the same 'id' on Windows as it does elsewhere, i.e. use os.path.join instead of {0}/{1}".format(...), as seems common elsewhere in galleries.py. I'm not at all sure this is the right thing to do. Maybe the 'id' really should be different on Windows, to match the filename. Advice welcome. I guess my current thinking is that this PR should be conservative, and not change behavior unless I'm absolutely sure it is a bug, so I'll probably modify this to restore the original behavior, and make the test demonstrate whatever that is.

Also, now with that fix in place, there is a new, similar failure:

FAILED tests/integration/test_demo_build.py::test_gallery_rss - AssertionError: Item 0 mismatch: link: [https://example.com/galleries\demo\tesla4_lg.jpg](https://example.com/galleries/demo/tesla4_lg.jpg) != https://example.com/galleries/demo/tesla4_lg.jpg
assert 'https://exam...tesla4_lg.jpg/' == 'https://exam...tesla4_lg.jpg/'
  - https://example.com/galleries/demo/tesla4_lg.jpg
  ?                              ^    ^
  + [https://example.com/galleries\demo\tesla4_lg.jpg](https://example.com/galleries/demo/tesla4_lg.jpg)
  ?                              ^    ^

The same chain of reasoning applies: I'm pretty certain the windows version is not a valid URL, but it works because browsers will helpfully massage the backslashes into forward slashes and resolve the resulting URL. So probably the ultimately right thing to do is fix the URL generation on Windows? But in the absence of strong advice from the project owners, this PR should probably be conservative, preserve current behavior, and tweak the test to demonstrate that (with a "TODO" comment in the test that I think this might be a bug.)

Thoughts welcome. More as it happens. Thanks people!

Prepare test to diagnostically output the actual results on
Windows, so that subsequent commits can modify the test to
expect the actual current behavior.
@Kwpolska
Copy link
Member

The URL generation on Windows is probably buggy and should be fixed, even if that would mean some minor regressions in places depending on the buggy behavior (e.g. RSS feed GUIDs). We usually don’t pay much attention to pull requests being small or very self-contained.

Due to a probable bug, images in the gallery exclude.meta are not
excluded on Windows. The test now expects this, to demonstrate the
current behavior.
...instead of dict literals, for consistency with
the other larger dicts in the same test.
Windows RSS item ids are now of same form as on other OSes, ie:

    'galleries/demo/tesla_tower1_lg.jpg'

Formerly these used backslashes on Windows, which caused a bug where
excluded images (listed in exclude.meta), were not excluded because
their id didn't match.

This means I've removed a big ugly TODO from the test, which no
longer expects these excluded items to appear in the RSS feed on
Windows.

This is a re-application of the 'tentative fix' that
was applied and reverted earlier in this PR.
@tartley tartley force-pushed the test-gallery-rss branch 2 times, most recently from 7899cad to ad03124 Compare July 4, 2023 00:58
@tartley tartley force-pushed the test-gallery-rss branch 2 times, most recently from 20aa60a to 66da315 Compare July 4, 2023 04:35
@tartley tartley marked this pull request as ready for review July 4, 2023 15:16
@tartley
Copy link
Contributor Author

tartley commented Jul 4, 2023

@Kwpolska: The URL generation on Windows is probably buggy and should be fixed, even if that would mean some minor regressions in places depending on the buggy behavior (e.g. RSS feed GUIDs).

Thank you for the advice! This is now fixed, and I believe this PR is complete and AFAIK can be merged.

nikola/nikola.py Outdated
@@ -1926,7 +1927,8 @@ def path(self, kind, name, lang=None, is_link=False, **kwargs):
else:
return link
else:
return os.path.join(*path)
# URLs should always use forward slash separators, even on Windows
return pathlib.PurePosixPath(*path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should return a string, not a pathlib.Path instance.

Copy link
Contributor Author

@tartley tartley Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahar, fair point, thank you. I was curious about why this wasn't causing problems in the tests, so I had a quick look around and it seems like almost every call to this function is using the return value as a component passed to os.path.join(), which accepts pathlib.Paths. So that explains why the problem isn't visibly manifesting. But yes, you are right, future callers to this (and maybe the existing one in Nikola.link()) might barf on pathlib.Paths. So I'll convert it to str, and swallow my idealism over this fix not requiring a test change.

@Kwpolska Kwpolska merged commit 84f6a2b into getnikola:master Jul 7, 2023
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants