Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Images failing to load in live builds #234

Closed
eloquence opened this issue Jun 24, 2021 · 13 comments
Closed

Images failing to load in live builds #234

eloquence opened this issue Jun 24, 2021 · 13 comments
Labels
bug Something isn't working

Comments

@eloquence
Copy link
Member

Appears LFS-related, see journalist guide for example:

https://docs.securedrop.org/en/stable/journalist.html

Example image URL: https://docs.securedrop.org/en/stable/_images/journalist-delete_submissions.png

The actual file contents are an LFS pointer, so something appears to be going wrong with the LFS setup in conf.py

@eloquence eloquence added the bug Something isn't working label Jun 24, 2021
@zenmonkeykstop
Copy link
Contributor

Latest failing build log is here: https://readthedocs.org/api/v2/build/14089116.txt

@gonzalo-bulnes
Copy link
Contributor

gonzalo-bulnes commented Jun 24, 2021

It seems to me that this part of the log may be relevant:

2021-06-24 17:24:42 (19.3 MB/s) - ‘git-lfs-linux-amd64-v2.12.0.tar.gz’ saved [4675677/4675677]

README.md
CHANGELOG.md
git-lfs
install.sh
Updated git hooks.
Git LFS initialized.
fetch: Fetching reference HEAD
Error updating the git index:
git-lfs filter-process: 1: git-lfs filter-process: git-lfs: not found
error: packet write with format failed
error: Could not write client identification
error: initialization for subprocess 'git-lfs filter-process' failed
fatal: docs/diagrams/SecureDrop-0.3-DFD.png: clean filter 'lfs' failed

Checking out LFS objects: 100% (228/228), 20 MB | 0 B/s, done.

Errors logged to /home/docs/checkouts/readthedocs.org/user_builds/securedrop-docs/checkouts/latest/.git/lfs/logs/20210624T172446.320040923.log
Use `git lfs logs last` to view the log.

Note: This may also be a red herring because the checkout of the objects themselves seems to have succeeded: Checking out LFS objects: 100% (228/228), 20 MB | 0 B/s, done.

On that, see also further down:

[...]
copying images... [ 97%] images/tails_update_notification.png
copying images... [ 98%] upgrade/../images/securedrop-updater.png
copying images... [ 98%] images/yubikey_overview.png
copying images... [ 99%] images/yubikey_oath_hotp_configuration.png
copying images... [100%] images/yubikey_configuration_successful.png

copying downloadable files... [ 10%] images/labels/admin_workstation.png
copying downloadable files... [ 20%] images/labels/journalist_workstation.png
copying downloadable files... [ 30%] images/labels/secure_viewing_station_offline_warning.png
copying downloadable files... [ 40%] images/labels/firewall.png
copying downloadable files... [ 50%] images/labels/app_server.png
copying downloadable files... [ 60%] images/labels/mon_server.png
copying downloadable files... [ 70%] images/labels/usb_admin.png
copying downloadable files... [ 80%] images/labels/usb_journalist.png
copying downloadable files... [ 90%] images/labels/usb_svs.png
copying downloadable files... [100%] images/labels/usb_file_transfer.png

copying static files... done
copying extra files... done
dumping search index in English (code: en) ... done
dumping object inventory... done
build succeeded, 1 warning.

The HTML pages are in _build/html.

As far as I can tell, the warning count refers to: /home/docs/checkouts/readthedocs.org/user_builds/securedrop-docs/checkouts/latest/docs/kernel_troubleshooting.rst:110: WARNING: Cannot analyze code. No Pygments lexer found for "none"., which seems unrelated.

@eloquence
Copy link
Member Author

@gonzalo-bulnes Pretty sure it's the LFS setup failing; note how images like https://docs.securedrop.org/en/stable/_images/journalist-delete_source_account.png resolve to LFS pointers:

$ curl https://docs.securedrop.org/en/stable/_images/journalist-delete_source_account.png

version https://git-lfs.github.com/spec/v1
oid sha256:b8dd91d867a9886050dd986516d3eeff8131697623cb92551132fbbc2b7ab119
size 127752

This from the build log

git-lfs filter-process: 1: git-lfs filter-process: git-lfs: not found

suggests to me that the installation of git-lfs is failing. Note that RTD does not support LFS out of the box; this is bolted on in our conf.py here:

if on_rtd:
if not os.path.exists("./git-lfs"):
os.system("wget {}".format(GIT_LFS_URL))
os.system("tar xvfz {}".format(GIT_LFS_PATH))
os.system("./git-lfs install")
os.system("./git-lfs fetch")
os.system("./git-lfs checkout")

Restarting the build makes no difference; this is a persistent failure that started about 2 days ago. I don't see any changes in this repo that are clearly correlated. It's possible that RTD's build environment has changed. I'll see if I can reproduce the issue in a fork as a first step.

@eloquence
Copy link
Member Author

eloquence commented Jun 25, 2021

Tested a bit more. Findings so far:

  1. I was able to reproduce the issue in a fork immediately.
  2. I tried building from the 1.8.2 tag - no difference.
  3. I tried bumping the git-lfs version to latest - no difference.
  4. I tried changing os.system calls for the LFS setup to the recommended subprocess.run - no difference.
  5. I tried changing git's HTTP POST buffer size per a common recommendation when encountering the "remote end hung up unexpectedly" error - no difference.
  6. For good measure, I tried removing and re-adding the file SecureDrop-0.3-DFD.png (which sometimes is included in the error output) - no difference.

I did get exactly one successful build with images in my fork, just after making the change to subprocess.run above (but not repeatably thereafter). The build log from that build is here:
https://readthedocs.org/api/v2/build/14092907.txt

Note how it does not include the fatal: The remote end hung up unexpectedly error.

One additional mystery: I am seeing both the git-lfs filter-process: 1: git-lfs filter-process: git-lfs: not found and fatal: The remote end hung up unexpectedly errors in build logs from months ago, when builds were passing and images were definitely working. Not sure if we're dealing with breakage masked by caching, or if there's something entirely different going on.

One thing I didn't get to try yet is to replace the direct git-lfs invocation with use of the Python git_lfs module. Here's an example of that: https://scm.cms.hu-berlin.de/jakimowb/eo-time-series-viewer/-/blob/f4d90e73765e77808f5d5df831a4d24b8c1dd4e1/doc/source/conf.py

@gonzalo-bulnes
Copy link
Contributor

Wow, that's a super clear recap! I'll try to reproduce it tomorrow as well, and take a look at the Python git_lfs module use.

[...] the checkout of the objects themselves seems to have succeeded: Checking out LFS objects: 100% (228/228), 20 MB | 0 B/s, done.

One additional mystery: I am seeing both the git-lfs filter-process: 1: git-lfs filter-process: git-lfs: not found and fatal: The remote end hung up unexpectedly errors in build logs from months ago, when builds were passing and images were definitely working.

Disclaimer: I haven't looked. Are the builds actually failing (being marked as failed)? Or might there be an issue with serving the images vs retrieving them / putting them in the right place? I guess what I'm asking is if the git-lfs errors could be non-fatal and something else started failing downstream?

@rmol
Copy link
Contributor

rmol commented Jun 25, 2021

Somewhere in the build process LFS content is being reverted to pointers, even if we do get images reified in the working tree at the beginning of the build, via git lfs pull (or fetch and checkout) in conf.py.

I tried using the RTD configuration file to install git-lfs in the container with apt, to ensure it was available throughout the process, and could verify in conf.py that LFS was being installed and content pulled correctly, but could never get any build products to contain the actual images.

I also used the configuration file to ensure RTD was using our requirements file. That made no difference.

None of the Sphinx or RTD code I've looked at so far seems to be doing anything more than simple copies of the images, so I don't know what's happening to them.

@eloquence
Copy link
Member Author

Filed upstream issue (though LFS is not officially supported, perhaps they can give some pointers): readthedocs/readthedocs.org#8288

@gonzalo-bulnes
Copy link
Contributor

gonzalo-bulnes commented Jun 26, 2021

👋 @eloquence @rmol
I'm catching up with the above and it is not immediately clear to me how I could help troubleshooting further before the US weekend. Meanwhile, I was thinking about the impact of this image outage and thinking that #70 (or a subset of it) could be relevant to mitigating some of that impact.

Concretely, I could go through the docs today and open a PR to improve the sections that suffer the most when images are missing if that seems useful.

To illustrate what I'm thinking: a couple of days ago, I noticed that the section Set Up the Network Firewall relies on two screenshots to describe the firewall rules. There is a description of abstract rules further down, but arguably those could be made more visible, and there could also be more precise content complementing or replacing the missing image. (One of the alt texts is "3 NIC Firewall LAN rules" and constitutes the entirety of the LAN interface subsection). That's an example that's fresh in my mind, but I assume I'd probably find other if performing a systematic review. Any thoughts?

@conorsch
Copy link
Contributor

Hello, @gonzalo-bulnes! Thanks for offering to help out. You're right, troubleshooting this problem without direct commit access is likely to be a frustrating problem. =/ I took a brief look just now, and I'm sorry to say I don't have anything to add to @rmol's analysis earlier today. Here's the latest info:

As @rmol observed, the images are indeed valid PNGs, but then they're rather bizarrely reverted to pointers again at some later step in the build prep. Perhaps inspecting the sphinx build logic in more detail would let us inject steps in various parts of the run. It doesn't appear that RTD will let us run any cleanup tasks after the sphinx build command has run.

So, unfortunately, it appears the images in the docs will remain broken over the weekend, but will pick this back up on Monday and try again.

@gonzalo-bulnes
Copy link
Contributor

👌 I'll take a closer look at what the docs look like without images, and see what I can propose to make that better. If it ends up there is some immediate value to it, that will be an available option, and if not it will anyway be a step towards making the docs more resilient in the future. (I like to think of accessibility in broad terms as a problem of content availability in varied circumstances.)

@eloquence
Copy link
Member Author

The plan of record is to investigate this a bit further today/tomorrow (upstream has also weighed in), and to potentially disable LFS on this repo if we can't resolve.

Longer term, we are considering self-hosting as an option, which would give us more control over the build architecture (this is no criticism of RTD, which is a great service, but SecureDrop may have outgrown it a while ago).

@gonzalo-bulnes
Copy link
Contributor

gonzalo-bulnes commented Jun 28, 2021

Two cents comment: I notice that some images in the docs could easily be optimized significantly. (Proof of concept, running /docs/_build/html/_images/yubikey_configuration_successful.png through TinyPNG (online service) yields 70% gains, from 103.5kB to 31.3kB.)

If disabling LFS, it may be worth adding a step (with an automated check or not) to optimize the images. (Since images would be optimized by the time they are integrated into main, I believe an automated step at CI time could be enough to keep the Git repo small if everyone works from forks and makes sure their build is green before merging?)

@eloquence
Copy link
Member Author

#239 is merged and I pushed a new stable tag and purged the Cloudflare cache for all of securedrop.org (which was necessary to purge cached LFS pointers). Filed #240 for follow-up; closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants