heads up...tracking a problem with archive.today and also wget options #35

nrvale0 · 2020-05-10T21:37:21Z

Just a heads up...

For some reason archive.today requests are failing (no, not using Cloudflare) and then the backup wget is failing because it does not like the '--execute robots=off' option.

I'm going to try to solve the archive.today problem first but I'll race ya! ;)

alphapapa · 2020-05-11T04:54:52Z

Unfortunately, archive.today (or whatever its alias of the day may be) generally seems like an unreliable service for using as a backend. It's not intended to be used except through a browser. So don't be surprised if it doesn't work sometimes.

If a change has been made to it that requires a change in this code, we can do that.

For Wget, you'll have to be more specific than "it does not like the option." Obviously it works for me and always has.

nrvale0 · 2020-05-11T11:28:22Z

(use-package org-web-tools)

produces timeout of archive.is function and then the following error in Messages for wget function:

wget output:

/usr/bin/wget: unrecognized option '--execute robots=off'
Usage: wget [OPTION]... [URL]...

Try `wget --help' for more options.

The following then fixes wget params:

(use-package org-web-tools
    :config
    (setq org-web-tools-archive-wget-options
        (delete "--execute robots=off" org-web-tools-archive-wget-options))
    (add-to-list 'org-web-tools-archive-wget-options "-e robots=off"))

My wget man page still shows both the -e and --execute options as valid but apparently not.

$ wget --version | head -n1
GNU Wget 1.20.3 built on linux-gnu.

nrvale0 · 2020-05-11T11:48:51Z

I tried a:

(setq org-web-tools-attach-archive-fn #'org-web-tools-archive--wget-tar)

to just skip the archive.is attempts completely but its still trying archive.is. New to elisp so I'm probably missing something important.

alphapapa · 2020-05-14T00:52:44Z

If those Wget options don't work on your Wget version, I don't know what to suggest other than to not use them. Hopefully you won't need them, but be aware of their purpose. Maybe there is a new, alternative option syntax in your Wget version?

I recommend using the customization system rather than setq for package options. i.e. M-x customize-group RET org-web-tools RET. use-package also has the :custom keyword.

alphapapa · 2020-05-14T00:57:55Z

wget output:

/usr/bin/wget: unrecognized option '--execute robots=off'
Usage: wget [OPTION]... [URL]...

Try `wget --help' for more options.

I just recognized something: the option string Wget complains about includes both --execute and robots=off as a single string with a space in between. I think this may be a problem with argument parsing. I encountered a similar problem with Wget when experimenting with something recently, and IIRC I wasn't able to find any workaround.

Try putting --execute and robots=off in separate strings in the org-web-tools-archive-wget-options option. (The customization UI makes it easier.) I don't know if that will work, but if it does, it's an easy fix or workaround.

But I can't explain why my Wget doesn't complain about that option.

nrvale0 · 2020-10-24T18:13:53Z

Indeed adding "--execute" and "robots=off" as their own customize entries seems to have solved the issue with wget archiving.

I'm still not able to get the archive.is based archiving working but the above is a suitable work-around.

alphapapa · 2020-10-24T18:35:52Z

Indeed adding "--execute" and "robots=off" as their own customize entries seems to have solved the issue with wget archiving.

I think there may be a bug in Wget, because I recently noticed this problem when calling it from outside of Emacs. I guess we have to work around it in Emacs.

I'm still not able to get the archive.is based archiving working but the above is a suitable work-around.

archive.is doesn't seem to provide zip archives at all anymore. I can't even download them through a browser, and I couldn't find any explanation on its "blog" where people ask questions. In one case I tried to use Wget on the archive.is HTML view (because the page I was trying to save rendered most of its content with JavaScript, so Wget on the actual site was useless), but the downloaded page had about 90% of the content missing, even though it displayed correctly in a browser.

Archiving contemporary web pages is mostly a disaster. I guess if you are serious about it, you'd better look into WARC or WebRecorder tools, something like that, but those are much more complicated, and AFAIK they require specialized "playback" tools. Imagine what people are going to have to do a few decades from now, running ancient browsers in ancient VMs just to render a newspaper article of the day. Or, almost as bad, looking at image-based archives of newspapers, like microfilm from before the digital age. It seems like no one ever knows when to say, "Stop, that's complicated enough. Just because we could doesn't mean that we should."

nrvale0 · 2020-10-24T23:31:11Z

Y, thanks for the confirmation. I feel ya' on future-state stuff.

gety9 · 2022-11-05T21:41:35Z

Just a heads up...

For some reason archive.today requests are failing (no, not using Cloudflare) and then the backup wget is failing because it does not like the '--execute robots=off' option.

I'm going to try to solve the archive.today problem first but I'll race ya! ;)

@nrvale0 Where you able to solve 1) arhive.today and 2) wget params problems? I have same in #52

gety9 · 2022-11-05T22:28:47Z

Try putting --execute and robots=off in separate strings in the org-web-tools-archive-wget-options option

@alphapapa

does this look right

(use-package org-web-tools
  :ensure t
  :custom
  (org-web-tools-archive-wget-options
    '(--execute
    robots=off)
  )
)

?

UPDATE: wget error solved with following:

(use-package org-web-tools
  :ensure t
  :custom
  (org-web-tools-archive-wget-options
    '("--execute"
    "robots=off")
  )
)

but archive.today always fails...

deadcombo · 2023-10-28T14:19:10Z

This is still an issue on wget 1.21.4. "--execute" and "robots=off" must be separated.

alphapapa · 2023-10-29T04:54:12Z

@deadcombo Thanks for reminding me. I've pushed a fix to master.

nrvale0 closed this as completed Oct 24, 2020

gety9 mentioned this issue Nov 5, 2022

Archive not yet available. #52

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

heads up...tracking a problem with archive.today and also wget options #35

heads up...tracking a problem with archive.today and also wget options #35

nrvale0 commented May 10, 2020

alphapapa commented May 11, 2020

nrvale0 commented May 11, 2020

nrvale0 commented May 11, 2020 •

edited

Loading

alphapapa commented May 14, 2020 •

edited

Loading

alphapapa commented May 14, 2020

nrvale0 commented Oct 24, 2020

alphapapa commented Oct 24, 2020

nrvale0 commented Oct 24, 2020

gety9 commented Nov 5, 2022

gety9 commented Nov 5, 2022 •

edited

Loading

deadcombo commented Oct 28, 2023

alphapapa commented Oct 29, 2023

heads up...tracking a problem with archive.today and also wget options #35

heads up...tracking a problem with archive.today and also wget options #35

Comments

nrvale0 commented May 10, 2020

alphapapa commented May 11, 2020

nrvale0 commented May 11, 2020

nrvale0 commented May 11, 2020 • edited Loading

alphapapa commented May 14, 2020 • edited Loading

alphapapa commented May 14, 2020

nrvale0 commented Oct 24, 2020

alphapapa commented Oct 24, 2020

nrvale0 commented Oct 24, 2020

gety9 commented Nov 5, 2022

gety9 commented Nov 5, 2022 • edited Loading

deadcombo commented Oct 28, 2023

alphapapa commented Oct 29, 2023

nrvale0 commented May 11, 2020 •

edited

Loading

alphapapa commented May 14, 2020 •

edited

Loading

gety9 commented Nov 5, 2022 •

edited

Loading