Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

heads up...tracking a problem with archive.today and also wget options #35

Closed
nrvale0 opened this issue May 10, 2020 · 12 comments
Closed

Comments

@nrvale0
Copy link

nrvale0 commented May 10, 2020

Just a heads up...

For some reason archive.today requests are failing (no, not using Cloudflare) and then the backup wget is failing because it does not like the '--execute robots=off' option.

I'm going to try to solve the archive.today problem first but I'll race ya! ;)

@alphapapa
Copy link
Owner

Unfortunately, archive.today (or whatever its alias of the day may be) generally seems like an unreliable service for using as a backend. It's not intended to be used except through a browser. So don't be surprised if it doesn't work sometimes.

If a change has been made to it that requires a change in this code, we can do that.

For Wget, you'll have to be more specific than "it does not like the option." Obviously it works for me and always has.

@nrvale0
Copy link
Author

nrvale0 commented May 11, 2020

(use-package org-web-tools)

produces timeout of archive.is function and then the following error in Messages for wget function:

wget output:

/usr/bin/wget: unrecognized option '--execute robots=off'
Usage: wget [OPTION]... [URL]...

Try `wget --help' for more options.

The following then fixes wget params:

(use-package org-web-tools
    :config
    (setq org-web-tools-archive-wget-options
        (delete "--execute robots=off" org-web-tools-archive-wget-options))
    (add-to-list 'org-web-tools-archive-wget-options "-e robots=off"))

My wget man page still shows both the -e and --execute options as valid but apparently not.

$ wget --version | head -n1
GNU Wget 1.20.3 built on linux-gnu.

@nrvale0
Copy link
Author

nrvale0 commented May 11, 2020

I tried a:

(setq org-web-tools-attach-archive-fn #'org-web-tools-archive--wget-tar)

to just skip the archive.is attempts completely but its still trying archive.is. New to elisp so I'm probably missing something important.

@alphapapa
Copy link
Owner

alphapapa commented May 14, 2020

If those Wget options don't work on your Wget version, I don't know what to suggest other than to not use them. Hopefully you won't need them, but be aware of their purpose. Maybe there is a new, alternative option syntax in your Wget version?

I recommend using the customization system rather than setq for package options. i.e. M-x customize-group RET org-web-tools RET. use-package also has the :custom keyword.

@alphapapa
Copy link
Owner

wget output:

/usr/bin/wget: unrecognized option '--execute robots=off'
Usage: wget [OPTION]... [URL]...

Try `wget --help' for more options.

I just recognized something: the option string Wget complains about includes both --execute and robots=off as a single string with a space in between. I think this may be a problem with argument parsing. I encountered a similar problem with Wget when experimenting with something recently, and IIRC I wasn't able to find any workaround.

Try putting --execute and robots=off in separate strings in the org-web-tools-archive-wget-options option. (The customization UI makes it easier.) I don't know if that will work, but if it does, it's an easy fix or workaround.

But I can't explain why my Wget doesn't complain about that option.

@nrvale0 nrvale0 closed this as completed Oct 24, 2020
@nrvale0
Copy link
Author

nrvale0 commented Oct 24, 2020

Indeed adding "--execute" and "robots=off" as their own customize entries seems to have solved the issue with wget archiving.

I'm still not able to get the archive.is based archiving working but the above is a suitable work-around.

@alphapapa
Copy link
Owner

Indeed adding "--execute" and "robots=off" as their own customize entries seems to have solved the issue with wget archiving.

I think there may be a bug in Wget, because I recently noticed this problem when calling it from outside of Emacs. I guess we have to work around it in Emacs.

I'm still not able to get the archive.is based archiving working but the above is a suitable work-around.

archive.is doesn't seem to provide zip archives at all anymore. I can't even download them through a browser, and I couldn't find any explanation on its "blog" where people ask questions. In one case I tried to use Wget on the archive.is HTML view (because the page I was trying to save rendered most of its content with JavaScript, so Wget on the actual site was useless), but the downloaded page had about 90% of the content missing, even though it displayed correctly in a browser.

Archiving contemporary web pages is mostly a disaster. I guess if you are serious about it, you'd better look into WARC or WebRecorder tools, something like that, but those are much more complicated, and AFAIK they require specialized "playback" tools. Imagine what people are going to have to do a few decades from now, running ancient browsers in ancient VMs just to render a newspaper article of the day. Or, almost as bad, looking at image-based archives of newspapers, like microfilm from before the digital age. It seems like no one ever knows when to say, "Stop, that's complicated enough. Just because we could doesn't mean that we should."

@nrvale0
Copy link
Author

nrvale0 commented Oct 24, 2020

Y, thanks for the confirmation. I feel ya' on future-state stuff.

@gety9
Copy link

gety9 commented Nov 5, 2022

Just a heads up...

For some reason archive.today requests are failing (no, not using Cloudflare) and then the backup wget is failing because it does not like the '--execute robots=off' option.

I'm going to try to solve the archive.today problem first but I'll race ya! ;)

@nrvale0 Where you able to solve 1) arhive.today and 2) wget params problems? I have same in #52

@gety9
Copy link

gety9 commented Nov 5, 2022

Try putting --execute and robots=off in separate strings in the org-web-tools-archive-wget-options option

@alphapapa

does this look right

(use-package org-web-tools
  :ensure t
  :custom
  (org-web-tools-archive-wget-options
    '(--execute
    robots=off)
  )
)

?


UPDATE: wget error solved with following:

(use-package org-web-tools
  :ensure t
  :custom
  (org-web-tools-archive-wget-options
    '("--execute"
    "robots=off")
  )
)

but archive.today always fails...

@deadcombo
Copy link

This is still an issue on wget 1.21.4. "--execute" and "robots=off" must be separated.

@alphapapa
Copy link
Owner

@deadcombo Thanks for reminding me. I've pushed a fix to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants