Commands and functions for retrieving web page content and processing it into and displaying it as Org-mode content.
Switch branches/tags
Nothing to show
Clone or download
alphapapa Change: (-archive--wget-tar) Warn if wget or tar exit non-zero
If wget exits non-zero because one of the page requisites is not
found, the archiving will fail without apparent reason.  At least this
shows the wget output so the user can figure out what happened.
Latest commit 3fdacf4 Dec 15, 2018

README.org

org-web-tools

https://melpa.org/packages/org-web-tools-badge.svg https://stable.melpa.org/packages/org-web-tools-badge.svg

This file contains library functions and commands useful for retrieving web page content and processing it into Org-mode content.

For example, you can copy a URL to the clipboard or kill-ring, then run a command that downloads the page, isolates the “readable” content with eww-readable, converts it to Org-mode content with Pandoc, and displays it in an Org-mode buffer. Another command does all of that but inserts it as an Org entry instead of displaying it in a new buffer.

Installation

Requirements

  • Emacs 25.1 or later.
  • Commands that process HTML into Org require Pandoc. Note: The output of current Pandoc versions differs substantially from versions that may still be present in stable Linux distros. If you encounter any issues, please install a more recent version of Pandoc.

MELPA

If you installed from MELPA, just run one of the commands below. If you want to use any of the functions in your own code, you should (require 'org-web-tools).

Manual

Install dash.el, esxml, request, and s.el. Then require this package in your init file:

(require 'org-web-tools)

Usage

Commands

  • org-web-tools-insert-link-for-url: Insert an Org-mode link to the URL in the clipboard or kill-ring. Downloads the page to get the HTML title.
  • org-web-tools-insert-web-page-as-entry: Insert the web page for the URL in the clipboard or kill-ring as an Org-mode entry, as a sibling heading of the current entry.
  • org-web-tools-read-url-as-org: Display the web page for the URL in the clipboard or kill-ring as Org-mode text in a new buffer, processed with eww-readable.
  • org-web-tools-convert-links-to-page-entries: Convert all URLs and Org links in current Org entry to Org headings, each containing the web page content of that URL, converted to Org-mode text and processed with eww-readable. This should be called on an entry that solely contains a list of URLs or links.
  • org-web-tools-attach-url-archive: Download a Zip archive of a web page from archive.is and attach with org-attach. (See also org-board).
  • org-web-tools-view-archive: Open Zip file archive of web page. Extracts to a temp directory and opens with browse-url-default-browser. Note: the extracted files are left on-disk in the temp directory.

Functions

These are used in the commands above and may be useful in building your own commands.

  • org-web-tools--dom-to-html: Return parsed HTML DOM as an HTML string. Note: This is an approximation and is not necessarily correct HTML (e.g. IMG tags may be rendered with a closing “</img>” tag).
  • org-web-tools--eww-readable: Return “readable” part of HTML with title.
  • org-web-tools--get-url: Return content for URL as string.
  • org-web-tools--html-title: Return title of HTML page.
  • org-web-tools--html-to-org-with-pandoc: Return string of HTML converted to Org with Pandoc. When SELECTOR is non-nil, the HTML is filtered using esxml-query SELECTOR and re-rendered to HTML with org-web-tools--dom-to-html, which see.
  • org-web-tools--url-as-readable-org: Return string containing Org entry of URL’s web page content. Content is processed with eww-readable and Pandoc. Entry will be a top-level heading, with article contents below a second-level “Article” heading, and a timestamp in the first-level entry for writing comments.
  • org-web-tools--demote-headings-below: Demote all headings in buffer so the highest level is below LEVEL.
  • org-web-tools--get-first-url: Return URL in clipboard, or first URL in the kill-ring, or nil if none.
  • org-web-tools--read-url: Return a URL by searching at point, then in clipboard, then in kill-ring, and finally prompting the user.
  • org-web-tools--read-org-bracket-link: Return (TARGET . DESCRIPTION) for Org bracket LINK or next link on current line.
  • org-web-tools--remove-dos-crlf: Remove all DOS CRLF (^M) in buffer.

Changelog

1.2-pre

Improvements

  • Archiving tools:
    • Can use multiple functions to attempt archiving.
    • Associated options control retry attempts, delays, and fallbacks to other functions.
    • Function org-web-tools-archive--wget-tar archives a URL using wget and tar.
    • Command org-web-tools-archive-view handles both zip and tar archives.
    • The default settings attempt to archive with archive.is, and if that fails after retrying for 75 seconds, falls back to using wget and tar.

1.1

Additions

  • Command org-web-tools-attach-url-archive.
  • Command org-web-tools-view-archive.
  • Function org-web-tools--read-url.

1.0.1

Changes

  • Remove all property drawers that contain the CUSTOM_ID property from Pandoc output.

1.0

  • First declared stable release.

Development

Contributions and suggestions are welcome.

License

GPLv3