Skip to content

Commit

Permalink
Switch from shell script to Emacs-native code
Browse files Browse the repository at this point in the history
Now the shell script is an optional utility.  Capturing from the browser
doesn't require it, nor does it require the extra protocol-handler.
  • Loading branch information
alphapapa committed Oct 3, 2015
1 parent 04ee024 commit d990adb
Show file tree
Hide file tree
Showing 3 changed files with 132 additions and 86 deletions.
77 changes: 12 additions & 65 deletions README.org
Original file line number Diff line number Diff line change
@@ -1,19 +1,18 @@
* Introduction
org-protocol is awesome, but browsers do a pretty poor job of turning a page's HTML content into plain-text. However, Pandoc supports converting /from/ HTML /to/ org-mode! So with a bit of JavaScript and a shell script, we can pipe the HTML from the browser's selection to Pandoc, then send it to Emacs through org-protocol!
* org-protocol-capture-html
org-protocol is awesome, but browsers do a pretty poor job of turning a page's HTML content into plain-text. However, Pandoc supports converting /from/ HTML /to/ org-mode, so we can use it to turn HTML into Org-mode content! It can even turn HTML tables into Org tables!
* Screenshot
Here's an example of what you get in Emacs from capturing [[http://kitchingroup.cheme.cmu.edu/blog/2014/07/17/Pandoc-does-org-mode-now/][this page]]:

[[screenshot.png]]
* Contents :TOC:
- [[#introduction][Introduction]]
- [[#org-protocol-capture-html][org-protocol-capture-html]]
- [[#screenshot][Screenshot]]
- [[#requirements][Requirements]]
- [[#bookmarklet][Bookmarklet]]
- [[#html-grabbing-function][HTML-grabbing function]]
- [[#bookmarklet][Bookmarklet]]
- [[#emacs][Emacs]]
- [[#shell-script][Shell script]]
- [[#protocol-handler][Protocol-handler]]
- [[#notes][Notes]]

* Requirements
+ *[[http://orgmode.org/worg/org-contrib/org-protocol.html][org-protocol]]*: This is what connects org-mode to the "outside world" using a MIME protocol handler. The instructions on the org-protocol page are a bit out of date, so you might want to try [[http://stackoverflow.com/questions/7464951/how-to-make-org-protocol-work/12751732#12751732][these instructions]] instead.
Expand Down Expand Up @@ -47,67 +46,15 @@ Here's a one-line version of it, better for pasting into bookmarklets and such:
function () {var html = ""; if (typeof window.getSelection != "undefined") {var sel = window.getSelection(); if (sel.rangeCount) {var container = document.createElement("div"); for (var i = 0, len = sel.rangeCount; i < len; ++i) {container.appendChild(sel.getRangeAt(i).cloneContents());} html = container.innerHTML;}} else if (typeof document.selection != "undefined") {if (document.selection.type == "Text") {html = document.selection.createRange().htmlText;}} return html;}();
#+END_SRC
** Bookmarklet
That function goes in the bookmarklet, resulting in this:
That function goes in the bookmarklet, and the =org-protocol= sub-protocol is changed to =capture-html:=, resulting in this:
#+BEGIN_SRC js
location.href='org-protocol-html://capture://w/'+encodeURIComponent(content.location.href)+'/'+encodeURIComponent(content.document.title)+'/'+encodeURIComponent(function () {var html = ""; if (typeof content.document.getSelection != "undefined") {var sel = content.document.getSelection(); if (sel.rangeCount) {var container = document.createElement("div"); for (var i = 0, len = sel.rangeCount; i < len; ++i) {container.appendChild(sel.getRangeAt(i).cloneContents());} html = container.innerHTML;}} else if (typeof content.document.selection != "undefined") {if (content.document.selection.type == "Text") {html = content.document.selection.createRange().htmlText;}} return html;}());
location.href='org-protocol://capture-html://w/'+encodeURIComponent(content.location.href)+'/'+encodeURIComponent(content.document.title)+'/'+encodeURIComponent(function () {var html = ""; if (typeof content.document.getSelection != "undefined") {var sel = content.document.getSelection(); if (sel.rangeCount) {var container = document.createElement("div"); for (var i = 0, len = sel.rangeCount; i < len; ++i) {container.appendChild(sel.getRangeAt(i).cloneContents());} html = container.innerHTML;}} else if (typeof content.document.selection != "undefined") {if (content.document.selection.type == "Text") {html = content.document.selection.createRange().htmlText;}} return html;}());
#+END_SRC
*Note:* I use the Pentadactyl extension, so I had to use ~content.location.href~ instead of ~location.href~, ~content.document~ instead of ~window~, and ~content.document.selection~ instead of ~window.getSelection()~. This might work in plain Firefox too, or you might need to adjust it.
* Shell script
This shell script takes the HTML, sends it through Pandoc (and Python), and then sends it to =org-protocol= through =emacsclient=. Put it in a file somewhere in your =PATH= named ~org-protocol-capture-html.sh~.

#+NAME: org-protocol-capture-html.sh
#+BEGIN_SRC sh
#!/bin/bash

if [[ $@ ]]
then
# Get data from args
data="$@"
else
# Get data from STDIN
data=$(cat)
fi

if [[ -z $data ]]
then
# No data; quit
exit 1
fi

# Fix protocol
data=$(sed 's|^org-protocol-html://|org-protocol://|' <<<"$data")

# Split data
readarray -t data <<<"$(sed -r 's|^(org-protocol://capture://w/[^/]+/[^/]+/)(.*)|\1\n\2|' <<<"$data")"

start="${data[0]}"
end="${data[1]}"

# Decode URL-encoded/quoted data
end=$(python -c "import sys, urllib; print urllib.unquote(' '.join(sys.argv[1:]))" "$end")

# Convert with Pandoc
end=$(pandoc --no-wrap -f html -t org <<<"$end")

# Reencode data
end=$(python -c "import sys, urllib; print urllib.quote(' '.join(sys.argv[1:]), safe='')" "$end")

# Send to Emacs
emacsclient "${start}${end}"
* Emacs
Put =org-protocol-capture-html.el= in your =load-path= and add to your init file:
#+BEGIN_SRC elisp
(require 'org-protocol-capture-html)
#+END_SRC
* Protocol-handler
Just like with the standard org-protocol setup, you need to add a protocol handler. Put this in =~/.local/share/applications/org-protocol-capture-html.desktop=, then run ~update-desktop-database ~/.local/share/applications~.

#+NAME: ~/.local/share/applications/org-protocol-capture-html.desktop
#+BEGIN_SRC conf
[Desktop Entry]
Name=org-protocol-html
Exec=org-protocol-capture-html.sh %u
Type=Application
Terminal=false
Categories=System;
MimeType=x-scheme-handler/org-protocol-html;
#+END_SRC

* Notes
If you wanted to, you could skip the shell script and write an org-protocol sub-protocol handler that called Pandoc from Emacs (or perhaps made use of =pandoc-mode=). It's probably simpler to do it with the shell script, especially since you have to un-escape and re-escape the HTML around Pandoc, but if you put together a "plain Emacs" solution, please feel free to share it and I'll add it here.
* Shell script
The [[org-protocol-capture-html.sh][shell script]] is handy for piping any HTML (or plain-text) content to Org through the shell, but it's not required.
67 changes: 67 additions & 0 deletions org-protocol-capture-html.el
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
;;; org-protocol-capture-html --- Capture HTML with org-protocol

;;; Commentary:
;; This makes it possible to capture HTML into Org-mode with
;; org-protocol by passing it through Pandoc to convert the HTML into
;; Org syntax. You can use a JavaScript function like the ones found
;; here[0] to get the HTML from the browser's selection, or here's one
;; that seems to work:
;;
;; function () {var html = ""; if (typeof window.getSelection != "undefined") {var sel = window.getSelection(); if (sel.rangeCount) {var container = document.createElement("div"); for (var i = 0, len = sel.rangeCount; i < len; ++i) {container.appendChild(sel.getRangeAt(i).cloneContents());} html = container.innerHTML;}} else if (typeof document.selection != "undefined") {if (document.selection.type == "Text") {html = document.selection.createRange().htmlText;}} return html;}();
;;
;; [0] http://stackoverflow.com/a/6668159/712624

;;; Code:
(defun org-protocol-capture-html-with-pandoc (data)
"Process an org-protocol://capture-html:// URL.
This function is basically a copy of `org-protocol-do-capture', but
it passes the captured content (not the URL or title) through
Pandoc, converting HTML to Org-mode."
;; It would be nice to not basically duplicate
;; `org-protocol-do-capture', but passing the data back to that
;; function would require re-encoding the data into a URL string
;; with Emacs after Pandoc converts it. Since we've already split
;; it up, we might as well go ahead and run the capture directly.
(let* ((parts (org-protocol-split-data data t org-protocol-data-separator))
(template (or (and (>= 2 (length (car parts))) (pop parts))
org-protocol-default-template-key))
(url (org-protocol-sanitize-uri (car parts)))
(type (if (string-match "^\\([a-z]+\\):" url)
(match-string 1 url)))
(title (or (cadr parts) ""))
(content (or (caddr parts) ""))
(orglink (org-make-link-string
url (if (string-match "[^[:space:]]" title) title url)))
(query (or (org-protocol-convert-query-to-plist (cadddr parts)) ""))
(org-capture-link-is-already-stored t)) ;; avoid call to org-store-link

(setq org-stored-links
(cons (list url title) org-stored-links))
(kill-new orglink)

(with-temp-buffer
(insert content)
(if (not (= 0 (call-process-region
(point-min) (point-max)
"pandoc" t t nil "--no-wrap" "-f" "html" "-t" "org")))
(message "Pandoc failed: " (buffer-string))
(progn
;; Pandoc succeeded
(org-store-link-props :type type
:link url
:description title
:orglink orglink
:initial (buffer-string))
(raise-frame)
(funcall 'org-capture nil template))))
nil))

(add-to-list 'org-protocol-protocol-alist
'("capture-html"
:protocol "capture-html"
:function org-protocol-capture-html-with-pandoc
:kill-client t))

(provide 'org-protocol-capture-html)
;;; org-protocol-capture-html ends here
74 changes: 53 additions & 21 deletions org-protocol-capture-html.sh
Original file line number Diff line number Diff line change
@@ -1,37 +1,69 @@
#!/bin/bash

# ** Defaults
heading="Heading"
template="w"
url="http://example.com"

# ** Functions
function urlencode {
python -c "import sys, urllib; print urllib.quote(' '.join(sys.argv[1:]), safe='')" "$@"
}
function usage {
cat <<EOF
org-protocol-capture-html [-t TITLE] [-u URL] [HTML]
Send HTML to Emacs through org-protocol, passing it through Pandoc to
convert HTML to Org-mode. HTML may be passed as an argument or
through STDIN.
Options:
-h HEADING Use HEADING as the Org heading (default: w)
-t TEMPLATE Use the org-capture template with TEMPLATE key
-u URL Use URL for the heading link
EOF
}

# ** Args
while getopts "h:t:u:" opt
do
case $opt in
h) heading=$OPTARG ;;
t) template=$OPTARG ;;
u) url=$OPTARG ;;
*) usage; exit ;;
esac
done
shift $(( OPTIND - 1 ));

# ** Get HTML
if [[ $@ ]]
then
# Get data from args
# Get from args
data="$@"
else
# Get data from STDIN
# Get from STDIN
data=$(cat)
fi

if [[ -z $data ]]
if ! [[ $data ]]
then
# No data; quit
echo "No data passed via args or STDIN." >&2
exit 1
fi

# Fix protocol
data=$(sed 's|^org-protocol-html://|org-protocol://|' <<<"$data")

# Split data
readarray -t data <<<"$(sed -r 's|^(org-protocol://capture://w/[^/]+/[^/]+/)(.*)|\1\n\2|' <<<"$data")"

start="${data[0]}"
end="${data[1]}"

# Decode URL-encoded/quoted data
end=$(python -c "import sys, urllib; print urllib.unquote(' '.join(sys.argv[1:]))" "$end")

# Convert with Pandoc
end=$(pandoc --no-wrap -f html -t org <<<"$end")
# ** Check template length
if [[ ${#template} -gt 1 ]]
then
echo "Template key should be one letter." >&2
exit 1
fi

# Reencode data
end=$(python -c "import sys, urllib; print urllib.quote(' '.join(sys.argv[1:]), safe='')" "$end")
# ** URL-encode data
heading=$(urlencode "$heading")
url=$(urlencode "$url")
data=$(urlencode "$data")

# Send to Emacs
emacsclient "${start}${end}"
# ** Send to Emacs
emacsclient "org-protocol://capture-html://$template/$url/$heading/$data"

0 comments on commit d990adb

Please sign in to comment.