Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable escaping #30

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,7 @@ pom.xml
.lein-deps-sum
.lein-failures
.lein-plugins
.lein-repl-history
.lein-repl-history
/.nrepl-port
/.cljs_rhino_repl/
/out/
60 changes: 30 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ parsing functions, `parse` and `parse-fragment`. Both take a string
containing HTML and return the parser objects representing the
document. (It happens that these parser objects are Jsoup Documents
and Nodes, but I do not consider this to be an aspect worth preserving
if a change in parser should become necessary).
if a change in parser should become necessary).

The first function, `parse` expects an entire HTML document, and
parses it using an HTML5 parser ([Jsoup](http://jsoup.org) on Clojure and
Expand All @@ -35,7 +35,7 @@ simply give you the list of nodes that it parsed.
These parsed objects can be turned into either Hiccup vector trees or
Hickory DOM maps using the functions `as-hiccup` or `as-hickory`.

Here's a usage example.
Here's a usage example.

```clojure
user=> (use 'hickory.core)
Expand All @@ -48,10 +48,10 @@ user=> (as-hickory parsed-doc)
{:type :document, :content [{:type :element, :attrs nil, :tag :html, :content [{:type :element, :attrs nil, :tag :head, :content nil} {:type :element, :attrs nil, :tag :body, :content [{:type :element, :attrs {:href "foo"}, :tag :a, :content ["foo"]}]}]}]}
user=> (def parsed-frag (parse-fragment "<a href=\"foo\">foo</a> <a href=\"bar\">bar</a>"))
#'user/parsed-frag
user=> (as-hiccup parsed-frag)
user=> (as-hiccup parsed-frag false)
IllegalArgumentException No implementation of method: :as-hiccup of protocol: #'hickory.core/HiccupRepresentable found for class: clojure.lang.PersistentVector clojure.core/-cache-protocol-fn (core_deftype.clj:495)

user=> (map as-hiccup parsed-frag)
user=> (map #(as-hiccup % false) parsed-frag)
([:a {:href "foo"} "foo"] " " [:a {:href "bar"} "bar"])
user=> (map as-hickory parsed-frag)
({:type :element, :attrs {:href "foo"}, :tag :a, :content ["foo"]} " " {:type :element, :attrs {:href "bar"}, :tag :a, :content ["bar"]})
Expand All @@ -75,25 +75,25 @@ user=> (use 'hickory.zip)
nil
user=> (require '[clojure.zip :as zip])
nil
user=> (-> (hiccup-zip (as-hiccup (parse "<a href=foo>bar<br></a>"))) zip/node)
user=> (-> (hiccup-zip (as-hiccup (parse "<a href=foo>bar<br></a>") false)) zip/node)
([:html {} [:head {}] [:body {} [:a {:href "foo"} "bar" [:br {}]]]])
user=> (-> (hiccup-zip (as-hiccup (parse "<a href=foo>bar<br></a>"))) zip/next zip/node)
user=> (-> (hiccup-zip (as-hiccup (parse "<a href=foo>bar<br></a>") false)) zip/next zip/node)
[:html {} [:head {}] [:body {} [:a {:href "foo"} "bar" [:br {}]]]]
user=> (-> (hiccup-zip (as-hiccup (parse "<a href=foo>bar<br></a>"))) zip/next zip/next zip/node)
user=> (-> (hiccup-zip (as-hiccup (parse "<a href=foo>bar<br></a>") false)) zip/next zip/next zip/node)
[:head {}]
user=> (-> (hiccup-zip (as-hiccup (parse "<a href=foo>bar<br></a>")))
zip/next zip/next
(zip/replace [:head {:id "a"}])
user=> (-> (hiccup-zip (as-hiccup (parse "<a href=foo>bar<br></a>") false))
zip/next zip/next
(zip/replace [:head {:id "a"}])
zip/node)
[:head {:id "a"}]
user=> (-> (hiccup-zip (as-hiccup (parse "<a href=foo>bar<br></a>")))
zip/next zip/next
(zip/replace [:head {:id "a"}])
user=> (-> (hiccup-zip (as-hiccup (parse "<a href=foo>bar<br></a>") false))
zip/next zip/next
(zip/replace [:head {:id "a"}])
zip/root)
([:html {} [:head {:id "a"}] [:body {} [:a {:href "foo"} "bar" [:br {}]]]])
user=> (-> (hickory-zip (as-hickory (parse "<a href=foo>bar<br></a>")))
zip/next zip/next
(zip/replace {:type :element :tag :head :attrs {:id "a"} :content nil})
user=> (-> (hickory-zip (as-hickory (parse "<a href=foo>bar<br></a>")))
zip/next zip/next
(zip/replace {:type :element :tag :head :attrs {:id "a"} :content nil})
zip/root)
{:type :document, :content [{:type :element, :attrs nil, :tag :html, :content [{:content nil, :type :element, :attrs {:id "a"}, :tag :head} {:type :element, :attrs nil, :tag :body, :content [{:type :element, :attrs {:href "foo"}, :tag :a, :content ["bar" {:type :element, :attrs nil, :tag :br, :content nil}]}]}]}]}
user=> (hickory-to-html *1)
Expand Down Expand Up @@ -139,11 +139,11 @@ nil
user=> (def site-htree (-> (client/get "http://formula1.com/default.html") :body parse as-hickory))
#'user/site-htree
user=> (-> (s/select (s/child (s/class "subCalender") ; sic
(s/tag :div)
(s/id :raceDates)
(s/tag :div)
(s/id :raceDates)
s/first-child
(s/tag :b))
site-htree)
(s/tag :b))
site-htree)
first :content first string/trim)
"10, 11, 12 May 2013"
```
Expand Down Expand Up @@ -182,19 +182,19 @@ There are also selector combinators, which take as argument some number of other
- `child`: Takes any number of selectors as arguments and returns a selector that returns true when the zipper location given as the argument is at the end of a chain of direct child relationships specified by the selectors given as arguments.
- `descendant`: Takes any number of selectors as arguments and returns a selector that returns true when the zipper location given as the argument is at the end of a chain of descendant relationships specified by the selectors given as arguments.

We can illustrate the selector combinators by continuing the Formula 1 example above. We suspect, to our dismay, that Sebastian Vettel is leading the championship for the fourth year in a row.
We can illustrate the selector combinators by continuing the Formula 1 example above. We suspect, to our dismay, that Sebastian Vettel is leading the championship for the fourth year in a row.

```clojure
user=> (-> (s/select (s/descendant (s/class "subModule")
(s/class "standings")
(s/and (s/tag :tr)
s/first-child)
(s/and (s/tag :td)
(s/nth-child 2))
(s/tag :a))
site-htree)
user=> (-> (s/select (s/descendant (s/class "subModule")
(s/class "standings")
(s/and (s/tag :tr)
s/first-child)
(s/and (s/tag :td)
(s/nth-child 2))
(s/tag :a))
site-htree)
first :content first string/trim)
"Sebastian Vettel"
"Sebastian Vettel"
```

Our fears are confirmed, Sebastian Vettel is well on his way to a fourth consecutive championship. If you were to inspect the page by hand (as of around May 2013, at least), you would see that unlike the `child` selector we used in the example above, the `descendant` selector allows the argument selectors to skip stages in the tree; we've left out some elements in this descendant relationship. The first table row in the driver standings table is selected with the `and`, `tag` and `first-child` selectors, and then the second `td` element is chosen, which is the element that has the driver's name (the first table element has the driver's standing) inside an `A` element. All of this is dependent on the exact layout of the HTML in the site we are examining, of course, but it should give an idea of how you can combine selectors to reach into a specific node of an HTML document very easily.
Expand Down
24 changes: 4 additions & 20 deletions project.clj
Original file line number Diff line number Diff line change
Expand Up @@ -5,36 +5,20 @@
:url "http://www.eclipse.org/legal/epl-v10.html"}
:source-paths ["src" "target/generated-src"]
:test-paths ["target/generated-test"]
:dependencies [[org.clojure/clojure "1.5.1"]
:dependencies [[org.clojure/clojure "1.7.0"]
[quoin "0.1.0"]
[org.jsoup/jsoup "1.7.1"]]
:plugins [[codox "0.6.4"]]
:profiles {:dev
{:dependencies [[org.clojure/clojurescript "0.0-2227"]]
:plugins [[lein-cljsbuild "1.0.3"]
[com.keminglabs/cljx "0.4.0"]
[com.cemerick/clojurescript.test "0.3.1"]]}}
:hooks [cljx.hooks]
{:dependencies [[org.clojure/clojurescript "1.7.145"]]
:plugins [[lein-cljsbuild "1.1.1"]
[com.cemerick/clojurescript.test "0.3.3"]]}}
:codox {:sources ["src" "target/generated-src"]
:output-dir "codox-out"
:src-dir-uri "http://github.com/davidsantiago/hickory/blob/master"
:src-linenum-anchor-prefix "L"}

:cljx {:builds [{:source-paths ["src"]
:output-path "target/generated-src"
:rules :clj}
{:source-paths ["src"]
:output-path "target/generated-src"
:rules :cljs}
{:source-paths ["test"]
:output-path "target/generated-test"
:rules :clj}
{:source-paths ["test"]
:output-path "target/generated-test"
:rules :cljs}]}
:cljsbuild {:builds [{:source-paths ["target/generated-src" "target/generated-test"]
:compiler {:output-to "target/cljs/testable.js"}
:optimizations :whitespace
:pretty-print true}]
:test-commands {"unit-tests" ["phantomjs" :runner "target/cljs/testable.js"]}})

File renamed without changes.
28 changes: 16 additions & 12 deletions src/hickory/core.clj
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
(defprotocol HiccupRepresentable
"Objects that can be represented as Hiccup nodes implement this protocol in
order to make the conversion."
(as-hiccup [this]
(as-hiccup [this escape]
"Converts the node given into a hiccup-format data structure. The
node must have an implementation of the HiccupRepresentable
protocol; nodes created by parse or parse-fragment already do."))
Expand All @@ -39,20 +39,20 @@
(extend-protocol HiccupRepresentable
Attribute
;; Note the attribute value is not html-escaped; see comment for Element.
(as-hiccup [this] [(utils/lower-case-keyword (.getKey this))
(as-hiccup [this escape] [(utils/lower-case-keyword (.getKey this))
(.getValue this)])
Attributes
(as-hiccup [this] (into {} (map as-hiccup this)))
(as-hiccup [this escape] (into {} (map #(as-hiccup % escape) this)))
Comment
(as-hiccup [this] (str "<!--" (.getData this) "-->"))
(as-hiccup [this escape] (str "<!--" (.getData this) "-->"))
DataNode
(as-hiccup [this] (str this))
(as-hiccup [this escape] (str this))
Document
(as-hiccup [this] (map as-hiccup (.childNodes this)))
(as-hiccup [this escape] (map #(as-hiccup % escape) (.childNodes this)))
DocumentType
(as-hiccup [this] (str this))
(as-hiccup [this escape] (str this))
Element
(as-hiccup [this]
(as-hiccup [this escape]
;; There is an issue with the hiccup format, which is that it
;; can't quite cover all the pieces of HTML, so anything it
;; doesn't cover is thrown into a string containing the raw
Expand All @@ -67,15 +67,19 @@
;; unescapable nodes.
(let [tag (utils/lower-case-keyword (.tagName this))]
(into [] (concat [tag
(as-hiccup (.attributes this))]
(as-hiccup (.attributes this) escape)]
(if (utils/unescapable-content tag)
(map str (.childNodes this))
(map as-hiccup (.childNodes this)))))))
(map #(as-hiccup % escape) (.childNodes this)))))))
TextNode
;; See comment for Element re: html escaping.
(as-hiccup [this] (utils/html-escape (.getWholeText this)))
(as-hiccup [this escape]
(let [unescaped (.getWholeText this)]
(if escape
(utils/html-escape unescaped)
unescaped)))
XmlDeclaration
(as-hiccup [this] (str this)))
(as-hiccup [this escape] (str this)))

(extend-protocol HickoryRepresentable
Attribute
Expand Down
56 changes: 30 additions & 26 deletions src/hickory/core.cljs
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
(defprotocol HiccupRepresentable
"Objects that can be represented as Hiccup nodes implement this protocol in
order to make the conversion."
(as-hiccup [this]
(as-hiccup [this escape]
"Converts the node given into a hiccup-format data structure. The
node must have an implementation of the HiccupRepresentable
protocol; nodes created by parse or parse-fragment already do."))
Expand Down Expand Up @@ -68,31 +68,35 @@

(extend-protocol HiccupRepresentable
object
(as-hiccup [this] (condp = (aget this "nodeType")
Attribute [(utils/lower-case-keyword (aget this "name"))
(aget this "value")]
Comment (str "<!--" (aget this "data") "-->")
Document (map as-hiccup (aget this "childNodes"))
DocumentType (format-doctype this)
;; There is an issue with the hiccup format, which is that it
;; can't quite cover all the pieces of HTML, so anything it
;; doesn't cover is thrown into a string containing the raw
;; HTML. This presents a problem because it is then never the case
;; that a string in a hiccup form should be html-escaped (except
;; in an attribute value) when rendering; it should already have
;; any escaping. Since the HTML parser quite properly un-escapes
;; HTML where it should, we have to go back and un-un-escape it
;; wherever text would have been un-escaped. We do this by
;; html-escaping the parsed contents of text nodes, and not
;; html-escaping comments, data-nodes, and the contents of
;; unescapable nodes.
Element (let [tag (utils/lower-case-keyword (aget this "tagName"))]
(into [] (concat [tag
(into {} (map as-hiccup (aget this "attributes")))]
(if (utils/unescapable-content tag)
(map #(aget % "wholeText") (aget this "childNodes"))
(map as-hiccup (aget this "childNodes"))))))
Text (utils/html-escape (aget this "wholeText")))))
(as-hiccup [this escape]
(condp = (aget this "nodeType")
Attribute [(utils/lower-case-keyword (aget this "name"))
(aget this "value")]
Comment (str "<!--" (aget this "data") "-->")
Document (map #(as-hiccup % escape) (aget this "childNodes"))
DocumentType (format-doctype this)
;; There is an issue with the hiccup format, which is that it
;; can't quite cover all the pieces of HTML, so anything it
;; doesn't cover is thrown into a string containing the raw
;; HTML. This presents a problem because it is then never the case
;; that a string in a hiccup form should be html-escaped (except
;; in an attribute value) when rendering; it should already have
;; any escaping. Since the HTML parser quite properly un-escapes
;; HTML where it should, we have to go back and un-un-escape it
;; wherever text would have been un-escaped. We do this by
;; html-escaping the parsed contents of text nodes, and not
;; html-escaping comments, data-nodes, and the contents of
;; unescapable nodes.
Element (let [tag (utils/lower-case-keyword (aget this "tagName"))]
(into [] (concat [tag
(into {} (map #(as-hiccup % escape) (aget this "attributes")))]
(if (utils/unescapable-content tag)
(map #(aget % "wholeText") (aget this "childNodes"))
(map #(as-hiccup % escape) (aget this "childNodes"))))))
Text (let [unescaped (aget this "wholeText")]
(if escape
(utils/html-escape unescaped)
unescaped)))))

(extend-protocol HickoryRepresentable
object
Expand Down
12 changes: 6 additions & 6 deletions src/hickory/hiccup_utils.cljx → src/hickory/hiccup_utils.cljc
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@
(first-idx -1 2) => 2
(first-idx 5 -1) => 5
(first-idx 5 3) => 3"
[#+clj ^long a #+clj ^long b
#+cljs a #+cljs b]
[#?(:clj ^long a) #?(:clj ^long b)
#?(:cljs a) #?(:cljs b)]
(if (== a -1)
b
(if (== b -1)
Expand All @@ -21,11 +21,11 @@

(defn- index-of
([^String s c]
#+clj (.indexOf s (int c))
#+cljs (.indexOf s c))
#?(:clj (.indexOf s (int c)))
#?(:cljs (.indexOf s c)))
([^String s c idx]
#+clj (.indexOf s (int c) (int idx))
#+cljs (.indexOf s c idx)))
#?(:clj (.indexOf s (int c) (int idx)))
#?(:cljs (.indexOf s c idx))))

(defn- split-keep-trailing-empty
"clojure.string/split is a wrapper on java.lang.String/split with the limit
Expand Down
5 changes: 2 additions & 3 deletions src/hickory/render.cljx → src/hickory/render.cljc
Original file line number Diff line number Diff line change
Expand Up @@ -55,9 +55,9 @@
"</" (name (:tag dom)) ">"))
:comment
(str "<!--" (apply str (:content dom)) "-->"))
(catch #+clj IllegalArgumentException #+cljs js/Error e
(catch #?(:clj IllegalArgumentException) #?(:cljs js/Error) e
(throw
(if (utils/starts-with #+clj (.getMessage e) #+cljs (aget e "message") "No matching clause: ")
(if (utils/starts-with #?(:clj (.getMessage e)) #?(:cljs (aget e "message")) "No matching clause: ")
(ex-info (str "Not a valid node: " (pr-str dom)) {:dom dom})
e))))))

Expand Down Expand Up @@ -128,4 +128,3 @@
hiccup's."
[hiccup-forms]
(apply str (map #(render-hiccup-form (hu/normalize-form %)) hiccup-forms)))

3 changes: 1 addition & 2 deletions src/hickory/select.cljx → src/hickory/select.cljc
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
(:require [clojure.zip :as zip]
[clojure.string :as string]
[hickory.zip :as hzip])
#+clj (:import clojure.lang.IFn)
#?(:clj (:import clojure.lang.IFn))
(:refer-clojure :exclude [and or not class]))

;;
Expand Down Expand Up @@ -660,4 +660,3 @@
zip/right
#(nil? %))
hzip-loc)))))

12 changes: 6 additions & 6 deletions src/hickory/utils.cljx → src/hickory/utils.cljc
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
(ns hickory.utils
"Miscellaneous utilities used internally."
#+clj (:require [quoin.text :as qt])
#?(:clj (:require [quoin.text :as qt]))
(:require [clojure.string :as string]
#+cljs [goog.string :as gstring]))
#?(:cljs [goog.string :as gstring])))

;;
;; Data
Expand All @@ -23,13 +23,13 @@

(defn html-escape
[s]
#+clj (qt/html-escape s)
#+cljs (gstring/htmlEscape s))
#?(:clj (qt/html-escape s))
#?(:cljs (gstring/htmlEscape s)))

(defn starts-with
[^String s ^String prefix]
#+clj (.startsWith s prefix)
#+cljs (goog.string.startsWith s prefix))
#?(:clj (.startsWith s prefix))
#?(:cljs (goog.string.startsWith s prefix)))

(defn lower-case-keyword
"Converts its string argument into a lowercase keyword."
Expand Down
File renamed without changes.