Skip to content

brool/beaujiful-soup

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Beaujiful Soup

Beautiful Soup is a really nice Python library for extracting content from possibly-sloppy HTML, and I wanted some reasonably close Clojure equivalent. Unfortunately, the standard classes don't work well malformed HTML; as an example:

=> (require '(clojure [xml :as xml]))
=> (xml/parse "http://www.google.com")
org.xml.sax.SAXParseException: The markup in the document preceding the root element must be well-formed. (NO_SOURCE_FILE:0)

Fortunately, there is already a TagSoup library that can parse non-perfect HTML, and it is very easy to integrate TagSoup into xml/parse. This module hardly does anything; it simply adds a few helper routines and brings the most-used calls into one amazingly bad namespace name.

Examples

Building your soup:

(use beaujiful-soup.core)

; build soup from URL
(def t (build-soup "http://www.google.com"))

; build soup from (deliberately malformed) string
(def t2 (build-string-soup "<html><body><ul><li>One<li>Two</ul></body></html>"))

Extracting information is done with the xml-> call. Oftentimes the last thing you do will be a node or text or (attr :attribute) call, in order to convert the results into a more workable type:

; you can "walk" down the tree with successive tag names.  For
; example, get every list item inside the unordered list
; immediately inside the body.
(xml-> t2 :body :ul :li node)
; => ({:tag :li, :attrs nil, :content ["One"]} {:tag :li, :attrs nil, :content ["Two"]})

; get the text for the list items
(xml-> t2 :body :ul :li text)
; => ("One" "Two")

; Get textareas immediately inside the body.
(xml-> t :body :textarea node)
; => ({:tag :textarea, :attrs {:id "csi", :style "display:none"}, :content nil})

; use descendants to iterate through all nodes, not just the immediate children.
; Get the text from all <a> tags anywhere in the body.
(xml-> t descendants :a text)
; => ("Images" "Videos" "Maps" ...)

;  Get the href attribute from all tags
(xml-> t descendants :a (attr :href))
; => ("http://www.google.com/imghp?hl=en&tab=wi" ... )

Use the (attr=) predicate to match an attribute value:

; find invisible stuff
(xml-> t2 descendants (attr= :style "display:none") tag)
; => (:textarea :iframe)

Strings match the text inside nodes:

; find the link for the <a> that has "Videos" for content
(xml-> t descendants :a "Videos" (attr :href))
; => ("http://video.google.com/?hl=en&tab=wv")

Arbitrary predicates can be used as well. They will take a loc (location), and are usually converted to a node before being used:

; find any :p or :div
(defn p-or-div [loc] (contains? #{:p :div} (:tag (node loc))))
(xml-> t descendants p-or-div tag)
; => (:div :div :div :div :div :div :div :div :div :div :div :p :div :div)

; find the link for <a> that has case-insensitive "Videos" for content
(require 'clojure.string)
(defn f [loc]
  (let [n (node loc)]
   (and (= (:tag n) :a) (= (clojure.string/upper-case (first (:content n))) "VIDEOS"))))
(xml-> t descendants f (attr :href))
; => ("http://video.google.com/?hl=en&tab=wv")

Fundamentally, the xml-> call returns a list of locations, and you can apply arbitrary transforms as necessary. For example, let's say that you want to build a map of text => href links based on the bookmarks:

(defn loc-to-pair [loc]
    [ (attr loc :href), (text loc) ])
(apply hash-map (xml-> t descendants :a loc-to-pair))
; => {"/services/" "Business Solutions",  ... }

Having a vector in the chain applies all the predicates within the vector, and filters out anything that doesn't match. It acts a little like a lookahead in a regex. For example:

; Find the IDs of all divs that contain an href immediately within them
(xml-> t descendants :div [ :a ] (attr :id))
 ; => ("fll")

 ; Find the IDs of all divs that contains an href anywhere within them
 (xml-> t descendants :div [ descendants :a ] (attr :id))
 ; => ("ghead" "gbar" "guser" "fll")

About

Helper routines for extracting information from HTML

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published