Created by Calle Ekdahl.
Current version: 1.0
jsoup is an open-source library written in Java, which excels at parsing HTML and manipulating the DOM. jsoupLink is a package written for Mathematica in Wolfram Language which aims to provide an interface to jsoup that feels natural for Mathematica users.
While traditionally HTML has been worked on in Mathematica by importing it as symbolic XML and painstakingly transforming it with pattern matching, jsoupLink introduces the concept of HTML element objects, which make it super easy to traverse the DOM tree and to modify it.
The most common application for jsoupLink is to extract information from websites, for example table data.
jsoupLink is distributed in the form of a paclet. Download the latest version of the paclet from the releases page and install it using the the
PacletManager package (which you already have because it comes with Mathematica):
Needs to load jsoupLink:
Importing and Exporting Documents
It is easy to import and export HTML using jsoupLink, with the built-in
Export commands. Specify
HTMLDOM as the file format.
The returned value is an HTML element object. It has properties that can be used to access information about itself or its children. It also has properties that can modify itself or its children. Having modified the object, exporting it back to HTML is equally simple:
HTML is but a bunch of nested elements.
<div><p>Paragraph 1</p><p>Paragraph 2</p></div> is made up of a
div element and two
p elements, the
div being the parent to its two children
p, and the
ps being siblings. The idea of jsoup is to assign one object to each element, and to relate the objects to each other through properties. The property
Children of the object corresponding to
div would list the two objects corresponding to the
p elements, the property
Parent on either of the
p elements would list the object
div, and the
Siblings property of either of the
p elements would list the other
p element. Furthermore other properties would retrieve other types of information. The
InnerHTML property of
div would return
<p>Paragraph 1</p><p>Paragraph 2</p> as a string, whereas the
OuterHTML property of the first
p would return
jsoupLink provides direct access to all of these objects and their properties. In a notebook, these objects have a distinctive display:
Starting with the object corresponding to the outermost element,
html, various properties can be used to find all other elements of interest. Properties can be retrieved as subvalues of the objects, as in the image.
In difference to normal Wolfram Language expressions, objects representing elements are mutable, and there are several properties that can modify elements. Most properties can be accessed as
obj["property"], some take several arguments, e.g.
obj["Attribute", "attributeName"], or
obj["Attribute", "key", "value"], which will set the attribute
key to the value
value. Since setting attributes is a common task, the shorthand notation
obj[key] = val is also provided.
Throughout this list, objects representing HTML elements will be referred to simply as elements. Elements are arranged in a tree structure, called the DOM tree. Whenever descriptions such as "the same level" or "topmost", or "beneath" are used in the following text, it refers to this tree structure. (See also the first paragraph of the preceding section.)
This is a complete listing of all the properties, available to all elements:
element["TagName"]Tag name. Example: link elements return
a, paragraph elements return
element["TagName", "tag"]Set element tag name. Example: Use to convert an
h1element into an
element["Root"]Topmost element, usually
element["Parent"]Immediate ancestor of
element. Example: the parent to
element["Children"]All elements that lie directly under
lielements are usually children of a
element["Siblings"]All elements on the same level as
element. Example: The siblings of an
<li>elements are usually other
element["Select", "selector"]All elements from anywhere beneath
element, that match the CSS selector "selector". More information about valid syntax: Use selector syntax to find elements.
element["AllElements"]All elements beneath
element["InnerHTML"]HTML corresponding to the offspring of
element. Example: the inner HTML of
element["OuterHTML"]HTML corresponding to
elementand all offspring. Example: the outer HTML of
element["OwnText"]Text which resides directly under
element. Example: the
<p>text <b>more text</b></p>is
element["AllText"]All text beneath
htmlelement returns all text in the document.
element["AllText", "text"]Remove existing elements and text beneath
elementand replace with
element["ClassNames"]List of classes in the class attribute.
valueattribute, if the element has it.
Trueif the attribute
attris given, and
element["Attribute", "attr"]Value of the attribute
element["Attribute", "attr", "val"]Set attribute
attrto the value
element["Attribute", "attr", True | False]Set attribute
element["Attribute", "assoc"]Set all attributes as given by the association
element["Attributes"]Association with all attributes and their values.
element["RemoveAttribute", "attr"]Remove the attribute
elementis a block level element,
element["AllText"]is not equal to
Falseif it is.
element["BaseURI"]The base URI of the document.
element["BaseURI", "uri"]Set the base URI of the document.
element's class attribute,
element's class attribute.
element's class attribute.
element's class attribute if it doesn't have it, and remove it if it is already there.
htmland insert the resulting object before
element["Before", el]Insert element
htmland insert the resulting object after
element["After", el]Insert element
htmland prepend the resulting object to
element["Prepend", el]Prepend element
htmland append the resulting object to
element["Append", el]Append element
elementa child of the object resulting from parsing
elementbut keep its children, essentially moving them up one level.
elementand all its offspring through a whitelist. Used to e.g. prevent XSS attacks.
element["DeepCopy"]Return a copy of
element, such that modifications done to the copy do not affect
element["Properties"]List all properties.
element["DOMTree"]Display the DOM tree. Details below.
DOM Tree Interface
element["DOMTree"] opens an interface to view the DOM tree with
element as root:
Elements can be selected by clicking on them. The "copy node" button writes the corresponding element to the clipboard, so that it can be pasted into a notebook. "Copy CSS selector" writes a CSS selector that uniquely identifies the selected element to the clipboard.
Retrieving absolute URLs
If you are having problem retrieving absolute URLs from links, you may try to retrieve the
abs:href attribute instead of the