Skip to content
python library for extracting html5 microdata
Pull request Compare This branch is 57 commits behind edsu:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
test-data
.gitignore
README.md
microdata.py
setup.py
test.py

README.md

microdata.py is a small utility library for extracting HTML5 Microdata from HTML. It depends on html5lib to do the heavy lifting of building the DOM. For more about HTML5 Microdata check out Mark Pilgrim's chapter on on it in Dive Into HTML5.

Here's the basic usage using https://raw.github.com/edsu/microdata/master/test-data/example.html as an example:

>>> import microdata
>>> items = microdata.get_items(open("test-data/example.html"))
>>> item = items[0]
>>> item.itemtype
u"http://schema.org/Person"
>>> item.name
u"Jane Doe"
>>> item.colleagues
u"http://www.xyz.edu/students/alicejones.html"
>>> item.get_all('colleagues')
[u"http://www.xyz.edu/students/alicejones.html", u"http://www.xyz.edu/students/bobsmith.html"]
>>> print item.json()
{ 
  "$itemtype": "http://schema.org/Person",
  "$itemid": "http://www.xyz.edu/~jane",
  "colleagues": [
    "http://www.xyz.edu/students/alicejones.html",
    "http://www.xyz.edu/students/bobsmith.html"
  ],
  "name": [
    "Jane Doe"
  ],
  "url": [
    "www.janedoe.com"
  ],
  "image": [
    "janedoe.jpg"
  ],
  "address": [
    { 
      "$itemtype": "http://schema.org/PostalAddress",
      "addressLocality": [
        "Seattle"
      ],
      "streetAddress": [
        "\n          20341 Whitworth Institute\n          405 N. Whitworth\n" 
      ],
      "postalCode": [
        "98052"
      ],
      "addressRegion": [
        "WA"
      ]
    }
  ],
  "telephone": [
    "(425) 123-4567"
  ],
  "jobTitle": [
    "Professor"
  ],
  "email": [
    "mailto:jane-doe@xyz.edu"
  ]
}
Something went wrong with that request. Please try again.