Easy HTML parsing for Haskell
Haskell HTML Makefile
Latest commit 4cb37a6 Mar 2, 2016 @egonSchiele Merge pull request #30 from nikolas/patch-1
use utf8 input encoding by default
Permalink
Failed to load latest commit information.
examples Changes executable to one in example folder, builds executable (example) Sep 19, 2014
src/Text use utf8 input encoding by default Mar 2, 2016
tests
.gitignore Adds cabal.sandbox.config to gitignore file Sep 19, 2014
HandsomeSoup.cabal
LICENSE
LIST_OF_SUPPORTED_SELECTORS.markdown list of supported selectors Jan 22, 2014
README.markdown add imports to example to make complete program Jun 22, 2014
Setup.hs make this a cabal package (docs still need to be fixed) Apr 24, 2012
TODO.markdown whoops Apr 27, 2012
makefile version bump Oct 15, 2014

README.markdown

HandsomeSoup

Current Status: Usable and stable. Needs GHC 7.6. Please file bugs!

HandsomeSoup is the library I wish I had when I started parsing HTML in Haskell.

It is built on top of HXT and adds a few functions that make it easier to work with HTML.

Most importantly, it adds CSS selectors to HXT. The goal of HandsomeSoup is to be a complete CSS2 selector parser for HXT.

Install

cabal install HandsomeSoup

Example

Nokogiri, the HTML parser for Ruby, has an example showing how to scrape Google search results. This is easy in HandsomeSoup:

import Text.XML.HXT.Core
import Text.HandsomeSoup

main = do
    let doc = fromUrl "http://www.google.com/search?q=egon+schiele"
    links <- runX $ doc >>> css "h3.r a" ! "href"
    mapM_ putStrLn links

What can HandsomeSoup do for you?

Easily parse an online page using fromUrl

let doc = fromUrl "http://example.com"

Or a local page using parseHtml

contents <- readFile [filename]
let doc = parseHtml contents

Easily extract elements using css

Here are some valid selectors:

doc <<< css "a"
doc <<< css "*"
doc <<< css "a#link1"
doc <<< css "a.foo"
doc <<< css "p > a"
doc <<< css "p strong"
doc <<< css "#container h1"
doc <<< css "img[width]"
doc <<< css "img[width=400]"
doc <<< css "a[class~=bar]"
doc <<< css "a:first-child"

Easily get attributes using (!)

doc <<< css "img" ! "src"
doc <<< css "a" ! "href"

Docs

Find Haddock docs on Hackage.

I also wrote The Complete Guide To Parsing HXT With Haskell.

Credits

Made by Adit.