awesome-html-parsing

You have HTML. We have parsers.

HTML parsing for kicks.

Draft list

grep. Just no.
sed
awk?
node / javascript
htmlq
- Written in rust
- Uses CSS selectors
- sudo apt install -y cargo && time cargo install htmlq
- cat file.html | htmlq 'div > song > content > h3'
- Cannot extract text content
- Cannot remove whitespace, etc. For that you need to combine with sed | sed 's/^ *//g' | sed '/^[[:space:]]*$/d'
Perl regex
- Similar to grep, but more powerful
- cat file.html | perl -C7 -0777 -e '(@s) = <STDIN> =~ m{<h3>(.+?)</h3>}sg; @s = map { s/^\s*//; s/\s*$//; s/\s{2,}//g; $_ } @s; print "@s"
Perl modules
- TODO
Python
- html.parser
- event-based, not dom. Ugh
Ruby
- nokogiri
- TODO
Go
- TODO
GUIs, Applications, Web services
- Real Programmers Don't Use GUIs

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md