Wextract.py

A CLI program to read a html/xml stream from stdin, extracts text and prints it to stdout.

Requirements

Install Python3 and pip as follows in Ubuntu/Debian Linux:

sudo apt install python3.6 python3-pip

Install dependencies:

pip3 install lxml

or

pip3 install -r requirements.txt

Download Wextract.py and set execute permissions:

curl -LJO https://raw.githubusercontent.com/byte-cook/wextract/main/wextract.py
chmod +x wextract.py

Usage examples

Show help:

wextract.py -h

Make a simple list from a html table without first header row:

cat file.html | wextract.py -l td -s "table tr" td text - ": " "td:nth-child(2)" text

Explanation:
-l td : skip line if text is empty
-s "table tr" : select tr tag of table as root element (all sub elements are run through)
td text : print text of td tag
- ": " : print ": " as separator
"td:nth-child(2)" text : print text of the second td tag

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
test.html		test.html
wextract.py		wextract.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wextract.py

Requirements

Usage examples

About

Releases 1

Packages

Languages

License

byte-cook/wextract

Folders and files

Latest commit

History

Repository files navigation

Wextract.py

Requirements

Usage examples

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages