A CLI program to read a html/xml stream from stdin, extracts text and prints it to stdout.
- Install Python3 and pip as follows in Ubuntu/Debian Linux:
sudo apt install python3.6 python3-pip
- Install dependencies:
pip3 install lxml
or
pip3 install -r requirements.txt
- Download Wextract.py and set execute permissions:
curl -LJO https://raw.githubusercontent.com/byte-cook/wextract/main/wextract.py
chmod +x wextract.py
Show help:
wextract.py -h
Make a simple list from a html table without first header row:
cat file.html | wextract.py -l td -s "table tr" td text - ": " "td:nth-child(2)" text
Explanation:
-l td
: skip line if text is empty
-s "table tr"
: select tr tag of table as root element (all sub elements are run through)
td text
: print text of td tag
- ": "
: print ": " as separator
"td:nth-child(2)" text
: print text of the second td tag