An automatic runner of data extraction methods RoadRunner and Webstemmer, as well as. Works on news articles. Once your environment is set up, you may run both methods at once by running the executable Python script:
$ ./main.py
main.py
runs four functions:
scrape()
, which downloads the HTML from desired webpages into./examples/
and creates a zip of them for Webstemmer to useroadrunner()
, which executes the RoadRunner method and saves results into./roadrunner/output/
webstemmer()
, which executes the Webstemmer method and saves results (in the form of*.txt
files) and generated wrappers (in the form of*.pat
files) into./webstemmer/webstemmer/
scrapy()
, which executes a custom implementation of Scrapy, a web crawler that extracts data using XPath, and writes the results into*.json
files in./scrapynews/scraped-content/
In the end the information of time needed per webpage for each data extraction method is shown.
- Add URLs to
constants.py
to choose custom webpages from which data will be extracted - Create a new file
.env
in the same folder asmain.py
and add the absoulte path of the folder in which you wish to save scraped data to the SCRAPE_DEST_FOLDER environment variable, eg.SCRAPE_DEST_FOLDER="/home/scraped-folders/"
- Initialize pipenv by running
pipenv install
- Open pipenv shell by running
pipenv shell
- Run the program by running
./main.py
from the repository's root folder