Skip to content

anze3db/data-extraction-methods

 
 

Repository files navigation

Data Extraction Methods

An automatic runner of data extraction methods RoadRunner and Webstemmer, as well as. Works on news articles. Once your environment is set up, you may run both methods at once by running the executable Python script:

$ ./main.py

main.py runs four functions:

  1. scrape(), which downloads the HTML from desired webpages into ./examples/ and creates a zip of them for Webstemmer to use
  2. roadrunner(), which executes the RoadRunner method and saves results into ./roadrunner/output/
  3. webstemmer(), which executes the Webstemmer method and saves results (in the form of *.txt files) and generated wrappers (in the form of *.pat files) into ./webstemmer/webstemmer/
  4. scrapy(), which executes a custom implementation of Scrapy, a web crawler that extracts data using XPath, and writes the results into *.json files in ./scrapynews/scraped-content/

In the end the information of time needed per webpage for each data extraction method is shown.

Environment Setup

  • Add URLs to constants.py to choose custom webpages from which data will be extracted
  • Create a new file .env in the same folder as main.py and add the absoulte path of the folder in which you wish to save scraped data to the SCRAPE_DEST_FOLDER environment variable, eg. SCRAPE_DEST_FOLDER="/home/scraped-folders/"
  • Initialize pipenv by running pipenv install
  • Open pipenv shell by running pipenv shell
  • Run the program by running ./main.py from the repository's root folder

About

Automatic runner of data extraction methods

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 88.8%
  • Java 6.0%
  • Python 5.0%
  • Other 0.2%