Data Extraction Methods

An automatic runner of data extraction methods RoadRunner and Webstemmer, as well as. Works on news articles. Once your environment is set up, you may run both methods at once by running the executable Python script:

$ ./main.py

main.py runs four functions:

scrape(), which downloads the HTML from desired webpages into ./examples/ and creates a zip of them for Webstemmer to use
roadrunner(), which executes the RoadRunner method and saves results into ./roadrunner/output/
webstemmer(), which executes the Webstemmer method and saves results (in the form of *.txt files) and generated wrappers (in the form of *.pat files) into ./webstemmer/webstemmer/
scrapy(), which executes a custom implementation of Scrapy, a web crawler that extracts data using XPath, and writes the results into *.json files in ./scrapynews/scraped-content/

In the end the information of time needed per webpage for each data extraction method is shown.

Environment Setup

Add URLs to constants.py to choose custom webpages from which data will be extracted
Create a new file .env in the same folder as main.py and add the absoulte path of the folder in which you wish to save scraped data to the SCRAPE_DEST_FOLDER environment variable, eg. SCRAPE_DEST_FOLDER="/home/scraped-folders/"
Initialize pipenv by running pipenv install
Open pipenv shell by running pipenv shell
Run the program by running ./main.py from the repository's root folder

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
roadrunner		roadrunner
scraped-folders		scraped-folders
scrapynews		scrapynews
webstemmer		webstemmer
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
compare_results.py		compare_results.py
constants.py		constants.py
main.py		main.py
roadrunner.py		roadrunner.py
scraper.py		scraper.py
scrapy.py		scrapy.py
webstemmer.py		webstemmer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roadrunner

roadrunner

scraped-folders

scraped-folders

scrapynews

scrapynews

webstemmer

webstemmer

.gitignore

.gitignore

Pipfile

Pipfile

Pipfile.lock

Pipfile.lock

README.md

README.md

compare_results.py

compare_results.py

constants.py

constants.py

main.py

main.py

roadrunner.py

roadrunner.py

scraper.py

scraper.py

scrapy.py

scrapy.py

webstemmer.py

webstemmer.py

Repository files navigation

Data Extraction Methods

Environment Setup

About

Releases

Packages

Languages

anze3db/data-extraction-methods

Folders and files

Latest commit

History

Repository files navigation

Data Extraction Methods

Environment Setup

About

Resources

Stars

Watchers

Forks

Languages