A scraper designed to scrape Bioschema's markup, in either JSON-LD or RDFa, from a set of known web pages. Implementation decisions are discussed here.
There are 3 sub-modules:
- core provides core scraping functionality.
- service extends core to be a process that can be run via the command line. URLs to be scraped are read from a database.
- web turns core into a very basic webapp that scrapes a single URL fed into the app.
- Using Apache Any23 to parse structured data from HTML.
- Apache Any23 is believed to have the best rdfa parser for JAVA; in the sense that it does not require the HTML to be perfect.
- If not using rdfa, simply pulling the HTML and extracting the div blocks using JSoup is faster.
- Using chrome headless driver with Selenium to load pages. As pages are increasingly dynamic the JS on each page needs to be processed.
- Quads are used as basic provenance is captured at the page/context/graph level.
- Quads are not automatically loaded into a triplestore as that is very slow, and scraping is slow enough. Saving to file then bulk importing is much quicker. This also means the end user can choose their own triplestore.
Requirements for core:
- java 1.8
- maven v3.6.0
- google chrome browser and driver; you must select the same version for both, e.g., v77.
- other requirements are provided through pom file.
Additional requirements for service:
- mysql v8.0.13
- appropriate chromium driver for your version of chrome
- other requirements are provided through pom file
Web has no additional requirements, but you require some way of running war files.
First clone the repo to your machine. Core is relied on by both service and web. However, core can be used in a standalone manner.
Provides the core functionality as an abstract class. Additionally, two example classes exist that can be used to scrape either a single given URL or a series of URLs from a given file. For most purposes this file scraper is likely to be sufficient and there is no need to explore further. If you follow the instructions below you will run the file scraper.
To use this:
- Update
core > src > main > resources > applications.properties
. You need to specify:- output loction: currently all RDF is saved as NQuads to a folder.
- location of sites file: where is the list of URLs you wish to scrape located? There is an example in
core > src > main > resources > urls2scrape.txt
- location of the chrome driver. This is not the location of the folder, but the full path to the driver file. On windows this will be called
chromedriver.exe
- NOTE: if you are using windows you will have to use a double backslash as the file separator; i.e.,
\\
not\
- Create/edit your list of urls file.
- Package with maven:
mvn clean package
- If you only want to compile/run core you can do this from inside the core directory.
- If you also want to use service or web then run maven from the top level Scraper folder.
- Inside the
core > target
directory you will find two jars. The fat jar is calledcore-x.x.x-SNAPSHOT.jar
and the skinny jar isoriginal-core-x.x.x-SNAPSHOT.jar
. Run the fat jar however you wish via maven or the command line, e.g.,java -jar core-x.x.x-SNAPSHOT.jar
. This will run the file scraper.
Note: Running this will produce a localProperties.Properties
file, which can be ignored. It is simply used to maintain an auto-incrementing count of the number of sites scrape (a.k.a. the contextCounter
). You can reset this count to 0 by deleting the localProperties.Properties
file.
Assumes a database of URLs that need to be scraped. Will collect a list of URLs from the database, scrape them and write the output to a specified folder. The output will be in NQuads format.
To use this:
- You may want to set the JVM parameters to increase the size of RAM available to JAVA.
- Add your database connection to hibernate; we are using
service > src > main > resources > META-INF > persistence.xml
. - If your database is empty, running the program (by following the steps below) will create an empty table before stopping as there are no URLs to scrape. You can then populate this table and re-run the program to perform the scrape. Alternatively, you can create the table and populate the database manually. An example script for this can be found in
service > src > main > resources > setUpDatabaseScript.sql
. If you run this before running the program, it will start scraping immediately. - Update
service > src > main > resources > applications.properties
. You need to specify:- how long you want to wait being fetching pages, measured in tenths of a second. (default: 5 = 0.5 second).
- output loction: currently all RDF is saved as NQuads to a folder.
- how many pages you want to crawl in a single loop (default: 8).
- how many pages you want to crawl in a single session; there are multiple loops in a session (default: 32). The default settings are enough for you to run the scraper to check everything is working. However, these should be increased for a real world scrape.
- location of the chrome driver.
- Package with maven:
mvn clean package
from the top level, i.e., Scraper folder not the service folder. - Inside the
service > target
directory you will findservice.jar
. Run it however you wish via maven or the command line, e.g.,java -jar service.jar
.
Still in development so use is not recommended. Goal: to provide a small web app that receives a URL as a request and returns the (bio)schema markup from that URL in a JSON format.
A project by SWeL funded through Elixir-Excelerate.