Skip to content
Squirrel searches and collects Linked Data
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bin/src/test/resources changes in the test, now using local file Mar 20, 2018
crash_logs added logs from third crawler run Oct 4, 2017
data fixing deployment setup Feb 26, 2018
deployment changed references from aksw.simba to dice_research Nov 12, 2018
docs # WARNING: head commit changed in the meantime Jul 30, 2018
scripts Rethink backend works! Sep 1, 2016
seed improvements and fixes for release 0.3 Jan 10, 2019
spring-config improvements and fixes for release 0.3 Jan 10, 2019
squirrel.api close version for release 0.3 Jan 10, 2019
squirrel.deduplication close version for release 0.3 Jan 10, 2019
squirrel.frontier close version for release 0.3 Jan 10, 2019
squirrel.mockup close version for release 0.3 Jan 10, 2019
squirrel.web-api close version for release 0.3 Jan 10, 2019
squirrel.web close version for release 0.3 Jan 10, 2019
squirrel.worker close version for release 0.3 Jan 10, 2019
src fixed RDFaParserTest folder Nov 13, 2018
whitelist updated seed file Dec 6, 2018
yaml
.editorconfig Added a common editorconfig file. Jun 21, 2017
.gitignore Created parent pom. Renamed SquirrelWebService and SquirrelWebObject … Aug 8, 2018
.travis.yml updated travis config to work with docker Dec 19, 2017
Dockerfile remove changes from dockerfile May 7, 2018
Dockerfile.frontier improvements and fixes for release 0.3 Jan 10, 2019
Dockerfile.web improvements and fixes for release 0.3 Jan 10, 2019
Dockerfile.worker improvements and fixes for release 0.3 Jan 10, 2019
LICENSE Revert "Merge latest developments from dice-group/develop" Oct 23, 2018
Makefile improvements and fixes for release 0.3 Jan 10, 2019
README.md Update README.md Apr 15, 2019
build-squirrel fixed worker image and build script Jan 11, 2019
docker-compose-sparql.yml improvements and fixes for release 0.3 Jan 10, 2019
docker-compose-web.yml fixed worker image and build script Jan 11, 2019
docker-compose.yml improvements and fixes for release 0.3 Jan 10, 2019
entrypoint.sh # WARNING: head commit changed in the meantime Jul 30, 2018
foundUris.lobs Resolved merge conflict. Feb 20, 2018
my-rethinkdb.pp added dependencies Jul 20, 2018
my-rethinkdb.te added dependencies Jul 20, 2018
pom.xml restored web-api module Jan 11, 2019

README.md

Squirrel - Crawler of linked data.

Introduction

Squirrel is a crawler for the linked web. It provides several tools to search and collect data from the heterogeneous content of the linked web.

You can find the crawler documentation, tutorials and more here: https://dice-group.github.io/squirrel.github.io/

Build notes

You can build the project with a simple mvn clean install and then you can use the makefile

  $ make build dockerize
  $ docker-compose build
  $ docker-compose up

Run

You can run by using the docker-compose file.

  $ docker-compose -f docker-compose-sparql.yml up

Squirrel uses spring context configuration to define the implementation of its components in Runtime. you can check the default implementation file in spring-config/sparqlStoreBased.xml and define your own beans on it.

You can also define a different context for each one of the workers. Check the docker-compose file and change an implementation file in each worker's env variable.

These are the components of Squirrel that can be customized:

Fetcher

  • HTTPFetcher - Fetches data from html sources.

  • FTPFetcher - Fetches data from html sources.

  • SparqlBasedFetcher - Fetches data from Sparql endpoints.

  • Note: The fetchers are not managed as spring beans yet, since only three are available.

Analyzer

Analyses the fetched data and extract triples from it. Note: the analyzer implementations are managed by the SimpleAnalyzerManager. Any implementations should be passed in the constructor of this class, like the example below:

<bean id="analyzerBean" class="org.aksw.simba.squirrel.analyzer.manager.SimpleAnalyzerManager">
        <constructor-arg index="0" ref="uriCollectorBean" />
        <constructor-arg index="1" >
        	<array value-type="java.lang.String">
			  <value>org.aksw.simba.squirrel.analyzer.impl.HDTAnalyzer</value>
			  <value>org.aksw.simba.squirrel.analyzer.impl.RDFAnalyzer</value>
			  <value>org.aksw.simba.squirrel.analyzer.impl.HTMLScraperAnalyzer</value>
		</array>
       	</constructor-arg>
</bean>

Also, if you want to implement your own analyzer, it is necessary to implement the method isEligible(), that checks if that analyzer matches the condition to call the analyze method.

Collectors

Collects new URIs found during the analysis process and serialize it before they are sent to the Frontier.

  • SimpleUriCollector - Serialize uri's and stores it in memory (mainly used for testing purposes).
  • SqlBasedUriCollector - Serialize uri's and stores it in a hsqldb database.

Sink

Responsible for persisting the collected RDF data.

  • FileBasedSink - persists the triples in NT files,
  • InMemorySink - persists the triples only in memory, not in disk (mainly used for testing purposes).
  • HdtBasedSink - persists the triples in a HDT file (compressed RDF format - http://www.rdfhdt.org/).
  • SparqlBasedSink - persists the triples in a SparqlEndPoint.
You can’t perform that action at this time.