No description, website, or topics provided.
Java
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.idea/copyright
src/java
README.md
focusedCrawler.iml
pom.xml
stop.dict

README.md

focused_crawler

Prerequisites

To start the focused crawler REST API service, Java Servlet is required. Find out more about Tomcat Java Servlet.

Installation

To setup the API follow these steps:

> git clone https://github.com/bfetahu/focused_crawler.git
> cd focused_crawler
> mvn compile
> mvn war:war

This will build the war file in the target directory. Copy the war into the deployment directory of your installed Java Servlet (e.g. apache-tomcat/webapps/) container.

##Methods The individual methods are accessible through URL once the REST API setup succeeded.

The URL is formatted as: http://route to the api/CrawlAPI/method_name?parameter_1=value_1e&parameter_2=value_2

e.g. http://tmocatserver.xxx.yyy.de:8080/api/CrawlAPI/crawl?seeds=http://dbpedia.org/resource/Berlin;http://dbpedia.org/resource/Social_Democratic_Party_of_Germany;&user=1&depth=2

###1. Initiate a crawl Method name: crawl

Parameters: {seeds, user, depth}

Output: The crawl identification number, which is later used to perform other operations implemented in the focused crawling module.

###2. Filter crawled entity candidates

Method: filterCrawlCandidates

Parameters: {crawl_id, crawl_filter}

Output: Confirmation that the particular candidate entities are filtered out from the candidate entity set.

###3. Export crawl data to SDAS Method: exportToSDA

Parameters: {crawl_id}

Output: A message confirming that the crawled entity candidates are exported successfully into the SDA.

###4. Delete a crawl Method: deleteCrawl

Parameters: {crawl_id}

Output: A message confirming that all information (crawled candidates and meta-data regarding the crawl) from a specific crawl are deleted.

###5. Load a crawl Method: loadCrawl

Parameters: {crawl_id}

Output: A ranked list of crawled candidate entities.

###6. Load the list of running crawls Method: loadAllRegisteredCrawls

Parameters: {N/A}

Output: The list of all running/finished crawls, with the detailed crawl metadata.

###7. Load the list of all crawls. Method: loadAllRegisteredCrawls

Parameters: {N/A}

Output: The list of all running/finished crawls, with the detailed crawl metadata.

###8. Load the list of finished crawls. Method: loadFinishedCrawls

Parameters: {N/A}

Output: The list of all finished crawls, with the detailed crawl metadata.

###9. Load the list of all crawls initiated by a specific user Method: loadCrawlsByUser

Parameters: {user}

Output: The list of all crawls initiated by a specific user, with the detailed crawl metadata.

###10. Load the list of all crawls containing a specific seed entity Method: loadCrawlsBySeed

Parameters: {seeds}

Output: The list of all crawls containing a specific seed entity, with the detailed crawl metadata.

##Parameters Description | parameter | methods | description | | --------|---------|-------| | user | crawl, loadCrawlsByUser | The user ID or username as part of an authentication system within the DURAARK WorkbenchUI.| | depth| crawl | The maximal depth from which we follow links into the Linked Open Data graph from the initial seed entities. | | seeds | crawl, loadCrawlsBySeed | Seed entities coming from a specific knowledge base (e.g. DBpedia) from which we initiate a focused crawling, or in the case of loadCrawlsBySeed list all crawls that contain a specific seed entity. | | crawl_id |filterCrawlCandidates, deleteCrawl, exportToSDA, loadCrawl | The crawl identification number that is used to delete, export or load the candidate entities from a specific crawl. | | crawl_filter | filterCrawlCandidates | The filter condition that is used to delete already crawled candidate entities from a specific crawl. |