Skip to content

Commit

Permalink
Update index.rst
Browse files Browse the repository at this point in the history
  • Loading branch information
julianafreire committed Jun 16, 2017
1 parent 785dbf3 commit 590332d
Showing 1 changed file with 4 additions and 3 deletions.
7 changes: 4 additions & 3 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,15 @@ Welcome to ACHE's Documentation!
========================================

ACHE is a focused web crawler. It collects web pages that satisfy some specific criteria, e.g., pages that belong to a given domain or that contain a user-specified pattern.
ACHE differs from generic crawlers in sense that it uses *page classifiers* to distinguish between relevant and irrelevant pages in a given domain. A page classifier can be from a simple regular expression (that matches every page that contains a specific word, for example), to a machine-learning based classification model.
ACHE can also automatically learn how to prioritize links in order to efficiently locate relevant content while avoiding the retrieval of irrelevant content.
ACHE differs from generic crawlers in sense that it uses *page classifiers* to distinguish between relevant and irrelevant pages in a given domain. A page classifier can be defined as a simple regular expression (e.g., that matches every page that contains a specific word) or a machine-learning-based classification model.
ACHE also automatically learns how to prioritize links in order to efficiently locate relevant content while avoiding the retrieval of irrelevant pages.
While ACHE was originally designed to perform focused crawls, it also supports other crawling tasks, including crawling all pages in a given web site and crawling Dark Web sites (using the TOR protocol).

ACHE supports many features, such as:

* Regular crawling of a fixed list of web sites
* Discovery and crawling of new relevant web sites through automatic link prioritization
* Configuration of different types of pages classifiers (machine-learning, regex, etc)
* Configuration of different types of pages classifiers (machine-learning, regex, etc.)
* Continuous re-crawling of sitemaps to discover new pages
* Indexing of crawled pages using Elasticsearch
* Web interface for searching crawled pages in real-time
Expand Down

0 comments on commit 590332d

Please sign in to comment.