Some questions Ubuntu 16.04 #172

tpolo777 · 2018-11-15T16:29:16Z

On the main page of Ache repo description reads ''Web interface for searching crawled pages in real-time''. but when I started running ache there was no interface. (Maybe i misunderstood something) Also I stuck a little when creating a Configuration File. Is this documentation up to date ? (https://ache.readthedocs.io/en/latest/index.html) How to set Ache to return me a CSV file (Relevant ) without urls set in seed file. It is quite time consuming to find new URLs through 15000+ harvest. My aim is to find databases with scientific publications/books, may you have some suggestion how to increase accuracy? Thank you for your quick support and help! You are doing a great job.

Here is my Configuration File

#
# Example of configuration for running a Focused Crawl
#

# Store pages classified as irrelevant pages by the target page classifier
target_storage.store_negative_pages: true

# Limit the max number of pages crawled per domain, in order to avoid crawling
# too many pages from same somain and favor discovery o new domains
link_storage.max_pages_per_domain: 10000

# Disable "seed scope" to allow crawl pages from any domain
link_storage.link_strategy.use_scope: false

# Set initial link classifier a simple one
link_storage.link_classifier.type: MaxDepthLinkClassifier
link_storage.link_classifier.max_depth: 3
# Train a new link classifier while the crawler is running. This allows
# the crawler automatically learn how to prioritize links in order to
# efficiently locate relevant content while avoiding the retrieval of
# irrelevant content.
link_storage.online_learning.enabled: true
link_storage.online_learning.type: FORWARD_CLASSIFIER_BINARY
link_storage.online_learning.learning_limit: 1000

# Always select top-k links with highest priority to be scheduled
link_storage.link_selector: TopkLinkSelector

# Configure the minimum time interval (in milliseconds) to wait between requests
# to the same host to avoid overloading servers. If you are crawling your own
# web site, you can descrease this value to speed-up the crawl.
link_storage.scheduler.host_min_access_interval: 5000

# Configure the User-Agent of the crawler
crawler_manager.downloader.user_agent.name: ACHE
crawler_manager.downloader.user_agent.url: https://github.com/ViDA-NYU/ache

aecio · 2018-11-27T21:43:27Z

Yes, the docs are up to date.

Whenever a crawl is started, a web server is started by default on port 8080. ACHE will print the address on the logs. When you open it on the browser, you can see some crawler statistics as well as search the content. The search will only work if you have configured Elasticsearch. The sample configuration in {ACHE_ROOT}/config/config_docker is an example that includes a configuration for elasticsearch using Docker. You can see this to try out the search feature.

Finally, ACHE also stores some TSV files in its output folder. One of the files, relevantpages.csv, includes only the pages classified as relevant by the page classifier provided.

aecio added the question label Mar 25, 2019

aecio closed this as completed Jan 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions Ubuntu 16.04 #172

Some questions Ubuntu 16.04 #172

tpolo777 commented Nov 15, 2018 •

edited by aecio

Loading

aecio commented Nov 27, 2018

Some questions Ubuntu 16.04 #172

Some questions Ubuntu 16.04 #172

Comments

tpolo777 commented Nov 15, 2018 • edited by aecio Loading

aecio commented Nov 27, 2018

tpolo777 commented Nov 15, 2018 •

edited by aecio

Loading