Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions Ubuntu 16.04 #172

Closed
tpolo777 opened this issue Nov 15, 2018 · 1 comment
Closed

Some questions Ubuntu 16.04 #172

tpolo777 opened this issue Nov 15, 2018 · 1 comment
Labels

Comments

@tpolo777
Copy link

tpolo777 commented Nov 15, 2018

On the main page of Ache repo description reads ''Web interface for searching crawled pages in real-time''. but when I started running ache there was no interface. (Maybe i misunderstood something) Also I stuck a little when creating a Configuration File. Is this documentation up to date ? (https://ache.readthedocs.io/en/latest/index.html) How to set Ache to return me a CSV file (Relevant ) without urls set in seed file. It is quite time consuming to find new URLs through 15000+ harvest. My aim is to find databases with scientific publications/books, may you have some suggestion how to increase accuracy? Thank you for your quick support and help! You are doing a great job.

Here is my Configuration File

#
# Example of configuration for running a Focused Crawl
#

# Store pages classified as irrelevant pages by the target page classifier
target_storage.store_negative_pages: true

# Limit the max number of pages crawled per domain, in order to avoid crawling
# too many pages from same somain and favor discovery o new domains
link_storage.max_pages_per_domain: 10000

# Disable "seed scope" to allow crawl pages from any domain
link_storage.link_strategy.use_scope: false

# Set initial link classifier a simple one
link_storage.link_classifier.type: MaxDepthLinkClassifier
link_storage.link_classifier.max_depth: 3
# Train a new link classifier while the crawler is running. This allows
# the crawler automatically learn how to prioritize links in order to
# efficiently locate relevant content while avoiding the retrieval of
# irrelevant content.
link_storage.online_learning.enabled: true
link_storage.online_learning.type: FORWARD_CLASSIFIER_BINARY
link_storage.online_learning.learning_limit: 1000

# Always select top-k links with highest priority to be scheduled
link_storage.link_selector: TopkLinkSelector

# Configure the minimum time interval (in milliseconds) to wait between requests
# to the same host to avoid overloading servers. If you are crawling your own
# web site, you can descrease this value to speed-up the crawl.
link_storage.scheduler.host_min_access_interval: 5000

# Configure the User-Agent of the crawler
crawler_manager.downloader.user_agent.name: ACHE
crawler_manager.downloader.user_agent.url: https://github.com/ViDA-NYU/ache
@aecio
Copy link
Member

aecio commented Nov 27, 2018

Yes, the docs are up to date.

Whenever a crawl is started, a web server is started by default on port 8080. ACHE will print the address on the logs. When you open it on the browser, you can see some crawler statistics as well as search the content. The search will only work if you have configured Elasticsearch. The sample configuration in {ACHE_ROOT}/config/config_docker is an example that includes a configuration for elasticsearch using Docker. You can see this to try out the search feature.

Finally, ACHE also stores some TSV files in its output folder. One of the files, relevantpages.csv, includes only the pages classified as relevant by the page classifier provided.

@aecio aecio added the question label Mar 25, 2019
@aecio aecio closed this as completed Jan 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants