You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On the main page of Ache repo description reads ''Web interface for searching crawled pages in real-time''. but when I started running ache there was no interface. (Maybe i misunderstood something) Also I stuck a little when creating a Configuration File. Is this documentation up to date ? (https://ache.readthedocs.io/en/latest/index.html) How to set Ache to return me a CSV file (Relevant ) without urls set in seed file. It is quite time consuming to find new URLs through 15000+ harvest. My aim is to find databases with scientific publications/books, may you have some suggestion how to increase accuracy? Thank you for your quick support and help! You are doing a great job.
Here is my Configuration File
## Example of configuration for running a Focused Crawl## Store pages classified as irrelevant pages by the target page classifiertarget_storage.store_negative_pages: true# Limit the max number of pages crawled per domain, in order to avoid crawling# too many pages from same somain and favor discovery o new domainslink_storage.max_pages_per_domain: 10000# Disable "seed scope" to allow crawl pages from any domainlink_storage.link_strategy.use_scope: false# Set initial link classifier a simple onelink_storage.link_classifier.type: MaxDepthLinkClassifierlink_storage.link_classifier.max_depth: 3# Train a new link classifier while the crawler is running. This allows# the crawler automatically learn how to prioritize links in order to# efficiently locate relevant content while avoiding the retrieval of# irrelevant content.link_storage.online_learning.enabled: truelink_storage.online_learning.type: FORWARD_CLASSIFIER_BINARYlink_storage.online_learning.learning_limit: 1000# Always select top-k links with highest priority to be scheduledlink_storage.link_selector: TopkLinkSelector# Configure the minimum time interval (in milliseconds) to wait between requests# to the same host to avoid overloading servers. If you are crawling your own# web site, you can descrease this value to speed-up the crawl.link_storage.scheduler.host_min_access_interval: 5000# Configure the User-Agent of the crawlercrawler_manager.downloader.user_agent.name: ACHEcrawler_manager.downloader.user_agent.url: https://github.com/ViDA-NYU/ache
The text was updated successfully, but these errors were encountered:
Whenever a crawl is started, a web server is started by default on port 8080. ACHE will print the address on the logs. When you open it on the browser, you can see some crawler statistics as well as search the content. The search will only work if you have configured Elasticsearch. The sample configuration in {ACHE_ROOT}/config/config_docker is an example that includes a configuration for elasticsearch using Docker. You can see this to try out the search feature.
Finally, ACHE also stores some TSV files in its output folder. One of the files, relevantpages.csv, includes only the pages classified as relevant by the page classifier provided.
On the main page of Ache repo description reads ''Web interface for searching crawled pages in real-time''. but when I started running ache there was no interface. (Maybe i misunderstood something) Also I stuck a little when creating a Configuration File. Is this documentation up to date ? (https://ache.readthedocs.io/en/latest/index.html) How to set Ache to return me a CSV file (Relevant ) without urls set in seed file. It is quite time consuming to find new URLs through 15000+ harvest. My aim is to find databases with scientific publications/books, may you have some suggestion how to increase accuracy? Thank you for your quick support and help! You are doing a great job.
Here is my Configuration File
The text was updated successfully, but these errors were encountered: