New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow fast-urlfilter to load from HDFS/S3 and support gzipped input [NUTCH-3017] #792
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- bug fix: do not use the time of the last fetch as last seen time, it's zero during updatedb for items which haven't been fetched but have found only as links
- bug fix: save robots.txt also if not storing content
- bug fix: save robots.txt also if not storing content
- bug fix: write digest into cdx file if does exist
if they point to an already known target (known in CrawlDb or known as target of a second redirect) - new tool DedupRedirectsJob extending DeduplicationJob - add deduplication unit tests
- bug fix: write digest into cdx file if does exist
Fix cdx output of revisit records (HTTP status 304 notmodified): - set "mime" to "warc/revisit" (as done by PyWB) - no "mime-detected" - add payload "digest" (required by columnar Parquet index)
Fix cdx output of revisit records (HTTP status 304 notmodified): - set "mime" to "warc/revisit" (as done by PyWB) - no "mime-detected" - add payload "digest" (required by columnar Parquet index)
- based on CLD2 bindings - adds charset and language to metadata records and CDX
- fix of language codes passed into cdx file - make detection more configurable - disable best-effort strategy by default
- fix of language codes passed into cdx file - make detection more configurable - disable best-effort strategy by default
…y-output' into cc-1.16-1
- supports only URLs pointing to sitemaps in plain-text files - can check for cross-submits - configurable limits of URLs per sitemap - random sampling if limit is reached - distributes score over URLs - robust regarding fetch and parser failures and timeouts - check robots.txt, skip disallowed URLs
duplicates found in first step
- based on secondary sorting by host/domain and decreasing score (no per-host or per-domain counts are hold in memory) - only selects top-scoring URLs per host or domain (no support for global topN top-scoring URLs) - partitions all generated segments in a single job
- older ant versions seem not to take "<include ...>" as an exclusive include - add empty <exclude> which excludes all transitive dependencies
…ins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used - protocol-okhttp: initialize SSLContext used to ignore SSL/TLS certificate verificiation not in a static code block
…ins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used - code improvements Nutch plugin system: - use `Class<?>` and remove suppressions of warnings - javadocs: fix typos - remove superfluous white space - autoformat using code style template
…ins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used NUTCH-2949 Tasks of a multi-threaded map runner may fail because of slow creation of URL stream handlers - cache URLStreamHandlers for each protocol to avoid that handlers are created anew - utilize the cache to route standard protocols (http, https, file, jar) to handlers implemented by the JVM: this fixes NUTCH-2936
- add include/exclude rules as list of IP address, CIDR notation or predefined IP ranges (localhost, loopback, sitelocal)
- upgrade to Nutch 1.19 / 1.20-SNAPSHOT
- add configurable random to generator sort value for pages to be refetched based on the time elapsed since the last fetch - update Javadoc
- no URI, content, protocol status - duplicate
(avoid that captures of some less useful MIME types, e.g. software packages and archives, occupy too much storage in WARC files) - `warc.skip.mimetype.pattern` defines a regex pattern to match MIME types to be skipped - `warc.skip.mimetype.factor` defines a factor by which matched captures are skipped randomly (0.0 :- never) and depending on their size (relative to http.content.limit) - `warc.skip.mimetype.truncated.factor` adds a factor to make skipping captures more likely if content is truncated
- archived captures - captures skipped because of a duplicated URL
- WARC format requires valid a URI as WARC-Target-URI - if the URL of a successfully fetched page is not a valid URI, normalize it and try whether the normalized form is a valid URI - use `urlnormalizer.scope.indexer` to allow for independently configurable normalizers
…ld is reached - if QueueFeeder is still alive, also block queues which are empty right now
- pass a collection of lower-cased user-agent names - update unit tests to merging of groups of rules (if multiple user-agent names are defined)
- URL filters exclude the robots.txt URL and the property fetcher.robotstxt.archiving.filter.url is true - dependent on the path and query of the URL: RFC 9309 says that "the /robots.txt URI is implicitly allowed."
Upgrade crawler-commons 1.4-SNAPSHOT -> 1.5-SNAPSHOT
#26 Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Obivously, pulled more changes than I meant to |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See description in https://issues.apache.org/jira/browse/NUTCH-3017