Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow fast-urlfilter to load from HDFS/S3 and support gzipped input [NUTCH-3017] #792

Closed
wants to merge 302 commits into from

Conversation

jnioche
Copy link
Contributor

@jnioche jnioche commented Oct 30, 2023

sebastian-nagel and others added 30 commits August 10, 2018 18:27
- bug fix: do not use the time of the last fetch as last seen time,
  it's zero during updatedb for items which haven't been fetched
  but have found only as links
- bug fix: save robots.txt also if not storing content
- bug fix: save robots.txt also if not storing content
- bug fix: write digest into cdx file if does exist
if they point to an already known target (known in CrawlDb
or known as target of a second redirect)
- new tool DedupRedirectsJob extending DeduplicationJob
- add deduplication unit tests
- bug fix: write digest into cdx file if does exist
Fix cdx output of revisit records (HTTP status 304 notmodified):
- set "mime" to "warc/revisit" (as done by PyWB)
- no "mime-detected"
- add payload "digest" (required by columnar Parquet index)
Fix cdx output of revisit records (HTTP status 304 notmodified):
- set "mime" to "warc/revisit" (as done by PyWB)
- no "mime-detected"
- add payload "digest" (required by columnar Parquet index)
- based on CLD2 bindings
- adds charset and language to metadata records and CDX
- fix of language codes passed into cdx file
- make detection more configurable
- disable best-effort strategy by default
- fix of language codes passed into cdx file
- make detection more configurable
- disable best-effort strategy by default
- supports only URLs pointing to sitemaps in plain-text files
- can check for cross-submits
- configurable limits of URLs per sitemap
- random sampling if limit is reached
- distributes score over URLs
- robust regarding fetch and parser failures and timeouts
- check robots.txt, skip disallowed URLs
- based on secondary sorting by host/domain and decreasing score
  (no per-host or per-domain counts are hold in memory)
- only selects top-scoring URLs per host or domain
  (no support for global topN top-scoring URLs)
- partitions all generated segments in a single job
- older ant versions seem not to take "<include ...>"
  as an exclusive include
- add empty <exclude> which excludes all transitive
  dependencies
sebastian-nagel and others added 27 commits June 14, 2022 11:00
…ins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used

- protocol-okhttp: initialize SSLContext used to ignore SSL/TLS certificate verificiation
  not in a static code block
…ins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used

- code improvements Nutch plugin system:
  - use `Class<?>` and remove suppressions of warnings
  - javadocs: fix typos
  - remove superfluous white space
  - autoformat using code style template
…ins may fail Hadoop jobs

           running in distributed mode if protocol-okhttp is used
NUTCH-2949 Tasks of a multi-threaded map runner may fail because of slow creation of URL stream handlers

- cache URLStreamHandlers for each protocol to avoid that handlers are
  created anew

- utilize the cache to route standard protocols (http, https, file, jar)
  to handlers implemented by the JVM: this fixes NUTCH-2936
- add include/exclude rules as list of IP address, CIDR notation
  or predefined IP ranges (localhost, loopback, sitelocal)
- upgrade to Nutch 1.19 / 1.20-SNAPSHOT
- add configurable random to generator sort value for pages to be
  refetched based on the time elapsed since the last fetch
- update Javadoc
- no URI, content, protocol status
- duplicate
(avoid that captures of some less useful MIME types, e.g.
 software packages and archives, occupy too much storage
 in WARC files)
- `warc.skip.mimetype.pattern` defines a regex pattern to
  match MIME types to be skipped
- `warc.skip.mimetype.factor` defines a factor by which
  matched captures are skipped randomly (0.0 :- never)
  and depending on their size (relative to http.content.limit)
- `warc.skip.mimetype.truncated.factor` adds a factor to make
  skipping captures more likely if content is truncated
- archived captures
- captures skipped because of a duplicated URL
- WARC format requires valid a URI as WARC-Target-URI
- if the URL of a successfully fetched page is not a valid URI,
  normalize it and try whether the normalized form is a valid URI
- use `urlnormalizer.scope.indexer` to allow for independently
  configurable normalizers
…ld is reached

- if QueueFeeder is still alive, also block queues which are empty right now
- pass a collection of lower-cased user-agent names
- update unit tests to merging of groups of rules
  (if multiple user-agent names are defined)
- URL filters exclude the robots.txt URL and the property
  fetcher.robotstxt.archiving.filter.url is true
- dependent on the path and query of the URL: RFC 9309 says
  that "the /robots.txt URI is implicitly allowed."
Upgrade crawler-commons 1.4-SNAPSHOT -> 1.5-SNAPSHOT
 #26

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Julien Nioche <julien@digitalpebble.com>
@jnioche jnioche closed this Oct 30, 2023
@jnioche
Copy link
Contributor Author

jnioche commented Oct 30, 2023

Obivously, pulled more changes than I meant to

@jnioche jnioche deleted the 26 branch November 8, 2023 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants