Allow fast-urlfilter to load from HDFS/S3 and support gzipped input [NUTCH-3017] #792

jnioche · 2023-10-30T18:22:28Z

See description in https://issues.apache.org/jira/browse/NUTCH-3017

- bug fix: do not use the time of the last fetch as last seen time, it's zero during updatedb for items which haven't been fetched but have found only as links

- bug fix: save robots.txt also if not storing content

- bug fix: write digest into cdx file if does exist

if they point to an already known target (known in CrawlDb or known as target of a second redirect) - new tool DedupRedirectsJob extending DeduplicationJob - add deduplication unit tests

- bug fix: write digest into cdx file if does exist

Fix cdx output of revisit records (HTTP status 304 notmodified): - set "mime" to "warc/revisit" (as done by PyWB) - no "mime-detected" - add payload "digest" (required by columnar Parquet index)

- based on CLD2 bindings - adds charset and language to metadata records and CDX

- fix of language codes passed into cdx file - make detection more configurable - disable best-effort strategy by default

…y-output' into cc-1.16-1

- supports only URLs pointing to sitemaps in plain-text files - can check for cross-submits - configurable limits of URLs per sitemap - random sampling if limit is reached - distributes score over URLs - robust regarding fetch and parser failures and timeouts - check robots.txt, skip disallowed URLs

duplicates found in first step

- based on secondary sorting by host/domain and decreasing score (no per-host or per-domain counts are hold in memory) - only selects top-scoring URLs per host or domain (no support for global topN top-scoring URLs) - partitions all generated segments in a single job

- older ant versions seem not to take "<include ...>" as an exclusive include - add empty <exclude> which excludes all transitive dependencies

…ins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used - protocol-okhttp: initialize SSLContext used to ignore SSL/TLS certificate verificiation not in a static code block

…ins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used - code improvements Nutch plugin system: - use `Class<?>` and remove suppressions of warnings - javadocs: fix typos - remove superfluous white space - autoformat using code style template

…ins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used NUTCH-2949 Tasks of a multi-threaded map runner may fail because of slow creation of URL stream handlers - cache URLStreamHandlers for each protocol to avoid that handlers are created anew - utilize the cache to route standard protocols (http, https, file, jar) to handlers implemented by the JVM: this fixes NUTCH-2936

…obs' into cc

- add include/exclude rules as list of IP address, CIDR notation or predefined IP ranges (localhost, loopback, sitelocal)

- upgrade to Nutch 1.19 / 1.20-SNAPSHOT

- add configurable random to generator sort value for pages to be refetched based on the time elapsed since the last fetch - update Javadoc

- no URI, content, protocol status - duplicate

(avoid that captures of some less useful MIME types, e.g. software packages and archives, occupy too much storage in WARC files) - `warc.skip.mimetype.pattern` defines a regex pattern to match MIME types to be skipped - `warc.skip.mimetype.factor` defines a factor by which matched captures are skipped randomly (0.0 :- never) and depending on their size (relative to http.content.limit) - `warc.skip.mimetype.truncated.factor` adds a factor to make skipping captures more likely if content is truncated

- archived captures - captures skipped because of a duplicated URL

- WARC format requires valid a URI as WARC-Target-URI - if the URL of a successfully fetched page is not a valid URI, normalize it and try whether the normalized form is a valid URI - use `urlnormalizer.scope.indexer` to allow for independently configurable normalizers

…ld is reached - if QueueFeeder is still alive, also block queues which are empty right now

- pass a collection of lower-cased user-agent names - update unit tests to merging of groups of rules (if multiple user-agent names are defined)

…o redirects

- URL filters exclude the robots.txt URL and the property fetcher.robotstxt.archiving.filter.url is true - dependent on the path and query of the URL: RFC 9309 says that "the /robots.txt URI is implicitly allowed."

…20

Upgrade crawler-commons 1.4-SNAPSHOT -> 1.5-SNAPSHOT

#26 Signed-off-by: Julien Nioche <julien@digitalpebble.com>

Signed-off-by: Julien Nioche <julien@digitalpebble.com>

jnioche · 2023-10-30T18:23:37Z

Obivously, pulled more changes than I meant to

sebastian-nagel and others added 30 commits August 10, 2018 18:27

Merge branch 'small-package-exclude-plugins' into cc-1.16-1

b091f38

Plugin scoring-adaptive:

4fce081

- bug fix: do not use the time of the last fetch as last seen time, it's zero during updatedb for items which haven't been fetched but have found only as links

Merge branch 'cc-scoring-adaptive' into cc-1.16-1

ba2c987

Common Crawl WARC writer/exporter

6de2640

- bug fix: save robots.txt also if not storing content

Common Crawl WARC writer/exporter

32039f5

- bug fix: save robots.txt also if not storing content

Common Crawl WARC writer/exporter

d6443c9

- bug fix: write digest into cdx file if does exist

Deduplication of redirects: mark redirects as duplicates

95cf059

if they point to an already known target (known in CrawlDb or known as target of a second redirect) - new tool DedupRedirectsJob extending DeduplicationJob - add deduplication unit tests

Merge branch 'redirect-deduplication' into cc-1.16-1

9224201

Common Crawl WARC writer/exporter

36a4c02

- bug fix: write digest into cdx file if does exist

Common Crawl WARC writer/exporter

4fc4f06

Fix cdx output of revisit records (HTTP status 304 notmodified): - set "mime" to "warc/revisit" (as done by PyWB) - no "mime-detected" - add payload "digest" (required by columnar Parquet index)

Common Crawl WARC writer/exporter

0bb1aad

Fix cdx output of revisit records (HTTP status 304 notmodified): - set "mime" to "warc/revisit" (as done by PyWB) - no "mime-detected" - add payload "digest" (required by columnar Parquet index)

Integrate detection of text language into WARC exporter/writer

9362e78

- based on CLD2 bindings - adds charset and language to metadata records and CDX

Integrate detection of text language into WARC exporter/writer

98bde03

- fix of language codes passed into cdx file - make detection more configurable - disable best-effort strategy by default

Merge branch 'cc-warc-writer' into cc-1.16-1

45e1ac8

Integrate detection of text language into WARC exporter/writer

ff81408

- fix of language codes passed into cdx file - make detection more configurable - disable best-effort strategy by default

Merge branch 'cc-language-detection' into cc-1.16-1

fd9ad2c

Merge remote-tracking branch 'sebastian/NUTCH-2635-generator-temporar…

d755482

…y-output' into cc-1.16-1

Refactor S3 output committers

5bd86ea

Merge branch 'cc-s3-null-output-committer' into cc-1.16-1

8c229c7

Deduplication of redirects: do not run merge job if there are no

e316bd1

duplicates found in first step

Merge branch 'redirect-deduplication' into cc-1.16-1

34a1982

Add DedupRedirectsJob to bin/nutch and log4j.properties

436ad90

Merge branch 'redirect-deduplication' into cc-1.16-1

6d5db89

Use crawler-commons development version

d9764fe

Build with Hadoop 2.6 / CDH 5.15.1

dd0b8ea

Merge branch 'Hadoop_2.6.0-cdh5.15.1' into cc

a8b237a

Fix exclusion of tika-parsers dependencies:

8ec9aec

- older ant versions seem not to take "<include ...>" as an exclusive include - add empty <exclude> which excludes all transitive dependencies

Merge branch 'cc-language-detection' into cc

8421929

sebastian-nagel and others added 27 commits June 14, 2022 11:00

NUTCH-2936 Early registration of URL stream handlers provided by plug…

9dfb6a9

…ins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used - protocol-okhttp: initialize SSLContext used to ignore SSL/TLS certificate verificiation not in a static code block

Merge branch 'master' into cc

01e54b7

Merge branch 'NUTCH-2936-url-stream-handler-protocol-okhttp-failing-j…

9b676ae

…obs' into cc

NUTCH-2930 Protocol-okhttp: implement IP filter

6b50e19

- add include/exclude rules as list of IP address, CIDR notation or predefined IP ranges (localhost, loopback, sitelocal)

Merge branch 'NUTCH-2930-okhttp-ip-address-filter' into cc

d34d641

Upgrade to crawler-commons 1.3 / 1.4-SNAPSHOT

f0e03a7

Merge branch 'master' into cc

04b54e9

- upgrade to Nutch 1.19 / 1.20-SNAPSHOT

Merge branch 'master' into cc

3ca954f

Merge branch 'master' into cc

0715793

Adaptive scoring filter:

8ec96ed

- add configurable random to generator sort value for pages to be refetched based on the time elapsed since the last fetch - update Javadoc

WARC writer: add counters for records skipped because of

ca58eed

- no URI, content, protocol status - duplicate

WARC writer: document properties in nutch-default.xml

04d7503

WARC writer: log fetch time, status code and size of

53722b3

- archived captures - captures skipped because of a duplicated URL

NUTCH-2992 Fetcher: always block fetch queues when exceptions thresho…

17bb09c

…ld is reached - if QueueFeeder is still alive, also block queues which are empty right now

Plugin lib-http: format robots.txt unit tests

433dfee

Upgrade robots.txt parser to forthcoming changes in crawler-commons 1.4

1d13f20

- pass a collection of lower-cased user-agent names - update unit tests to merging of groups of rules (if multiple user-agent names are defined)

Work-around for NUTCH-2749 Fetcher and scoring-opic: transfer score t…

0b406b4

…o redirects

Fetcher: add option to archive robots.txt responses in WARC files if

e8f9901

- URL filters exclude the robots.txt URL and the property fetcher.robotstxt.archiving.filter.url is true - dependent on the path and query of the URL: RFC 9309 says that "the /robots.txt URI is implicitly allowed."

WARC writer: use URI.toASCIIString() instead of URI.toString(), fixes #…

3bdb58f

…20

Merge branch 'master' into cc

adeb861

Upgrade crawler-commons 1.4-SNAPSHOT -> 1.5-SNAPSHOT

Add Override annotations

b76798b

Allow fast-urlfilter to load from HDFS/S3 and support gzipped input, fix

ce45a92

#26 Signed-off-by: Julien Nioche <julien@digitalpebble.com>

[Nutch-3017] Apply Nutch formatting

f777105

Signed-off-by: Julien Nioche <julien@digitalpebble.com>

jnioche closed this Oct 30, 2023

jnioche deleted the 26 branch November 8, 2023 14:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow fast-urlfilter to load from HDFS/S3 and support gzipped input [NUTCH-3017] #792

Allow fast-urlfilter to load from HDFS/S3 and support gzipped input [NUTCH-3017] #792

jnioche commented Oct 30, 2023

jnioche commented Oct 30, 2023

Allow fast-urlfilter to load from HDFS/S3 and support gzipped input [NUTCH-3017] #792

Allow fast-urlfilter to load from HDFS/S3 and support gzipped input [NUTCH-3017] #792

Conversation

jnioche commented Oct 30, 2023

jnioche commented Oct 30, 2023