Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow fast-urlfilter to load from HDFS/S3 and support gzipped input [NUTCH-3017] #792

Closed
wants to merge 302 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
302 commits
Select commit Hold shift + click to select a range
b091f38
Merge branch 'small-package-exclude-plugins' into cc-1.16-1
sebastian-nagel Aug 10, 2018
4fce081
Plugin scoring-adaptive:
sebastian-nagel Aug 12, 2018
ba2c987
Merge branch 'cc-scoring-adaptive' into cc-1.16-1
sebastian-nagel Aug 12, 2018
6de2640
Common Crawl WARC writer/exporter
sebastian-nagel Aug 14, 2018
32039f5
Common Crawl WARC writer/exporter
sebastian-nagel Aug 14, 2018
d6443c9
Common Crawl WARC writer/exporter
sebastian-nagel Aug 14, 2018
95cf059
Deduplication of redirects: mark redirects as duplicates
sebastian-nagel Aug 16, 2018
9224201
Merge branch 'redirect-deduplication' into cc-1.16-1
sebastian-nagel Aug 16, 2018
36a4c02
Common Crawl WARC writer/exporter
sebastian-nagel Aug 14, 2018
4fc4f06
Common Crawl WARC writer/exporter
sebastian-nagel Aug 21, 2018
0bb1aad
Common Crawl WARC writer/exporter
sebastian-nagel Aug 21, 2018
9362e78
Integrate detection of text language into WARC exporter/writer
sebastian-nagel Aug 6, 2018
98bde03
Integrate detection of text language into WARC exporter/writer
sebastian-nagel Aug 8, 2018
45e1ac8
Merge branch 'cc-warc-writer' into cc-1.16-1
sebastian-nagel Aug 21, 2018
ff81408
Integrate detection of text language into WARC exporter/writer
sebastian-nagel Aug 8, 2018
fd9ad2c
Merge branch 'cc-language-detection' into cc-1.16-1
sebastian-nagel Aug 21, 2018
d755482
Merge remote-tracking branch 'sebastian/NUTCH-2635-generator-temporar…
sebastian-nagel Aug 21, 2018
12db351
SitemapInjector
sebastian-nagel Aug 22, 2018
5bd86ea
Refactor S3 output committers
sebastian-nagel Aug 22, 2018
8c229c7
Merge branch 'cc-s3-null-output-committer' into cc-1.16-1
sebastian-nagel Aug 22, 2018
e316bd1
Deduplication of redirects: do not run merge job if there are no
sebastian-nagel Aug 22, 2018
34a1982
Merge branch 'redirect-deduplication' into cc-1.16-1
sebastian-nagel Aug 22, 2018
f931117
Generator2 - alternative generation of fetch lists:
sebastian-nagel Aug 22, 2018
436ad90
Add DedupRedirectsJob to bin/nutch and log4j.properties
sebastian-nagel Aug 22, 2018
6d5db89
Merge branch 'redirect-deduplication' into cc-1.16-1
sebastian-nagel Aug 22, 2018
d9764fe
Use crawler-commons development version
sebastian-nagel Aug 2, 2018
dd0b8ea
Build with Hadoop 2.6 / CDH 5.15.1
sebastian-nagel Sep 11, 2018
a8b237a
Merge branch 'Hadoop_2.6.0-cdh5.15.1' into cc
sebastian-nagel Sep 11, 2018
8ec9aec
Fix exclusion of tika-parsers dependencies:
sebastian-nagel Sep 12, 2018
8421929
Merge branch 'cc-language-detection' into cc
sebastian-nagel Sep 12, 2018
5b7c469
Extend local maven resolver to also use remote repositories as fall-back
sebastian-nagel Sep 12, 2018
5b246e4
Merge branch 'tika-dev' into cc
sebastian-nagel Sep 12, 2018
e1a3d3a
NUTCH-2644 CrawlDbReader -dump ignores filter options
sebastian-nagel Sep 12, 2018
d9e37f3
Generator2: exit early if no URLs have been selected
sebastian-nagel Sep 13, 2018
86067a8
Adaptive scoring filter: allow generator sort value to depend on MIME…
sebastian-nagel Sep 13, 2018
33fb9a6
Merge branch 'cc-scoring-adaptive' into cc
sebastian-nagel Sep 13, 2018
fd7fa94
WARC writer incorrectly adds extra line in response records
sebastian-nagel Sep 14, 2018
edc7398
WARC writer incorrectly adds extra line in response records
sebastian-nagel Sep 14, 2018
8eecc67
Generator2:
sebastian-nagel Sep 18, 2018
4ab7b68
Common Crawl WARC writer/exporter
sebastian-nagel Sep 25, 2018
9e57eb5
Merge branch 'cc-warc-writer' into cc
sebastian-nagel Sep 25, 2018
0b393af
Merge branch 'master' into cc
sebastian-nagel Oct 10, 2018
5ae466a
Use crawler-commons development version
sebastian-nagel Aug 2, 2018
62233d0
Merge branch 'master' into cc
sebastian-nagel Oct 10, 2018
2f3ce2c
WARC writer incorrectly adds extra line in response records
sebastian-nagel Oct 10, 2018
7a9ea78
Merge branch 'cc-warc-writer' into cc
sebastian-nagel Oct 10, 2018
3b6172c
Upgrade to use Tika 1.19 (core and parse-tika)
sebastian-nagel Oct 10, 2018
e9f32dc
CCFetchSchedule: reset not-modified time during fetch list generation
sebastian-nagel Oct 10, 2018
e94d9a2
Merge branch 'cc-fetch-schedule' into cc
sebastian-nagel Oct 10, 2018
7c02acb
Upgrade to use Tika 1.19.1 (core and parse-tika)
sebastian-nagel Oct 12, 2018
b57faf3
Merge branch 'tika-1.19.1' into cc
sebastian-nagel Oct 12, 2018
efa0044
WARC writer charset detection #6
sebastian-nagel Oct 16, 2018
06e63c7
WARC writer charset detection #6
sebastian-nagel Oct 16, 2018
1121dbb
NUTCH-2674 HostDb: dump shows wrong column headers
sebastian-nagel Nov 8, 2018
04ae437
Add job/task counters to track status and errors of language detection
sebastian-nagel Nov 8, 2018
9e6ee31
Merge branch 'cc-language-detection' into cc
sebastian-nagel Nov 8, 2018
63267fc
Use crawler-commons development version
sebastian-nagel Aug 2, 2018
5faa22a
Merge branch 'master' (upstream apache/nutch) into cc
sebastian-nagel Nov 8, 2018
872ebe5
Deduplication of redirects: allow to optionally skip the sorting job
sebastian-nagel Nov 8, 2018
7abf930
Merge branch 'redirect-deduplication' into cc
sebastian-nagel Nov 8, 2018
20c2d4e
Add job/task counters to track status and errors of language detection:
sebastian-nagel Nov 8, 2018
671417e
Deduplication of redirects: allow to optionally skip the sorting job
sebastian-nagel Nov 8, 2018
92e097a
Merge branch 'redirect-deduplication' into cc
sebastian-nagel Nov 8, 2018
e19ffb9
Adaptive scoring filter: allow to configure time span after which
sebastian-nagel Nov 9, 2018
d327738
Merge branch 'cc-scoring-adaptive' into cc
sebastian-nagel Nov 9, 2018
9a8f78a
Generator2 if run with -keep: do not cleanup step1 if step2 fails
sebastian-nagel Nov 12, 2018
e53e96d
Generator2:
sebastian-nagel Dec 17, 2018
993a00b
Use crawler-commons development version
sebastian-nagel Aug 2, 2018
6abcb39
NUTCH-2683 DeduplicationJob: add option to prefer https:// over http://
sebastian-nagel Jan 7, 2019
5d3a8f9
Merge branch 'master' into cc
sebastian-nagel Jan 11, 2019
e0486d7
Merge branch NUTCH-2682-upgrade-tika into cc
sebastian-nagel Jan 11, 2019
8fcb209
Add FastURLFilter which first does fast exact matches on host/domain …
sebastian-nagel Nov 28, 2016
655d19d
Add urlfilter-fast to javadoc, eclipse, maven-jar
sebastian-nagel Aug 14, 2017
2077ffa
Speed-up urlfilter-fast:
sebastian-nagel Nov 10, 2017
c3b1d2d
Improve urlfilter-fast
sebastian-nagel Jun 7, 2018
813ae7b
urlfilter-fast: also look up hostname in domain rules
sebastian-nagel Jun 13, 2018
1278970
Improve plugin urlfilter-fast:
sebastian-nagel Jan 21, 2019
7e1c8ce
Merge branch 'master' into cc
sebastian-nagel Jan 22, 2019
d6c6325
Merge branch 'cc-fast-url-filter' into cc
sebastian-nagel Jan 22, 2019
e68e085
WARC writer: improve logging for URLs skipped because of no content
sebastian-nagel Jan 24, 2019
676cac0
Integrate detection of text language into WARC exporter/writer
sebastian-nagel Aug 6, 2018
d0fe4d7
Integrate detection of text language into WARC exporter/writer
sebastian-nagel Aug 8, 2018
e5803ff
Integrate detection of text language into WARC exporter/writer
sebastian-nagel Aug 8, 2018
6e90e76
Fix exclusion of tika-parsers dependencies:
sebastian-nagel Sep 12, 2018
53fd5c6
WARC writer charset detection #6
sebastian-nagel Oct 16, 2018
e5fe23f
Add job/task counters to track status and errors of language detection
sebastian-nagel Nov 8, 2018
b548c03
Add job/task counters to track status and errors of language detection:
sebastian-nagel Nov 8, 2018
e2364e6
SitemapInjector: allow to run sitemap and inject jobs independently
sebastian-nagel Jan 24, 2019
562ec81
Merge branch 'cc-language-detection' into cc
sebastian-nagel Jan 24, 2019
82bbcbf
Merge branch 'cc-sitemap-injector' into cc
sebastian-nagel Jan 24, 2019
7b9f247
Upgrade to Tika 1.20: also upgrade tika-parsers in core
sebastian-nagel Jan 24, 2019
9f4fdeb
Merge branch 'master' into cc
sebastian-nagel Feb 7, 2019
4152309
Generator2: code clean-up and minor improvements
sebastian-nagel Feb 13, 2019
4c14d26
Generator2: domain-specific limits
sebastian-nagel Feb 13, 2019
a4c6de8
Generator2: domain-specific limits
sebastian-nagel Feb 15, 2019
041bbd8
Generator2: domain-specific limits
sebastian-nagel Feb 20, 2019
3f95636
Merge branch 'master' into cc
sebastian-nagel Mar 7, 2019
c183ba0
Improve checks to ensure that HTTP headers end in `\r\n\r\n` (cf. #5)
sebastian-nagel Mar 14, 2019
c3e07a8
Merge branch 'cc-warc-writer' into cc
sebastian-nagel Mar 14, 2019
278234e
Generator2: domain-specific limits
sebastian-nagel Mar 22, 2019
9ae8908
Add tool UrlCleaner which cleans a list of URLs
sebastian-nagel Apr 1, 2019
7f81ea3
Merge branch 'master' into cc
sebastian-nagel Apr 11, 2019
5a4eaa4
Merge branch 'cc-fast-url-filter' into cc
sebastian-nagel Apr 11, 2019
9adac54
Merge branch 'master' into cc
sebastian-nagel Apr 12, 2019
6fd0911
Upgrade to CDH 6.0.0 (Hadoop 3.0)
sebastian-nagel Oct 10, 2018
3e82b1a
Update to CDH 6.1.0
sebastian-nagel Jan 11, 2019
21b77e1
Update to CDH 6.2.0
sebastian-nagel Apr 29, 2019
864fdd0
Merge branch 'master' into cc
sebastian-nagel May 15, 2019
05cf8f3
Merge branch 'master' into cc
sebastian-nagel Jun 12, 2019
8781b59
SitemapInjector: implement configurable depth limit when processing
sebastian-nagel Jun 17, 2019
2e5b0b9
Language detection: avoid unnecessary reencoding of UTF-8
sebastian-nagel Jul 13, 2019
10df4ba
Set task status when WARC writers are closed
sebastian-nagel Aug 9, 2019
ff52f39
Merge branch 'cc-warc-writer' into cc
sebastian-nagel Aug 9, 2019
706337e
Merge branch 'master' into cc
sebastian-nagel Aug 11, 2019
8dae896
[WARC writer] end datetime in WARC file name must be fixed to timelimit,
sebastian-nagel Aug 11, 2019
d3f9200
Merge branch 'cc-warc-writer' into cc
sebastian-nagel Aug 11, 2019
e1e820e
Merge 'NUTCH-2726-tika-1.22' into cc
sebastian-nagel Aug 12, 2019
220fa0c
NUTCH-2729 protocol-okhttp: fix marking of truncated content
sebastian-nagel Aug 13, 2019
ea36ebb
Merge branch 'NUTCH-2728-protocol-okhttp-3.14.2' into cc
sebastian-nagel Aug 13, 2019
7937eec
Merge branch 'NUTCH-2729-protocol-okhttp-mark-truncated' into cc
sebastian-nagel Aug 13, 2019
3663f35
WARC writer: always add HTTP `Content-Length` header,
sebastian-nagel Aug 13, 2019
fc1ffdd
Merge branch 'cc-warc-writer' into cc
sebastian-nagel Aug 13, 2019
fda1c89
NUTCH-2729 protocol-okhttp: fix marking of truncated content
sebastian-nagel Aug 26, 2019
c399fbf
Merge branch 'NUTCH-2729-protocol-okhttp-mark-truncated' into cc
sebastian-nagel Aug 26, 2019
3ea50e0
Merge changes related to
sebastian-nagel Aug 29, 2019
c309dbe
NUTCH-2729 protocol-okhttp: fix marking of truncated content
sebastian-nagel Aug 30, 2019
61eca49
Merge branch 'NUTCH-2729-protocol-okhttp-mark-truncated' into cc
sebastian-nagel Aug 30, 2019
51292de
Merge branch 'master' into cc
sebastian-nagel Sep 9, 2019
e80ed65
Merge branch 'master' into cc
sebastian-nagel Oct 9, 2019
25fa6a5
Merge branch 'master' into cc
sebastian-nagel Nov 7, 2019
c1b8ce9
Merge remote-tracking branch 'sebastian/NUTCH-2746-url-normalizer-bas…
sebastian-nagel Nov 7, 2019
adfcc45
WARC writer (CDX writer): new CDX fields/keys in JSON data
sebastian-nagel Nov 7, 2019
1d82265
RobotsRulesParser: also store fetch date if robots.txt content is stored
sebastian-nagel Nov 7, 2019
e99284b
Merge branch 'warc-cdx-mark-truncation-and-redirects' into cc
sebastian-nagel Nov 12, 2019
eaeb7b1
Code cleanup:
sebastian-nagel Nov 7, 2019
f91c8fc
Merge branch 'master' into cc
sebastian-nagel Dec 2, 2019
fdc1c68
Merge branch 'master' into cc
sebastian-nagel Dec 2, 2019
81d4e1b
Merge branch 'NUTCH-2754-max-crawl-delay' into cc
sebastian-nagel Dec 6, 2019
5cbfac9
NUTCH-2759 bin/crawl: Rename option --num-slaves
sebastian-nagel Jan 9, 2020
c4d407e
Merge branch 'master' into cc
sebastian-nagel Jan 9, 2020
6669783
NUTCH-2733 protocol-okhttp: add support for Brotli compression (Conte…
sebastian-nagel Jan 9, 2020
816e510
NUTCH-2733 protocol-okhttp: add support for Brotli compression (Conte…
sebastian-nagel Jan 9, 2020
7347c42
Merge remote-tracking branch 'sebastian/NUTCH-2733-protocol-okhttp-su…
sebastian-nagel Jan 17, 2020
bbdea28
Add tool to convert CrawlDb (optionally filtered) to Nutch seed files
sebastian-nagel Jan 17, 2020
bd84ae9
Merge branch 'master' into cc
sebastian-nagel Jan 19, 2020
92b0da0
Merge branch 'master' into cc
sebastian-nagel Feb 10, 2020
cb55401
NUTCH-2763 protocol-okhttp (store.http.headers): add whitespace in st…
sebastian-nagel Jan 20, 2020
6152d0d
Tool UrlCleaner: compress output if configured
sebastian-nagel Feb 13, 2020
792f1ba
Merge branch 'master' into cc
sebastian-nagel Mar 4, 2020
e5033be
Merge branch 'master' into cc
sebastian-nagel Mar 20, 2020
c2e9b5b
NUTCH-2623 Fetcher to guarantee delay for same host/domain/ip indepen…
sebastian-nagel Mar 20, 2020
d3897fa
NUTCH-2776 Fetcher to temporarily deduplicate followed redirects
sebastian-nagel Mar 20, 2020
7855a4b
Merge branch 'NUTCH-2776-fetcher-dedup-redirects' into cc
sebastian-nagel Mar 21, 2020
7038ccc
NUTCH-2775 Fetcher to guarantee minimum delay even if robots.txt defi…
sebastian-nagel Mar 25, 2020
9871174
Merge branch 'master' into cc
sebastian-nagel May 7, 2020
dd9c1c6
Add section about Common Crawl modifications to README
sebastian-nagel May 11, 2020
6ada324
Merge branch 'master' into cc
sebastian-nagel May 18, 2020
ac480a0
SitemapInjector:
sebastian-nagel May 20, 2020
41b00a5
Merge branch 'cc-sitemap-injector' into cc
sebastian-nagel May 20, 2020
4ab1254
Merge branch 'master' into cc
sebastian-nagel Jun 28, 2020
13f0e5e
Fetcher writing WARCs: do not pass robots.txt content to record writer
sebastian-nagel Jun 28, 2020
fc75109
Upgrade crawler-commons dependency to 1.2-SNAPSHOT
sebastian-nagel Jun 28, 2020
93ca14c
SitemapInjector:
sebastian-nagel Jul 1, 2020
5b73e16
Merge branch 'master' into cc
sebastian-nagel Jul 1, 2020
b3b78bb
WarcCdxWriter: extraction of redirect targets for CDX should not be c…
sebastian-nagel Jul 3, 2020
9a3cef5
SitemapInjector:
sebastian-nagel Jul 30, 2020
49a483f
Tool UrlCleaner: allow to sum integer values (command-line option -su…
sebastian-nagel Jul 30, 2020
9f9b44e
Tool UrlCleaner: allow multiple input paths
sebastian-nagel Jul 31, 2020
14ba0fa
Generator2: reduce memory foot-print by loading only domain-specific
sebastian-nagel Jul 31, 2020
f60a941
Add tool UrlSampler:
sebastian-nagel Aug 2, 2020
15468b6
Merge branch 'master' into cc
sebastian-nagel Aug 18, 2020
899f6ec
Generator2: log number of URLs selected and skipped per host/domain
sebastian-nagel Aug 18, 2020
af635f7
Generator2: load domain limits file from hdfs:// or any supported fil…
sebastian-nagel Aug 20, 2020
421db1a
Merge branch 'master' into cc
sebastian-nagel Sep 14, 2020
5ff6f42
Generator2: update Javadoc class description
sebastian-nagel Oct 16, 2020
31f4f45
Merge branch 'master' into cc
sebastian-nagel Nov 18, 2020
891b1b7
Merge branch 'master' into cc
sebastian-nagel Jan 12, 2021
f0814f6
Upgrade JNA dependency
sebastian-nagel Jan 12, 2021
fd1d613
Language detection: log content type with reduced level (info -> debug)
sebastian-nagel Jan 21, 2021
be3f0e5
Update nutch-default.xml
sebastian-nagel Feb 15, 2021
ae5089c
Merge branch 'master' into cc
sebastian-nagel Feb 15, 2021
8fba6dd
Merge branch 'master' into cc
sebastian-nagel Feb 22, 2021
a063809
SitemapInjector: refactor and improve logging and counters
sebastian-nagel Feb 22, 2021
e266a21
Merge branch 'master' into cc
sebastian-nagel Mar 25, 2021
eb07f51
Downgrade to build using Java 8 (for now)
sebastian-nagel Mar 25, 2021
73560c4
Generator2: improve distribution of URLs per host over segments
sebastian-nagel Mar 25, 2021
b72d3d9
Merge branch 'master' into cc
sebastian-nagel Apr 6, 2021
98a1379
WARC writer: update link to WARC specification
sebastian-nagel Apr 6, 2021
809f3e1
UrlSampler: increase impact of random
sebastian-nagel May 20, 2021
57f963b
Merge branch 'master' into cc
sebastian-nagel Jun 8, 2021
ad3e47d
Log start/end/elapsed time in Common Crawl tools:
sebastian-nagel Jun 8, 2021
5eab6a1
Adaptive scoring filter: add `@Override` annotations
sebastian-nagel Jun 10, 2021
5ca3056
Merge branch 'master' into cc
sebastian-nagel Jun 14, 2021
78acb32
Generator2: reduce memory usage while reading domain limits in the
sebastian-nagel Jun 17, 2021
d00dc75
Merge branch 'master' into cc
sebastian-nagel Jul 15, 2021
88affb0
Upgrade to Hadoop 3.2.2
sebastian-nagel Jul 20, 2021
9885421
README: update instructions how to build CC's fork of Nutch
sebastian-nagel Aug 3, 2021
3597e23
Fetcher: optionally do not archive robots.txt responses in WARC files if
sebastian-nagel Jul 20, 2021
e1a4f9d
Update to compile for Java 11
sebastian-nagel Aug 30, 2021
45fced0
NUTCH-2896 Protocol-okhttp: make connection pool configurable
sebastian-nagel Sep 21, 2021
3b77820
NUTCH-2896 Protocol-okhttp: make connection pool configurable
sebastian-nagel Sep 21, 2021
6e02ac1
Merge branch 'master' into cc
sebastian-nagel Oct 11, 2021
7558c6b
Merge branch 'NUTCH-2896-okhttp-connection-pool' into cc
sebastian-nagel Oct 11, 2021
6ef53c9
Merge branch 'master' into cc
sebastian-nagel Nov 22, 2021
f7c666f
Merge branch 'NUTCH-2891-tika-2.1' into cc
sebastian-nagel Nov 22, 2021
ca0c9a3
Upgrade to use Tika 2.1.0 in LanguageDetector
sebastian-nagel Nov 23, 2021
cde92e7
Upgrade to use Tika 2.1.0 in LanguageDetector
sebastian-nagel Nov 26, 2021
93f3a1d
Merge branch 'master' into cc
sebastian-nagel Dec 1, 2021
3b5dfe9
Merge branch 'master' into cc
sebastian-nagel Dec 3, 2021
009b314
Merge branch 'master' into cc
sebastian-nagel Jan 10, 2022
d52cfca
Merge branch 'NUTCH-2929-fetcher-threads-slow-start' into cc
sebastian-nagel Jan 11, 2022
b124c63
NUTCH-2935 DeduplicationJob: failure on URLs with invalid percent enc…
sebastian-nagel Jan 14, 2022
e07bb69
Merge branch 'master' into cc
sebastian-nagel Jan 14, 2022
b289248
Merge branch 'NUTCH-2935' into cc
sebastian-nagel Jan 14, 2022
ae9b390
Merge branch 'master' into cc
sebastian-nagel Jan 18, 2022
422e6be
Merge branch 'master' into cc
sebastian-nagel May 2, 2022
0407edf
NUTCH-2946 Fetcher: slow down fetching from hosts where requests fail…
sebastian-nagel Jan 14, 2022
2a57303
NUTCH-2946 Fetcher: optionally slow down fetching from hosts with rep…
sebastian-nagel May 3, 2022
f3b29b2
Merge branch 'NUTCH-2946-fetcher-queue-exception-slow-down' into cc
sebastian-nagel May 12, 2022
014f5c6
NUTCH-2947 Fetcher: keep state of empty but stateful fetch queues
sebastian-nagel Jan 27, 2022
2659f68
NUTCH-2947 Fetcher: keep state of empty but stateful fetch queues
sebastian-nagel May 12, 2022
b324e18
Merge branch 'NUTCH-2947-keep-stateful-fetch-queues-cc' into cc
sebastian-nagel May 12, 2022
ca0f089
NUTCH-2948 Upgrade dependencies to Any23 2.7 and Tika 2.3.0
sebastian-nagel May 5, 2022
9dfb6a9
NUTCH-2936 Early registration of URL stream handlers provided by plug…
sebastian-nagel Jun 14, 2022
90e1f6a
NUTCH-2936 Early registration of URL stream handlers provided by plug…
sebastian-nagel May 19, 2022
327ab2c
NUTCH-2936 Early registration of URL stream handlers provided by plug…
sebastian-nagel Jun 15, 2022
01e54b7
Merge branch 'master' into cc
sebastian-nagel Jun 21, 2022
9b676ae
Merge branch 'NUTCH-2936-url-stream-handler-protocol-okhttp-failing-j…
sebastian-nagel Jun 21, 2022
6b50e19
NUTCH-2930 Protocol-okhttp: implement IP filter
sebastian-nagel Jan 11, 2019
d34d641
Merge branch 'NUTCH-2930-okhttp-ip-address-filter' into cc
sebastian-nagel Jun 23, 2022
f0e03a7
Upgrade to crawler-commons 1.3 / 1.4-SNAPSHOT
sebastian-nagel Aug 4, 2022
04b54e9
Merge branch 'master' into cc
sebastian-nagel Sep 12, 2022
3ca954f
Merge branch 'master' into cc
sebastian-nagel Jan 8, 2023
0715793
Merge branch 'master' into cc
sebastian-nagel Mar 17, 2023
8ec96ed
Adaptive scoring filter:
sebastian-nagel Mar 18, 2023
ca58eed
WARC writer: add counters for records skipped because of
sebastian-nagel Mar 19, 2023
9233de6
WARC writer: add options to skip captures by MIME type
sebastian-nagel Mar 20, 2023
04d7503
WARC writer: document properties in nutch-default.xml
sebastian-nagel Mar 22, 2023
53722b3
WARC writer: log fetch time, status code and size of
sebastian-nagel Mar 28, 2023
1729f0c
WARC writer: try to normalize invalid URIs
sebastian-nagel Mar 28, 2023
17bb09c
NUTCH-2992 Fetcher: always block fetch queues when exceptions thresho…
sebastian-nagel May 16, 2023
433dfee
Plugin lib-http: format robots.txt unit tests
sebastian-nagel May 23, 2023
1d13f20
Upgrade robots.txt parser to forthcoming changes in crawler-commons 1.4
sebastian-nagel May 24, 2023
0b406b4
Work-around for NUTCH-2749 Fetcher and scoring-opic: transfer score t…
sebastian-nagel May 25, 2023
e8f9901
Fetcher: add option to archive robots.txt responses in WARC files if
sebastian-nagel May 25, 2023
3bdb58f
WARC writer: use URI.toASCIIString() instead of URI.toString(), fixes…
sebastian-nagel Jun 24, 2023
adeb861
Merge branch 'master' into cc
sebastian-nagel Aug 22, 2023
b76798b
Add Override annotations
sebastian-nagel Aug 22, 2023
ce45a92
Allow fast-urlfilter to load from HDFS/S3 and support gzipped input, …
jnioche Oct 30, 2023
f777105
[Nutch-3017] Apply Nutch formatting
jnioche Oct 30, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
39 changes: 36 additions & 3 deletions README.md
@@ -1,9 +1,42 @@
Apache Nutch README
===================
Common Crawl Fork of Apache Nutch
=================================

Please also have a look at the [Apache Nutch](/apache/nutch) repository and all information about Apache Nutch given below.

Notable additions in Common Crawl's fork of Nutch (not yet pushed to upstream Nutch although this is planned):
- WARC and CDX writer integrated into Fetcher and able to detect the language of HTML pages using the CLD2 language detector
- [Generator2](src/java/org/apache/nutch/crawl/Generator2.java): alternative implementation of Generator
- allowing to combine per-domain and per-host limits and
- optimized to create many (eg. 100) segments in a single job

How to install additional requirements to build this fork of Nutch:
- [crawler-commons](/crawler-commons/crawler-commons) development snapshot package:
```
git clone git@github.com:crawler-commons/crawler-commons.git
cd crawler-commons/
mvn install
```
- install the latest public suffix list into `conf/` to ensure that it is definitely used (see #17):
```
wget https://publicsuffix.org/list/public_suffix_list.dat -O conf/effective_tld_names.dat
```
- [Java wrapper for CLD2 language detection](/commoncrawl/language-detection-cld2)
```
git clone git@github.com:commoncrawl/language-detection-cld2.git
cd language-detection-cld2/
mvn install
```
For runtime, if WARC language detection is enabled (`warc.detect.language` = true), also the CLD2 shared objects are required, e.g. on Ubuntu
```
sudo apt install libcld2-0 libcld2-dev
```

Apache Nutch
============

<img src="https://nutch.apache.org/assets/img/nutch_logo_tm.png" align="right" width="300" />

For the latest information about Nutch, please visit our website at:
For the latest information about Nutch, please visit the Nutch website at:

https://nutch.apache.org/

Expand Down
7 changes: 5 additions & 2 deletions build.xml
Expand Up @@ -126,7 +126,7 @@
<javac
encoding="${build.encoding}"
srcdir="${src.dir}"
includes="org/apache/nutch/**/*.java"
includes="org/apache/nutch/**/*.java org/commoncrawl/**/*.java"
destdir="${build.classes}"
debug="${javac.debug}"
optimize="${javac.optimize}"
Expand Down Expand Up @@ -250,6 +250,7 @@
<packageset dir="${plugins.dir}/protocol-okhttp/src/java"/>
<packageset dir="${plugins.dir}/protocol-selenium/src/java"/>
<packageset dir="${plugins.dir}/publish-rabbitmq/src/java"/>
<packageset dir="${plugins.dir}/scoring-adaptive/src/java"/>
<packageset dir="${plugins.dir}/scoring-depth/src/java"/>
<packageset dir="${plugins.dir}/scoring-link/src/java"/>
<packageset dir="${plugins.dir}/scoring-opic/src/java"/>
Expand Down Expand Up @@ -456,7 +457,7 @@
<javac
encoding="${build.encoding}"
srcdir="${test.src.dir}"
includes="org/apache/nutch/**/*.java"
includes="org/apache/nutch/**/*.java org/commoncrawl/**/*.java"
destdir="${test.build.classes}"
debug="${javac.debug}"
optimize="${javac.optimize}"
Expand Down Expand Up @@ -735,6 +736,7 @@
<packageset dir="${plugins.dir}/protocol-okhttp/src/java"/>
<packageset dir="${plugins.dir}/protocol-selenium/src/java"/>
<packageset dir="${plugins.dir}/publish-rabbitmq/src/java"/>
<packageset dir="${plugins.dir}/scoring-adaptive/src/java"/>
<packageset dir="${plugins.dir}/scoring-depth/src/java"/>
<packageset dir="${plugins.dir}/scoring-link/src/java"/>
<packageset dir="${plugins.dir}/scoring-opic/src/java"/>
Expand Down Expand Up @@ -1256,6 +1258,7 @@
<source path="${plugins.dir}/protocol-okhttp/src/test/" />
<source path="${plugins.dir}/protocol-selenium/src/java"/>
<source path="${plugins.dir}/publish-rabbitmq/src/java"/>
<source path="${plugins.dir}/scoring-adaptive/src/java"/>
<source path="${plugins.dir}/scoring-depth/src/java/" />
<source path="${plugins.dir}/scoring-link/src/java/" />
<source path="${plugins.dir}/scoring-opic/src/java/" />
Expand Down
17 changes: 17 additions & 0 deletions conf/adaptive-scoring.txt.template
@@ -0,0 +1,17 @@
#
# Configuration file for scoring-adaptive
#
# See also properties
# scoring.adaptive.sort.by_status.file
# scoring.adaptive.factor.fetchtime
# scoring.adaptive.penalty.fetch_retry
# scoring.adaptive.boost.injected
#
# Format:
# <status> <tab> <sortvalue>
# e.g.
# db_unfetched .1
# db_gone -.5
#
# The sort value is added to other sort values (score, fetch time).
# It may be negative to penalize fetch items.
15 changes: 15 additions & 0 deletions conf/generate-domain-limits.txt.template
@@ -0,0 +1,15 @@
#
# Fetch list limits by domain
#
# Note: please register this file using the property `generate.domain.limits.file`.
#
# Fields (tab-separated):
# 1 - domain name
# 2 - max. number of URLs per domain and segment
# 3 - max. number of URLs per host (in domain) and segment
# 4 - max. number of hosts per domain over all segments
# 5 - max. number of partitions (fetcher tasks) for domain
#
# Lines starting with `#` are ignored.
#
# wikipedia.org 5000 500 20 1