Skip to content

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem#1380

Draft
lewismc wants to merge 26 commits intoapache:masterfrom
lewismc:BIGTOP-284
Draft

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem#1380
lewismc wants to merge 26 commits intoapache:masterfrom
lewismc:BIGTOP-284

Conversation

@lewismc
Copy link
Member

@lewismc lewismc commented Feb 28, 2026

Description of PR

BIGTOP-284 seeks to introduce Apache Nutch smoke tests into the Bigtop ecosystem. I commented on the original ticket way back in 2011 and never did anything about it. This PR seeks to address that.
Nutch is a highly extensible, highly scalable, matured, production-ready Web crawler which enables fine grained configuration and accommodates a wide variety of data acquisition tasks. Nutch relies on Apache Hadoop data structures, Nutch is great for batch processing large data volumes via MapReduce jobs but can also be tailored to smaller jobs.

How was this patch tested?

Testing is ongoing. The goal is for the Nutch community to test this patch and hopefully update this thread with feedback. More details to follow.

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'BIGTOP-3638. Your PR title ...')?
  • Make sure that newly added files do not have any licensing issues. When in doubt refer to https://www.apache.org/licenses/

@lewismc
Copy link
Member Author

lewismc commented Feb 28, 2026

Testing the Nutch integration

This guidance is intended for peer reviewers interested in the Apache Nutch integration in Bigtop. Nutch is built from source with Ant (ant runtime), packaged using runtime/deploy for Hadoop cluster execution, and all smoke tests run against a Hadoop cluster using HDFS.

Prerequisites

  • Around ~20GB free disk space (to be safe)
sudo apt update && sudo apt upgrade && sudo apt install zip && sudo apt install openjdk-11-jdk
# Add Docker's official GPG key:
sudo apt update
sudo apt install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
sudo tee /etc/apt/sources.list.d/docker.sources <<EOF
Types: deb
URIs: https://download.docker.com/linux/ubuntu
Suites: $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}")
Components: stable
Signed-By: /etc/apt/keyrings/docker.asc
EOF

sudo apt update
sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin docker-compose ruby
sudo usermod -aG docker $USER

restart/exit session

git clone https://github.com/lewismc/bigtop.git && cd bigtop && git checkout -b BIGTOP-284 && git pull origin BIGTOP-284
  • Hadoop cluster – Smoke tests require a running cluster (HDFS and YARN). They use HADOOP_CONF_DIR and will not run without it.
  • x86_64 Linux – Building Nutch packages via nutch-pkg-ind uses the Bigtop Docker slave image, which is only published for x86_64. On Apple Silicon (arm64), the build script uses --platform linux/amd64; running amd64 containers under emulation can fail with "exec format error", so building packages is most reliable on native x86_64 Linux.

1. Build the Nutch package

From the Bigtop repo root:

./gradlew nutch-pkg-ind -POS=ubuntu-22.04 -Pdocker-run-option="--privileged"

To build Nutch and its dependencies (e.g. Hadoop) in Docker:

./gradlew nutch-pkg-ind -POS=ubuntu-22.04 -Pdocker-run-option="--privileged" -Dbuildwithdeps=true

Output appears under build/nutch/ and output/nutch/. The installed Nutch uses runtime/deploy (uber jar and scripts that run via hadoop jar on the cluster).

2. Run the Nutch smoke tests

Smoke tests require a Hadoop cluster: they use HDFS for seed URLs, crawldb, and segments, and they expect HADOOP_CONF_DIR to be set.

On a host where Nutch and Hadoop are already installed

Set the environment and run the Nutch smoke tests:

export JAVA_HOME=/path/to/jdk
export HADOOP_CONF_DIR=/etc/hadoop/conf   # or your cluster's conf dir
./gradlew bigtop-tests:smoke-tests:nutch:test -Psmoke.tests

Or from the smoke-tests directory:

cd bigtop-tests/smoke-tests
../../gradlew nutch:test -Psmoke.tests

Tests run in order: usage, inject subcommand, inject + readdb on HDFS, then generate on HDFS. Cleanup removes /user/root/nutch-smoke from HDFS.

Via Docker provisioner (full stack + smoke)

  1. Build packages (with deps if needed) and enable the local repo in provisioner/docker/config.yaml:

    • enable_local_repo: true
    • nutch is already in components and smoke_test_components.
  2. From provisioner/docker/:

    ./docker-hadoop.sh --create 3 --smoke-tests

    This provisions a cluster (including Nutch), then runs all smoke tests (including Nutch). Ensure the provisioner has enough resources and that the Nutch packages are present in the local repo (e.g. under output/apt or equivalent).

3. Deploy Nutch with Puppet

To deploy Nutch on a Bigtop-managed cluster, include nutch in the cluster components (e.g. in Hiera or site.yaml):

hadoop_cluster_node::cluster_components:
  - hdfs
  - yarn
  - mapreduce
  - nutch

Nodes that receive the nutch-client role will have the Nutch package installed and /etc/default/nutch configured with NUTCH_HOME, NUTCH_CONF_DIR, and HADOOP_CONF_DIR. Run crawl commands (e.g. nutch inject, nutch generate) from a gateway/client node against HDFS paths.

4. Quick sanity checks (no cluster)

Without a cluster you can still confirm that the test project loads and compiles:

./gradlew bigtop-tests:smoke-tests:nutch:tasks
./gradlew bigtop-tests:smoke-tests:nutch:compileTestGroovy

The full test suite will not pass without HADOOP_CONF_DIR and a running cluster.

5. What the smoke tests do

Test Description
testNutchUsage Runs nutch with no arguments; expects exit 0 and usage output.
testNutchInjectSubcommand Runs nutch inject with no arguments; expects non-zero exit and usage/error message.
testNutchInjectAndReaddb Creates /user/root/nutch-smoke/urls/seed.txt on HDFS, runs nutch inject and nutch readdb -stats on HDFS paths, asserts stats output.
testNutchGenerate Runs nutch generate with HDFS crawldb and segments paths, then verifies at least one segment under the segments directory.

All tests use the deploy runtime (cluster mode) and HDFS only; there are no local-mode or /tmp-based crawl directories.

@lewismc
Copy link
Member Author

lewismc commented Mar 1, 2026

Testing based on the above guidance.

BUILD SUCCESSFUL in 10m 25s
1 actionable task: 1 executed

OS information

Distributor ID:	Ubuntu
Description:	Ubuntu 24.04.4 LTS
Release:	24.04
Codename:	noble

@lewismc lewismc marked this pull request as draft March 1, 2026 20:10
@lewismc
Copy link
Member Author

lewismc commented Mar 2, 2026

Curent test failures as follows

./gradlew nutch-pkg-ind -POS=ubuntu-24.04 -Pdocker-run-option="--privileged" -Dbuildwithdeps=true && cd provisioner/docker && ./docker-hadoop.sh --create 3 --smoke-tests nutch
...

> Task :bigtop-tests:smoke-tests:nutch:test
Downloading https://repo.maven.apache.org/maven2/org/slf4j/slf4j-log4j12/1.7.10/slf4j-log4j12-1.7.10.pom to /tmp/gradle_download13039068898071309938bin
Downloading https://repo.maven.apache.org/maven2/javax/servlet/jsp/jsp-api/2.1/jsp-api-2.1.pom to /tmp/gradle_download6096608470298673698bin
Downloading https://repo.maven.apache.org/maven2/javax/servlet/jsp/jsp-api/2.1/jsp-api-2.1.jar to /tmp/gradle_download15367241625943263473bin
Downloading https://repo.maven.apache.org/maven2/org/slf4j/slf4j-log4j12/1.7.10/slf4j-log4j12-1.7.10.jar to /tmp/gradle_download18172895790539033857bin
Caching disabled for task ':bigtop-tests:smoke-tests:nutch:test' because:
  Build cache is disabled
Task ':bigtop-tests:smoke-tests:nutch:test' is not up-to-date because:
  No history is available.
Starting process 'Gradle Test Executor 2'. Working directory: /bigtop-home/bigtop-tests/smoke-tests/nutch Command: /usr/lib/jvm/java-11-openjdk-amd64/bin/java -Dawt.toolkit=sun.awt.X11.XToolkit -Dfile.separator=/ -Djava.awt.graphicsenv=sun.awt.X11GraphicsEnvironment -Djava.awt.printerjob=sun.print.PSPrinterJob -Djava.class.path=/root/.gradle/wrapper/dists/gradle-5.6.4-bin/c9880aa85176bf8c458862eb99f7e0a9/gradle-5.6.4/lib/gradle-launcher-5.6.4.jar -Djava.class.version=55.0 -Djava.home=/usr/lib/jvm/java-11-openjdk-amd64 -Djava.library.path=/usr/java/packages/lib:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib -Djava.runtime.name=OpenJDK Runtime Environment -Djava.runtime.version=11.0.30+7-post-Ubuntu-1ubuntu124.04 -Djava.specification.maintenance.version=3 -Djava.specification.name=Java Platform API Specification -Djava.specification.vendor=Oracle Corporation -Djava.specification.version=11 -Djava.vendor=Ubuntu -Djava.vendor.url=https://ubuntu.com/ -Djava.vendor.url.bug=https://bugs.launchpad.net/ubuntu/+source/openjdk-lts -Djava.version=11.0.30 -Djava.version.date=2026-01-20 -Djava.vm.compressedOopsMode=Zero based -Djava.vm.info=mixed mode, sharing -Djava.vm.name=OpenJDK 64-Bit Server VM -Djava.vm.specification.name=Java Virtual Machine Specification -Djava.vm.specification.vendor=Oracle Corporation -Djava.vm.specification.version=11 -Djava.vm.vendor=Ubuntu -Djava.vm.version=11.0.30+7-post-Ubuntu-1ubuntu124.04 -Djdk.debug=release -Dlibrary.jansi.path=/root/.gradle/native/jansi/1.17.1/linux64 -Dline.separator=
 -Dorg.gradle.appname=gradlew -Dorg.gradle.native=false -Dos.arch=amd64 -Dos.name=Linux -Dos.version=6.14.0-1018-aws -Dpath.separator=: -Dsun.arch.data.model=64 -Dsun.boot.library.path=/usr/lib/jvm/java-11-openjdk-amd64/lib -Dsun.cpu.endian=little -Dsun.cpu.isalist -Dsun.io.unicode.encoding=UnicodeLittle -Dsun.java.command=org.gradle.launcher.daemon.bootstrap.GradleDaemon 5.6.4 -Dsun.java.launcher=SUN_STANDARD -Dsun.jnu.encoding=ANSI_X3.4-1968 -Dsun.management.compiler=HotSpot 64-Bit Tiered Compilers -Dsun.os.patch.level=unknown -Duser.dir=/bigtop-home -Duser.home=/root -Duser.name=root -Duser.timezone @/tmp/gradle-worker-classpath3267578986700153504txt -Xmx512m -Dfile.encoding=ANSI_X3.4-1968 -Djava.io.tmpdir=/tmp -Duser.country=US -Duser.language=en -Duser.variant -ea worker.org.gradle.process.internal.worker.GradleWorkerMain 'Gradle Test Executor 2'
Successfully started process 'Gradle Test Executor 2'
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.codehaus.groovy.vmplugin.v7.Java7$1 (file:/root/.gradle/caches/modules-2/files-2.1/org.codehaus.groovy/groovy/2.5.4/86b94e2949bcff3a13b7ad200e4c5299b52ad994/groovy-2.5.4.jar) to constructor java.lang.invoke.MethodHandles$Lookup(java.lang.Class,int)
WARNING: Please consider reporting this to the maintainers of org.codehaus.groovy.vmplugin.v7.Java7$1
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

org.apache.bigtop.itest.nutch.TestNutchSmoke > testNutchInjectAndReaddb FAILED
    java.lang.AssertionError: nutch inject (HDFS) failed: [26/03/01 22:34:22 INFO plugin.PluginManifestParser: Plugins: looking in: /tmp/hadoop-unjar10809354754474963427/classes/plugins, 26/03/01 22:34:23 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true], 26/03/01 22:34:23 INFO plugin.PluginRepository: Registered Plugins:, 26/03/01 22:34:23 INFO plugin.PluginRepository: 	Regex URL Filter (urlfilter-regex), 26/03/01 22:34:23 INFO plugin.PluginRepository: 	Html Parse Plug-in (parse-html), 26/03/01 22:34:23 INFO plugin.PluginRepository: 	HTTP Framework (lib-http), 26/03/01 22:34:23 INFO plugin.PluginRepository: 	the nutch core extension points (nutch-extensionpoints), 26/03/01 22:34:23 INFO plugin.PluginRepository: 	Basic Indexing Filter (index-basic), 26/03/01 22:34:23 INFO plugin.PluginRepository: 	Anchor Indexing Filter (index-anchor), 26/03/01 22:34:23 INFO plugin.PluginRepository: 	Tika Parser Plug-in (parse-tika), 26/03/01 22:34:23 INFO plugin.PluginRepository: 	Basic URL Normalizer (urlnormalizer-basic), 26/03/01 22:34:23 INFO plugin.PluginRepository: 	Regex URL Filter Framework (lib-regex-filter), 26/03/01 22:34:23 INFO plugin.PluginRepository: 	Regex URL Normalizer (urlnormalizer-regex), 26/03/01 22:34:23 INFO plugin.PluginRepository: 	CyberNeko HTML Parser (lib-nekohtml), 26/03/01 22:34:23 INFO plugin.PluginRepository: 	OPIC Scoring Plug-in (scoring-opic), 26/03/01 22:34:23 INFO plugin.PluginRepository: 	Pass-through URL Normalizer (urlnormalizer-pass), 26/03/01 22:34:23 INFO plugin.PluginRepository: 	Http Protocol Plug-in (protocol-http), 26/03/01 22:34:23 INFO plugin.PluginRepository: 	SolrIndexWriter (indexer-solr), 26/03/01 22:34:23 INFO plugin.PluginRepository: Registered Extension-Points:, 26/03/01 22:34:23 INFO plugin.PluginRepository: 	Nutch Content Parser (org.apache.nutch.parse.Parser), 26/03/01 22:34:23 INFO plugin.PluginRepository: 	Nutch URL Filter (org.apache.nutch.net.URLFilter), 26/03/01 22:34:23 INFO plugin.PluginRepository: 	HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter), 26/03/01 22:34:23 INFO plugin.PluginRepository: 	Nutch Scoring (org.apache.nutch.scoring.ScoringFilter), 26/03/01 22:34:23 INFO plugin.PluginRepository: 	Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer), 26/03/01 22:34:23 INFO plugin.PluginRepository: 	Nutch Publisher (org.apache.nutch.publisher.NutchPublisher), 26/03/01 22:34:23 INFO plugin.PluginRepository: 	Nutch Exchange (org.apache.nutch.exchange.Exchange), 26/03/01 22:34:23 INFO plugin.PluginRepository: 	Nutch Protocol (org.apache.nutch.protocol.Protocol), 26/03/01 22:34:23 INFO plugin.PluginRepository: 	Nutch URL Ignore Exemption Filter (org.apache.nutch.net.URLExemptionFilter), 26/03/01 22:34:23 INFO plugin.PluginRepository: 	Nutch Index Writer (org.apache.nutch.indexer.IndexWriter), 26/03/01 22:34:23 INFO plugin.PluginRepository: 	Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter), 26/03/01 22:34:23 INFO plugin.PluginRepository: 	Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter), 26/03/01 22:34:23 INFO crawl.Injector: Injector: starting, 26/03/01 22:34:23 INFO crawl.Injector: Injector: crawlDb: /user/root/nutch-smoke/crawldb, 26/03/01 22:34:23 INFO crawl.Injector: Injector: urlDir: /user/root/nutch-smoke/urls, 26/03/01 22:34:23 INFO crawl.Injector: Injector: Converting injected urls to crawl db entries., 26/03/01 22:34:24 INFO crawl.Injector: Injecting seed URL file hdfs://ee1644932a99.bigtop.apache.org:8020/user/root/nutch-smoke/urls/seed.txt, 26/03/01 22:34:24 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at ee1644932a99.bigtop.apache.org/172.18.0.4:8032, 26/03/01 22:34:25 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/root/.staging/job_1772404362998_0001, 26/03/01 22:34:33 INFO input.FileInputFormat: Total input files to process : 1, 26/03/01 22:34:33 INFO input.FileInputFormat: Total input files to process : 0, 26/03/01 22:34:33 INFO mapreduce.JobSubmitter: number of splits:1, 26/03/01 22:34:33 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1772404362998_0001, 26/03/01 22:34:33 INFO mapreduce.JobSubmitter: Executing with tokens: [], 26/03/01 22:34:33 INFO conf.Configuration: resource-types.xml not found, 26/03/01 22:34:33 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'., 26/03/01 22:34:34 INFO impl.YarnClientImpl: Submitted application application_1772404362998_0001, 26/03/01 22:34:34 INFO mapreduce.Job: The url to track the job: http://ee1644932a99.bigtop.apache.org:20888/proxy/application_1772404362998_0001/, 26/03/01 22:34:34 INFO mapreduce.Job: Running job: job_1772404362998_0001, 26/03/01 22:35:06 INFO mapreduce.Job: Job job_1772404362998_0001 running in uber mode : false, 26/03/01 22:35:06 INFO mapreduce.Job:  map 0% reduce 0%, 26/03/01 22:35:06 INFO mapreduce.Job: Job job_1772404362998_0001 failed with state FAILED due to: Application application_1772404362998_0001 failed 2 times due to AM Container for appattempt_1772404362998_0001_000002 exited with  exitCode: 1, Failing this attempt.Diagnostics: [2026-03-01 22:35:06.070]Exception from container-launch., Container id: container_1772404362998_0001_02_000001, Exit code: 1, , [2026-03-01 22:35:06.103]Container exited with a non-zero exit code 1. Error file: prelaunch.err., Last 4096 bytes of prelaunch.err :, Last 4096 bytes of stderr :, log4j:ERROR Could not find value for key log4j.appender.CLA, log4j:ERROR Could not instantiate appender named "CLA"., log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapreduce.v2.app.MRAppMaster)., log4j:WARN Please initialize the log4j system properly., log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info., , , [2026-03-01 22:35:06.103]Container exited with a non-zero exit code 1. Error file: prelaunch.err., Last 4096 bytes of prelaunch.err :, Last 4096 bytes of stderr :, log4j:ERROR Could not find value for key log4j.appender.CLA, log4j:ERROR Could not instantiate appender named "CLA"., log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapreduce.v2.app.MRAppMaster)., log4j:WARN Please initialize the log4j system properly., log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info., , , For more detailed output, check the application tracking page: http://ee1644932a99:8088/cluster/app/application_1772404362998_0001 Then click on links to logs of each attempt., . Failing the application., 26/03/01 22:35:06 INFO mapreduce.Job: Counters: 0, 26/03/01 22:35:06 ERROR crawl.Injector: Injector job did not succeed, job id: job_1772404362998_0001, job status: FAILED, reason: Application application_1772404362998_0001 failed 2 times due to AM Container for appattempt_1772404362998_0001_000002 exited with  exitCode: 1, Failing this attempt.Diagnostics: [2026-03-01 22:35:06.070]Exception from container-launch., Container id: container_1772404362998_0001_02_000001, Exit code: 1, , [2026-03-01 22:35:06.103]Container exited with a non-zero exit code 1. Error file: prelaunch.err., Last 4096 bytes of prelaunch.err :, Last 4096 bytes of stderr :, log4j:ERROR Could not find value for key log4j.appender.CLA, log4j:ERROR Could not instantiate appender named "CLA"., log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapreduce.v2.app.MRAppMaster)., log4j:WARN Please initialize the log4j system properly., log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info., , , [2026-03-01 22:35:06.103]Container exited with a non-zero exit code 1. Error file: prelaunch.err., Last 4096 bytes of prelaunch.err :, Last 4096 bytes of stderr :, log4j:ERROR Could not find value for key log4j.appender.CLA, log4j:ERROR Could not instantiate appender named "CLA"., log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapreduce.v2.app.MRAppMaster)., log4j:WARN Please initialize the log4j system properly., log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info., , , For more detailed output, check the application tracking page: http://ee1644932a99:8088/cluster/app/application_1772404362998_0001 Then click on links to logs of each attempt., . Failing the application., 26/03/01 22:35:06 ERROR crawl.Injector: Injector:, java.lang.RuntimeException: Injector job did not succeed, job id: job_1772404362998_0001, job status: FAILED, reason: Application application_1772404362998_0001 failed 2 times due to AM Container for appattempt_1772404362998_0001_000002 exited with  exitCode: 1, Failing this attempt.Diagnostics: [2026-03-01 22:35:06.070]Exception from container-launch., Container id: container_1772404362998_0001_02_000001, Exit code: 1, , [2026-03-01 22:35:06.103]Container exited with a non-zero exit code 1. Error file: prelaunch.err., Last 4096 bytes of prelaunch.err :, Last 4096 bytes of stderr :, log4j:ERROR Could not find value for key log4j.appender.CLA, log4j:ERROR Could not instantiate appender named "CLA"., log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapreduce.v2.app.MRAppMaster)., log4j:WARN Please initialize the log4j system properly., log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info., , , [2026-03-01 22:35:06.103]Container exited with a non-zero exit code 1. Error file: prelaunch.err., Last 4096 bytes of prelaunch.err :, Last 4096 bytes of stderr :, log4j:ERROR Could not find value for key log4j.appender.CLA, log4j:ERROR Could not instantiate appender named "CLA"., log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapreduce.v2.app.MRAppMaster)., log4j:WARN Please initialize the log4j system properly., log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info., , , For more detailed output, check the application tracking page: http://ee1644932a99:8088/cluster/app/application_1772404362998_0001 Then click on links to logs of each attempt., . Failing the application., 	at org.apache.nutch.crawl.Injector.inject(Injector.java:495), 	at org.apache.nutch.crawl.Injector.run(Injector.java:631), 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82), 	at org.apache.nutch.crawl.Injector.main(Injector.java:595), 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method), 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62), 	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43), 	at java.base/java.lang.reflect.Method.invoke(Method.java:566), 	at org.apache.hadoop.util.RunJar.run(RunJar.java:328), 	at org.apache.hadoop.util.RunJar.main(RunJar.java:241)]
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.assertTrue(Assert.java:41)
        at org.junit.Assert$assertTrue.callStatic(Unknown Source)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallStatic(CallSiteArray.java:55)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:196)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:216)
        at org.apache.bigtop.itest.nutch.TestNutchSmoke.testNutchInjectAndReaddb(TestNutchSmoke.groovy:72)

org.apache.bigtop.itest.nutch.TestNutchSmoke > testNutchGenerate FAILED
    java.lang.AssertionError: nutch generate (HDFS) failed: [26/03/01 22:35:11 INFO plugin.PluginManifestParser: Plugins: looking in: /tmp/hadoop-unjar8823238033492348761/classes/plugins, 26/03/01 22:35:12 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true], 26/03/01 22:35:12 INFO plugin.PluginRepository: Registered Plugins:, 26/03/01 22:35:12 INFO plugin.PluginRepository: 	Regex URL Filter (urlfilter-regex), 26/03/01 22:35:12 INFO plugin.PluginRepository: 	Html Parse Plug-in (parse-html), 26/03/01 22:35:12 INFO plugin.PluginRepository: 	HTTP Framework (lib-http), 26/03/01 22:35:12 INFO plugin.PluginRepository: 	the nutch core extension points (nutch-extensionpoints), 26/03/01 22:35:12 INFO plugin.PluginRepository: 	Basic Indexing Filter (index-basic), 26/03/01 22:35:12 INFO plugin.PluginRepository: 	Anchor Indexing Filter (index-anchor), 26/03/01 22:35:12 INFO plugin.PluginRepository: 	Tika Parser Plug-in (parse-tika), 26/03/01 22:35:12 INFO plugin.PluginRepository: 	Basic URL Normalizer (urlnormalizer-basic), 26/03/01 22:35:12 INFO plugin.PluginRepository: 	Regex URL Filter Framework (lib-regex-filter), 26/03/01 22:35:12 INFO plugin.PluginRepository: 	Regex URL Normalizer (urlnormalizer-regex), 26/03/01 22:35:12 INFO plugin.PluginRepository: 	CyberNeko HTML Parser (lib-nekohtml), 26/03/01 22:35:12 INFO plugin.PluginRepository: 	OPIC Scoring Plug-in (scoring-opic), 26/03/01 22:35:12 INFO plugin.PluginRepository: 	Pass-through URL Normalizer (urlnormalizer-pass), 26/03/01 22:35:12 INFO plugin.PluginRepository: 	Http Protocol Plug-in (protocol-http), 26/03/01 22:35:12 INFO plugin.PluginRepository: 	SolrIndexWriter (indexer-solr), 26/03/01 22:35:12 INFO plugin.PluginRepository: Registered Extension-Points:, 26/03/01 22:35:12 INFO plugin.PluginRepository: 	Nutch Content Parser (org.apache.nutch.parse.Parser), 26/03/01 22:35:12 INFO plugin.PluginRepository: 	Nutch URL Filter (org.apache.nutch.net.URLFilter), 26/03/01 22:35:12 INFO plugin.PluginRepository: 	HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter), 26/03/01 22:35:12 INFO plugin.PluginRepository: 	Nutch Scoring (org.apache.nutch.scoring.ScoringFilter), 26/03/01 22:35:12 INFO plugin.PluginRepository: 	Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer), 26/03/01 22:35:12 INFO plugin.PluginRepository: 	Nutch Publisher (org.apache.nutch.publisher.NutchPublisher), 26/03/01 22:35:12 INFO plugin.PluginRepository: 	Nutch Exchange (org.apache.nutch.exchange.Exchange), 26/03/01 22:35:12 INFO plugin.PluginRepository: 	Nutch Protocol (org.apache.nutch.protocol.Protocol), 26/03/01 22:35:12 INFO plugin.PluginRepository: 	Nutch URL Ignore Exemption Filter (org.apache.nutch.net.URLExemptionFilter), 26/03/01 22:35:12 INFO plugin.PluginRepository: 	Nutch Index Writer (org.apache.nutch.indexer.IndexWriter), 26/03/01 22:35:12 INFO plugin.PluginRepository: 	Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter), 26/03/01 22:35:12 INFO plugin.PluginRepository: 	Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter), 26/03/01 22:35:13 INFO crawl.Generator: Generator: starting, 26/03/01 22:35:13 INFO crawl.Generator: Generator: selecting best-scoring urls due for fetch., 26/03/01 22:35:13 INFO crawl.Generator: Generator: filtering: true, 26/03/01 22:35:13 INFO crawl.Generator: Generator: normalizing: true, 26/03/01 22:35:13 INFO crawl.Generator: Generator: topN: 1, 26/03/01 22:35:13 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at ee1644932a99.bigtop.apache.org/172.18.0.4:8032, 26/03/01 22:35:14 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/root/.staging/job_1772404362998_0002, 26/03/01 22:35:22 INFO input.FileInputFormat: Total input files to process : 0, 26/03/01 22:35:22 INFO mapreduce.JobSubmitter: number of splits:0, 26/03/01 22:35:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1772404362998_0002, 26/03/01 22:35:22 INFO mapreduce.JobSubmitter: Executing with tokens: [], 26/03/01 22:35:23 INFO conf.Configuration: resource-types.xml not found, 26/03/01 22:35:23 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'., 26/03/01 22:35:23 INFO impl.YarnClientImpl: Submitted application application_1772404362998_0002, 26/03/01 22:35:23 INFO mapreduce.Job: The url to track the job: http://ee1644932a99.bigtop.apache.org:20888/proxy/application_1772404362998_0002/, 26/03/01 22:35:23 INFO mapreduce.Job: Running job: job_1772404362998_0002, 26/03/01 22:35:50 INFO mapreduce.Job: Job job_1772404362998_0002 running in uber mode : false, 26/03/01 22:35:51 INFO mapreduce.Job:  map 0% reduce 0%, 26/03/01 22:35:51 INFO mapreduce.Job: Job job_1772404362998_0002 failed with state FAILED due to: Application application_1772404362998_0002 failed 2 times due to AM Container for appattempt_1772404362998_0002_000002 exited with  exitCode: 1, Failing this attempt.Diagnostics: [2026-03-01 22:35:50.197]Exception from container-launch., Container id: container_1772404362998_0002_02_000001, Exit code: 1, , [2026-03-01 22:35:50.229]Container exited with a non-zero exit code 1. Error file: prelaunch.err., Last 4096 bytes of prelaunch.err :, Last 4096 bytes of stderr :, log4j:ERROR Could not find value for key log4j.appender.CLA, log4j:ERROR Could not instantiate appender named "CLA"., log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapreduce.v2.app.MRAppMaster)., log4j:WARN Please initialize the log4j system properly., log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info., , , [2026-03-01 22:35:50.229]Container exited with a non-zero exit code 1. Error file: prelaunch.err., Last 4096 bytes of prelaunch.err :, Last 4096 bytes of stderr :, log4j:ERROR Could not find value for key log4j.appender.CLA, log4j:ERROR Could not instantiate appender named "CLA"., log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapreduce.v2.app.MRAppMaster)., log4j:WARN Please initialize the log4j system properly., log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info., , , For more detailed output, check the application tracking page: http://ee1644932a99:8088/cluster/app/application_1772404362998_0002 Then click on links to logs of each attempt., . Failing the application., 26/03/01 22:35:51 INFO mapreduce.Job: Counters: 0, 26/03/01 22:35:51 ERROR crawl.Generator: Generator job did not succeed, job id: job_1772404362998_0002, job status: FAILED, reason: Application application_1772404362998_0002 failed 2 times due to AM Container for appattempt_1772404362998_0002_000002 exited with  exitCode: 1, Failing this attempt.Diagnostics: [2026-03-01 22:35:50.197]Exception from container-launch., Container id: container_1772404362998_0002_02_000001, Exit code: 1, , [2026-03-01 22:35:50.229]Container exited with a non-zero exit code 1. Error file: prelaunch.err., Last 4096 bytes of prelaunch.err :, Last 4096 bytes of stderr :, log4j:ERROR Could not find value for key log4j.appender.CLA, log4j:ERROR Could not instantiate appender named "CLA"., log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapreduce.v2.app.MRAppMaster)., log4j:WARN Please initialize the log4j system properly., log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info., , , [2026-03-01 22:35:50.229]Container exited with a non-zero exit code 1. Error file: prelaunch.err., Last 4096 bytes of prelaunch.err :, Last 4096 bytes of stderr :, log4j:ERROR Could not find value for key log4j.appender.CLA, log4j:ERROR Could not instantiate appender named "CLA"., log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapreduce.v2.app.MRAppMaster)., log4j:WARN Please initialize the log4j system properly., log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info., , , For more detailed output, check the application tracking page: http://ee1644932a99:8088/cluster/app/application_1772404362998_0002 Then click on links to logs of each attempt., . Failing the application., 26/03/01 22:35:51 ERROR crawl.Generator: Generator:, java.lang.RuntimeException: Generator job did not succeed, job id: job_1772404362998_0002, job status: FAILED, reason: Application application_1772404362998_0002 failed 2 times due to AM Container for appattempt_1772404362998_0002_000002 exited with  exitCode: 1, Failing this attempt.Diagnostics: [2026-03-01 22:35:50.197]Exception from container-launch., Container id: container_1772404362998_0002_02_000001, Exit code: 1, , [2026-03-01 22:35:50.229]Container exited with a non-zero exit code 1. Error file: prelaunch.err., Last 4096 bytes of prelaunch.err :, Last 4096 bytes of stderr :, log4j:ERROR Could not find value for key log4j.appender.CLA, log4j:ERROR Could not instantiate appender named "CLA"., log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapreduce.v2.app.MRAppMaster)., log4j:WARN Please initialize the log4j system properly., log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info., , , [2026-03-01 22:35:50.229]Container exited with a non-zero exit code 1. Error file: prelaunch.err., Last 4096 bytes of prelaunch.err :, Last 4096 bytes of stderr :, log4j:ERROR Could not find value for key log4j.appender.CLA, log4j:ERROR Could not instantiate appender named "CLA"., log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapreduce.v2.app.MRAppMaster)., log4j:WARN Please initialize the log4j system properly., log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info., , , For more detailed output, check the application tracking page: http://ee1644932a99:8088/cluster/app/application_1772404362998_0002 Then click on links to logs of each attempt., . Failing the application., 	at org.apache.nutch.crawl.Generator.generate(Generator.java:1012), 	at org.apache.nutch.crawl.Generator.run(Generator.java:1232), 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82), 	at org.apache.nutch.crawl.Generator.main(Generator.java:1179), 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method), 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62), 	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43), 	at java.base/java.lang.reflect.Method.invoke(Method.java:566), 	at org.apache.hadoop.util.RunJar.run(RunJar.java:328), 	at org.apache.hadoop.util.RunJar.main(RunJar.java:241)]
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.assertTrue(Assert.java:41)
        at org.junit.Assert$assertTrue.callStatic(Unknown Source)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallStatic(CallSiteArray.java:55)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:196)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:216)
        at org.apache.bigtop.itest.nutch.TestNutchSmoke.testNutchGenerate(TestNutchSmoke.groovy:83)

Gradle Test Executor 2 finished executing tests.

> Task :bigtop-tests:smoke-tests:nutch:test

4 tests completed, 2 failed
Finished generating test XML results (0.022 secs) into: /bigtop-home/bigtop-tests/smoke-tests/nutch/build/test-results/test
Generating HTML test report...
Finished generating test html results (0.032 secs) into: /bigtop-home/bigtop-tests/smoke-tests/nutch/build/reports/tests/test

> Task :bigtop-tests:smoke-tests:nutch:test FAILED
:bigtop-tests:smoke-tests:nutch:test (Thread[Daemon worker,5,main]) completed. Took 1 mins 52.7 secs.

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':bigtop-tests:smoke-tests:nutch:test'.
> There were failing tests. See the report at: file:///bigtop-home/bigtop-tests/smoke-tests/nutch/build/reports/tests/test/index.html

* Try:
Run with --stacktrace option to get the stack trace. Run with --debug option to get more log output. Run with --scan to get full insights.

* Get more help at https://help.gradle.org

BUILD FAILED in 2m 14s
29 actionable tasks: 6 executed, 23 up-to-date
Stopped 1 worker daemon(s).
+ rm -rf buildSrc/build/test-results/binary
+ rm -rf /bigtop-home/.gradle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant