Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUTCH-2883 Provide means to run server as a persistent service in Docker container #691

Closed
wants to merge 13 commits into from
Closed
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
18 changes: 18 additions & 0 deletions docker/.dockerfilelintrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

rules:
invalid_port: off
missing_tag: off
77 changes: 72 additions & 5 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -13,17 +13,41 @@
# See the License for the specific language governing permissions and
# limitations under the License.

FROM alpine:3.13
MAINTAINER Apache Nutch Committers <dev@nutch.apache.org>
# NOTE TO DEVELOPERS: Make sure this file passes linting tests
# by running https://github.com/replicatedhq/dockerfilelint

# BUILD_MODE can be either
# 0 == Nutch master branch source install with 'crawl' and 'nutch' scripts on PATH
# 1 == Same as mode 0 with addition of Nutch REST Server
# 2 == Same as mode 1 with addition of Nutch WebApp
ARG BUILD_MODE=0

FROM alpine:3.13 AS base

ARG SERVER_PORT=8081
ARG SERVER_HOST=0.0.0.0
ARG WEBAPP_PORT=8080

LABEL maintainer="Apache Nutch Developers <dev@nutch.apache.org>"
LABEL org.opencontainers.image.authors="Apache Nutch Developers <dev@nutch.apache.org>"
LABEL org.opencontainers.image.description="Docker image for running Apache Nutch, a highly extensible and scalable open source web crawler software project. Visit the project website at https://nutch.apache.org"
LABEL org.opencontainers.image.documentation="https://hub.docker.com/r/apache/nutch"
LABEL org.opencontainers.image.licenses="Apache-2.0"
LABEL org.opencontainers.image.source="https://raw.githubusercontent.com/apache/nutch/master/docker/Dockerfile"
LABEL org.opencontainers.image.title="Apache Nutch 1.x Docker Image"
LABEL org.opencontainers.image.url="https://hub.docker.com/r/apache/nutch"
LABEL org.opencontainers.image.vendor="Apache Nutch https://nutch.apache.org"

WORKDIR /root/

# Install dependencies
RUN apk update
RUN apk --no-cache add apache-ant bash git openjdk11
RUN apk --no-cache add apache-ant bash git openjdk11 supervisor

# Establish environment variables
RUN echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk' >> $HOME/.bashrc
env NUTCH_HOME='/root/nutch_source/runtime/local'
ENV JAVA_HOME='/usr/lib/jvm/java-11-openjdk'
ENV NUTCH_HOME='/root/nutch_source/runtime/local'

# Checkout and build the Nutch master branch (1.x)
RUN git clone https://github.com/apache/nutch.git nutch_source && \
Expand All @@ -34,4 +58,47 @@ RUN git clone https://github.com/apache/nutch.git nutch_source && \

# Create symlinks for runtime/local/bin/nutch and runtime/local/bin/crawl
RUN ln -sf $NUTCH_HOME/bin/nutch /usr/local/bin/
RUN ln -sf $NUTCH_HOME/bin/crawl /usr/local/bin/
RUN ln -sf $NUTCH_HOME/bin/crawl /usr/local/bin/

FROM base AS branch-version-0

RUN echo "Nutch master branch source install with 'crawl' and 'nutch' scripts on PATH"

FROM base AS branch-version-1

RUN echo "Nutch master branch source install with 'crawl' and 'nutch' scripts on PATH and Nutch REST Server on $SERVER_HOST:$SERVER_PORT"

ENV SERVER_PORT=$SERVER_PORT
ENV SERVER_HOST=$SERVER_HOST

# Arrange necessary setup for supervisord
RUN mkdir -p /var/log/supervisord
COPY ./config/supervisord_startserver.conf /etc/supervisord.conf

# Expose port for server which can only be accessed if
# the same port is published when the container is run.
EXPOSE $SERVER_PORT

ENTRYPOINT [ "supervisord", "--nodaemon", "--configuration", "/etc/supervisord.conf" ]

FROM base AS branch-version-2

RUN echo "Nutch master branch source install with 'crawl' and 'nutch' scripts on PATH, Nutch REST Server on $SERVER_HOST:$SERVER_PORT and WebApp on this container port $WEBAPP_PORT"

ENV SERVER_PORT=$SERVER_PORT
ENV SERVER_HOST=$SERVER_HOST
ENV WEBAPP_PORT=$WEBAPP_PORT

# Arrange necessary setup for supervisord
RUN mkdir -p /var/log/supervisord
COPY ./config/supervisord_startserver_webapp.conf /etc/supervisord.conf

# Expose ports for server and webapp, these can only be accessed if
# the same ports are published when the container is run.
EXPOSE $SERVER_PORT
EXPOSE $WEBAPP_PORT

ENTRYPOINT [ "supervisord", "--nodaemon", "--configuration", "/etc/supervisord.conf" ]

FROM branch-version-$BUILD_MODE AS final
RUN echo "Successfully built image, see https://s.apache.org/m5933 for guidance on running a container instance."
71 changes: 61 additions & 10 deletions docker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,21 +45,72 @@ The easiest way to do this:

2. Build from files in this directory:

$(boot2docker shellinit | grep export)
docker build -t apache/nutch .
There are three build **modes** which can be activated using the `--build-arg BUILD_MODE=0` flag. All values used here are defaults.
* 0 == Nutch master branch source install with `crawl` and `nutch` scripts on `$PATH`
* 1 == Same as mode 0 with addition of **Nutch REST Server**; additional build args `--build-arg SERVER_PORT=8081` and `--build-arg SERVER_HOST=0.0.0.0`
* 2 == Same as mode 1 with addition of **Nutch WebApp**; additional build args `--build-arg WEBAPP_PORT=8080`

For example, if you wanted to install Nutch master branch and run both the Nutch REST server and webapp then run the following

```bash
$(boot2docker shellinit | grep export) #may not be necessary
docker build -t apache/nutch . --build-arg BUILD_MODE=2 --build-arg SERVER_PORT=8081 --build-arg SERVER_HOST=0.0.0.0 --build-arg WEBAPP_PORT=8080
```

## Usage

Start docker
If not already running, start docker
```bash
boot2docker up
$(boot2docker shellinit | grep export)
```

Run a container

```bash
docker run -t -i -d -p 8080:8080 -p 8081:8081 --name nutchcontainer apache/nutch
c5401810e50a606f43256b4b24602443508bd9badcf2b7493bd97839834571fc

docker logs c5401810e50a606f43256b4b24602443508bd9badcf2b7493bd97839834571fc
2021-06-29 19:14:32,922 CRIT Supervisor is running as root. Privileges were not dropped because no user is specified in the config file. If you intend to run as root, you can set user=root in the config file to avoid this message.
2021-06-29 19:14:32,925 INFO supervisord started with pid 1
2021-06-29 19:14:33,929 INFO spawned: 'nutchserver' with pid 8
2021-06-29 19:14:33,932 INFO spawned: 'nutchwebapp' with pid 9
2021-06-29 19:14:36,012 INFO success: nutchserver entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
2021-06-29 19:14:36,012 INFO success: nutchwebapp entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
```

You can now access the webapp at `http://localhost:8080` and you can interact with the REST API e.g.

```bash
curl http://localhost:8080/admin
{"startDate":1625118207995,"configuration":["default"],"jobs":[],"runningJobs":[]}
```

Attach to the container

boot2docker up
$(boot2docker shellinit | grep export)
```bash
docker exec -it c5401810e50a606f43256b4b24602443508bd9badcf2b7493bd97839834571fc /bin/bash
```

View supervisord logs
```bash
cat /tmp/supervisord.log
2021-06-29 19:14:32,922 CRIT Supervisor is running as root. Privileges were not dropped because no user is specified in the config file. If you intend to run as root, you can set user=root in the config file to avoid this message.
2021-06-29 19:14:32,925 INFO supervisord started with pid 1
2021-06-29 19:14:33,929 INFO spawned: 'nutchserver' with pid 8
2021-06-29 19:14:33,932 INFO spawned: 'nutchwebapp' with pid 9
2021-06-29 19:14:36,012 INFO success: nutchserver entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
2021-06-29 19:14:36,012 INFO success: nutchwebapp entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
```

Start up an image and attach to it
View supervisord subprocess logs

docker run -t -i -d --name nutchcontainer apache/nutch /bin/bash
docker attach --sig-proxy=false nutchcontainer
```bash
ls /var/log/supervisord/
nutchserver_stderr.log nutchserver_stdout.log nutchwebapp_stderr.log nutchwebapp_stdout.log
```

Nutch is located in ~/nutch and is almost ready to run.
You will need to set seed URLs and update the configuration with your crawler's Agent Name.
Nutch is located in `$NUTCH_HOME` and is almost ready to run.
You will need to set seed URLs and update the `http.agent.name` configuration property in `$NUTCH_HOME/conf/nutch-site.xml` with your crawler's Agent Name.
For additional "getting started" information checkout the [Nutch Tutorial](https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial).
47 changes: 47 additions & 0 deletions docker/config/supervisord_startserver.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

[supervisord]
childlogdir=/var/log/supervisord/
logfile=/tmp/supervisord.log ; (main log file;default $CWD/supervisord.log)
lewismc marked this conversation as resolved.
Show resolved Hide resolved
logfile_maxbytes=50MB ; (max main logfile bytes b4 rotation;default 50MB)
logfile_backups=10 ; (num of main logfile rotation backups;default 10)
loglevel=info ; (log level;default info; others: debug,warn,trace)
minfds=1024 ; (min. avail startup file descriptors;default 1024)
minprocs=200 ; (min. avail process descriptors;default 200)
nodaemon=false ; (start in foreground if true;default false)
pidfile=/tmp/supervisord.pid ; (supervisord pidfile;default supervisord.pid)
lewismc marked this conversation as resolved.
Show resolved Hide resolved
user=root

[program:nutchserver]
autorestart=true
autostart=true
command=nutch startserver -port %(ENV_SERVER_PORT)s -host %(ENV_SERVER_HOST)s
environment=PATH=/usr/local/bin:%(ENV_PATH)s
process_name=%(program_name)s
numprocs=1
#redirect_stderr=true
startsecs=0
stderr_capture_maxbytes=10MB
stderr_logfile=/var/log/supervisord/%(program_name)s_stderr.log
stderr_logfile_backups=5
stderr_logfile_maxbytes=10MB
#stderr_syslog=
stdout_capture_maxbytes=10MB
stdout_logfile=/var/log/supervisord/%(program_name)s_stdout.log
stdout_logfile_backups=5
stdout_logfile_maxbytes=10MB
#stdout_syslog=
user=root
68 changes: 68 additions & 0 deletions docker/config/supervisord_startserver_webapp.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

[supervisord]
childlogdir=/var/log/supervisord/
logfile=/tmp/supervisord.log ; (main log file;default $CWD/supervisord.log)
logfile_maxbytes=50MB ; (max main logfile bytes b4 rotation;default 50MB)
logfile_backups=10 ; (num of main logfile rotation backups;default 10)
loglevel=info ; (log level;default info; others: debug,warn,trace)
minfds=1024 ; (min. avail startup file descriptors;default 1024)
minprocs=200 ; (min. avail process descriptors;default 200)
nodaemon=false ; (start in foreground if true;default false)
pidfile=/tmp/supervisord.pid ; (supervisord pidfile;default supervisord.pid)
user=root

[program:nutchserver]
autorestart=true
autostart=true
command=nutch startserver -port %(ENV_SERVER_PORT)s -host %(ENV_SERVER_HOST)s
environment=PATH=/usr/local/bin:%(ENV_PATH)s
process_name=%(program_name)s
numprocs=1
#redirect_stderr=true
startsecs=0
stderr_capture_maxbytes=10MB
stderr_logfile=/var/log/supervisord/%(program_name)s_stderr.log
stderr_logfile_backups=5
stderr_logfile_maxbytes=10MB
#stderr_syslog=
stdout_capture_maxbytes=10MB
stdout_logfile=/var/log/supervisord/%(program_name)s_stdout.log
stdout_logfile_backups=5
stdout_logfile_maxbytes=10MB
#stdout_syslog=
user=root

[program:nutchwebapp]
autorestart=true
autostart=true
command=nutch webapp -port %(ENV_WEBAPP_PORT)s
environment=PATH=/usr/local/bin:%(ENV_PATH)s
process_name=%(program_name)s
numprocs=1
#redirect_stderr=true
startsecs=0
stderr_capture_maxbytes=10MB
stderr_logfile=/var/log/supervisord/%(program_name)s_stderr.log
stderr_logfile_backups=5
stderr_logfile_maxbytes=10MB
#stderr_syslog=
stdout_capture_maxbytes=10MB
stdout_logfile=/var/log/supervisord/%(program_name)s_stdout.log
stdout_logfile_backups=5
stdout_logfile_maxbytes=10MB
#stdout_syslog=
user=root