Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable pyspark console in docker container #20

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
8 changes: 6 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,10 @@ LABEL website="http://archivesunleashed.org/"
ARG SPARK_VERSION=2.4.5

# Git and Wget
RUN apk add --update \
RUN apk --no-cache --virtual build-dependencies add --update \
git \
wget
wget \
&& apk add --update python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this line can't just be python?

Copy link
Contributor Author

@sepastian sepastian Feb 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both git and wget are installed into the virtual package build-dependencies, which gets deleted at the end of the build. To keep python out of this package, a separate add without --virtual is required.

Actually, --update is not required again here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gotcha!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(2/2) Installing python2 (2.7.16-r2)

We should go with Python 3 here. Any reason we're using Python 2?


# Sample resources
RUN git clone https://github.com/archivesunleashed/aut-resources.git
Expand All @@ -31,4 +32,7 @@ RUN mkdir /spark \
&& tar -xf "/tmp/spark-$SPARK_VERSION-bin-hadoop2.7.tgz" -C /spark --strip-components=1 \
&& rm "/tmp/spark-$SPARK_VERSION-bin-hadoop2.7.tgz"

# Cleanup package manager
RUN apk del build-dependencies

CMD /spark/bin/spark-shell --packages "io.archivesunleashed:aut:0.50.1-SNAPSHOT"
16 changes: 13 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,13 +45,13 @@ You can also build this Docker image locally with the following steps:

### Overrides

You can add any Spark flags to the build if you need too.
You can add any Spark flags when starting the container, if you need too.

```
$ docker run --rm -it archivesunleashed/docker-aut:0.17.0 /spark/bin/spark-shell --packages "io.archivesunleashed:aut:0.17.0" --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s
```

Once the build finishes, you should see:
Once the container has started, you should see:

```bash
$ docker run --rm -it aut
Expand All @@ -64,7 +64,7 @@ Welcome to
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.4
/_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.
Expand All @@ -73,6 +73,16 @@ scala>

```

### Python - PySpark

You can start a Python shell (pyspark) with the following command:

``` shell
docker run --rm -it -v "$(pwd)/your/datadir:/data" aut /spark/bin/pyspark --py-files /aut/target/aut.zip --jars /aut/target/aut-0.50.1-SNAPSHOT-fatjar.jar
```

See [the official documentation](https://github.com/archivesunleashed/aut-docs/tree/master/current#the-archives-unleashed-toolkit-latest-documentation) for usage examples in Python.

## Example

When the image is running, you will be brought to the Spark Shell interface. Try running the following command.
Expand Down