PySpark Local Environment

This is a repository you can clone to easily setup a local PySpark environment for interacting with data on Amazon S3.

Getting Started

Note: The following commands require you to have Docker installed Optionally, you can also have Jupyter, VS Code, and AWS credentials

docker build -t local-spark .

docker run -it local-spark

Now that you're running in the container, you can set AWS credentials to access S3, run spark-shell or pyspark.

You can also spin up jupyter server if you want to connect a notebook.

docker run --rm -it -p 8888:8888 local-spark -c "jupyter server --ip='*'"

If you want, you can specify a different EMR release version. For example, to get Spark 3.4.0, use EMR 6.12.0 by providing a --build-arg.

docker build --build-arg EMR_RELEASE=6.12.0 -t local-spark:emr-6.12.0 .
docker run -it local-spark:emr-6.12.0

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md