Demo of PySpark and Jupyter Notebook with the Jupyter Docker Stacks
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
work
.gitignore
README.md
stack.yml

README.md

PySpark / Jupyter Notebook Demo

Demo of PySpark and Jupyter Notebook with the Jupyter Docker Stacks. Complete information for this project can be found by reading the related blog post, Getting Started with PySpark for Big Data Analytics, using Jupyter Notebooks and Docker

Architecture

Set-up

  1. git clone this project from GitHub
  2. Create $HOME/data/postgre directory for PostgreSQL files
  3. Deploy Docker Stack: docker stack deploy -c stack.yml pyspark
  4. Download 'BreadBasket_DMS.csv' from kaggle to work/ directory
  5. From Jupyter terminal, install Psycopg Python PostgreSQL adapter: pip install psycopg2 psycopg2-binary

Demo

From a Jupyter terminal window:

  1. Sample Python script: python ./01_simple_script.py
  2. Sample PySpark script: $SPARK_HOME/bin/spark-submit 02_bakery_dataframes.py
  3. Load PostgreSQL sample data: python ./03_load_sql.py
  4. Sample Jupyter Notebook: open 04_pyspark_demo_notebook.ipynb from Jupyter Console

Jupyter Notebook

Misc. Commands

docker pull jupyter/all-spark-notebook:latest
docker stack ps pyspark --no-trunc
docker logs $(docker ps | grep _pyspark | awk '{print $NF}') --follow

docker stats --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"

apt-get update -y && apt-get upgrade -y
apt-get install htop
htop --sort-key help
htop --sort-key

# optional from Jupyter terminal if not part of SparkSession spark.driver.extraClassPath
cp postgresql-42.2.5.jar /usr/local/spark/jars

References