Source to my "Introduction to Apache Spark using Frameless" talk
The slides are written in Markdown and must be translated to HTML+Reveal.JS using Pandoc. The following executables must be present in your shell's PATH to build the slides:
pandoc(version 2.3.1 or better)
lessc(version 3.0.4 or better), for LESS stylesheet translation
git, to check out the Reveal.js repository.
To build the slides, just run
./build.sh. It'll build a standalone
slides.html file in the top-level directory.
The Databricks notebooks
notebooks folder contains the individual notebooks used during the
presentation. You'll need all three. If you want, you can import them
individually. Or, you can simply download and import the
file in this directory; it contains all three notebooks.
For information on how to import notebooks into Databricks, including Databricks Community Edition, see https://docs.databricks.com/user-guide/notebooks/notebook-manage.html#import-a-notebook
There are three notebooks:
Defs.scala: definitions shared across the other two notebooks (each of which invokes
00-Create-Data-Files.scala, which downloads a data file of tweets from early 2018 and also parses a Kafka stream of current tweets, producing the new data files needed by the presentation. Follow the instructions in this notebook to create local copies of the data. BUT, also, see below.
01-Presentation.scalais the hands-on notebook part of the presentation.
I ran the notebooks in Databricks, with:
- Spark 2.3
- Scala 2.11
You can us the
00-Create-Data-Files.scala to download and create the data.
However, if you'd prefer to use existing data, you can also just get existing
Parquet files from the following locations:
- Download those zip files.
- Unzip them.
- Upload them to your own S3 bucket.
- In a Databricks workspace (such as Databricks Community Edition), mount your S3 bucket to DBFS.
- Update the paths (in the
Defs.scalanotebook) to point to your S3 bucket.
Feel free to drop me email (email@example.com) if you need help.