RDDs, DataFrames and Datasets in Apache Spark
This repo contains the source for my 2016 Northeast Scala Symposium talk, RDDs, DataFrames and Datasets in Apache Spark, which I updated (a little) for Apache Spark 2.0 and gave again at a Philly Area Scala Enthusiasts (PHASE) Meetup in June, 2016 (http://www.meetup.com/scala-phase/events/229870987/).
Slides: You can see the actual deck, in action, here.
Video: The talk at the Northeast Scala Symposium was recorded. The video is here.
The Git tag
nescala captures the code and presentation as given at
the Northeast Scala Symposium.
phase captures the code and presentation as given at the PHASE
The presentation is in presentation. The demo notebooks
are in demo, in runnable source form. Also in demo is a
notebooks.dbc, which can be loaded directly into Databricks.
Feel free to sign up for the free
Databricks Community Edition and try them yourself.
The presentation is built with Reveal.js, augmented with some custom
build code. To build the presentation, you can run
rake from the top level.
The presentation will end up in
Preparing to build the slides
- Install NodeJS and
- Install the LESS preprocessor:
npm install -g less
- Install Bower:
npm install -g bower
- Make sure you have a version of Ruby 2 installed. (This stuff has been tested with 2.2.3.)
- Install Bundler:
gem install bundler
- Use Bundler to install the required Ruby gems:
Building the Slides
Once you've successfully completed preparation, building the slide deck is as simple as:
Rake will build
dist/index.html, a Reveal.js slide show. Just
open the file in your browser, and away you go.
Installing the slide show
If you want to install the slide show somewhere (e.g., a web server), copy
dist directory (presumably renaming it).
To create PDF versions of the slides, open the HTML slides in Chrome or
Chromium. Then, tack
?print-pdf on the end of the URL, and print the result.
See the Reveal.js documentation for details.