Apache Spark Workshop @ Scala World 2017
Importing the notebooks
You can import the DBC directly into Databricks' free Spark as a Service product, Databricks Community Edition. The URL for the package of notebooks (a "DBC" archive file) is: https://github.com/bmc/scala-world-2017-spark-workshop/blob/master/notebooks.dbc?raw=true.
Start by logging into your Community Edition account. Use Firefox or Chrome, for best results. Then:
In the popup, click URL.
Paste the DBC URL (from above) into the text box.
Wait for the import to finish. This can take a minute or so to complete.
Running the notebooks
Create a cluster
Start by creating a Spark cluster. (If you already have a running cluster, skip this bit. You can only have one cluster per Community Edition account.)
Name your cluster. In Community Edition, since you can only have one running cluster, the name doesn't matter too much.
Select the cluster type. I recommend the latest runtime (3.2, 3.3, etc.) and Scala 2.11.
Open the first notebook
Select the Home button in the sidebar, and select the "Scala World 2017 Spark Workshop Presentation Notebooks" folder. Then, click on the notebook you want to run (e.g., "01 ETL").
You can run individual cells with Command-Click (Mac) or Alt-Click (Linux and Windows), which runs the cells and moves the cursor to the next cell. You can use Control-Click (all platforms) to run the cell and leave the cursor in the cell.
The first time you try to run a non-Markdown cell, Databricks will prompt you to attach to your cluster.
The data files used by the notebooks are at the following locations. You can't run the notebooks without them.
You can download these raw files, copy them into your own S3 bucket, and mount the S3 bucket to DBFS.
You'll need to change the paths in the notebooks. They all start with
/mnt/bmc. You'll need to change that prefix to correspond to the mount
point you choose.
Creating compatible data
If you want to get updated data or data for different years, go to https://data.police.uk/ and select the years and police forces you want. Do not select outcomes or stop-and-search data.
Unpack the resulting zip file, which will unpack into separate directories for each year-month combination. For instance, if you download 2015 data, you'll get directories 2015-01, 2015-02, etc. Each directory contains a number of CSV files.
To combine all files (e.g., for an entire year) into one single CSV file, use
combine.py file. Run it like this:
python combine.py <output-file> directory [directory] ...
python combine.py uk-crime-data-2015.csv 2015*
Drop me an email (email@example.com) or open an issue if you're having problems. I'm happy to help.