Lab 3: Explore Big Data using Azure Databricks

In this lab you will use Azure Databricks to explore the New York Taxi data files you saved in your data lake in Lab 2. Using a Databricks notebook you will connect to the data lake and query taxi ride details.

The estimated time to complete this lab is: 20 minutes.

Lab Architecture

Step	Description
	Build an Azure Databricks notebook to explore the data files you saved in your data lake in the previous exercise. You will use Python and SQL commands to open a connection to your data lake and query data from data files.

Azure Databricks (dbx) is a Spark-as-a-Service offering. Spark is a "big data" tool that allows you to perform ETL, streaming data, analytics, and machine learning from a notebook-like (think Jupyter notebook) experience. The most proficient dbx users tend to know Python, Scala, or SQL.

For this lab we are only going to do the most basic queries with dbx to give you a feel for the Art of the Possible.

Create Azure Databricks Cluster

In this section you are going to create an Azure Databricks cluster that will be used to execute notebooks.

In the Azure Portal, navigate to the MDW-Lab resource group and locate the Azure Databricks resource MDWDatabricks-suffix.
On the MDWDatabricks-suffix blade, click the Launch Workspace button. The Azure Databricks portal will open on a new browser tab.

This is a Databricks Workspace which allows you to create one-to-many dbx clusters, create/save/share notebooks, and connect to common services like data lakes and databases.

On the Azure Databricks portal, click the Clusters button on the left-hand side menu.
On the Clusters blade, click + Create Cluster.

You can have multiple clusters in a workspace. One cluster may be setup and scaled for streaming data, one for batch jobs that require more horsepower for data engineering, and one for ad hoc analytics.

On the Create Cluster blade, type MDWDatabricksCluster in the Cluster Name field. Leave all other fields with their default values.
Click Create Cluster. It should take around 5 minutes for the cluster to be fully operational.

Create an Azure Databricks Notebook

In this section you are going to create an Azure Databricks notebook that will be used to explore the taxi data files you copied to your data lake in the Lab 2.

On the Azure Databricks portal, click the Home button on the left-hand side menu.
On the Workspace blade, click the down arrow next to your user name and then click Create > Notebook.
On the Create Notebook pop-up window type “NYCTaxiData” in the Name field.
Ensure you have the Language field set to Python and the Cluster field is set to MDWDatabricksCluster.
Click Create.
On the Cmd 1 cell, click the Edit button on the top right-hand corner of the cell and then click Show Title.
Type Setup connection to MDWDataLake storage account in the cell title.
On the Cmd 1 cell, you will invoke the Spark API to establish a connection to your MDWDataLake storage account. For this you will need to retrieve the name and key of your MDWDataLake storage account from the Azure Portal.
Use the Python code below and replace [your MDWDataLake storage account name] with mdwdatalakesuffix and to replace [your MDWDataLake storage account key] with the storage account key.

## vars to change
acctname = "mdwdatalakeg3sve"
acctkey = "/A2mGb+x4ZLpy1pGV4JzYA0YgQZ6gV0SaSFeIjvdhXcwTkQIzAwtYP5goo2vW6dYa1i1Ng9hLWwOiKv7XUxDIQ=="

fullacctname = "fs.azure.account.key." + acctname + ".blob.core.windows.net"
wasbs_location = "wasbs://nyctaxidata@" + acctname + ".blob.core.windows.net/"
spark.conf.set(fullacctname, acctkey)

Press Shift + Enter to execute and create a new notebook cell. Set the title of the Cmd 2 cell to Define NYCTaxiData schema and load data into a Data Frame
In the Cmd 2 cell, define a new StructType object that will contain the definition of the data frame schema.
Using the schema defined above, initialize a new data frame by invoking the Spark API to read the contents of the nyctaxidata container in the MDWDataLake storage account. Use the Python code below:

from pyspark.sql.types import *

nycTaxiDataSchema = StructType([
  StructField("VendorID",IntegerType(),True)
  , StructField("tpep_pickup_datetime",DateType(),True)
  , StructField("tpep_dropoff_datetime",DateType(),True)
  , StructField("passenger_count",IntegerType(),True)
  , StructField("trip_distance",DoubleType(),True)
  , StructField("RatecodeID",IntegerType(),True)
  , StructField("store_and_fwd_flag",StringType(),True)
  , StructField("PULocationID",IntegerType(),True)
  , StructField("DOLocationID",IntegerType(),True)
  , StructField("payment_type",IntegerType(),True)
  , StructField("fare_amount",DoubleType(),True)
  , StructField("extra",DoubleType(),True)
  , StructField("mta_tax",DoubleType(),True)
  , StructField("tip_amount",DoubleType(),True)
  , StructField("tolls_amount",DoubleType(),True)
  , StructField("improvement_surcharge",DoubleType(),True)
  , StructField("total_amount",DoubleType(),True)])
  
dfNYCTaxiData = spark.read.format('csv').options(header='true', schema=nycTaxiDataSchema).load(wasbs_location)

Hit Shift + Enter to execute the command and create a new cell.
Set the title of the Cmd 3 cell to Display Data Frame Content with code:

display(dfNYCTaxiData)

Hit Shift + Enter to execute the command and create a new cell. You will see a data grid showing the top 1000 records from the dataframe
Set the title of the Cmd 4 cell to “Create Temp View”
In the Cmd 4 cell, call the createOrReplaceTempView method of the data frame object to create a temporary view of the data in memory. Use the Python code below:

dfNYCTaxiData.createOrReplaceTempView('NYCTaxiDataTable')

Hit Shift + Enter to execute the command and create a new cell.
Set the title of the Cmd 5 cell to “Use SQL to count NYC Taxi Data records”
In the Cmd 5 cell, change the default language to SQL using the %sql command.
Write a SQL query to retrieve the total number of records in the NYCTaxiDataTable view. Use the command below:

%sql
select count(*) from NYCTaxiDataTable

Set the title of the Cmd 6 cell to “Use SQL to filter NYC Taxi Data records”
In the Cmd 6 cell, write a SQL query to filter taxi rides that happened on the Apr, 7th 2018 that had more than 5 passengers. Use the command below:

%sql

select cast(tpep_pickup_datetime as date) as pickup_date
  , tpep_dropoff_datetime
  , passenger_count
  , total_amount
from NYCTaxiDataTable
where cast(tpep_pickup_datetime as date) = '2018-04-07'
  and passenger_count > 5

Set the title of the Cmd 7 cell to “Use SQL to aggregate NYC Taxi Data records and visualize data”
In the Cmd 7 cell, write a SQL query to aggregate records and return total number of rides by payment type. Use the command below:

%sql

select case payment_type
            when 1 then 'Credit card'
            when 2 then 'Cash'
            when 3 then 'No charge'
            when 4 then 'Dispute'
            when 5 then 'Unknown'
            when 6 then 'Voided trip'
        end as PaymentType
  , count(*) as TotalRideCount
from NYCTaxiDataTable
group by payment_type
order by TotalRideCount desc

Hit Ctrl + Enter to execute the command. Results will be displayed in a grid in the cell.
Click the Bar chart button to see results as a bar chart.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly