### Captured EventHub Example Databricks Notebook
##### by Robert Alexander, roalexan@microsoft.com

##### Copyright (c) Microsoft Corporation. All rights reserved.

##### Licensed under the MIT License.

##### Prerequisites
1. An **Azure subscription**. You will be asked for the *subscription id*.
1. A **Service Principal** with read/write access to this subscription. You will be asked for the *app id*, *app key*, and *tenant id*. Click [here](https://docs.microsoft.com/en-us/azure/active-directory/develop/howto-create-service-principal-portal) for help on adding a Service Principal.
1. An **Azure DataBricks Service** and **cluster**. Use Python version 2 for the cluster. Click [here](https://docs.microsoft.com/en-us/azure/azure-databricks/quickstart-create-databricks-workspace-portal) for help on adding a DataBricks Service and cluster.
1. Add the following libraries to your cluster via pypi. Click [here](https://docs.databricks.com/user-guide/libraries.html) for help on adding a library.
   - **azure-cli**
   - **azure-eventhub** 

##### Usage
Enter the the required input parameters then click run all (or run each step invidually, if you prefer). This will create a resource group containing a [Captured EventHub](https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-capture-overview) configured to automatically write messages sent to the EventHub to an Azure Storage Account container.

##### Cleanup

When you are finished, you can undeploy all the resources created by this notebook by uncommenting and running the last step.

##### The following Azure services will be deployed into a new resource group:
1. Azure Storage Account
1. Event Hubs Namespace

In [2]:
dbutils.widgets.text("subscription_id", "", "")
dbutils.widgets.text("location", "", "")
dbutils.widgets.text("prefix", "", "")
dbutils.widgets.text("tenant_id", "","")
dbutils.widgets.text("app_id", "","")
dbutils.widgets.text("app_key", "","")

# After running this cell, fill in all of the above input parameters before proceeding.

In [3]:
SUBSCRIPTION_ID = dbutils.widgets.get("subscription_id")
PREFIX = dbutils.widgets.get("prefix")
RESOURCE_GROUP_NAME = PREFIX + "EventHub-rg"
LOCATION = dbutils.widgets.get("location")
TENANT_ID = dbutils.widgets.get("tenant_id")
APP_ID = dbutils.widgets.get("app_id")
APP_KEY = dbutils.widgets.get("app_key")

STORAGE_ACCOUNT_NAME = PREFIX + "storageaccount"
STORAGE_CONTAINER_NAME = "container2"
NAMESPACE_NAME = PREFIX + "EventHubNamespace"
EVENT_HUB_NAME = PREFIX + "EventHub"

print('SUBSCRIPTION_ID: ', SUBSCRIPTION_ID)
print('PREFIX: ', PREFIX)
print('RESOURCE_GROUP_NAME: ', RESOURCE_GROUP_NAME)
print('LOCATION: ', LOCATION)
print('TENANT_ID: ', TENANT_ID)
print('APP_ID: ', APP_ID)
print('APP_KEY: ', APP_KEY)

print('STORAGE_CONTAINER_NAME: ', STORAGE_CONTAINER_NAME)
print('STORAGE_ACCOUNT_NAME: ', STORAGE_ACCOUNT_NAME)
print('NAMESPACE_NAME: ', NAMESPACE_NAME)
print('EVENT_HUB_NAME: ', EVENT_HUB_NAME)

In [4]:
# https://docs.microsoft.com/en-us/python/azure/python-sdk-azure-authenticate?view=azure-python

from azure.common.credentials import ServicePrincipalCredentials

credentials = ServicePrincipalCredentials(
    client_id = APP_ID,
    secret = APP_KEY,
    tenant = TENANT_ID
)
print('credentials: ', credentials)

In [5]:
# https://github.com/Azure/azure-sdk-for-python/tree/master/azure-mgmt-resource

from azure.mgmt.resource import ResourceManagementClient
from azure.mgmt.resource.resources.models import ResourceGroup
from azure.common.client_factory import get_client_from_cli_profile

resourceManagementClient = get_client_from_cli_profile(ResourceManagementClient, credentials=credentials, subscription_id=SUBSCRIPTION_ID)
resourceManagementClient.resource_groups.create_or_update(
    resource_group_name = RESOURCE_GROUP_NAME,
    parameters = ResourceGroup(location=LOCATION)
)

In [6]:
# https://github.com/Azure-Samples/storage-python-manage
# https://github.com/Azure/azure-sdk-for-python/tree/master/azure-mgmt-storage
# https://docs.microsoft.com/en-us/python/api/overview/azure/storage/management?view=azure-python
# https://blogs.msdn.microsoft.com/jmstall/2014/06/12/azure-storage-naming-rules/

from azure.common.client_factory import get_client_from_cli_profile
from azure.mgmt.storage.storage_management_client import StorageManagementClient
from azure.mgmt.storage.models import StorageAccountCreateParameters
from azure.mgmt.storage.models import Sku
from azure.mgmt.storage.models import SkuName
from azure.mgmt.storage.models import Kind

# Create StorageManagementClient
storageManagementClient = get_client_from_cli_profile(StorageManagementClient, credentials=credentials, subscription_id=SUBSCRIPTION_ID)
print('storageManagementClient: ', storageManagementClient)

# Create StorageAccount
async_create = storageManagementClient.storage_accounts.create(
    resource_group_name = RESOURCE_GROUP_NAME,
    account_name = STORAGE_ACCOUNT_NAME,
    parameters = StorageAccountCreateParameters(
        sku = Sku(name=SkuName.standard_lrs),
        kind = Kind.storage_v2,
        location = LOCATION
    )
)
async_create.wait()

In [7]:
#https://github.com/Azure/azure-sdk-for-python/blob/master/azure-mgmt-eventhub/azure/mgmt/eventhub/operations/event_hubs_operations.py
#https://github.com/Azure/azure-sdk-for-python/tree/master/azure-mgmt-eventhub/tests
#https://docs.microsoft.com/en-us/python/api/azure-mgmt-eventhub/?view=azure-python
#https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-capture-overview

from azure.common.client_factory import get_client_from_cli_profile
from azure.mgmt.eventhub import EventHubManagementClient
from azure.mgmt.eventhub.models import EHNamespace
from azure.mgmt.eventhub.models import Eventhub
from azure.mgmt.eventhub.models import CaptureDescription
from azure.mgmt.eventhub.models import Destination
from azure.mgmt.eventhub.models import EncodingCaptureDescription

# Create EventHubManagementClient
eventHubManagementClient = get_client_from_cli_profile(EventHubManagementClient, credentials=credentials, subscription_id=SUBSCRIPTION_ID)
print('eventHubManagementClient: ', eventHubManagementClient)

# Create EventHub NameSpace
async_create = eventHubManagementClient.namespaces.create_or_update(
    resource_group_name = RESOURCE_GROUP_NAME,
    namespace_name = NAMESPACE_NAME,
    parameters = EHNamespace(location=LOCATION)
)
async_create.wait()

# Create (Captured) EventHub
storage_account = storageManagementClient.storage_accounts.get_properties(RESOURCE_GROUP_NAME, STORAGE_ACCOUNT_NAME)
eventHubManagementClient.event_hubs.create_or_update(
    resource_group_name = RESOURCE_GROUP_NAME,
    namespace_name = NAMESPACE_NAME,
    event_hub_name = EVENT_HUB_NAME,
    parameters = Eventhub(
        message_retention_in_days = 2,
        partition_count = 2,
        capture_description = CaptureDescription(
            enabled=True,
            encoding=EncodingCaptureDescription.avro,
            interval_in_seconds = 60,
            size_limit_in_bytes = 1024*1024*10, # must be >= 10 MB
            destination = Destination(
                name="EventHubArchive.AzureBlockBlob",
                storage_account_resource_id = storage_account.id,
                blob_container = STORAGE_CONTAINER_NAME,
                archive_name_format="{Namespace}/{EventHub}/{PartitionId}/{Year}/{Month}/{Day}/{Hour}/{Minute}/{Second}"
            )
        )
    )
)

In [8]:
# https://github.com/Azure/azure-event-hubs-python
# https://github.com/Azure/azure-event-hubs-python/blob/master/tests/test_send.py - see json.dumps

from azure.eventhub import EventHubClient, Sender, EventData
import time
import random

authorization_rules = list(eventHubManagementClient.namespaces.list_authorization_rules(RESOURCE_GROUP_NAME, NAMESPACE_NAME))
default_authorization_rule_name = authorization_rules[0].name

accessKeys = eventHubManagementClient.namespaces.list_keys(RESOURCE_GROUP_NAME, NAMESPACE_NAME, default_authorization_rule_name)

ADDRESS = "amqps://{0}.servicebus.windows.net/{1}".format(NAMESPACE_NAME, EVENT_HUB_NAME)
USER = accessKeys.key_name
KEY = accessKeys.primary_key

# Create Event Hubs client
client = EventHubClient(ADDRESS, debug=False, username=USER, password=KEY)
sender = client.add_sender(partition="0")
client.run()
try:
    start_time = time.time()
    for i in range(1):
        print("Sending message: {}".format(i))
        #message = "UserId:" + str(random.randint(1,1000)) + ",MovieId:" + str(random.randint(1,100)) + ",Rating:" + str(random.randint(1,5))
        userId = str(random.randint(1,1000))
        movieId = str(random.randint(1,100))
        rating = str(random.randint(1,5))
        #message = userId + "," + movieId + "," + rating
        message = u"testmessage"
        #sender.send(EventData(body=str(message)))
        sender.send(EventData(message))
        #data = EventData(body=str(message))
        #data.type = "str"
        #sender.send(data)
        
except:
    raise
finally:
    end_time = time.time()
    client.stop()
    run_time = end_time - start_time
    print("Runtime: {} seconds".format(run_time))

In [9]:
#dbutils.fs.unmount("/mnt/container2")

In [10]:
accountKey = "fs.azure.account.key.{}.blob.core.windows.net".format(STORAGE_ACCOUNT_NAME)
accessKey = storageManagementClient.storage_accounts.list_keys(resource_group_name=RESOURCE_GROUP_NAME, account_name=STORAGE_ACCOUNT_NAME).keys[0].value

# Mount the drive for native python
inputSource = "wasbs://{}@{}.blob.core.windows.net".format(STORAGE_CONTAINER_NAME, STORAGE_ACCOUNT_NAME)
mountPoint = "/mnt/" + STORAGE_CONTAINER_NAME
extraConfig = {accountKey: accessKey}
print("Mounting: {}".format(mountPoint))
try:
  dbutils.fs.mount(
    source = inputSource,
    mount_point = str(mountPoint),
    extra_configs = extraConfig
  )
  print("Succeeded")
except Exception as e:
  if "Directory already mounted" in str(e):
    print("Directory {} already mounted".format(mountPoint))
  else:
    raise(e)

In [11]:
# https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html#access-dbfs-with-the-databricks-cli

#display(dbutils.fs.ls("dbfs:/mnt/container2/"))
#display(dbutils.fs.ls("dbfs:/mnt/container2/rbaeventhubnamespace/"))
#display(dbutils.fs.ls("dbfs:/mnt/container2/rbaeventhubnamespace/rbaeventhub/"))
#display(dbutils.fs.ls("dbfs:/mnt/container2/rbaeventhubnamespace/rbaeventhub/0/"))
#display(dbutils.fs.ls("dbfs:/mnt/container2/rbaeventhubnamespace/rbaeventhub/0/2018/"))
#display(dbutils.fs.ls("dbfs:/mnt/container2/rbaeventhubnamespace/rbaeventhub/0/2018/11/"))
#display(dbutils.fs.ls("dbfs:/mnt/container2/rbaeventhubnamespace/rbaeventhub/0/2018/11/28/"))
#display(dbutils.fs.ls("dbfs:/mnt/container2/rbaeventhubnamespace/rbaeventhub/0/2018/11/28/20/"))
#display(dbutils.fs.ls("dbfs:/mnt/container2/rbaeventhubnamespace/rbaeventhub/0/2018/11/28/20/49/"))
#display(dbutils.fs.ls("dbfs:/mnt/container2/rbaeventhubnamespace/rbaeventhub/0/2018/11/28/20/49/21.avro"))

#display(dbutils.fs.ls("dbfs:/mnt/container2/rbaeventhubnamespace/rbaeventhub/0/2018/11/29/"))

#dbutils.fs.head("dbfs:/mnt/container2/rbaeventhubnamespace/rbaeventhub/0/2018/11/28/20/49/21.avro")
dbutils.fs.head("dbfs:/mnt/container2/rbaeventhubnamespace/rbaeventhub/0/2018/11/29/19/54/30.avro")

In [12]:
spark.conf.set("spark.sql.avro.compression.codec", "deflate")

In [13]:
# https://stackoverflow.com/questions/34477089/select-specific-columns-only-form-a-dataframe-in-python
# https://stackoverflow.com/questions/38294897/spark-avro-databricks-package
# https://github.com/Azure/azure-event-hubs-python/blob/master/azure/eventhub/common.py#L185

#{Namespace}/{EventHub}/{PartitionId}/{Year}/{Month}/{Day}/{Hour}/{Minute}/{Second}

#df = spark.read.format("com.databricks.spark.avro").load("dbfs:/mnt/container2/rbaeventhubnamespace/rbaeventhub/0/2018/11/28/20/49/21.avro")
#df = spark.read.format("com.databricks.spark.avro").load("dbfs:/mnt/container2/rbaeventhubnamespace/rbaeventhub/0/2018/11/28/20/49/*")
#df = spark.read.format("com.databricks.spark.avro").load("dbfs:/mnt/container2/rbaeventhubnamespace/rbaeventhub/0/2018/11/28/20/*")
#df = spark.read.format("com.databricks.spark.avro").load("dbfs:/mnt/container2/rbaeventhubnamespace/rbaeventhub/0/2018/11/28/*")
#df = spark.read.format("com.databricks.spark.avro").load("dbfs:/mnt/container2/rbaeventhubnamespace/rbaeventhub/0/2018/11/29/*")

#df = spark.read.format("com.databricks.spark.avro").load("dbfs:/mnt/container2/rbaeventhubnamespace/rbaeventhub/0/2018/11/29/14/*")
#df = spark.read.format("com.databricks.spark.avro").load("dbfs:/mnt/container2/rbaeventhubnamespace/rbaeventhub/0/2018/11/29/*/*/*")
#df = spark.read.format("com.databricks.spark.avro").load("dbfs:/mnt/container2/rbaeventhubnamespace/rbaeventhub/0/2018/11/*/*/*/*")
#df = spark.read.format("com.databricks.spark.avro").load("dbfs:/mnt/container2/rbaeventhubnamespace/rbaeventhub/0/2018/*/*/*/*/*")
#df = spark.read.format("com.databricks.spark.avro").load("dbfs:/mnt/container2/rbaeventhubnamespace/rbaeventhub/0/*/*/*/*/*/*")
#df = spark.read.format("com.databricks.spark.avro").load("dbfs:/mnt/container2/rbaeventhubnamespace/rbaeventhub/*/*/*/*/*/*/*")
#df = spark.read.format("com.databricks.spark.avro").load("dbfs:/mnt/container2/rbaeventhubnamespace/rbaeventhub/*/*/*/*/*/*/*")
#df = spark.read.format("com.databricks.spark.avro").load("dbfs:/mnt/container2/rbaeventhubnamespace/rbaeventhub/*/*/*/*/16/*/*")
df = spark.read.format("com.databricks.spark.avro").load("dbfs:/mnt/container2/rbaeventhubnamespace/rbaeventhub/0/2018/11/29/19/54/30.avro")
#df = spark.read.format("com.databricks.spark.avro").load("dbfs:/mnt/container2/episodes.avro")

#dir(df)
#subset = df['Body']
#df = df[['Body']]

#len(df.index)
#display(count)
#count_row = df.shape[0]  # gives number of row count
#count_row
#df.count()
display(df)

#subset = df.where("Rating = 1")
#display(subset)

#subset = df.where("doctor > 5")
#display(subset)

SequenceNumber,Offset,EnqueuedTimeUtc,SystemProperties,Properties,Body
17,1184,11/29/2018 7:55:09 PM,Map(),Map(),dGVzdG1lc3NhZ2U=


In [14]:
# https://docs.azuredatabricks.net/spark/latest/data-sources/read-avro.html

#df = spark.read.format("com.databricks.spark.avro").load("/tmp/episodes.avro")
df = spark.read.format("com.databricks.spark.avro").load("dbfs:/mnt/container2/rbaeventhubnamespace/rbaeventhub/0/2018/11/28/20/49/21.avro")
display(df)

SequenceNumber,Offset,EnqueuedTimeUtc,SystemProperties,Properties,Body
0,0,11/28/2018 8:49:59 PM,Map(),Map(),VXNlcklkOjcwNyxNb3ZpZUlkOjUzLFJhdGluZzo1
1,80,11/28/2018 8:49:59 PM,Map(),Map(),VXNlcklkOjI1LE1vdmllSWQ6OCxSYXRpbmc6Mw==
2,152,11/28/2018 8:49:59 PM,Map(),Map(),VXNlcklkOjI4LE1vdmllSWQ6NzYsUmF0aW5nOjM=
3,224,11/28/2018 8:49:59 PM,Map(),Map(),VXNlcklkOjY3OCxNb3ZpZUlkOjMxLFJhdGluZzox
4,304,11/28/2018 8:49:59 PM,Map(),Map(),VXNlcklkOjIzMyxNb3ZpZUlkOjY4LFJhdGluZzoz
5,384,11/28/2018 8:49:59 PM,Map(),Map(),VXNlcklkOjYzMSxNb3ZpZUlkOjgsUmF0aW5nOjU=
6,456,11/28/2018 8:49:59 PM,Map(),Map(),VXNlcklkOjI5MixNb3ZpZUlkOjc5LFJhdGluZzoz
7,536,11/28/2018 8:49:59 PM,Map(),Map(),VXNlcklkOjY4NCxNb3ZpZUlkOjQyLFJhdGluZzo1
8,616,11/28/2018 8:49:59 PM,Map(),Map(),VXNlcklkOjM5MCxNb3ZpZUlkOjM0LFJhdGluZzo1
9,696,11/28/2018 8:49:59 PM,Map(),Map(),VXNlcklkOjM2NCxNb3ZpZUlkOjYwLFJhdGluZzoz


In [15]:
# pip install avro-python3
# https://stackoverflow.com/questions/40188721/attributeerror-str-object-has-no-attribute-decode-while-reading-from-avro-u

import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

try:
    reader = DataFileReader(open("1.avro", "rb"), DatumReader())
    for user in reader:
        print(user)
    reader.close()
except Exception as e:
    print(e)
finally:
    input("Press Enter to exit...")

In [16]:
# https://docs.databricks.com/spark/latest/data-sources/read-avro.html 
# https://docs.azuredatabricks.net/spark/latest/data-sources/read-avro.html
# https://feedback.azure.com/forums/909463-azure-databricks/suggestions/33107896-support-reading-avro-files-from-azure-blob-storage
# https://github.com/databricks/spark-avro


  

In [17]:
# When you are finished, you can undeploy all the resources created by this notebook by uncommenting and running this step. This will delete the resource group and all of its resources - namely the Azure Storage Account and Event Hubs Namespace.

#dbutils.fs.unmount("/mnt/container2")
#async_delete = resourceManagementClient.resource_groups.delete(resource_group_name = RESOURCE_GROUP_NAME)
#async_delete.wait()