# Mount an Azure Blob Container as a directory in the linux file system (in the Worker node)
This has to be done once - the mounting is remembered between cluster restarts.

The mount point is in a distributed file system called DBUTILS -- a product of Databricks.

See the documentation at  https://docs.databricks.com/files/index.html  to understand the difference between this mount point and using the local storage of a cluster

This file is part of https://github.com/cnoam/iem_teachinglab.git in 'databricks' folder

<br>

**NOTE** In 2024, DBR hardened the security, so you need to whitelist the method. 

This can be done by connecting to a cluster that is in **"Access Mode: Dedicated"**(formerly single user mode)

In [None]:
storage_account = "coursedata2024"  
container = "fwm-stb-data"  


In [0]:
# If you want to make sure the mount point is unmounted, uncomment this
# WARNING WARNING WARNING: IT LOOKS LIKE DBR REMEMEBERS LOST MOUNTS. VERY WEIRD.
"""Key takeaway: You do not have to mount a container to use it in Databricks; mounting just creates a “friendly” DBFS path. If you have credentials set at the cluster/session level, you can directly access wasbs://... or abfss://... paths (or the /mnt/... DBFS equivalent) without an explicit mount. That’s why dbutils.fs.ls("/mnt/...") can still succeed even though dbutils.fs.mounts() does not show it.
"""
dbutils.fs.unmount(f"/mnt/{storage_account}/{container}")

In [0]:



# This code uses SAS token authentication to access the data
secret = "sp=rl&st=2025-03-16T08:29:06Z&se=2025-09-01T15:29:06Z&spr=https&sv=2022-11-02&sr=c&sig=kv1VUVNTHUxwMUUUpw7z3duDFfn6CHEIvbgeGcRygkM%3D"
mount_point   = "/mnt/{storage_account}/{container}".format(storage_account = storage_account, container = container)  

if mount_point not in [m.mountPoint for m in dbutils.fs.mounts()]:  
                dbutils.fs.mount(  
                   source        = "wasbs://{container}@{storage_account}.blob.core.windows.net/".format(container = container, storage_account = storage_account),  
                   mount_point   = mount_point,  
                   extra_configs = {f"fs.azure.sas.{container}.{storage_account}.blob.core.windows.net" : secret}  
                 )
                print("Mounted ok")
else:
    print(f"{mount_point} is already mounted")
  



In [None]:
# list the mount points
dbutils.fs.mounts()
#dbutils.fs.unmount(f'/mnt/{storage_account}/{container}/')

In [None]:
dbutils.fs.ls(f"/mnt/{storage_account}/{container}") 

In [0]:
# Modify this line to try to read an existing file. It should fail
fname = dbutils.fs.ls(f"/mnt/{storage_account}/{container}/demographic/") 
if fname[0].name != 'SintecMedia.rpt_demodata.date_2015-12-31.2016-01-01.pd.gz':
    raise Exception("File name is not expected!")

opening the file with open() should fail

In [0]:
try: open(f"dbfs:/mnt/{storage_account}/{container}/demographic/SintecMedia.rpt_demodata.date_2015-12-31.2016-01-01.pd.gz", "r")
except Exception as ex: print("calling open() will fail!\n" + str(ex))

## Read XML file

An external library is needed to be installed in the cluster.

Follow the instructions in https://learn.microsoft.com/en-us/azure/databricks/libraries/cluster-libraries. <br>
The library name is `com.databricks:spark-xml_2.12:0.16.0`

and can be downloaded from `https://repo1.maven.org/maven2/com/databricks/spark-xml_2.12/0.16.0/spark-xml_2.12-0.16.0.jar`

or installed directly using the Maven coordinates

2025-05-28 update:  running this notebook on cluster "15.4 LTS (includes Apache Spark 3.5.0, Scala 2.12)", without any libraries installed, the run succeeded. 

In [0]:
fname = f"dbfs:/mnt/{storage_account}/fwm-stb-data/refxml/SintecMedia.rpt_refxml.date_2015-01-01.2016-11-21.xml.gz"
df = spark.read.format("xml").option("compression","gzip").option("rowTag", "mapping").load(fname)

In [0]:
df.show(5)

In [0]:
# Try to write to the mount point. We should fail with Permission Denied
empty_df = spark.createDataFrame([], df.schema)
try:
    empty_df.write.mode("overwrite").parquet(f"dbfs:/mnt/{storage_account}/fwm-stb-data/refxml/dummy")
except Exception as ex:
    #print("Writing to mount point should fail!\n" + str(ex))
    if "This request is not authorized to perform this operation using this permission" not in str(ex):
        raise Exception("❌ Writing to mount point should fail, but it SUCCEDED!\n" + str(ex))
    else:
        print("✅ Writing to mount point failed as it should be!\n")

In [0]:
df.printSchema()

In [0]:
from pyspark.sql.functions import window, column, desc, col
df.filter(col('_system-type') != 'H').show(4)

In [0]:
# how many records generated from each of the system types?
df.groupBy("_system-type").count().show()