# Mount an Azure Data Lake location to Databricks Cluster

Mounting an Data Lake location to the cluster is the preferred method of creating the connecting to the source data securely. 
Note, this is a one time operation. Once the mount point is created in the cluster, anyone that has access to the cluster will be able to use it. 
This is ideal for security, since the Admin will create the mount point, and the data engineers/scientists can simply use it. Each individual doesn't have to provide authentication in each notebook each time. 

For individual access to files in a Data Lake directly without a mount point, see [Access ADLS Gen2 directly](https://docs.databricks.com/data/data-sources/azure/adls-gen2/azure-datalake-gen2-sp-access.html#access-adls-gen2-directly)

# Using Blob Endpoint and Storage Account Key

```
dbutils.fs.mount(
source = "wasbs://blob-container@bdpgen2datalake.blob.core.windows.net/blob-storage",
mount_point = "/mnt/blob-storage",
extra_configs = {"fs.azure.account.key.bdpgen2datalake.blob.core.windows.net":dbutils.secrets.get(scope = "databricks-secret-scope",key = "blob-container-key")})
```
                                        
                                        
dbutils is automaically imported and available to the notebook, you dont need to explicitly import the library
* fs.mount is the standard method used from this library to mount an external location to the cluster

## Source
Note, it appears that you cannot access the ABFSS Endpoint of the data lake using the storage account key. The code throws an error. You will need to use a Service Principle to do that. Thus, the below code is to access the blob storage api **wasbs** of the data lake using a storage account key. 

1. Pre-requisites
	1. Data Lake already defined
	2. Key Vault Backed Secret Scope already defined
2. Create Mount in Azure Databricks in Python
	1. `dbutils.fs.mount(source = "wasbs://blob-container@bdpgen2datalake.blob.core.windows.net/blob-storage",mount_point = "/mnt/blob-storage",extra_configs = {"fs.azure.account.key.bdpgen2datalake.blob.core.windows.net":dbutils.secrets.get(scope = "databricks-secret-scope",key = "blob-container-key")})`
	2. Standard URL: `"wasbs://blob-container@bdpgen2datalake.blob.core.windows.net/blob-storage"`
	3. In general the below is true, **but it appears this code only works with the blob.core.windows.net and wasbs configs...**
		1. **dfs.core.windows.net** is used for Data Lakes
		2. **blob.core.windows.net** is used for Blob Storage
		3. **ABFS[S]** is used for Azure Data Lake Storage Gen2 which is based on normal Azure storage(during creating Azure storage account, enable Hierarchical namespace, then you create a Azure Data Lake Storage Gen2). An example is here.
		4. **WASB[S]** is used for the normal Azure storage. An example is here.
3. For the conf-key (telling spark how you are authenticating...),
	1. use `fs.azure.account.key.<storage-account-name>.blob.core.windows.net` when you are using account key for auth
	2. use `fs.azure.sas.<container-name>.<storage-account-name>.blob.core.windows.net` when you are using an SAS TOken for auth

In [0]:
#mount the location
dbutils.fs.mount(
  source = "wasbs://rawdata@dianrandddatalake.blob.core.windows.net/MockCSVFiles", #URL to the external location you want to mount in the data lake
  mount_point = "/mnt/datalake_rawdata_MockCSVFiles", #location inside the databricks file system this mount point will be accessed from in future. mnt/ is standard protocol. 
  extra_configs = {"fs.azure.account.key.dianrandddatalake.blob.core.windows.net" : dbutils.secrets.get(scope = "DianGRAndDKeyVault", key = "dianrandddatalake-accountkey")}
)

In [0]:
#unmount the location
dbutils.fs.unmount("/mnt/datalake_rawdata_MockCSVFiles")

# Using ADLS Endpoint and Service Principle (Recommended)

* It appears the only way to mount using the ADLS2 api is to use a service principle. 
* You can use the below code as a standard in all notebooks as no sensitive information is stored here. 
* All sensitive information is retrieved from key vault. 
* This can actually be converted into a parameterised notebook called at the start of other notebooks. 
* Just need to send the new mount point directory in as the mount target...

## Mount a container as a whole - preferred

In [0]:
# Python code to mount and access Azure Data Lake Storage Gen2 Account from Azure Databricks with Service Principal and OAuth

# KeyVault Secret Scope Name
VarSecretScopeName = "DianGRAndDKeyVault" ##this would be a fixed name we would have a standard for. Ideally there is only one secret scope for autoamted notebooks
 
# Define the variables used for creating connection strings - Data Lake Related
varAdlsAccountName = dbutils.secrets.get(scope=VarSecretScopeName,key="dianrandddatalake-storageaccountname") # e.g. "dianrandddatalake" --the storage account name itself
varAdlsContainerName = "rawdata" #Would be parameterised based on what the notebook is doing, for now just hardcoding
varMountPoint = "/mnt/datalake_" + varAdlsContainerName #Would be parameterised based on what the notebook is doing - datalake_<adlsContainerName>

# Define the variables that have the names of the secrets in key vault that store the sensitive information we need for the conenction via Service Principle Auth
VarSecretClientID = "RandD-ServicePrinciple-ApplicationID" #Name of the generic key vault secret contianing the Service Principle Name.
VarSecretClientSecret = "RandD-ServicePrinciple-Password" #Name of the generic key vault secret contianing the Service Principle Password. 
VarSecretTenantID = "RandD-ServicePrinciple-TenantID" #Bame of the generic key vault secret contianing the Tenant ID.

# Get the actual secrets from key vault for the service principle
varApplicationId = dbutils.secrets.get(scope=VarSecretScopeName, key=VarSecretClientID) # Application (Client) ID
varAuthenticationKey = dbutils.secrets.get(scope=VarSecretScopeName, key=VarSecretClientSecret) # Application (Client) Secret Key
varTenantId = dbutils.secrets.get(scope=VarSecretScopeName, key=VarSecretTenantID) # Directory (Tenant) ID

# Using the secrets above, generate the URL to the storage account and the authentication endpoint for OAuth
varEndpoint = "https://login.microsoftonline.com/" + varTenantId + "/oauth2/token" #Fixed URL for the endpoint
varSource = "abfss://" + varAdlsContainerName + "@" + varAdlsAccountName + ".dfs.core.windows.net/"
 
# Connecting using Service Principal secrets and OAuth
varConfigs = {"fs.azure.account.auth.type": "OAuth", #standard
           "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider", #standard
           "fs.azure.account.oauth2.client.id": varApplicationId,
           "fs.azure.account.oauth2.client.secret": varAuthenticationKey,
           "fs.azure.account.oauth2.client.endpoint": varEndpoint}
 
# Mount ADLS Storage to DBFS only if the directory is not already mounted
# Mount is generated as a list of all mount points available already via dbutils.fs.mounts()
# Then it checks the list for the new mount point we are trying to generate.
if not any(mount.mountPoint == varMountPoint for mount in dbutils.fs.mounts()): 
  dbutils.fs.mount(
    source = varSource,
    mount_point = varMountPoint,
    extra_configs = varConfigs)

# print the mount point used for troubleshooting
print("Mount Point: " + varMountPoint)

## Mount a specific directory inside a container

In [0]:
# Python code to mount and access Azure Data Lake Storage Gen2 Account from Azure Databricks with Service Principal and OAuth

# KeyVault Secret Scope Name
VarSecretScopeName = "DianGRAndDKeyVault" ##this would be a fixed name we would have a standard for. Ideally there is only one secret scope for autoamted notebooks
 
# Define the variables used for creating connection strings - Data Lake Related
varAdlsAccountName = dbutils.secrets.get(scope=VarSecretScopeName,key="dianrandddatalake-storageaccountname") # e.g. "dianrandddatalake" --the storage account name itself
varAdlsContainerName = "rawdata" #Would be parameterised based on what the notebook is doing, for now just hardcoding
varAdlsFolderName = "MockCSVFiles" #Would be parameterised based on what the notebook is doing, for now just hardcoding
varMountPoint = "/mnt/datalake_rawdata_MockCSVFiles" #Would be parameterised based on what the notebook is doing - datalake_<adlsContainerName>_<adlsFolderName>

# Define the variables that have the names of the secrets in key vault that store the sensitive information we need for the conenction via Service Principle Auth
VarSecretClientID = "RandD-ServicePrinciple-ApplicationID" #Name of the generic key vault secret contianing the Service Principle Name.
VarSecretClientSecret = "RandD-ServicePrinciple-Password" #Name of the generic key vault secret contianing the Service Principle Password. 
VarSecretTenantID = "RandD-ServicePrinciple-TenantID" #Bame of the generic key vault secret contianing the Tenant ID.

# Get the actual secrets from key vault for the service principle
varApplicationId = dbutils.secrets.get(scope=VarSecretScopeName, key=VarSecretClientID) # Application (Client) ID
varAuthenticationKey = dbutils.secrets.get(scope=VarSecretScopeName, key=VarSecretClientSecret) # Application (Client) Secret Key
varTenantId = dbutils.secrets.get(scope=VarSecretScopeName, key=VarSecretTenantID) # Directory (Tenant) ID

# Using the secrets above, generate the URL to the storage account and the authentication endpoint for OAuth
varEndpoint = "https://login.microsoftonline.com/" + varTenantId + "/oauth2/token" #Fixed URL for the endpoint
varSource = "abfss://" + varAdlsContainerName + "@" + varAdlsAccountName + ".dfs.core.windows.net/" + varAdlsFolderName
 
# Connecting using Service Principal secrets and OAuth
varConfigs = {"fs.azure.account.auth.type": "OAuth", #standard
           "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider", #standard
           "fs.azure.account.oauth2.client.id": varApplicationId,
           "fs.azure.account.oauth2.client.secret": varAuthenticationKey,
           "fs.azure.account.oauth2.client.endpoint": varEndpoint}
 
# Mount ADLS Storage to DBFS only if the directory is not already mounted
# Mount is generated as a list of all mount points available already via dbutils.fs.mounts()
# Then it checks the list for the new mount point we are trying to generate.
if not any(mount.mountPoint == varMountPoint for mount in dbutils.fs.mounts()): 
  dbutils.fs.mount(
    source = varSource,
    mount_point = varMountPoint,
    extra_configs = varConfigs)