## Access an Azure Data Lake Storage
To access an Azure Data Lake Storage Gen2 storage account, we recommend that you set your account credentials in your notebook’s session configs.
Use the storage account key.

In [None]:
%sh
wget -P /tmp https://raw.githubusercontent.com/alagala/labs/azure/databricks/data/master/sample.txt

In [None]:
# Variable declarations. These will be accessible by any calling notebook.
keyVaultScope = "key-vault-secrets"
adlsAccountName = dbutils.secrets.get(keyVaultScope, "ADLS-Gen2-Name")
adlsAccountKey = dbutils.secrets.get(keyVaultScope, "ADLS-Gen2-Key")

fileSystemName = "data"
abfsUri = "abfss://" + fileSystemName + "@" + adlsAccountName + ".dfs.core.windows.net/"

spark.conf.set(
  "fs.azure.account.key." + adlsAccountName + ".dfs.core.windows.net",
  dbutils.secrets.get(scope = keyVaultScope, key = adlsAccountKey))

dbutils.fs.cp("file:///tmp/sample.txt", abfsUri)

dbutils.fs.ls(absUri)

Once this is set up, you can use standard Spark and Databricks APIs to read from the storage account. For example,

In [None]:
val df = spark.read.text(abfsUri + "sample.txt")
val groupedDf = df.toLowerCase.split(" ")).groupBy("value").count()

groupedDf.show(10)

## Connect to Cosmos DB

Using the Azure Cosmos DB Spark Connector, you will now use Cosmos DB as an input source to retrieve a sample of transaction data. You will start by setting up a static connection to Cosmos, and reading in a sample of the transaction data that is stored there.

In order to query Cosmos DB, you need to first create a configuration object that contains the configuration information. If you are curious, read the [configuration reference](https://github.com/Azure/azure-cosmosdb-spark/wiki/Configuration-references) for details on all of the options. 

The core items you need to provide are:

  - **Endpoint**: Your Cosmos DB url (i.e. https://youraccount.documents.azure.com:443/).
  - **Masterkey**: The primary or secondary key string for you Cosmos DB account.
  - **Database**: The name of the database.
  - **Collection**: The name of the collection that you wish to query.

> **NOTE**: For this hands-on lab, you have already added the endpoint and master key into Azure Key Vault, so you will retrieve the values from there using `dbutils.secrets.get()`. If you chose to use different database and collection names, you will need to update those in the cell below prior to running it.

The `query_custom` property of the configuration is the query that will be executed against the transactions collection in Cosmos DB. For this example, you are only pulling in a small sample of the transaction data.

Run the cell to add the configuration needed to create a static connection to Cosmos DB.

In [2]:
# https://github.com/Azure/azure-cosmosdb-spark/wiki/Configuration-references
# https://docs.microsoft.com/en-us/azure/cosmos-db/spark-connector

# Write configuration
writeConfig = {
    "Endpoint" : dbutils.secrets.get(keyVaultScope, "Cosmos-DB-Cassandra-URI"),
    "Masterkey" : dbutils.secrets.get(keyVaultScope, "Cosmos-DB-Cassandra-Key"),
    "Database" : "DepartureDelays",
    "Collection" : "flights_fromsea",
    "Upsert" : "true"
}

# Write to Cosmos DB from the flights DataFrame
flights.write.format("com.microsoft.azure.cosmosdb.spark").options(**writeConfig).save()

## Persist the transaction data to an Azure Databricks Delta table

[Databricks Delta](https://docs.databricks.com/delta/delta-intro.html) delivers a powerful transactional storage layer by harnessing the power of Apache Spark and Databricks DBFS. The core abstraction of Delta is an optimized Spark table that:

  - Stores data as Parquet files in DBFS.
  - Maintains a transaction log that efficiently tracks changes to the table.

You read and write data stored in the delta format using the same familiar Apache Spark SQL batch and streaming APIs that you use to work with Hive tables and DBFS directories. With the addition of the transaction log and other enhancements, Delta offers significant benefits:

  - **ACID transactions**
    - Multiple writers can simultaneously modify a dataset and see consistent views. For qualifications, see Multi-cluster writes.
    - Writers can modify a dataset without interfering with jobs reading the dataset.
  - **Fast read access**
    - Automatic file management organizes data into large files that can be read efficiently.
    - Statistics enable speeding up reads by 10-100x and and data skipping avoids reading irrelevant information.
    
To create your transactions Delta table, you will first write the cleaned dataset contained within the `static_transactions_clean` DataFrame to a folder in ADLS Gen2 in Databricks Delta format.

Let's break down the command in the cell below before running it.

  - `mode("overwrite")`: This tells the write operation to overwrite any existing Delta table stored in the specified location.
  - `format("delta")`: To save the data in Delta format, you specify "delta" with the `format()` option of the `write` command.
  - `partitionBy()`: When creating a new Delta table, you can optionally specify partition columns. Partitioning is used to speed up queries or DML that have predicates involving the partition columns. In this case, we are partitioning on the `ipCountryCode`, which is also the same field used to partition the data stored in Cosmos DB.
  - `save()`: The `save` command accepts a location, which is where the underlying files for the Delta table will be stored. In our case, this is a location in the ADLS Gen2 filesystem, which you will provide using the abfs URI.

In [9]:
static_transactions_clean.write.mode("overwrite").format("delta").partitionBy("ipCountryCode").save("abfss://" + fileSystemName + "@" + adlsGen2AccountName + ".dfs.core.windows.net/delta/transactions")

Now that you have saved the clean transaction data into a ADLS Gen2 filesystem location in Delta format, you can create a Databricks global table which is backed by the Delta location you created above. Notice that the `LOCATION` specified in the `CREATE TABLE` query is the same as what you used to write the cleaned transaction data in Delta format above. Doing this allows the table in the Hive metastore to automatically inherit the schema, partitioning, and table properties of the existing data.

**IMPORTANT**: You will need to add the name of your ADLS Gen2 account into the location value below, before running the cell.

In [11]:
%sql
CREATE TABLE transactions
USING DELTA
LOCATION 'abfss://transactions@<your-adls-gen2-account-name>.dfs.core.windows.net/delta/transactions' -- TODO: Replace <your-adls-gen2-account-name> with your ADLS Gen2 account name.

You can use the `DESCRIBE DETAIL` SQL command to view information about schema, partitioning, table size, and so on.

In [13]:
%sql
DESCRIBE DETAIL transactions

Finally, you can use Spark SQL to query records in the Hive table.

In [15]:
%sql
SELECT * FROM transactions LIMIT 10