## Accessing Data on Compute Instance

This notebook articulates how to access data residing on an Azure Cloud Datastore (Blob, ADLS gen2, SQL) in an AzureML compute instance.

### Import packages 

In [None]:
import tempfile
import os
import pandas as pd
from azureml.core import Dataset, Workspace, Datastore

### Step 1: Register the Datastore (if you haven't done so already)

_Datastores_ are an AzureML abstraction for cloud-based stores of data - Blob, ADLS, SQL, Postgres. These are _registered_ into an AzureML workspace as a 'one-time' operation - typically at the start of a project. The benefit of the datastore abstraction is that all the credentials for the datastore are securely stored in an Azure Keyvault and do not have to be re-entered everytime you want to access some data on the store.

An AzureML workspace comes with a default datastore (blob) named _workspaceblobstore_, however if your data resides in a different datastore, you can register it into the AzureML workspace. The easiest way to register a datastore is through the machine learning portal - https://ml.azure.com - go to __Datastores__ on the left-hand menu and then __+New Datastore__. Work through the on-screen instructions.

### Step 2: Access Data

Below enter the datastore name and the path on the datastore that you wish to create a dataset from.

In [None]:
DATASTORE_NAME = "" # the default datastore is called workspaceblobstore.
PATH_ON_DATASTORE = ""

ws = Workspace.from_config()
my_datastore = Datastore.get(ws, DATASTORE_NAME)

#### If your data is tabular then start here

_TabularDataset_ represents data in a tabular format created by parsing the provided file or list of files. Below we show how to create an AzureML Dataset object from the folder path that you defined above (we assume the data is in parquet) and then render the first 10 rows into a pandas dataframe. We also show how to read in data from delimited files and SQL -- uncomment the most appropriate one for your data.

In [None]:
dataset = Dataset.Tabular.from_parquet_files(path = [(my_datastore, PATH_ON_DATASTORE)])
dataset.take(10).to_pandas_dataframe()

# to render the entire dataset into a pandas dataframe use:
# df = dataset.to_pandas_dataframe()

# to render into a spark dataframe use:
# df = dataset.to_spark_dataframe()

# to render a random sample of the dataset into a pandas dataframe use:
# df = dataset.take_sample(probability=0.10).to_pandas_dataframe()

# to create a dataset and pandas dataframe from delimited files use:
# dataset = Dataset.Tabular.from_delimited_files(path = [(my_datastore, PATH_ON_DATASTORE)])
# df = dataset.to_pandas_dataframe()
# df.head(5) # show first 5 rows

# to create a dataset and pandas dataframe from a sql query use:
# q = """SELECT * FROM TABLE"""
# dataset = Dataset.Tabular.from_sql_query(q)
# df = dataset.to_pandas_dataframe()
# df.head(5) # show first 5 rows

#### If your data is file-based then start here
_FileDataset_ references single or multiple files in datastores or from public URLs - an example may be a folder that contains a set of images that you want to train a deep-learning model on. Below we create a file dataset from the path that you defined above and then __mount__ the files onto the compute instance (into /tmp directory).

The variable `mount_folder` is populated with the folder path of the mounted data. Below we use ```os.listdir(mount_folder)``` to display the files in the directory.

In [None]:
dset = Dataset.File.from_files(path=[(my_datastore, PATH_ON_DATASTORE)])

mount_folder = tempfile.mkdtemp()
mount_context = dset.mount(mount_folder)
mount_context.start()

print('files are mounted at: ' + mount_folder)
os.listdir(mount_folder)


# to unmount the folder from the compute instance use:
# mount_context.stop()

# to take a sample of files use:
# mount_context.stop()
# mount_context = dset.take(10).mount(mount_folder)
# mount_context.start()

###  From this point forward continue as normal

From this point you can process your data just as you normally would on a local machine or server.