# Work with Data Assets

Data is the foundation on which machine learning models are built. Managing data centrally in the cloud, and making it accessible to teams of data scientists who are running experiments and training models on multiple workstations and compute targets is an important part of any professional data science solution.

In this notebook, you'll explore two Azure Machine Learning objects for working with data: *datastores*, and *data assets*.

## Before you start

You'll need the latest version of the **azure-ai-ml** package to run the code in this notebook. Run the cell below to verify that it is installed.

> **Note**:
> If the **azure-ai-ml** package is not installed, run `pip install azure-ai-ml` to install it.

In [None]:
pip show azure-ai-ml

## Connect to your workspace

With the required SDK packages installed, now you're ready to connect to your workspace.

To connect to a workspace, we need identifier parameters - a subscription ID, resource group name, and workspace name. Since you're working with a compute instance, managed by Azure Machine Learning, you can use the default values to connect to the workspace.

In [None]:
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient

try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()


In [None]:
# Get a handle to workspace
ml_client = MLClient.from_config(credential=credential)

## List the datastores

When you create the Azure Machine Learning workspace, an Azure Storage Account is created too. The Storage Account includes Blob and file storage and are automatically connected with your workspace as **datastores**. You can list all datastores connected to your workspace:

In [None]:
stores = ml_client.datastores.list()
for ds_name in stores:
    print(ds_name.name)

Note the `workspaceblobstore` which connects to the **azureml-blobstore-...** container you explored earlier. The `workspacefilestore` connects to the **code-...** file share.

## Create a datastore

Whenever you want to connect another Azure storage service with the Azure Machine Learning workspace, you can create a datastore. Note that creating a datastore, creates the connection between your workspace and the storage, it doesn't create the storage service itself. 

To create a datastore and connect to a (already existing) storage, you'll need to specify:

- The class to indicate with what type of storage service you want to connect. The example below connects to a Blob storage (`AzureBlobDatastore`).
- `name`: The display name of the datastore in the Azure Machine Learning workspace.
- `description`: Optional description to provide more information about the datastore.
- `account_name`: The name of the Azure Storage Account.
- `container_name`: The name of the container to store blobs in the Azure Storage Account.
- `credentials`: Provide the method of authentication and the credentials to authenticate. The example below uses an account key.

**Important**: 
- Replace the **YOUR-STORAGE-ACCOUNT-NAME** with the name of the Storage Account that was automatically created for you. 
- Replace the **XXXX-XXXX** for `account_key` with the account key of your Azure Storage Account. 

Remember you can retrieve the account key by navigating to the [Azure portal](https://portal.azure.com), go to your Storage Account, from the **Access keys** tab, copy the **Key** value for key1 or key2. 

In [None]:
from azure.ai.ml.entities import AzureBlobDatastore
from azure.ai.ml.entities import AccountKeyConfiguration

store = AzureBlobDatastore(
    name="blob_images_datastore",
    description="Blob Storage for images training data",
    account_name="<YOUR-STORAGE-ACCOUNT-NAME>",
    container_name="images-data", 
    credentials=AccountKeyConfiguration(
        account_key="<XXXX-XXXX>"
    ),
)

ml_client.create_or_update(store)

List the datastores again to verify that a new datastore named `blob_training_data` has been created:

In [None]:
stores = ml_client.datastores.list()
for ds_name in stores:
    print(ds_name.name)

## Create data assets

To point to a specific folder or file in a datastore, you can create data assets. There are three types of data assets:

- `URI_FILE` points to a specific file.
- `URI_FOLDER` points to a specific folder.
- `MLTABLE` points to a MLTable file which specifies how to read one or more files within a folder.

You'll now create URI_FOLDER data asset.

To create a `URI_FOLDER` data asset, you have to specify a path that points to a specific folder. The path can be a local path or cloud path.

In the example below, you'll create a data asset by referencing a *cloud* path. The path doesn't have to exist yet. The folder will be created when data is uploaded to the path.

In [None]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

blob_images_datastore_path = 'azureml://datastores/blob_images_datastore/paths/'

my_data = Data(
    path=blob_images_datastore_path,
    type=AssetTypes.URI_FOLDER,
    description="Data asset pointing to images data-asset-path folder in datastore",
    name="images-data-asset"
)

ml_client.data.create_or_update(my_data)

In [None]:
datasets = ml_client.data.list()
for ds_name in datasets:
    print(ds_name.name)

## Use data in a job

After using a notebook for experimentation. You can use scripts to train machine learning models. A script can be run as a job, and for each job you can specify inputs and outputs. 

You can use either **data assets** or **datastore paths** as inputs or outputs of a job. 

The cells below creates the **move-data.py** script in the **src** folder. The script reads the input data with the `read_csv()` function. The script then stores the data as a CSV file in the output path.

In [None]:
import os

# create a folder for the script files
script_folder = 'src'
os.makedirs(script_folder, exist_ok=True)
print(script_folder, 'folder created')

In [None]:
%%writefile $script_folder/read_imagaes_data.py
# import libraries
import argparse
import pandas as pd
import numpy as np
from pathlib import Path
import os

def main(args):    
    print(f"analyzig data asset: {args.images_data_asset}")
    
    # List all files and directories in the specified path
    contents = os.listdir(args.images_data_asset)

    # Print the contents
    for item in contents:
        print(item)
    

def parse_args():
    # setup arg parser
    parser = argparse.ArgumentParser()
    
    parser.add_argument("--images_data_asset", dest='images_data_asset',
                        type=str)

    # parse args
    args = parser.parse_args()

    # return args
    return args

# run script
if __name__ == "__main__":
    # add space in logs
    print("\n\n")
    print("*" * 60)

    # parse args
    args = parse_args()
    
    print(f"args: {args}")

    # run main function
    main(args)

    # add space in logs
    print("*" * 60)
    print("\n\n")


To submit a job that runs the **read_imagaes_data.py** script, run the cell below. 

The job is configured to use the data asset `images-data-asset`, pointing to the local **images-data** container as input. The output is a path pointing to a folder in the new datastore `blob_training_data`.

## InputOutputModes.RO_MOUNT - Read-only mount on the compute target

In [None]:
from azure.ai.ml import Input, Output
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.ai.ml import command

# ==============================================================
# Set the mode. The popular modes include:
# mode = InputOutputModes.RO_MOUNT # Read-only mount on the compute target
# mode = InputOutputModes.DOWNLOAD # Download the data to the compute target
# ==============================================================
mode = InputOutputModes.RO_MOUNT # Read-only mount on the compute target
# mode = InputOutputModes.DOWNLOAD # Download the data to the compute target


# configure input and output
my_job_inputs = {
    "images_data_asset": Input(type=AssetTypes.URI_FOLDER,
                            path="images-data-asset:1",
                            mode=mode)
}

# configure job
job = command(
    code="./src",
    command="python read_imagaes_data.py --images_data_asset ${{inputs.images_data_asset}}",
    inputs=my_job_inputs,
    environment="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest",
    compute="cpu-cluster-DS3v2",
    display_name="explore-images-data-RO_MOUNT",
    experiment_name="explore-images-data"
)

# submit job
returned_job = ml_client.create_or_update(job)
aml_url = returned_job.studio_url
print("Monitor your job at", aml_url)

## InputOutputModes.DOWNLOAD - Download the data to the compute target

In [None]:
from azure.ai.ml import Input, Output
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.ai.ml import command

# ==============================================================
# Set the mode. The popular modes include:
# mode = InputOutputModes.RO_MOUNT # Read-only mount on the compute target
# mode = InputOutputModes.DOWNLOAD # Download the data to the compute target
# ==============================================================
# mode = InputOutputModes.RO_MOUNT # Read-only mount on the compute target
mode = InputOutputModes.DOWNLOAD # Download the data to the compute target


# configure input and output
my_job_inputs = {
    "images_data_asset": Input(type=AssetTypes.URI_FOLDER,
                            path="images-data-asset:1",
                            mode=mode)
}

# configure job
job = command(
    code="./src",
    command="python read_imagaes_data.py --images_data_asset ${{inputs.images_data_asset}}",
    inputs=my_job_inputs,
    environment="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest",
    compute="cpu-cluster-DS3v2",
    display_name="explore-images-data-DOWNLOAD",
    experiment_name="explore-images-data"
)

# submit job
returned_job = ml_client.create_or_update(job)
aml_url = returned_job.studio_url
print("Monitor your job at", aml_url)