# Работа с Datastores и Datasets в Azure ML

__Цель лабораторной работы:__

- создание и управление Хранилищами данных (Datastore)
- регистрация и управление Наборами данных (Datasets)

## Подготовка среды

Импорт необходимых модулей и проверка версии Azure ML SDK:

In [1]:
import os

import azureml.core
from azureml.core import Workspace, Experiment, Dataset
from azureml.core.datastore import Datastore
from azureml.data.data_reference import DataReference
from msrest.exceptions import HttpOperationError

# Check core SDK version number
print(f'SDK version: {azureml.core.VERSION}')

Failure while loading azureml_run_type_providers. Failed to load entrypoint hyperdrive = azureml.train.hyperdrive:HyperDriveRun._from_run_dto with exception [Errno 2] No such file or directory: '/anaconda/envs/azureml_py36/lib/python3.6/site-packages/cryptography-3.0.dist-info/METADATA'.


SDK version: 1.14.0


Получим конфигурацию эксперимента:

In [2]:
%run core.py

config = get_experiment_config('lab_2B')
config

{'experiment_name': 'datasets-experiment',
 'storage_account_name': 'aiclouddata',
 'storage_container_name': 'aml-ws-data',
 'storage_account_key': 'eSy/BWB0hvEkguj36V63lT87fFskj/OlklaavThM0S/qwlaMXfI7vZNP19HDtYUUpacIoMxzlaFzkifd22xZKg==',
 'core': {'expriments_root_dir': 'expriments/',
  'datastore_name': 'aml_ws_datastore_v2',
  'dataset_name': 'diabetes-data',
  'ml_cluster_name': 'aml-ws-cluster',
  'ml_model_name': 'diabetes-predict-model'}}

## Соединение со Azure ML Workspace

Устанавливаем соединение с Рабочей областью в Azure ML:

In [3]:
ws = Workspace.from_config()
print(f'Successfully connected to Workspace: {ws.name}.')

Successfully connected to Workspace: aml-workshop.


## Работа с Datastore

## Просмотр существующих Datastore

Получим список зарегистрированных Хранилищ данных, в т.ч. и Хранилища данных по умолчанию:

In [4]:
# Enumerate all datastores, indicating which is the default
for ds_name in ws.datastores:
    print(ds_name)

print('\n')

# Get the default datastore
default_ds = ws.get_default_datastore()
print(f'Default Datastore: {default_ds.name}')

aml_ws_datastore_ui
azureml_globaldatasets
workspacefilestore
workspaceblobstore


Default Datastore: aml_ws_datastore_ui


## Создаем новый Datastore

Установка имени нового Datastore и указание информации об Azure Storage Account, где будет распологаться новый Datastore:

In [5]:
datastore_name = config['core']['datastore_name']

# Azure Storage Account Info
storage_account_name = config['storage_account_name']
storage_container_name = config['storage_container_name']
storage_account_key = config['storage_account_key'] # WARN: insert your storage account key here

In [6]:
print(datastore_name, storage_account_name, storage_container_name)

aml_ws_datastore_v2 aiclouddata aml-ws-data


Создаем Datastore, если он уже не существует:

In [7]:
try:
    new_datastore = Datastore.get(ws, datastore_name)
    print(f'Blob Datastore with name {new_datastore.name} was found!')

except HttpOperationError:
    new_datastore = Datastore.register_azure_blob_container(
        workspace=ws,
        datastore_name=datastore_name,
        account_name=storage_account_name,
        container_name=storage_container_name,
        account_key=storage_account_key)
    print(f'Registered blob datastore with name {new_datastore.name}')

Registered blob datastore with name aml_ws_datastore_v2


Получаем информацию о cозданном Datastore:

In [8]:
print(f'Datastore {new_datastore.name} based on {new_datastore.datastore_type} in storage account named {new_datastore.account_name}')

Datastore aml_ws_datastore_v2 based on AzureBlob in storage account named aiclouddata


## Загрузка данных из Datastore

Делаем созданный Datastore хранилищем по умолчанию (для удобства дальнейшей работы):

In [9]:
ws.set_default_datastore(new_datastore.name)
ds = ws.get_default_datastore()

print(ds.name)

aml_ws_datastore_v2


Загрузка данных (внимание: создайте контейтенер, если его до это не существовала):

In [11]:
ds.upload_files(files=['../data/diabetes_train.csv', '../data/diabetes_test.csv'], # Upload the diabetes csv files in /data
                       target_path='diabetes-data/', # Put it in a folder path in the datastore
                       overwrite=True, # Replace existing files of the same name
                       show_progress=True)

Uploading an estimated of 2 files
Uploading ../data/diabetes_train.csv
Uploaded ../data/diabetes_train.csv, 1 files out of an estimated total of 2
Uploading ../data/diabetes_test.csv
Uploaded ../data/diabetes_test.csv, 2 files out of an estimated total of 2
Uploaded 2 files


$AZUREML_DATAREFERENCE_5163cd174e924d4eacb97298627f683e

Зарегистрируем загруженные в Datastore данные, как табличный Dataset:

In [12]:
diabetes_ds = Dataset.Tabular.from_delimited_files(path=(ds, 'diabetes-data/*.csv'))
diabetes_ds

{
  "source": [
    "('aml_ws_datastore_v2', 'diabetes-data/*.csv')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ParseDelimited",
    "DropColumns",
    "SetColumnTypes"
  ]
}

## Зарегистрируем Dataset

In [13]:
diabetes_db = diabetes_ds.register(workspace = ws,
                                   name = config['core']['dataset_name'],
                                   description = 'Diabetes Disease Database',
                                   create_new_version = True)

Просмотрим список зарегистрированных Наборов данных:

In [14]:
print('Available datasets:')

for ds in ws.datasets:
    print(f'\t{ds}')

Available datasets:
	diabetes-data
	diabetes-data-ui


## Просмотр Набора данных

Скачаем зарегистрированный набор данных и выведем 10 строк:

In [15]:
diabetes_db_from_azure = ws.datasets.get(config['core']['dataset_name'])

diabetes_df = diabetes_db_from_azure.to_pandas_dataframe()
diabetes_df.sample(10)

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
12624,1675945,1,85,52,9,47,22.034829,0.8011,21,0
10094,1645997,7,114,83,43,152,42.163435,0.201104,60,1
751,1169427,8,126,81,30,22,21.838381,0.120111,23,0
1723,1889762,0,75,58,35,35,34.979228,0.096909,25,0
8155,1685002,2,101,78,13,180,28.126878,0.088584,30,1
241,1781692,9,125,99,15,56,29.496814,0.839732,22,1
5463,1917192,3,107,89,48,172,42.775847,0.268844,22,1
7706,1312440,6,150,70,34,49,21.110925,0.169408,26,0
5324,1352538,0,102,73,35,63,21.972607,0.759517,21,0
5372,1214282,12,97,78,51,256,48.286829,1.216489,47,1
