# Работа с Datastores и Datasets в Azure ML

__Цель лабораторной работы:__

- создание и управление Хранилищами данных (Datastore)
- регистрация и управление Наборами данных (Datasets)

## Подготовка среды

Импорт необходимых модулей и проверка версии Azure ML SDK:

In [26]:
import os

import azureml.core
from azureml.core import Workspace, Experiment, Dataset
from azureml.core.datastore import Datastore
from azureml.data.data_reference import DataReference
from msrest.exceptions import HttpOperationError

# Check core SDK version number
print(f'SDK version: {azureml.core.VERSION}')

SDK version: 1.19.0


Получим конфигурацию эксперимента:

In [None]:
%run core.py

config = get_experiment_config('lab_2B')
config

## Соединение со Azure ML Workspace

Устанавливаем соединение с Рабочей областью в Azure ML:

In [28]:
ws = Workspace.from_config()
print(f'Successfully connected to Workspace: {ws.name}.')

Successfully connected to Workspace: ai-in-cloud-workspace.


## Работа с Datastore

## Просмотр существующих Datastore

Получим список зарегистрированных Хранилищ данных, в т.ч. и Хранилища данных по умолчанию:

In [31]:
# ws.set_default_datastore('workspaceblobstore')

In [32]:
# Enumerate all datastores, indicating which is the default
for ds_name in ws.datastores:
    print(ds_name)

print('\n')

# Get the default datastore
default_ds = ws.get_default_datastore()
print(f'Default Datastore: {default_ds.name}')

creditcardfraudstore
dogsimagesblob
azureml_globaldatasets
workspaceblobstore
workspacefilestore


Default Datastore: workspaceblobstore


## Создаем новый Datastore

Установка имени нового Datastore и указание информации об Azure Storage Account, где будет распологаться новый Datastore:

In [33]:
datastore_name = config['core']['datastore_name']

# Azure Storage Account Info
storage_account_name = config['storage_account_name']
storage_container_name = config['storage_container_name']
storage_account_key = config['storage_account_key'] # WARN: insert your storage account key here

In [37]:
print(f'Creating {datastore_name} datastore linked to {storage_container_name} blob container in {storage_account_name} storage account.')

Creating winter_school_2020 datastore linked to aml-ws-data blob container in aiclouddata storage account.


Создаем Datastore, если он уже не существует:

In [39]:
try:
    new_datastore = Datastore.get(ws, datastore_name)
    print(f'Blob Datastore with name {new_datastore.name} was found!')

except HttpOperationError:
    new_datastore = Datastore.register_azure_blob_container(
        workspace=ws,
        datastore_name=datastore_name,
        account_name=storage_account_name,
        container_name=storage_container_name,
        account_key=storage_account_key)
    print(f'Registered blob datastore with name {new_datastore.name}')

Blob Datastore with name winter_school_2020 was found!


Получаем информацию о cозданном Datastore:

In [None]:
print(f'Datastore {new_datastore.name} based on {new_datastore.datastore_type} in storage account named {new_datastore.account_name}')

## Загрузка данных из Datastore

Делаем созданный Datastore хранилищем по умолчанию (для удобства дальнейшей работы):

In [40]:
ws.set_default_datastore(new_datastore.name)
ds = ws.get_default_datastore()

print(ds.name)

winter_school_2020


Загрузка данных (внимание: создайте контейтенер, если его до это не существовала):

In [41]:
ds.upload_files(files=['../data/diabetes_train.csv', '../data/diabetes_test.csv'], # Upload the diabetes csv files in /data
                       target_path='diabetes-data/', # Put it in a folder path in the datastore
                       overwrite=True, # Replace existing files of the same name
                       show_progress=True)

Uploading an estimated of 2 files
Uploading ../data/diabetes_test.csv
Uploaded ../data/diabetes_test.csv, 1 files out of an estimated total of 2
Uploading ../data/diabetes_train.csv
Uploaded ../data/diabetes_train.csv, 2 files out of an estimated total of 2
Uploaded 2 files


$AZUREML_DATAREFERENCE_184a8815b8f04d98a517e0496827e8eb

Зарегистрируем загруженные в Datastore данные, как табличный Dataset:

In [42]:
diabetes_ds = Dataset.Tabular.from_delimited_files(path=(ds, 'diabetes-data/*.csv'))
diabetes_ds

{
  "source": [
    "('winter_school_2020', 'diabetes-data/*.csv')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ParseDelimited",
    "DropColumns",
    "SetColumnTypes"
  ]
}

## Зарегистрируем Dataset

In [43]:
diabetes_db = diabetes_ds.register(workspace = ws,
                                   name = config['core']['dataset_name'],
                                   description = 'Diabetes Disease Database (Winter School 2020)',
                                   create_new_version = True)

Просмотрим список зарегистрированных Наборов данных:

In [47]:
print('Available datasets:')

for ds in ws.datasets:
    print(f'\t{ds}')

Available datasets:
	diabetes-data
	diabetes-batch-data
	credit-card-fraud
	covid19-spread-russia
	covid19-spread
	mnist-dataset
	Pima Indians Diabetes Database


## Просмотр Набора данных

Скачаем зарегистрированный набор данных и выведем 10 строк:

In [46]:
diabetes_db_from_azure = ws.datasets.get(config['core']['dataset_name'])

diabetes_df = diabetes_db_from_azure.to_pandas_dataframe()
diabetes_df.sample(10)

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
14820,1233038,6,118,59,11,28,37.320076,0.126788,25,0
3602,1821648,1,87,49,9,31,37.64182,0.255898,23,0
4427,1615533,1,79,96,31,139,21.793499,0.126214,26,0
1942,1016702,0,59,55,7,24,35.997559,0.212584,22,0
5353,1521530,7,89,82,45,26,33.64065,0.253484,51,0
273,1814728,7,161,75,8,28,21.614417,0.148146,43,0
10906,1915154,1,159,49,46,176,41.34435,0.096023,21,0
12512,1753702,0,70,60,34,159,39.633064,0.097683,42,0
2023,1822332,3,128,67,53,487,46.352321,0.116897,21,1
4945,1101217,0,166,78,31,176,19.241181,0.104875,22,0
