# Monitoring Data Drift

Over time, models can become less effective at predicting accurately due to changing trends in feature data. This phenomenon is known as *data drift*, and it's important to monitor your machine learning solution to detect it so you can retrain your models if necessary.

In this lab, you'll configure data drift monitoring for datasets.

## Before you start

In addition to the latest version of the **azureml-sdk** and **azureml-widgets** packages, you'll need the **azureml-datadrift** package to run the code in this notebook. Run the cell below to verify that it is installed.

In [3]:
!pip install azureml-datadrift

Collecting azureml-datadrift
  Downloading azureml_datadrift-1.34.0-py3-none-any.whl (99 kB)
Collecting azureml-dataset-runtime[fuse,pandas]~=1.34.0
  Downloading azureml_dataset_runtime-1.34.0-py3-none-any.whl (3.5 kB)
Collecting matplotlib<=3.2.1,>=3.0.2
  Downloading matplotlib-3.2.1-cp38-cp38-win_amd64.whl (9.2 MB)
Collecting azureml-pipeline-core~=1.34.0
  Downloading azureml_pipeline_core-1.34.0-py3-none-any.whl (312 kB)
Collecting lightgbm
  Downloading lightgbm-3.2.1-py3-none-win_amd64.whl (1.0 MB)
Collecting pyspark
  Downloading pyspark-3.1.2.tar.gz (212.4 MB)
Collecting azureml-core~=1.34.0
  Downloading azureml_core-1.34.0-py3-none-any.whl (2.2 MB)
Collecting azureml-telemetry~=1.34.0
  Downloading azureml_telemetry-1.34.0-py3-none-any.whl (30 kB)
Collecting azureml-dataprep<2.23.0a,>=2.22.0a
  Downloading azureml_dataprep-2.22.2-py3-none-any.whl (39.4 MB)
Collecting fusepy<4.0.0,>=3.0.1
  Using cached fusepy-3.0.1-py3-none-any.whl
Collecting azureml-dataprep-rslex~=1.20.0d

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
azureml-widgets 1.33.0 requires azureml-core~=1.33.0, but you have azureml-core 1.34.0 which is incompatible.
azureml-widgets 1.33.0 requires azureml-telemetry~=1.33.0, but you have azureml-telemetry 1.34.0 which is incompatible.
azureml-train-core 1.33.0 requires azureml-core~=1.33.0, but you have azureml-core 1.34.0 which is incompatible.
azureml-train-core 1.33.0 requires azureml-telemetry~=1.33.0, but you have azureml-telemetry 1.34.0 which is incompatible.
azureml-train-automl-client 1.33.0 requires azureml-core~=1.33.0, but you have azureml-core 1.34.0 which is incompatible.
azureml-train-automl-client 1.33.0 requires azureml-dataset-runtime~=1.33.0, but you have azureml-dataset-runtime 1.34.0 which is incompatible.
azureml-train-automl-client 1.33.0 requires azureml-telemetry~=1.33.0, but you have azureml-t

## Connect to your workspace

With the required SDK packages installed, now you're ready to connect to your workspace.

> **Note**: If you haven't already established an authenticated session with your Azure subscription, you'll be prompted to authenticate by clicking a link, entering an authentication code, and signing into Azure.

In [1]:
from azureml.core import Workspace

# Load the workspace from the saved config file
ws = Workspace.from_config()
print('Ready to work with', ws.name)

Failure while loading azureml_run_type_providers. Failed to load entrypoint hyperdrive = azureml.train.hyperdrive:HyperDriveRun._from_run_dto with exception (azureml-telemetry 1.34.0 (c:\applications\anaconda\lib\site-packages), Requirement.parse('azureml-telemetry~=1.33.0')).
Failure while loading azureml_run_type_providers. Failed to load entrypoint automl = azureml.train.automl.run:AutoMLRun._from_run_dto with exception (azureml-telemetry 1.34.0 (c:\applications\anaconda\lib\site-packages), Requirement.parse('azureml-telemetry~=1.33.0')).


Ready to work with wsag


## Create a *baseline* dataset

To monitor a dataset for data drift, you must register a *baseline* dataset (usually the dataset used to train your model) to use as a point of comparison with data collected in the future. 

In [2]:
from azureml.core import Datastore, Dataset


# Upload the baseline data
default_ds = ws.get_default_datastore()
default_ds.upload_files(files=['data/diabetes.csv', 'data/diabetes2.csv'],
                       target_path='diabetes-baseline',
                       overwrite=True, 
                       show_progress=True)

# Create and register the baseline dataset
print('Registering baseline dataset...')
baseline_data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, 'diabetes-baseline/*.csv'))
baseline_data_set = baseline_data_set.register(workspace=ws, 
                           name='diabetes baseline',
                           description='diabetes baseline data',
                           tags = {'format':'CSV'},
                           create_new_version=True)

print('Baseline dataset registered!')

Uploading an estimated of 2 files
Uploading data/diabetes2.csv
Uploaded data/diabetes2.csv, 1 files out of an estimated total of 2
Uploading data/diabetes.csv
Uploaded data/diabetes.csv, 2 files out of an estimated total of 2
Uploaded 2 files
Registering baseline dataset...
Baseline dataset registered!


## Create a *target* dataset

Over time, you can collect new data with the same features as your baseline training data. To compare this new data to the baseline data, you must define a target dataset that includes the features you want to analyze for data drift as well as a timestamp field that indicates the point in time when the new data was current -this enables you to measure data drift over temporal intervals. The timestamp can either be a field in the dataset itself, or derived from the folder and filename pattern used to store the data. For example, you might store new data in a folder hierarchy that consists of a folder for the year, containing a folder for the month, which in turn contains a folder for the day; or you might just encode the year, month, and day in the file name like this: *data_2020-01-29.csv*; which is the approach taken in the following code:

In [3]:
import datetime as dt
import pandas as pd

print('Generating simulated data...')

# Load the smaller of the two data files
data = pd.read_csv('data/diabetes2.csv')

# We'll generate data for the past 6 weeks
weeknos = reversed(range(6))

file_paths = []
for weekno in weeknos:
    
    # Get the date X weeks ago
    data_date = dt.date.today() - dt.timedelta(weeks=weekno)
    
    # Modify data to ceate some drift
    data['Pregnancies'] = data['Pregnancies'] + 1
    data['Age'] = round(data['Age'] * 1.2).astype(int)
    data['BMI'] = data['BMI'] * 1.1
    
    # Save the file with the date encoded in the filename
    file_path = 'data/diabetes_{}.csv'.format(data_date.strftime("%Y-%m-%d"))
    data.to_csv(file_path)
    file_paths.append(file_path)

# Upload the files
path_on_datastore = 'diabetes-target'
default_ds.upload_files(files=file_paths,
                       target_path=path_on_datastore,
                       overwrite=True,
                       show_progress=True)

# Use the folder partition format to define a dataset with a 'date' timestamp column
partition_format = path_on_datastore + '/diabetes_{date:yyyy-MM-dd}.csv'
target_data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, path_on_datastore + '/*.csv'),
                                                       partition_format=partition_format)

# Register the target dataset
print('Registering target dataset...')
target_data_set = target_data_set.with_timestamp_columns('date').register(workspace=ws,
                                                                          name='diabetes target',
                                                                          description='diabetes target data',
                                                                          tags = {'format':'CSV'},
                                                                          create_new_version=True)

print('Target dataset registered!')

Generating simulated data...
Uploading an estimated of 6 files
Uploading data/diabetes_2021-08-15.csv
Uploaded data/diabetes_2021-08-15.csv, 1 files out of an estimated total of 6
Uploading data/diabetes_2021-08-22.csv
Uploaded data/diabetes_2021-08-22.csv, 2 files out of an estimated total of 6
Uploading data/diabetes_2021-08-29.csv
Uploaded data/diabetes_2021-08-29.csv, 3 files out of an estimated total of 6
Uploading data/diabetes_2021-09-12.csv
Uploaded data/diabetes_2021-09-12.csv, 4 files out of an estimated total of 6
Uploading data/diabetes_2021-09-19.csv
Uploaded data/diabetes_2021-09-19.csv, 5 files out of an estimated total of 6
Uploading data/diabetes_2021-09-05.csv
Uploaded data/diabetes_2021-09-05.csv, 6 files out of an estimated total of 6
Uploaded 6 files
Registering target dataset...
Target dataset registered!


## Create a data drift monitor

Now you're ready to create a data drift monitor for the diabetes data. The data drift monitor will run periodicaly or on-demand to compare the baseline dataset with the target dataset, to which new data will be added over time.

### Create a compute target

To run the data drift monitor, you'll need a compute target. Run the following cell to specify a compute cluster (if it doesn't exist, it will be created).

> **Important**: Change *your-compute-cluster* to the name of your compute cluster in the code below before running it! Cluster names must be globally unique names between 2 to 16 characters in length. Valid characters are letters, digits, and the - character.

In [4]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "agcluster"

try:
    # Check for existing compute target
    training_cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    # If it doesn't already exist, create it
    try:
        compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS11_V2', max_nodes=2)
        training_cluster = ComputeTarget.create(ws, cluster_name, compute_config)
        training_cluster.wait_for_completion(show_output=True)
    except Exception as ex:
        print(ex)
    

InProgress...
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


> **Note**: Compute instances and clusters are based on standard Azure virtual machine images. For this exercise, the *Standard_DS11_v2* image is recommended to achieve the optimal balance of cost and performance. If your subscription has a quota that does not include this image, choose an alternative image; but bear in mind that a larger image may incur higher cost and a smaller image may not be sufficient to complete the tasks. Alternatively, ask your Azure administrator to extend your quota.

### Define the data drift monitor

Now you're ready to use a **DataDriftDetector** class to define the data drift monitor for your data. You can specify the features you want to monitor for data drift, the name of the compute target to be used to run the monitoring process, the frequency at which the data should be compared, the data drift threshold above which an alert should be triggered, and the latency (in hours) to allow for data collection.

In [5]:
from azureml.datadrift import DataDriftDetector

# set up feature list
features = ['Pregnancies', 'Age', 'BMI']

# set up data drift detector
monitor = DataDriftDetector.create_from_datasets(ws, 'mslearn-diabates-drift', baseline_data_set, target_data_set,
                                                      compute_target=cluster_name, 
                                                      frequency='Week', 
                                                      feature_list=features, 
                                                      drift_threshold=.3, 
                                                      latency=24)
monitor

{'_logger': <_TelemetryLoggerContextAdapter azureml.datadrift._logging._telemetry_logger.azureml.datadrift.datadriftdetector (DEBUG)>, '_workspace': Workspace.create(name='wsag', subscription_id='6ea869be-bab3-4204-94c3-1fc677f7d2de', resource_group='rgag'), '_frequency': 'Week', '_schedule_start': None, '_schedule_id': None, '_interval': 1, '_state': 'Disabled', '_alert_config': None, '_type': 'DatasetBased', '_id': '24361da5-7f77-4f02-8037-9398b22f37a1', '_compute_target_name': 'agcluster', '_drift_threshold': 0.3, '_baseline_dataset_id': '9602b31a-d59d-4bfe-a835-2432fe604406', '_target_dataset_id': 'a513d7d0-1ff3-4ea7-8bf0-014fb763b8d8', '_feature_list': ['Pregnancies', 'Age', 'BMI'], '_latency': 24, '_name': 'mslearn-diabates-drift', '_latest_run_time': None, '_client': <azureml.datadrift._restclient.datadrift_client.DataDriftClient object at 0x0000022158094D60>}

## Backfill the data drift monitor

You have a baseline dataset and a target dataset that includes simulated weekly data collection for six weeks. You can use this to backfill the monitor so that it can analyze data drift between the original baseline and the target data.

> **Note** This may take some time to run, as the compute target must be started to run the backfill analysis. The widget may not always update to show the status, so click the link to observe the experiment status in Azure Machine Learning studio!

In [6]:
from azureml.widgets import RunDetails

backfill = monitor.backfill(dt.datetime.now() - dt.timedelta(weeks=6), dt.datetime.now())

RunDetails(backfill).show()
backfill.wait_for_completion()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

{'runId': 'mslearn-diabates-drift-Monitor-Runs_1632059782411',
 'target': 'agcluster',
 'status': 'Completed',
 'startTimeUtc': '2021-09-19T14:10:31.065938Z',
 'endTimeUtc': '2021-09-19T14:17:59.53249Z',
 'services': {},
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': '8ef9dca9-4d31-47c1-b678-496e7c9715e9',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [{'dataset': {'id': '9602b31a-d59d-4bfe-a835-2432fe604406'}, 'consumptionDetails': {'type': 'Reference'}}, {'dataset': {'id': 'a513d7d0-1ff3-4ea7-8bf0-014fb763b8d8'}, 'consumptionDetails': {'type': 'Reference'}}],
 'outputDatasets': [],
 'runDefinition': {'script': '_generate_script_datasets.py',
  'useAbsolutePath': False,
  'arguments': ['--baseline_dataset_id',
   '9602b31a-d59d-4bfe-a835-2432fe604406',
   '--target_dataset_id',
   'a513d7d0-1ff3-4ea7-8bf0-014fb763b8d8',
   '--workspace_name',
   'wsag',
   '--work

## Analyze data drift

You can use the following code to examine data drift for the points in time collected in the backfill run.

In [10]:
drift_metrics = backfill.get_metrics()
for metric in drift_metrics:
    print(metric, drift_metrics[metric])

start_date 2021-08-01
end_date 2021-09-19
frequency Week
Datadrift percentage {'days_from_start': [7, 14, 21, 28, 35, 42], 'drift_percentage': [74.19152901127207, 87.23985219136877, 91.74192122865539, 94.96492628559955, 97.58354951107833, 99.23199438682525]}


You can also visualize the data drift metrics in [Azure Machine Learning studio](https://ml.azure.com) by following these steps:

1. On the **Datasets** page, view the **Dataset monitors** tab.
2. Click the data drift monitor you want to view.
3. Select the date range over which you want to view data drift metrics (if the column chart does not show multiple weeks of data, wait a minute or so and click **Refresh**).
4. Examine the charts in the **Drift overview** section at the top, which show overall drift magnitude and the drift contribution per feature.
5. Explore the charts in the **Feature detail** section at the bottom, which enable you to see various measures of drift for individual features.

> **Note**: For help understanding the data drift metrics, see the [How to monitor datasets](https://docs.microsoft.com/azure/machine-learning/how-to-monitor-datasets#understanding-data-drift-results) in the Azure Machine Learning documentation.

## Explore further

This lab is designed to introduce you to the concepts and principles of data drift monitoring. To learn more about monitoring data drift using datasets, see the [Detect data drift on datasets](https://docs.microsoft.com/azure/machine-learning/how-to-monitor-datasets) in the Azure machine Learning documentation.

You can also collect data from published services and use it as a target dataset for datadrift monitoring. See [Collect data from models in production](https://docs.microsoft.com/azure/machine-learning/how-to-enable-data-collection) for details.
