# Monitoring Data Drift

Over time, models can become less effective at predicting accurately due to changing trends in feature data. This phenomenon is known as *data drift*, and it's important to monitor your machine learning solution to detect it so you can retrain your models if necessary.

In this lab, you'll configure data drift monitoring for datasets.

## Install the DataDriftDetector module

To define a data drift monitor, you'll need to ensure that you have the latest version of the Azure ML SDK installed, and install the **datadrift** module; so run the following cell to do that:

In [33]:
!pip install --upgrade azureml-sdk[notebooks,automl,explain]
!pip install --upgrade azureml-datadrift
# Restart the kernel after installation is complete!

Requirement already up-to-date: azureml-sdk[automl,explain,notebooks] in /anaconda/envs/azureml_py36/lib/python3.6/site-packages (1.13.0)
Requirement already up-to-date: azureml-datadrift in /anaconda/envs/azureml_py36/lib/python3.6/site-packages (1.13.0)


> **Important**: Now you'll need to <u>restart the kernel</u>. In Jupyter, on the **Kernel** menu, select **Restart and Clear Output**. Then, when the output from the cell above has been removed and the kernel is restarted, continue the steps below.

## Connect to Your Workspace

The first thing you need to do is to connect to your workspace using the Azure ML SDK.

> **Note**: You may be prompted to authenticate. Just copy the code and click the link provided to sign into your Azure subscription, and then return to this notebook.

In [34]:
from azureml.core import Workspace

# Load the workspace from the saved config file
ws = Workspace.from_config()
print('Ready to work with', ws.name)

Ready to work with customer_360_ws


## Create a Baseline Dataset

To monitor a dataset for data drift, you must register a *baseline* dataset (usually the dataset used to train your model) to use as a point of comparison with data collected in the future. 

In this case we will use the already existing **flight_delays** dataset as our baseline dataset

In [35]:
# Get the training dataset
baseline_dataset = ws.datasets.get('flight_delays_data')

## Create a Target Dataset

Over time, you can collect new data with the same features as your baseline training data. To compare this new data to the baseline data, you must define a target dataset that includes the features you want to analyze for data drift as well as a timestamp field that indicates the point in time when the new data was current -this enables you to measure data drift over temporal intervals. The timestamp can either be a field in the dataset itself, or derived from the folder and filename pattern used to store the data. For example, you might store new data in a folder hierarchy that consists of a folder for the year, containing a folder for the month, which in turn contains a folder for the day; or you might just encode the year, month, and day in the file name like this: *data_2020-01-29.csv*; which is the approach taken in the following code:

In [36]:
import datetime as dt
from azureml.core import Dataset
import os

data_folder = 'data'
os.makedirs(data_folder, exist_ok=True)

print(data_folder, 'Folder ready.')

default_ds = ws.get_default_datastore()

print('Generating simulated data...')



# Load 10 percentage of the baseline data for drift simulation
drift_data = baseline_dataset.random_split(percentage=0.1, seed=123)[0]
drift_data = drift_data.to_pandas_dataframe()

file_path = 'data/flight_delays.csv'
drift_data.head().to_csv(file_path)


# We'll generate data for the past 6 weeks
weeknos = reversed(range(6))

file_paths = []
for weekno in weeknos:
    
    # Get the date X weeks ago
    data_date = dt.date.today() - dt.timedelta(weeks=weekno)
    
    # Modify data to ceate some drift
    drift_data['Month'] = drift_data['Month'] + 1
    drift_data['CRSDepTime'] = drift_data['CRSDepTime'] + 123
    drift_data['CRSArrTime'] = drift_data['CRSArrTime'] + 325
    drift_data['DepDelay'] = drift_data['DepDelay'] * 1.1
    
    # Save the file with the date encoded in the filename
    file_path = 'data/flight_delays_{}.csv'.format(data_date.strftime("%Y-%m-%d"))
    drift_data.to_csv(file_path)
    file_paths.append(file_path)

# Upload the files
path_on_datastore = 'flight_delays_target'
default_ds.upload_files(files=file_paths,
                       target_path=path_on_datastore,
                       overwrite=True,
                       show_progress=True)

# Use the folder partition format to define a dataset with a 'date' timestamp column
partition_format = path_on_datastore + '/flight_delays_{date:yyyy-MM-dd}.csv'
target_data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, path_on_datastore + '/*.csv'),
                                                       partition_format=partition_format)

# Register the target dataset
print('Registering target dataset...')
target_data_set = target_data_set.with_timestamp_columns('date').register(workspace=ws,
                                                                          name='Flight Delays target',
                                                                          description='Flight Delays target data',
                                                                          tags = {'format':'CSV'},
                                                                          create_new_version=True)

print('Target dataset registered!')

data Folder ready.
Generating simulated data...
Uploading an estimated of 6 files
Uploading data/flight_delays_2020-08-14.csv
Uploaded data/flight_delays_2020-08-14.csv, 1 files out of an estimated total of 6
Uploading data/flight_delays_2020-08-28.csv
Uploaded data/flight_delays_2020-08-28.csv, 2 files out of an estimated total of 6
Uploading data/flight_delays_2020-09-11.csv
Uploaded data/flight_delays_2020-09-11.csv, 3 files out of an estimated total of 6
Uploading data/flight_delays_2020-09-18.csv
Uploaded data/flight_delays_2020-09-18.csv, 4 files out of an estimated total of 6
Uploading data/flight_delays_2020-08-21.csv
Uploaded data/flight_delays_2020-08-21.csv, 5 files out of an estimated total of 6
Uploading data/flight_delays_2020-09-04.csv
Uploaded data/flight_delays_2020-09-04.csv, 6 files out of an estimated total of 6
Uploaded 6 files
Registering target dataset...
Target dataset registered!


## Create a Data Drift Monitor

Now you're ready to create a data drift monitor for the flight delays data. The data drift monitor will run periodicaly or on-demand to compare the baseline dataset with the target dataset, to which new data will be added over time.

### Create a Compute Target

To run the data drift monitor, you'll need a compute target. create an Azure Machine Learning compute cluster in your workspace (or use an existing one if you have created it previously).

> **Important**: Change *your-compute-cluster* to the unique name for your compute cluster in the code below before running it!

In [37]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "aml-cluster"

try:
    # Get the cluster if it exists
    training_cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    # If not, create it
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS2_V2', max_nodes=2)
    training_cluster = ComputeTarget.create(ws, cluster_name, compute_config)

training_cluster.wait_for_completion(show_output=True)

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


### Define the Data Drift Monitor

Now you're ready to use a **DataDriftDetector** class to define the data drift monitor for your data. You can specify the features you want to monitor for data drift, the name of the compute target to be used to run the monitoring process, the frequency at which the data should be compared, the data drift threshold above which an alert should be triggered, and the latency (in hours) to allow for data collection.

In [38]:
from azureml.datadrift import DataDriftDetector

# set up feature list
features = ['Month', 'CRSDepTime', 'CRSArrTime', 'DepDelay']

# set up data drift detector
monitor = DataDriftDetector.create_from_datasets(ws, 'flight-delays-drift-detector', baseline_dataset, target_data_set,
                                                      compute_target=cluster_name, 
                                                      frequency='Week', 
                                                      feature_list=features, 
                                                      drift_threshold=.3, 
                                                      latency=24)
monitor

{'_workspace': Workspace.create(name='customer_360_ws', subscription_id='d09a2b06-19d3-43a2-adb0-55088c10565e', resource_group='Intelfort_Big_Data_01'), '_frequency': 'Week', '_schedule_start': None, '_schedule_id': None, '_interval': 1, '_state': 'Disabled', '_alert_config': None, '_type': 'DatasetBased', '_id': 'e54cfe6c-2aa1-417d-bdab-e52d03559c71', '_model_name': None, '_model_version': 0, '_services': None, '_compute_target_name': 'aml-cluster', '_drift_threshold': 0.3, '_baseline_dataset_id': '874f5158-224f-45a0-a1d0-c1fab9559330', '_target_dataset_id': '561e0909-ecfe-4448-991d-e756573cd1ff', '_feature_list': ['Month', 'CRSDepTime', 'CRSArrTime', 'DepDelay'], '_latency': 24, '_name': 'diabetes-drift-detector', '_latest_run_time': None, '_client': <azureml.datadrift._restclient.datadrift_client.DataDriftClient object at 0x7f63ec151c18>, '_logger': <_TelemetryLoggerContextAdapter azureml.datadrift._logging._telemetry_logger.azureml.datadrift.datadriftdetector (DEBUG)>}

## Backfill the Monitor

You have a baseline dataset and a target dataset that includes simulated weekly data collection for six weeks. You can use this to backfill the monitor so that it can analyze data drift between the original baseline and the target data.

> **Note** This may take some time to run, as the compute target must be started to run the backfill analysis. The widget may not always update to show the status, so click the link to observe the experiment status in Azure Machine Learning studio!

In [39]:
from azureml.widgets import RunDetails

backfill = monitor.backfill( dt.datetime.now() - dt.timedelta(weeks=6), dt.datetime.now())

RunDetails(backfill).show()
backfill.wait_for_completion()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

## Analyze Data Drift

You can use the following code to examine data drift for the points in time collected in the backfill run.

In [None]:
drift_metrics = backfill.get_metrics()
for metric in drift_metrics:
    print(metric, drift_metrics[metric])

You can also visualize the data drift metrics in [Azure Machine Learning studio](https://ml.azure.com) by following these steps:

1. On the **Datasets** page, view the **Dataset monitors** tab.
2. Click the data drift monitor you want to view.
3. Select the date range over which you want to view data drift metrics (if the column chart does not show multiple weeks of data, wait a minute or so and click **Refresh**).
4. Examine the charts in the **Drift overview** section at the top, which show overall drift magnitude and the drift contribution per feature.
5. Explore the charts in the **Feature detail** section at the bottom, which enable you to see various measures of drift for individual features.

> **Note**: For help understanding the data drift metrics, see the [How to monitor datasets](https://docs.microsoft.com/azure/machine-learning/how-to-monitor-datasets#understanding-data-drift-results) in the Azure Machine Learning documentation.

## Explore Further

This lab is designed to introduce you to the concepts and principles of data drift monitoring. To learn more about monitoring data drift using datasets, see the [Detect data drift on datasets](https://docs.microsoft.com/azure/machine-learning/how-to-monitor-datasets) in the Azure machine Learning documentation.

You can also configure data drift monitoring for services deployed in an Azure Kubernetes Service (AKS) cluster. For more information about this, see [Detect data drift on models deployed to Azure Kubernetes Service (AKS)](https://docs.microsoft.com/azure/machine-learning/how-to-monitor-data-drift) in the Azure Machine Learning documentation.
