# Assertions
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.

Frequently, the data we work with while cleaning and preparing data is just a subset of the total data we will need to work with in production. It is also common to be working on a snapshot of a live dataset that is continuously updated and augmented.

In these cases, some of the assumptions we make as part of our cleaning might turn out to be false. Columns that originally only contained numbers within a certain range might actually contain a wider range of values in later executions. These errors often result in either broken pipelines or bad data.

AzureML DataPrep supports creating assertions on data, which are evaluated as the pipeline is executed. These assertions enable us to verify that our assumptions on the data continue to be accurate and, when not, to handle failures in a clean way.

To demonstrate, we will load a dataset and then add some assertions based on what we can see in the first few rows.

In [1]:
from azureml.dataprep import smart_read_file

df = smart_read_file('./data/crime0-10.csv')
df.get_profile()

Unnamed: 0,Type,Min,Max,Count,Missing Count,Error Count,Lower Quartile,Upper Quartile,Standard Deviation,Mean
Longitude,FieldType.DECIMAL,-87.8002,-87.6445,10.0,1.0,0.0,-87.725,-87.6589,0.0512571,-87.6974
ID,FieldType.DECIMAL,1.01397e+07,1.01409e+07,10.0,0.0,0.0,10139800.0,10140400.0,409.806,10140100.0
Location Description,FieldType.STRING,ALLEY,VEHICLE NON-COMMERCIAL,10.0,0.0,0.0,,,,
Ward,FieldType.DECIMAL,9,49,10.0,0.0,0.0,16.0,41.0,14.1676,29.5
Location,FieldType.STRING,,"(42.008124017, -87.65955018)",10.0,0.0,0.0,,,,
District,FieldType.DECIMAL,5,24,10.0,1.0,0.0,8.75,17.0,6.09872,13.2222
Primary Type,FieldType.STRING,ARSON,THEFT,10.0,0.0,0.0,,,,
Y Coordinate,FieldType.DECIMAL,1.82648e+06,1.94627e+06,10.0,1.0,0.0,1876210.0,1932620.0,37733.2,1898270.0
Latitude,FieldType.DECIMAL,41.6793,42.0081,10.0,1.0,0.0,41.816,41.9709,0.103645,41.8766
Date,FieldType.DATE,2015-07-05 22:10:00+00:00,2015-07-05 23:50:00+00:00,10.0,0.0,0.0,,,,


We can see there are latitude and longitude columns present in this dataset. By definition, these are constrained to specific ranges of values. We can assert that this is indeed the case so that if any records come through with invalid values, we detect them.

In [2]:
from azureml.dataprep import f_and, value

df = df.assert_value('Latitude', f_and(value <= 90, value >= -90), error_code='InvalidLatitude')
df = df.assert_value('Longitude', f_and(value <= 180, value >= -180), error_code='InvalidLongitude')
df.keep_columns(['Latitude', 'Longitude']).get_profile()

Unnamed: 0,Type,Min,Max,Count,Missing Count,Error Count,Lower Quartile,Upper Quartile,Standard Deviation,Mean
Longitude,FieldType.DECIMAL,-87.800175,-87.644545,10.0,0.0,1.0,-87.724992,-87.658915,0.051257,-87.69737
Latitude,FieldType.DECIMAL,41.679311,42.008124,10.0,0.0,1.0,41.816021,41.970902,0.103645,41.876613


Any assertion failures are represented as Errors in the resulting dataset. From the profile above, you can see that the Error Count for both of these columns is 1. We can use a filter to retrieve the error and see what value caused the assertion to fail.

In [3]:
from azureml.dataprep import col

error_df = df.filter(col('Latitude').is_error())
error = error_df.head(10)['Latitude'][0]
print(error.originalValue)

None


Our assertion failed because we were not removing missing values from our data. At this point, we have two options: we can go back and edit our code to avoid this error in the first place or we can resolve it now. In this case, we will just filter these out.

In [4]:
from azureml.dataprep import f_not, LocalFileOutput
clean_df = df.filter(f_not(col('Latitude').is_error()))
clean_df.get_profile()

Unnamed: 0,Type,Min,Max,Count,Missing Count,Error Count,Lower Quartile,Upper Quartile,Standard Deviation,Mean
Longitude,FieldType.DECIMAL,-87.8002,-87.6445,9.0,0.0,0.0,-87.725,-87.6589,0.0512571,-87.6974
ID,FieldType.DECIMAL,1.01397e+07,1.01409e+07,9.0,0.0,0.0,10139800.0,10140400.0,427.717,10140000.0
Location Description,FieldType.STRING,ALLEY,VEHICLE NON-COMMERCIAL,9.0,0.0,0.0,,,,
Ward,FieldType.DECIMAL,12,49,9.0,0.0,0.0,22.0,42.25,12.94,31.7778
Location,FieldType.STRING,"(41.6793109, -87.644545209)","(42.008124017, -87.65955018)",9.0,0.0,0.0,,,,
District,FieldType.DECIMAL,5,24,9.0,0.0,0.0,8.75,17.0,6.09872,13.2222
Primary Type,FieldType.STRING,ARSON,THEFT,9.0,0.0,0.0,,,,
Y Coordinate,FieldType.DECIMAL,1.82648e+06,1.94627e+06,9.0,0.0,0.0,1876210.0,1932620.0,37733.2,1898270.0
Latitude,FieldType.DECIMAL,41.6793,42.0081,9.0,0.0,0.0,41.816,41.9709,0.103645,41.8766
Date,FieldType.DATE,2015-07-05 22:10:00+00:00,2015-07-05 23:50:00+00:00,9.0,0.0,0.0,,,,
