####Synchronising Computed Metrics in a MetricsRepository to a json file stored on DBFS

PyDeequ allows us to persist the metrics we computed on dataframes in a so-called MetricsRepository. In the following example, we showcase repository json file managed by python-deequ on DBFS. This can be especially usefull:
- For python-deequ application migration,
- To manage MetricsRepository json on the application side application,
- To enable explainability and analytics using MetricsRepository json.

Note: As of 1.1.0 release of Python Deequ release initialising repository json as FileSystemMetricsRepository is the only way to run validations on historical metrics. InMemoryMetricsRepository does not support initialising from historical metrics.

#### 0) Set file location for metrics repository
Write repository file using File API Format but provide Spark API Format to [FileSystemMetricsRepository](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/repository/fs/FileSystemMetricsRepository.scala)

In [0]:
metrics_file = '/table_xyz_pydeequ_metrics_repository.json'
metrics_file_api =f"/dbfs/dbfs{metrics_file}"
metrics_spark_api =f"dbfs:/dbfs{metrics_file}"

#### 1) Create json - that stores historical metrics

This json structure is retrived from a previos pydeequ run.

In [0]:

with open(metrics_file_api, "w", encoding='utf-8') as file:
  file.write("""
  [
    {
      "resultKey": { "dataSetDate": 1702836503289, "tags": {} },
      "analyzerContext": {
        "metricMap": [
          {
            "analyzer": {
              "analyzerName": "Mean",
              "column": "age"
            },
            "metric": {
              "metricName": "DoubleMetric",
              "entity": "Column",
              "instance": "age",
              "name": "Mean",
              "value": 32
            }
          }
        ]
      }
    }
  ]""")

##### 1.1) Validate the file is written to DBFS

In [0]:
!cat /dbfs/dbfs/table_xyz_pydeequ_metrics_repository.json


  [
    {
      "resultKey": { "dataSetDate": 1702836503289, "tags": {} },
      "analyzerContext": {
        "metricMap": [
          {
            "analyzer": {
              "analyzerName": "Mean",
              "column": "age"
            },
            "metric": {
              "metricName": "DoubleMetric",
              "entity": "Column",
              "instance": "age",
              "name": "Mean",
              "value": 32
            }
          }
        ]
      }
    }
  ]

#### 2) Initiate FileSystemMetricsRepository with underlying file

In [0]:
from pydeequ.repository import FileSystemMetricsRepository

repository = FileSystemMetricsRepository(spark, metrics_spark_api)



#### 3) Run anomaly checks on new data but also using underlying file with historical metrics

In [0]:

from pydeequ.repository import ResultKey
from pydeequ.verification import VerificationSuite
from pydeequ.anomaly_detection import RelativeRateOfChangeStrategy
from pydeequ.analyzers import Mean

COLUMN_NAME = "age"

df = spark.createDataFrame([{COLUMN_NAME:19},{COLUMN_NAME:21}])

verification_suite = (
    VerificationSuite(spark)
        .onData(df)
        .useRepository(repository)
        .saveOrAppendResult(
            ResultKey(
                spark, 
                ResultKey.current_milli_time()
            )
        )
        .addAnomalyCheck(
            RelativeRateOfChangeStrategy(
                maxRateDecrease=0.8,
                maxRateIncrease=1.2
            ), 
            Mean(COLUMN_NAME)
        )
    )
results = verification_suite.run()


##### 3.1) Validate that historical metrics were taken into consideration when calculating anomaly

New data age mean is 20, while old data age average is 30. RelativeRateOfChangeStrategy should fails as accepted rate of change is +/-20%.

In [0]:
results.checkResultsAsDataFrame(spark_session=spark, verificationResult=results).display()

check,check_level,check_status,constraint,constraint_status,constraint_message
"Anomaly check for Mean(age,None)",Warning,Warning,"AnomalyConstraint(Mean(age,None))",Failure,Value: 20.0 does not meet the constraint requirement!


#### 4) Validate that repository json is updated after VerificationSuite run

In [0]:
import json
with open(metrics_file_api, "r", encoding="utf-8") as file:
    repository_str = file.read()

json.loads(repository_str)

Out[63]: [{'resultKey': {'dataSetDate': 1702836503289, 'tags': {}},
  'analyzerContext': {'metricMap': [{'analyzer': {'analyzerName': 'Mean',
      'column': 'age'},
     'metric': {'metricName': 'DoubleMetric',
      'entity': 'Column',
      'instance': 'age',
      'name': 'Mean',
      'value': 32.0}}]}},
 {'resultKey': {'dataSetDate': 1704554941399, 'tags': {}},
  'analyzerContext': {'metricMap': [{'analyzer': {'analyzerName': 'Mean',
      'column': 'age'},
     'metric': {'metricName': 'DoubleMetric',
      'entity': 'Column',
      'instance': 'age',
      'name': 'Mean',
      'value': 20.0}}]}}]