####Manage FileSystemMetricsRepository metrics JSON file stored on DBFS between runs.

PyDeeQu allows us to persist the metrics in a so-called MetricsRepository. FileSystemMetricsRepository implements MetricsRepository and allows to materialize repository to JSON file.

The following tutorial is created to demonstrate
- How to access FileSystemMetricsRepository materialized data - metrics.json (section 1),
- How to create FileSystemMetricsRepository with managed data (section 2.3),
- How to run VerificationSuite using specific managed data (all sections).


This can be especially usefull:
- For python-deequ application migration from one Databricks Workspace to another,
- To manage MetricsRepository json on the application side application,
- To enable explainability and analytics using MetricsRepository json.

Note: As of 1.1.0 release of Python Deequ release initialising repository json as FileSystemMetricsRepository is the only way to run validations on historical metrics. InMemoryMetricsRepository does not support initialising from historical metrics.

#### 1) Simulate historical Anomaly run

##### 1.1) Create FileSystemMetricsRepository with autogenerated path

When initializing the FileSystemMetricsRepository without explicitly specifying a path, it will autonomously generate the path.

In [0]:
from pydeequ.repository import FileSystemMetricsRepository

repository = FileSystemMetricsRepository(spark)
print(repository.path)

This path is crucial for materializing json file containing metrics for consecutive PyDeequ runs. PyDeequ, in turn, matches check definitions in the persisted data model and utilizes the underlying metrics to calculate specific anomalies.

##### 1.2) Run anomaly checks

In [0]:
from pydeequ.repository import ResultKey
from pydeequ.verification import VerificationSuite
from pydeequ.anomaly_detection import RelativeRateOfChangeStrategy
from pydeequ.analyzers import Mean

COLUMN_NAME = "age"

df_xyz = spark.createDataFrame([{COLUMN_NAME:19},{COLUMN_NAME:21}])

verification_suite = (
    VerificationSuite(spark)
        .onData(df_xyz)
        .useRepository(repository)
        .saveOrAppendResult(
            ResultKey(
                spark, 
                ResultKey.current_milli_time(),
                {"tag": "historical-run-0"}
            )
        )
        .addAnomalyCheck(
            RelativeRateOfChangeStrategy(
                maxRateDecrease=0.8,
                maxRateIncrease=1.2
            ), 
            Mean(COLUMN_NAME)
        )
    )
results = verification_suite.run()


##### 1.3) Verify metrics.json is persisted to dbfs

Triggering VerificationSuite.run() will persist metrics to the file underlying FileSystemMetricsRepository.

In [0]:
historical_repository_path = repository.path
with open(f"/dbfs{historical_repository_path}", "r") as f: 
    print(f.read())

#### 2) Create FileSystemMetricsRepository with managed path

##### 2.1) Define target path

In [0]:
import os

os.makedirs("/dbfs/table_xyz", exist_ok=True) 
target_metrics_file_path = 'table_xyz/metrics.json'

##### 2.2) Copy historical metrics.json to target path

In [0]:
import shutil

shutil.copyfile(
    src = f"/dbfs{historical_repository_path}",
    dst = f"/dbfs/{target_metrics_file_path}"
)

##### 2.3) Initialize FileSystemMetricsRepository with managed path
Note: Manage repository file using File API format but provide Spark API format to [FileSystemMetricsRepository](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/repository/fs/FileSystemMetricsRepository.scala)

In [0]:
metrics_spark_api =f"dbfs:/{target_metrics_file_path}"

repository = FileSystemMetricsRepository(spark, metrics_spark_api)
print(repository.path)

#### 3) Run anomaly checks

In [0]:
df_xyz = spark.createDataFrame([{COLUMN_NAME:19},{COLUMN_NAME:21},{COLUMN_NAME:50}])

verification_suite = (
    VerificationSuite(spark)
        .onData(df_xyz)
        .useRepository(repository)
        .saveOrAppendResult(
            ResultKey(
                spark, 
                ResultKey.current_milli_time()
            )
        )
        .addAnomalyCheck(
            RelativeRateOfChangeStrategy(
                maxRateDecrease=0.8,
                maxRateIncrease=1.2
            ), 
            Mean(COLUMN_NAME)
        )
    )
results = verification_suite.run()


##### 3.1) Validate that historical metrics were taken into consideration when calculating anomaly

New data age mean is 30, while old data age average is 20. RelativeRateOfChangeStrategy should fails as accepted rate of change is +/-20%.

In [0]:
results.checkResultsAsDataFrame(spark_session=spark, verificationResult=results).display()

##### 3.2) Validate that repository json is updated after VerificationSuite run

In [0]:
import json
with open(f"/dbfs/{target_metrics_file_path}", "r", encoding="utf-8") as file:
    repository_str = file.read()

json.loads(repository_str)