# Data Quality & Drift
This notebook uses AWS Glue and AWS Glue Data Brew to create a data profiling report. This can be used for tracking data drift overtime by adding data wrangling code to compare statistical summaries (standard de3viation for example) for data quality metrics for features of interest.

**Note:** This code requires the `pyathena` package to be installed, the following cell install `pyathena` if not already installed.

In [None]:
try:
    import pyathena
except ImportError as e:
    !pip3 install pyathena==2.3.2

## Imports

In [None]:
from IPython.display import display, Markdown
from datetime import datetime
from pathlib import Path
import sagemaker
import logging
import boto3
import sys
import os

In [None]:
# import from a different path
path = Path(os.path.abspath(os.getcwd()))
package_dir = f'{str(path.parent)}/utils'
print(package_dir)
sys.path.insert(0, package_dir)
import utils
import feature_monitoring_utils

## Setup Logging

In [None]:
logger = logging.getLogger('__name__')
logging.basicConfig(format="%(asctime)s,%(filename)s,%(funcName)s,%(lineno)s,%(levelname)s,p%(process)s,%(message)s", level=logging.INFO)       


## Setup Config Variables
Read the metadata (feature group name, model endpoint name etc.) produced by the previous notebooks so that they can be provided as inputs to the lineage tracking module.

In [None]:
endpoint_name = utils.read_param("endpoint_name")
customer_inputs_fg_name = utils.read_param("customer_inputs_fg_name")
destinations_fg_name = utils.read_param("destinations_fg_name")
customer_inputs_fg_query_string = utils.read_param("customer_inputs_fg_query_string")
query_string = utils.read_param("query_string")
training_job_name = utils.read_param("training_job_name")
logger.info(f"endpoint_name={endpoint_name}, customer_inputs_fg_name={customer_inputs_fg_name},\n"
            f"customer_inputs_fg_query_string={customer_inputs_fg_query_string}, training_job_name={training_job_name}")

In [None]:
# Set up the results bucket location
results_bucket=sagemaker.Session().default_bucket() # You might change this for a different s3 bucket
results_key='aws-databrew-results/Offline-FS'

## Run data profiling jobs
We use the feature_monitoring_prep module as a wrapper to initiate Glue data brew jobs for profiling the data.

In [None]:
response_brew_prep = feature_monitoring_utils.feature_monitoring_prep(
    customer_inputs_fg_name, 
    results_bucket, 
    results_key,
    verbose = False
)

In [None]:
# Call the main profile execution function
response_brew_job = feature_monitoring_utils.feature_monitoring_run(
    customer_inputs_fg_name,
    verbose=False
)

In [None]:
# Display the Report S3 location
databrew_profile_console_url = response_brew_job[2]
brew_results_s3 = response_brew_job[4]
logger.info(f"Report is available at the following S3 location:\n{brew_results_s3}\n")

# Display the DataBrew link
print("Please click on the link below to access visulizations in Glue DataBrew console:")
databrew_link = f'[DataBrew Profile Job Visualizations]({databrew_profile_console_url})'
display(Markdown(databrew_link))

<img src="../images/AWS-Glue-DataBrew.png">Data Profile</img>