# Data Quality & Drift
This notebook uses AWS Glue and AWS Glue Data Brew to create a data profiling report. This can be used for tracking data drift overtime by adding data wrangling code to compare statistical summaries (standard de3viation for example) for data quality metrics for features of interest.

**Note:** This code requires the `pyathena` package to be installed, the following cell install `pyathena` if not already installed.

In [1]:
try:
    import pyathena
except ImportError as e:
    !pip3 install pyathena==2.3.2

Collecting pyathena==2.3.2
  Downloading PyAthena-2.3.2-py3-none-any.whl (37 kB)
Installing collected packages: pyathena
Successfully installed pyathena-2.3.2


## Imports

In [2]:
from IPython.display import display, Markdown
from datetime import datetime
from pathlib import Path
import sagemaker
import logging
import boto3
import sys
import os

In [3]:
# import from a different path
path = Path(os.path.abspath(os.getcwd()))
package_dir = f'{str(path.parent)}/utils'
print(package_dir)
sys.path.insert(0, package_dir)
import utils
import feature_monitoring_utils

/home/ec2-user/SageMaker/feature-store-expedia/utils




## Setup Logging

In [4]:
logger = logging.getLogger('__name__')
logging.basicConfig(format="%(asctime)s,%(filename)s,%(funcName)s,%(lineno)s,%(levelname)s,p%(process)s,%(message)s", level=logging.INFO)       


## Setup Config Variables
Read the metadata (feature group name, model endpoint name etc.) produced by the previous notebooks so that they can be provided as inputs to the lineage tracking module.

In [5]:
endpoint_name = utils.read_param("endpoint_name")
customer_inputs_fg_name = utils.read_param("customer_inputs_fg_name")
destinations_fg_name = utils.read_param("destinations_fg_name")
customer_inputs_fg_query_string = utils.read_param("customer_inputs_fg_query_string")
query_string = utils.read_param("query_string")
training_job_name = utils.read_param("training_job_name")
logger.info(f"endpoint_name={endpoint_name}, customer_inputs_fg_name={customer_inputs_fg_name},\n"
            f"customer_inputs_fg_query_string={customer_inputs_fg_query_string}, training_job_name={training_job_name}")

2022-06-10 15:18:21,460,utils.py,read_param,130,INFO,p24251,read_param, fpath=../config/endpoint_name, read endpoint_name=hotel-cluster-prediction-ml-model-2022-06-08-19-12-46-266
2022-06-10 15:18:21,461,utils.py,read_param,130,INFO,p24251,read_param, fpath=../config/customer_inputs_fg_name, read customer_inputs_fg_name=expedia-customer-inputs-2022-6-8-15-0
2022-06-10 15:18:21,462,utils.py,read_param,130,INFO,p24251,read_param, fpath=../config/destinations_fg_name, read destinations_fg_name=expedia-destinations-2022-6-8-15-0
2022-06-10 15:18:21,463,utils.py,read_param,130,INFO,p24251,read_param, fpath=../config/customer_inputs_fg_query_string, read customer_inputs_fg_query_string=SELECT * FROM "expedia-customer-inputs-2022-6-8-15-0-1654700956" limit 10

2022-06-10 15:18:21,464,utils.py,read_param,130,INFO,p24251,read_param, fpath=../config/training_job_name, read training_job_name=hotel-cluster-prediction-ml-model-2022-06-08-19-02-56-809
2022-06-10 15:18:21,465,<ipython-input-5-5b86116

In [6]:
# Set up the results bucket location
results_bucket=sagemaker.Session().default_bucket() # You might change this for a different s3 bucket
results_key='aws-databrew-results/Offline-FS'

## Run data profiling jobs
We use the feature_monitoring_prep module as a wrapper to initiate Glue data brew jobs for profiling the data.

In [8]:
response_brew_prep = feature_monitoring_utils.feature_monitoring_prep(
    customer_inputs_fg_name, 
    results_bucket, 
    results_key,
    verbose = False
)

Feature Group S3 URL: s3://expedia-feature-store-offline-195cbf60/expedia-customer-inputs-2022-6-8-15-0/015469603702/sagemaker/us-east-1/offline-store/expedia-customer-inputs-2022-6-8-15-0-1654700956
Feature Group Table Name: expedia-customer-inputs-2022-6-8-15-0-1654700956
CTAS table created successfully: expedia-customer-inputs-2022-6-8-15-0-1654700956-ctas-temp
Start crawling expedia-customer-inputs-2022-6-8-15-0-1654700956-ctas-temp-crawler.


2022-06-10 15:18:55,743,feature_monitoring_utils.py,wait_until_ready,214,INFO,p24251,Crawler expedia-customer-inputs-2022-6-8-15-0-1654700956-ctas-temp-crawler is running.


..........

2022-06-10 15:19:46,567,feature_monitoring_utils.py,wait_until_ready,214,INFO,p24251,Crawler expedia-customer-inputs-2022-6-8-15-0-1654700956-ctas-temp-crawler is stopping.


.............

2022-06-10 15:20:52,674,feature_monitoring_utils.py,wait_until_ready,214,INFO,p24251,Crawler expedia-customer-inputs-2022-6-8-15-0-1654700956-ctas-temp-crawler is ready.


.!

DataBrew Dataset Created:  expedia-customer-inputs-2022-6-8-15-0-dataset
AWS Glue DataBrew Profile Job Created: expedia-customer-inputs-2022-6-8-15-0-profile-job


In [9]:
# Call the main profile execution function
response_brew_job = feature_monitoring_utils.feature_monitoring_run(
    customer_inputs_fg_name,
    verbose=False
)

Feature Group S3 URL: s3://expedia-feature-store-offline-195cbf60/expedia-customer-inputs-2022-6-8-15-0/015469603702/sagemaker/us-east-1/offline-store/expedia-customer-inputs-2022-6-8-15-0-1654700956
Feature Group Table Name: expedia-customer-inputs-2022-6-8-15-0-1654700956
CTAS table created successfully: expedia-customer-inputs-2022-6-8-15-0-1654700956-ctas-temp
Running DataBrew Profiling Job
......................................!



In [10]:
# Display the Report S3 location
databrew_profile_console_url = response_brew_job[2]
brew_results_s3 = response_brew_job[4]
logger.info(f"Report is available at the following S3 location:\n{brew_results_s3}\n")

# Display the DataBrew link
print("Please click on the link below to access visulizations in Glue DataBrew console:")
databrew_link = f'[DataBrew Profile Job Visualizations]({databrew_profile_console_url})'
display(Markdown(databrew_link))

2022-06-10 15:24:26,411,<ipython-input-10-e4662f4f4412>,<module>,4,INFO,p24251,Report is available at the following S3 location:
s3://sagemaker-us-east-1-015469603702/aws-databrew-results/Offline-FS-reports/expedia-customer-inputs-2022-6-8-15-0-dataset_055636775b80de5f229f2e6e3121542542d93d2dcde421f8cefb8bb8fe805bf2.json



Please click on the link below to access visulizations in Glue DataBrew console:


[DataBrew Profile Job Visualizations](https://us-east-1.console.aws.amazon.com/databrew/home?region=us-east-1#dataset-details?dataset=expedia-customer-inputs-2022-6-8-15-0-dataset&tab=profile-overview)

<img src="../images/AWS-Glue-DataBrew.png">Data Profile</img>