# DQDL Wrappers for PyDeequ

This notebook expands upon the [basic_example](basic_example.ipynb) to define a Class having a set of 'quality' methods similar to the boto3 Glue.Class.   This provides a custom implementation of DLDQ-like functionality with alternate data stores in DynamoDb and S3, while supporting easy future migration to Glue ETL.


### Sample Scripts
First execute Initialization and Definition cells below (these will eventually move to common library sourced here)

#### 1. Generate Rule Recommendations

Rule recommendations utilize [Deequ *constraint suggestion*](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/examples/constraint_suggestion_example.md) functionality to first profile all columns in the data set, and then apply heuristic rules to define a set of suggested constraints.

This library adds a [DQDL-like Rule](https://docs.aws.amazon.com/glue/latest/dg/dqdl.html) to each suggested constraint, and stores the suggested constraints in **dqsuggestion_runs** DynamoDB table.


In [42]:
# 1. Generate Rule Recommendations
s3_url = "s3a://wc2h-dtl-prd-datalake/PARQUET/tstg_creditor_agency/Unload.D230831/" # OK
#s3_url = "s3a://wc2h-dtl-prd-datalake/PARQUET/xstg_arclient/Unload.D230831/" # too many cols
#s3_url = "s3a://wc2h-dtl-prd-datalake/PARQUET/xstg_arphone/Unload.D230831/"  # wtf?

arg = {
    "DataSource" : {
        "S3Url" : s3_url
    }
}

dq = SimpleDQ()

run_id = dq.start_data_quality_rule_recommendation_run( **arg )
run_id

DataSource from S3 Url 's3a://wc2h-dtl-prd-datalake/PARQUET/tstg_creditor_agency/Unload.D230831/'


[Stage 152:>                                                        (0 + 1) / 1]

23/09/12 13:30:40 WARN DAGScheduler: Broadcasting large task binary with size 1259.4 KiB


                                                                                

{'RunId': 'dqrecrun-2023-09-12T13:30:38.314605Z-interactive'}

##### 1a. Create Data Quality Ruleset from Recommendations
Suggested Constraints provide a lot of detailed information about every column in the dataset. This library supports storing DQDL-like rules in a more concise format similar to Glue Data Quality API.  Rulesets stored in **dqrulesets** DynamoDB table can be edited to use only desired constraints, or to modify conditions applied (Hint -- use DynamoDB Form mode to edit the Ruleset string).  Alternatively, Rulesets can be defined from scratch.  

In [43]:
#### 1a. Convert Recommendations to a Ruleset

print(run_id) # from Step 1

tablename = get_tablename_from_datasource( arg['DataSource'] ) # from Step 1

dq = SimpleDQ()

rec = dq.get_data_quality_rule_recommendation_run( **run_id )
rec
rules_list = []
for item in rec['ConstraintSuggestions']:
    rules_list.append( item['dqdl_rule'] )
    
#print(rules_list)

ruleset = {
 "Name": f"{tablename}_generated",
 "ClientToken": "string",
 "Description": "ruleset from pydeequ.suggestions",
 "Ruleset": ", \n".join(rules_list) , # stored as string
 "Tags": {
  "SuggestionRunId": rec["RunId"]
 },
 "DataSource" : rec["DataSource"],    # custom  
 "TargetTable": {
  "CatalogId": "AwsDataCatalog",
  "DatabaseName": "dtl-prd-smpl0-g2", # ToDo from variable
  "TableName": tablename
 }
}

dq.update_data_quality_ruleset(**ruleset)


{'RunId': 'dqrecrun-2023-09-12T13:30:38.314605Z-interactive'}
Getting Suggestion Run dqrecrun-2023-09-12T13:30:38.314605Z-interactive


{'Name': 'tstg_creditor_agency_generated'}

#### 2. Run Ruleset to Evaluate Data Quality
Next, the Ruleset can be evaluated against particular data sets to verify compliance with each of the constraints.   Ruleset Evaluation Runs are logged in **dqruleset-eval-runs** DynamoDB table, which associates the evaluated Datasource with Rulesets applied and the Results of the evaluation.   Multiple Rulesets can be applied against a DataSource in the same Evaluation Run, to generate multiple Results stored in **dqresults** table.


In [49]:
#### 2. Run Ruleset to Evaluate Data Quality

dq = SimpleDQ()

ruleset_name = f"{tablename}_generated"  # from Step 1a

ruleset = dq.get_data_quality_ruleset( Name = ruleset_name )
#print(ruleset)

# note the option to evaluate multiple Rulesets by name in same run
ruleset_runspec = {
    'DataSource': ruleset['DataSource'],
    'RulesetNames' : [
        ruleset['Name'],
    ]
}

eval_run_id = dq.start_data_quality_ruleset_evaluation_run( **ruleset_runspec)

dq_eval_run = dq.get_data_quality_ruleset_evaluation_run( **eval_run_id )

# note that a ResultId will be generated for each Ruleset
for result_id in dq_eval_run['ResultIds']:
    dq_result = dq.get_data_quality_result( ResultId = result_id )

#print(json.dumps(dq_result, default=str, indent=2))
print( f"Score: {dq_result['Score']}")
#print( f"Tally: {dq_result['Tally']}")


Getting Ruleset tstg_creditor_agency_generated
DataSource from S3 Url 's3a://wc2h-dtl-prd-datalake/PARQUET/tstg_creditor_agency/Unload.D230831/'
Getting Ruleset tstg_creditor_agency_generated
Skipping Check -- Rule Type 'HasDataType' is not implemented.
Skipping Check -- Rule Type 'HasCompleteness' is not implemented.
{'RULES': 80, 'PASS': 64, 'FAIL': 14, 'SKIP': 2}
Getting Eval Run dqrun-2023-09-12T13:37:59.031230Z-interactive
Getting Result dqresult-2023-09-12T13:37:59.262032Z-tstg_creditor_agency_generated
Score: 0.8205128205128205
Tally: {'SKIP': Decimal('2'), 'RULES': Decimal('80'), 'PASS': Decimal('64'), 'FAIL': Decimal('14')}


### Initialization

In [None]:
%%bash
# cold start
pip install pydeequ
pip install 'awswrangler[redshift]'

In [None]:
# cold start for SageMaker on WC2H -- unzip dependencies from local file, since Maven is blocked
import zipfile
import os

S_rootdir = os.getcwd()

with zipfile.ZipFile( f"{S_rootdir}/common/ivy2cache.zip" ) as z:
    z.extractall( f"{os.environ['HOME']}/.ivy2" )

with zipfile.ZipFile( f"{S_rootdir}/common/ivy2cache_33.zip" ) as z:
    z.extractall( f"{os.environ['HOME']}/.ivy2" )



In [1]:
import os 
os.environ['AWS_DEFAULT_REGION'] = 'us-gov-west-1'
#os.environ["SPARK_VERSION"] = '3.0'
os.environ["SPARK_VERSION"] = '3.3'

import awswrangler as wr
import pandas as pd
pd.DataFrame.iteritems = pd.DataFrame.items # (https://stackoverflow.com/questions/75926636/databricks-issue-while-creating-spark-data-frame-from-pandas)


In [2]:
from pyspark.sql import SparkSession, Row, DataFrame

import sagemaker_pyspark
import pydeequ

classpath = ":".join(sagemaker_pyspark.classpath_jars())

from pyspark import SparkConf
conf = (SparkConf()
        .set('fs.s3a.endpoint', 's3-us-gov-west-1.amazonaws.com')
        .set("spark.sql.parquet.int96RebaseModeInRead", "CORRECTED")
        .set("spark.sql.parquet.int96RebaseModeInWrite", "CORRECTED")
        .set("spark.sql.parquet.datetimeRebaseModeInRead", "CORRECTED")
        .set("spark.sql.parquet.datetimeRebaseModeInWrite", "CORRECTED")        
       )

spark = (SparkSession
    .builder
    .config("spark.driver.extraClassPath", classpath)
    .config("spark.jars.packages", pydeequ.deequ_maven_coord)
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .config( conf=conf )
    .getOrCreate())



:: loading settings :: url = jar:file:/home/ec2-user/anaconda3/envs/python3/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/ec2-user/.ivy2/cache
The jars for the packages stored in: /home/ec2-user/.ivy2/jars
com.amazon.deequ#deequ added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-7a85ebe5-2cb1-49a8-b364-74064c0b02af;1.0
	confs: [default]
	found com.amazon.deequ#deequ;2.0.3-spark-3.3 in central
	found org.scala-lang#scala-reflect;2.12.10 in central
	found org.scalanlp#breeze_2.12;0.13.2 in central
	found org.scalanlp#breeze-macros_2.12;0.13.2 in central
	found com.github.fommil.netlib#core;1.1.2 in central
	found net.sf.opencsv#opencsv;2.3 in central
	found com.github.rwl#jtransforms;2.4.0 in central
	found junit#junit;4.8.2 in central
	found org.apache.commons#commons-math3;3.2 in central
	found org.spire-math#spire_2.12;0.13.0 in central
	found org.spire-math#spire-macros_2.12;0.13.0 in central
	found org.typelevel#machinist_2.12;0.6.1 in central
	found com.chuusai#shapeless_2.12;2.3.2 in central
	found org.typelevel#macro-compat_2.12;1.

23/09/11 20:24:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/09/11 20:24:53 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/09/11 20:24:53 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


### Function and Class Definitions
Will be moved to common module for access from either notebook or ETL jobs.

In [14]:
def parse_dqdl_rule(rule_text): # ToDo move to common module
    """ Transform a DQDL-like rule from string to dict """
    import json
    import re
    
    s = rule_text.split(' ', 2)

    rule = {
        'Type' : s[0],
        'ColName' : '',
        'Expression' : '',
        'Lambda' : None,
        'Text' : rule_text
    }
    if '"' in s[1]:
        rule['ColName'] = s[1].replace('"','')
        if len(s) == 3:
            rule['Expression'] = s[2]
    else:
        rule['Expression'] = f"{s[1]} {s[2]}"

    # transform the Expression into a lambda assertion
    if rule['Expression'] == '':
        pass
    
    elif re.search("[<=>]", rule['Expression']):
        xpr = rule['Expression'].split()
        op =  xpr[0]
        val = float(xpr[1])
        if op == "=":
            rule['Lambda'] = lambda x: x == val
        elif op == ">":
            rule['Lambda'] = lambda x: x > val
        elif op == "<":
            rule['Lambda'] = lambda x: x < val
        elif op == ">=":
            rule['Lambda'] = lambda x: x >= val
        elif op == "<=":
            rule['Lambda'] = lambda x: x <= val
        
    elif rule['Expression'].startswith('between'):
        xpr = rule['Expression'].split()
        lo = xpr[1]
        hi = xpr[3]
        rule['Lambda'] = lambda x: lo < x < hi
        
    elif rule['Expression'].startswith('in'): 
        xpr = rule['Expression'].split() # no spaces between list values, please!
        inlist = xpr[1][1:-1].replace('"','').split(',')
        #rule['Lambda'] = lambda x: x in inlist
        rule['Lambda'] = inlist
        
    else:
        print("Can't Parse Expression")
    
    #print(json.dumps(rule, indent=2, default=str))

    return rule

In [3]:
#def run_pydeequ_checks( df, ruleset_name ): # ToDo move to common module
def run_pydeequ_checks( df, ruleset_name, rules_list ): # ToDo move to common module
    from pydeequ.checks import Check,CheckLevel
    from pydeequ.verification import VerificationSuite,VerificationResult
    import datetime
    beg_time = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S.%fZ")
    dq_result_id = f'dqresult-{beg_time}-{ruleset_name}'
    
    # create check object representing a set of constraints 
    check = Check(spark, CheckLevel.Error, ruleset_name)
    
    parsed_rules = []
    for rule_text in rules_list:
        rule = parse_dqdl_rule(rule_text)
        rule['status'] = 'processed'

        if rule['Type'] == 'HasSize':
            check.hasSize( rule['Lambda'] )
        elif rule['Type'] == 'HasMin':
            check.hasMin( rule['ColName'], rule['Lambda'] )
        elif rule['Type'] == 'IsComplete':
            check.isComplete( rule['ColName'] )        
        elif rule['Type'] == 'IsUnique':
            check.isUnique( rule['ColName'] )        
        elif rule['Type'] == 'IsContainedIn':  
            check.isContainedIn(rule['ColName'], rule['Lambda'])
        elif rule['Type'] == 'IsNonNegative':
            check.isNonNegative( rule['ColName'] )
        else:
            rule['status'] = 'skipped'
            msg = f"Skipping Check -- Rule Type '{rule['Type']}' is not implemented."
            '''
            skipped.append( {
                'constraint' : rule_text,
                'constraint_status' : 'Skipped',
                'constraint_message' : msg
            })'''
            print( msg )
        parsed_rules.append( rule )

    # apply constraints to data frame
    checkResult = VerificationSuite(spark).onData(df).addCheck(check).run()

    # get DeeQu results and customize to resemble Glue Data Quality API
    check_results = checkResult.checkResults # dict
    j=-1
    rule_results = []

    for i in range(len(parsed_rules)):
        #print(i,j)
        rule_result = {
            "Name" : f"Rule_{i}",
            "Description" : parsed_rules[i]["Text"]
        }
        if parsed_rules[i]['status'] == 'processed':
            j += 1
            rule_result.update ( {
                "EvaluationMessage" : check_results[j]["constraint_message"],
                "EvaluatedMetrics" : {
                    "Constraint" : check_results[j]["constraint"]
                }                        
            } )
            if check_results[j]['constraint_status'] == 'Success':
                rule_result['Result'] = 'PASS'
            elif check_results[j]['constraint_status'] == 'Failure':
                rule_result['Result'] = 'FAIL'
            else:
                rule_result['Result'] = 'ERROR'
            
        elif parsed_rules[i]['status'] == 'skipped':
            rule_result['Result'] = 'SKIP'
            rule_result.update ( {
                "EvaluationMessage" : f"Rule Type '{parsed_rules[i]['Type']}' is not implemented.",
                "EvaluatedMetrics" : {},
                "Result" : rule_result['Result']
            } )   
            
        rule_results.append( rule_result )

        end_time = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S.%fZ")

    df_dq_result = pd.DataFrame.from_dict(rule_results, orient='columns')
    
    tally = { 
        'RULES' : len(df_dq_result),
        'PASS' : 0,
        'FAIL' : 0 }
    ls = list(df_dq_result['Result'])
    x = set(ls)
    for item in x:
        tally.update( {item : ls.count(item)} )

    print(tally)
    score = tally['PASS'] / ( tally['PASS'] + tally['FAIL'] )
    #tally, score

    result = {
        "ResultId" : dq_result_id,
        "Score" : score ,
        "Tally" : tally ,
        "DataSource" : "DataFrame",
        "RulesetName" : ruleset_name,
        "StartedOn" : beg_time,
        "CompletedOn" : end_time,
        "RulesetEvaluationRunId": "unknown",
        "RuleResults": rule_results
    }

    return result # dict

In [4]:
def get_dataframe_from_datasource( DataSource ):
    
    # Three options for DataSource ...
    if 'GlueTable' in DataSource.keys(): # GlueDQ (like Glue Data Quality API)
        table = DataSource['GlueTable']
        s3_url = wr.catalog.get_table_location( database=table['DatabaseName'], table=table['TableName'])
        s3_url = s3_url + '/*/*'
        print (f"DataSource from GlueTable '{table['DatabaseName']}.{table['TableName']}'")

    elif 'S3Url' in DataSource.keys():   # DataSource Alt #1 (custom)
        s3_url = DataSource['S3Url']
        # table = DataSource['TableName'] # ToDo substring of S3Url
        print (f"DataSource from S3 Url '{s3_url}'")
    '''    
    elif 'SQL' in DataSource.keys():  # ToDo DataSource Alt #2 (custom)  
        s3_url = None
        sql = DataSource['Athena']['SQL']
        dbname = DataSource['Athena']['DataBase']
        
        # ToDo read Athena and/or Redshift (Spectrum) into Spark df
        df_pd = wr.athena.read_sql_query(sql, database=dbname)
        print('Convert Pandas to Spark')
        df = spark.createDataFrame(df_pd) 
    '''    
    if s3_url:
        df = spark.read.parquet( s3_url.replace( 's3://', 's3a://') )
        
    return df


In [41]:
def get_tablename_from_datasource( DataSource ):
    if 'GlueTable' in DataSource.keys(): # GlueDQ (like Glue Data Quality API)
        tablename = DataSource['GlueTable']['TableName']

    elif 'S3Url' in DataSource.keys():   # DataSource Alt #1 (custom)
        s3_url = DataSource['S3Url']
        tablename = s3_url.split('/')[4] # ToDo improve
    '''    
    elif 'SQL' in DataSource.keys():  # ToDo DataSource Alt #2 (custom)  
    '''    
    return tablename

In [5]:
def run_pydeequ_suggestions( df ): # ToDo move to common module
    from pydeequ.suggestions import ConstraintSuggestionRunner, DEFAULT

    constraint_suggestions = ConstraintSuggestionRunner(spark) \
             .onData(df) \
             .addConstraintRule(DEFAULT()) \
             .run()
    
    # transform pydeequ code into DQDL rule 
    for item in constraint_suggestions["constraint_suggestions"]:
        x = item["code_for_constraint"]
        if item['suggesting_rule'].startswith('FractionalCategoricalRangeRule'):
            x = x[:x.find(', lambda')]
        elif item['suggesting_rule'].startswith('RetainCompletenessRule'):
            y = x.split(', ')
            x = f"{y[0]} {y[1].replace('lambda x: x ','')}"            
        for y in re.findall('\[.*?\]', x): 
            x = x.replace( y, y.replace(', ', ','))
        x = re.sub(r'[()]', ' ', (x[1].upper() + x[2:])).replace( ', ', ' in ').strip() 
        item.update( {
            'dqdl_rule' : x
        })

    return constraint_suggestions


In [7]:
import datetime
import re

class SimpleDQ: # ToDo move to common module
    ''' Data Quality functionality similar to boto3 Glue.Client '''
    import datetime

    import boto3
    glue_client = boto3.client('glue')
    ddb_resource = boto3.resource('dynamodb')
    
    # DynamoDb Tables for PyDeeQu data stores similar to Glue Data Quality API
    dqrulesets_table = ddb_resource.Table('dtl-prd-SMPL0-dqrulesets') 
    dqruleset_eval_runs_table = ddb_resource.Table('dtl-prd-SMPL0-dqruleset-eval-runs') 
    dqresults_table = ddb_resource.Table('dtl-prd-SMPL0-dqresults') 
    dqsuggestion_runs_table = ddb_resource.Table('dtl-prd-SMPL0-dqsuggestion_runs') 
    #dqsuggestions_table = ddb_resource.Table('dtl-prd-SMPL0-dqsuggestions') 
    
    mode='PyDeeQu'  

    def __init__(self, mode='PyDeeQu', **kwargs) -> None:
        if mode == 'GlueDQ':
            # pass-thru wrapper for boto3 class Glue.Client *data_quality* methods
            print('Glue Data Quality implementation pending availability on Govcloud')
        self.mode = mode

    def batch_get_data_quality_result(self, **kwargs): pass # ToDo 
    def cancel_data_quality_rule_recommendation_run(self, **kwargs): pass # ToDo 
    def cancel_data_quality_ruleset_evaluation_run(self, **kwargs): pass # ToDo

    def create_data_quality_ruleset( self, **kwargs):
        if type(kwargs['Ruleset']) == list:
            ", ".join(kwargs['Ruleset'])

        if self.mode == "GlueDQ":
            response = self.glue_client.create_data_quality_ruleset(
                Name = kwargs['Name'],       # str Reqd
                Ruleset = kwargs['Ruleset'], # str Reqd 
                Description = kwargs['Description'],
                Tags = kwargs['Tags'],      # dict
                TargetTable = kwargs['TargetTable'], # dict
                ClientToken = kwargs['ClientTokens']
            )
            return response   # { 'Name': 'string' }
        elif self.mode == "PyDeeQu":
            now = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S.%fZ")
            kwargs.update( {
                'CreatedOn' : now,
                'LastModifiedOn' : now
            })
            self.dqrulesets_table.put_item(
                Item = kwargs
            )
            return { 'Name' : kwargs['Name'] }

    def delete_data_quality_ruleset(self, **kwargs): pass # ToDo

    def get_data_quality_result(self, **kwargs): 
        print(f"Getting Result {kwargs['ResultId']}")
        response = self.dqresults_table.get_item(
            Key = { 'ResultId' : kwargs['ResultId'] }
        )
        return response['Item']
    
    def get_data_quality_rule_recommendation_run(self, **kwargs): 
        print(f"Getting Suggestion Run {kwargs['RunId']}")
        response = self.dqsuggestion_runs_table.get_item(
            Key = { 'RunId' : kwargs['RunId'] }
        )
        return response['Item']

    def get_data_quality_ruleset(self, **kwargs): 
        print(f"Getting Ruleset {kwargs['Name']}")
        if self.mode == "GlueDQ":
            print('Glue Data Quality implementation pending availability on Govcloud')
            '''response = self.glue_client.get_data_quality_ruleset(
                Name=kwargs['Name']
            )
            return response'''
        elif self.mode == "PyDeeQu":
            response = self.dqrulesets_table.get_item(
                Key = { 'Name' : kwargs['Name'] }
            )
            return response['Item']
       
    def get_data_quality_ruleset_evaluation_run(self, **kwargs): 
        print(f"Getting Eval Run {kwargs['RunId']}")
        if self.mode == "GlueDQ":
            print('Glue Data Quality implementation pending availability on Govcloud')

        elif self.mode == "PyDeeQu":
            response = self.dqruleset_eval_runs_table.get_item(
                Key = { 'RunId' : kwargs['RunId'] }
            )
            return response['Item']
    
    def list_data_quality_results(self, **kwargs): pass # ToDo
    def list_data_quality_rule_recommendation_runs(self, **kwargs): pass # ToDo
    def list_data_quality_ruleset_evaluation_runs(self, **kwargs): pass # ToDo
    def list_data_quality_rulesets(self, **kwargs): pass # ToDo
    
    def start_data_quality_rule_recommendation_run(self, **kwargs):
        if self.mode == "GlueDQ":
            print('Glue Data Quality implementation pending availability on Govcloud')

        elif self.mode == "PyDeeQu":
            # 'lightweight' interactive option for logic that will be implemented in Glue ETL job
            beg_time = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S.%fZ")
            dqrunid = f'dqrecrun-{beg_time}-interactive'
            
            process_parms = kwargs            
            process_parms.update ({
                'RunId' : dqrunid,
                'Status' : 'RUNNING',
                'StartedOn' : beg_time
            })
            # Store in DynDb table
            self.dqsuggestion_runs_table.put_item( Item = process_parms )
            
            df = get_dataframe_from_datasource( process_parms['DataSource'] )

            dq_suggestions = run_pydeequ_suggestions( df )

            process_parms.update ({
                'ConstraintSuggestions' : dq_suggestions['constraint_suggestions'], 
                'Status' : 'SUCCEEDED',
                'CompletedOn' : datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S.%fZ")
            })
            # Store in DynDb table
            self.dqsuggestion_runs_table.put_item( Item = process_parms )
            
            return { 'RunId': process_parms['RunId'] }
    
    def start_data_quality_ruleset_evaluation_run(self, **kwargs): 
        import datetime

        if self.mode == "GlueDQ":
            print('Glue Data Quality implementation pending availability on Govcloud')

        elif self.mode == "PyDeeQu":

            # 'lightweight' interactive option for logic that will be implemented in Glue ETL job
            beg_time = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S.%fZ")
            dqrunid = f'dqrun-{beg_time}-interactive'
            
            process_parms = kwargs
            process_parms.update ({
                'RunId' : dqrunid,
                'ResultIds' : [],
                'Status' : 'RUNNING',
                'StartedOn' : beg_time
            })
            
            # Store in DynDb table
            self.dqruleset_eval_runs_table.put_item( Item = process_parms )
            
            df = get_dataframe_from_datasource( process_parms['DataSource'] )
                
            for ruleset_name in process_parms['RulesetNames']:
                ruleset = self.get_data_quality_ruleset( Name = ruleset_name ) 
                rules_list = ruleset['Ruleset'].replace('\n','').split(', ')
                
                dq_result = run_pydeequ_checks( df, ruleset_name, rules_list ) 
                dq_result.update( {
                    'DataSource' : process_parms['DataSource'],
                    'RulesetEvaluationRunId' : process_parms['RunId']
                })
                
                import json
                from decimal import Decimal
                dq_result = json.loads(json.dumps(dq_result), parse_float=Decimal)
                self.dqresults_table.put_item(
                    Item = dq_result
                )
                                
                process_parms['ResultIds'].append( dq_result['ResultId'] )
            
            process_parms.update ({
                'Status' : 'SUCCEEDED',
                'CompletedOn' : datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S.%fZ")
            })

            # Store in DynDb table
            self.dqruleset_eval_runs_table.put_item( Item = process_parms )
            
            return { 'RunId': process_parms['RunId'] }
    
    def update_data_quality_ruleset(self, **kwargs): 
        if self.mode == "GlueDQ":
            response = self.glue_client.update_data_quality_ruleset(
                Name = kwargs['Name'],
                Description = kwargs['Description'],
                Ruleset = kwargs['Ruleset']
            )
            return response
        elif self.mode == "PyDeeQu":
            # 'put_item' works for both create and update
            now = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S.%fZ")

            kwargs.update( {
                'LastModifiedOn' : now
            })
            self.dqrulesets_table.put_item(
                Item = kwargs
            )
            return { 'Name' : kwargs['Name'] }
