# dqRunning
### satisfies Check

```yaml
Check: {
    level: Error, 
    description: 'CheckObject by yaml File'
}

Constraints: [
    addConstraint: {
        check_method: satisfies(),
        columnCondition: "i1c_rating_riscos = 9",
        constraintName: "isRating9",
        assertion: "lambda x: x>0.2"
    }
    
]

```

### Installation

O primeiro passo é fazer a instalacao do pacote via *pip install*

In [4]:
!pip install dataquality_bnr

### Set up a PySpark session
A biblioteca foi construida para ser utilizada com __PySpark__ e possibilitar '*testes unitarios dos dados*', executando validacoes qualitativas em datasets de larga escala.</br>
A integracao da sessao spark e a biblioteca depende apenas de duas configuracoes adicionais:


In [5]:
from pyspark.sql import SparkSession, Row
from dataquality_bnr.dqSupport import main as dqSup

spark = SparkSession.builder\
        .appName("pdq_helloWorld")\
        .enableHiveSupport()\
        .config("spark.sql.catalogImplementation","hive")\
        .config("spark.jars", dqSup.getDeequJar_path())\
        .config("spark.jars.excludes", dqSup.getDeequJar_excludes())\
        .config("spark.executor.memory","12g")\
        .config("spark.executor.memoryOverhead","8g")\
        .config("spark.shuffle.service.enabled","true")\
        .config("spark.dynamicAllocation.enabled","true")\
        .config("spark.dynamicAllocation.initialExecutors","1")\
        .config("spark.dynamicAllocation.maxExecutors","32")\
        .config("spark.dynamicAllocation.minExecutors","1")\
        .config("spark.executor.cores","2")\
        .config("spark.driver.memory","12g")\
        .config("spark.driver.maxResultSize","8g")\
        .config("spark.network.timeout","8000")\
        .config("spark.hadoop.hive.metastore.client.socket.timeout","900")\
        .config("spark.sql.hive.convertMetastoreParquet","true")\
        .config("spark.sql.broadcastTimeout","36000")\
        .config("spark.ui.killEnabled","true")\
        .config('spark.yarn.queue','root.gapl_plataf_projetos_motores_decisao')\
        .getOrCreate()

In [6]:
spark

### Dataset
We will be running the analyzers on a dataset sampled from th.thbpd38 table

In [7]:
sql_query = """
select
    i1c_renda_final,
    i1c_lim_pre_ap_preventivo,
    i1c_rating_riscos,
    i1d_idade,
    i1d_sexo,
    i1c_cli_possui_conta,
    i1c_soc_cd_segm_empr1,
    i1c_soc_cd_ramo_atvd1,
    dat_ref_carga
from th.thbpd381 where dat_ref_carga='2022-01-03'
"""
df_input = spark.sql(sql_query)

In [8]:
df_input.printSchema()

root
 |-- i1c_renda_final: integer (nullable = true)
 |-- i1c_lim_pre_ap_preventivo: integer (nullable = true)
 |-- i1c_rating_riscos: integer (nullable = true)
 |-- i1d_idade: integer (nullable = true)
 |-- i1d_sexo: string (nullable = true)
 |-- i1c_cli_possui_conta: string (nullable = true)
 |-- i1c_soc_cd_segm_empr1: integer (nullable = true)
 |-- i1c_soc_cd_ramo_atvd1: integer (nullable = true)
 |-- dat_ref_carga: string (nullable = true)



### Set up & run()

In [9]:
from dataquality_bnr.dqRunning import main as dqRun

In [10]:
dqView1 = {"viewName" : "dqView_Check",
           "inputData": df_input,
           "infraYaml": "yamlFiles/satisfies_check/a_check/infrastructure.yaml",
           "vsYaml": "yamlFiles/satisfies_check/a_check/verificationSuite_idea.yaml"}

dq_bnr_directory = "/user/x266727/dataquality_bnr-docs/satisfies_check/DQ"

In [None]:
myDq = dqRun.Dq(spark, dq_bnr_directory)

myDq = (myDq
        .addView(dqView1, "Check"))
        
dq_run_return = myDq.run()

## Analisando Resultados

In [15]:
import pandas as pd

In [16]:
dq_run_return.get_overall_booleanResult()

False

#### DQ/overall/overall_OK.csv

In [17]:
csvResult_df = dq_run_return.get_overall_csvResult_df(spark)
csvResult_df.toPandas()

Unnamed: 0,check,check_level,check_status,constraint,constraint_status,constraint_message,dataset_date,YY_MM_DD,viewName,viewPath
0,CheckObject by yaml File,Error,Error,SizeConstraint(Size(None)),Failure,Value: 541268 does not meet the constraint req...,1644014715176,2022-02-04,dqView_Check,/user/x266727/dataquality_bnr-docs/satisfies_c...
1,CheckObject by yaml File,Error,Error,"ComplianceConstraint(Compliance(isRating9,i1c_...",Failure,Value: 0.03564408019687105 does not meet the c...,1644014715176,2022-02-04,dqView_Check,/user/x266727/dataquality_bnr-docs/satisfies_c...


In [18]:
pd.set_option('display.max_colwidth', None)

display(csvResult_df.select("constraint", "constraint_message").toPandas())

pd.set_option('display.max_colwidth', 50)

Unnamed: 0,constraint,constraint_message
0,SizeConstraint(Size(None)),Value: 541268 does not meet the constraint requirement!
1,"ComplianceConstraint(Compliance(isRating9,i1c_rating_riscos = 9,None))",Value: 0.03564408019687105 does not meet the constraint requirement!


## Indo mais a fundo

#### DQ/view/currentResult/checkResult.parquet

In [19]:
dqView_name = dqView1["viewName"]
path = dq_bnr_directory +"/"+ dqView_name +'/currentResult/checkResult.parquet'
print(path)

df = spark.read.parquet(path)
df.drop("YY_MM_DD").drop("dataset_date").show()

/user/x266727/dataquality_bnr-docs/satisfies_check/DQ/dqView_Check/currentResult/checkResult.parquet
+--------------------+-----------+------------+--------------------+-----------------+--------------------+
|               check|check_level|check_status|          constraint|constraint_status|  constraint_message|
+--------------------+-----------+------------+--------------------+-----------------+--------------------+
|CheckObject by ya...|      Error|       Error|SizeConstraint(Si...|          Failure|Value: 541268 doe...|
|CheckObject by ya...|      Error|       Error|ComplianceConstra...|          Failure|Value: 0.03564408...|
|CheckObject by ya...|      Error|       Error|ApproxCountDistin...|          Success|                    |
+--------------------+-----------+------------+--------------------+-----------------+--------------------+



In [20]:
df.select("constraint","constraint_status").toPandas()

Unnamed: 0,constraint,constraint_status
0,SizeConstraint(Size(None)),Failure
1,"ComplianceConstraint(Compliance(isRating9,i1c_...",Failure
2,ApproxCountDistinctConstraint(ApproxCountDisti...,Success


#### DQ/view/currentResult/successMetrics.parquet

In [21]:
dqView_name = dqView1["viewName"]
path = dq_bnr_directory +"/"+ dqView_name +'/currentResult/successMetrics.parquet'
print(path)

df = spark.read.parquet(path)
df.show()

/user/x266727/dataquality_bnr-docs/satisfies_check/DQ/dqView_Check/currentResult/successMetrics.parquet
+-------+---------+-------------------+-------------------+-------------+----------+----+------------+
| entity| instance|               name|              value| dataset_date|  YY_MM_DD|tags|check_status|
+-------+---------+-------------------+-------------------+-------------+----------+----+------------+
| Column| i1d_sexo|ApproxCountDistinct|                3.0|1644014715176|2022-02-04|  []|       Error|
|Dataset|        *|               Size|           541268.0|1644014715176|2022-02-04|  []|       Error|
| Column|isRating9|         Compliance|0.03564408019687105|1644014715176|2022-02-04|  []|       Error|
+-------+---------+-------------------+-------------------+-------------+----------+----+------------+



In [22]:
df.select("instance","value").toPandas()

Unnamed: 0,instance,value
0,i1d_sexo,3.0
1,*,541268.0
2,isRating9,0.035644


#### pydeequ shutdown_callback_server()
#### spark.stop()
__Importante!__
Após a execucao dos jobs, garanta que a sessao __spark__ juntamente com o __callback_server__ sejam encerrados, evitando que qualquer processo "fantasma" fique pendurado.<br>
Leia mais sobre __Pydeequ__ e __callback_server__ em: https://github.com/awslabs/python-deequ

In [None]:
spark.sparkContext._gateway.shutdown_callback_server()
spark.stop()