# DQ Research module

* Analyzer
* __Profile__
* Suggestions


### Profile

Run pydeequ Profile over .</br>

#### Setup environment

##### import libraries

In [1]:
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql import SparkSession, Row, DataFrame

In [2]:
from dataquality_bnr.dqSupport import main as dqSup

##### Set up a PySpark session

In [3]:
spark = SparkSession.builder\
        .appName("pdq_helloWorld")\
        .enableHiveSupport()\
        .config("spark.sql.catalogImplementation","hive")\
        .config("spark.jars", dqSup.getDeequJar_path())\
        .config("spark.jars.excludes", dqSup.getDeequJar_excludes())\
        .config("spark.executor.memory","12g")\
        .config("spark.executor.memoryOverhead","8g")\
        .config("spark.shuffle.service.enabled","true")\
        .config("spark.dynamicAllocation.enabled","true")\
        .config("spark.dynamicAllocation.initialExecutors","1")\
        .config("spark.dynamicAllocation.maxExecutors","32")\
        .config("spark.dynamicAllocation.minExecutors","1")\
        .config("spark.executor.cores","2")\
        .config("spark.driver.memory","12g")\
        .config("spark.driver.maxResultSize","8g")\
        .config("spark.network.timeout","8000")\
        .config("spark.hadoop.hive.metastore.client.socket.timeout","900")\
        .config("spark.sql.hive.convertMetastoreParquet","true")\
        .config("spark.sql.broadcastTimeout","36000")\
        .config("spark.ui.killEnabled","true")\
        .config('spark.yarn.queue','root.gapl_plataf_projetos_motores_decisao')\
        .getOrCreate()

### Dataset
We will be running the analyzers on a dataset sampled from th.thbpd38 table

In [5]:
sql_query = """
select
    i1c_renda_final,
    i1c_lim_pre_ap_preventivo,
    i1c_rating_riscos,
    i1d_idade,
    i1d_sexo,
    i1c_cli_possui_conta,
    i1c_soc_cd_segm_empr1,
    i1c_soc_cd_ramo_atvd1,
    dat_ref_carga
from th.thbpd381 where dat_ref_carga='2022-01-03'
"""
df_input = spark.sql(sql_query)

In [6]:
df_input.printSchema()

root
 |-- i1c_renda_final: integer (nullable = true)
 |-- i1c_lim_pre_ap_preventivo: integer (nullable = true)
 |-- i1c_rating_riscos: integer (nullable = true)
 |-- i1d_idade: integer (nullable = true)
 |-- i1d_sexo: string (nullable = true)
 |-- i1c_cli_possui_conta: string (nullable = true)
 |-- i1c_soc_cd_segm_empr1: integer (nullable = true)
 |-- i1c_soc_cd_ramo_atvd1: integer (nullable = true)
 |-- dat_ref_carga: string (nullable = true)



## Profile

*dqResearch.runProfile(spark, df, metrics_file)*
* *spark*: SparkSession
* *df*: Dataframe to be profiled
* *metrics_file*: temporary path, where reads/writes are allowed

In [7]:
from dataquality_bnr.dqResearch import main as dqResearch

In [9]:
metrics_file = "/tmp/x266727/sampleRepository/metricsFile/tempFile.json"

Note: The path which *metrics_file* point to, is going to support the profiling process. Therefore user must to have read and write permissions to this directory, even though no file is going to be persisted there after runtime execution.

More informations about the metrics_file functionalities at oficial PyDeequ repository:</br>
https://github.com/awslabs/python-deequ/blob/master/tutorials/repository.ipynb

In [11]:
profile_df = dqResearch.runProfile(spark, df_input, metrics_file)

##### show results

Get Dataframe inMemory then show it with *pandas.DataFrame*

In [12]:
sc = spark.sparkContext
inMemory_df =profile_df.collect()
inMemory_df= sc.parallelize(inMemory_df).toDF()
inMemory_df = inMemory_df.toPandas()

In [13]:
inMemory_df

Unnamed: 0,column,metric,value,research_date
0,i1c_lim_pre_ap_preventivo,Completeness,1.000000,2022-02-02
1,i1c_lim_pre_ap_preventivo,ApproxCountDistinct,2.000000,2022-02-02
2,i1c_lim_pre_ap_preventivo,Minimum,0.000000,2022-02-02
3,i1c_lim_pre_ap_preventivo,Maximum,2.000000,2022-02-02
4,i1c_lim_pre_ap_preventivo,Mean,0.016509,2022-02-02
...,...,...,...,...
208,i1d_sexo,Histogram.ratio.M,0.609611,2022-02-02
209,i1d_sexo,Histogram.abs.,856.000000,2022-02-02
210,i1d_sexo,Histogram.ratio.,0.001581,2022-02-02
211,i1d_sexo,Histogram.abs.F,210449.000000,2022-02-02
