# DQ Research module

* __Analyzer__
* Profile
* Suggestions


### Analyzers

Run pydeequ Analyzers based on metrics specified at analyzers.yaml file.</br>

#### Setup environment

##### import libraries

In [1]:
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql import SparkSession, Row, DataFrame

In [2]:
from dataquality_bnr.dqSupport import main as dqSup

##### Set up a PySpark session

In [3]:
spark = SparkSession.builder\
        .appName("pdq_helloWorld")\
        .enableHiveSupport()\
        .config("spark.sql.catalogImplementation","hive")\
        .config("spark.jars", dqSup.getDeequJar_path())\
        .config("spark.jars.excludes", dqSup.getDeequJar_excludes())\
        .config("spark.executor.memory","12g")\
        .config("spark.executor.memoryOverhead","8g")\
        .config("spark.shuffle.service.enabled","true")\
        .config("spark.dynamicAllocation.enabled","true")\
        .config("spark.dynamicAllocation.initialExecutors","1")\
        .config("spark.dynamicAllocation.maxExecutors","32")\
        .config("spark.dynamicAllocation.minExecutors","1")\
        .config("spark.executor.cores","2")\
        .config("spark.driver.memory","12g")\
        .config("spark.driver.maxResultSize","8g")\
        .config("spark.network.timeout","8000")\
        .config("spark.hadoop.hive.metastore.client.socket.timeout","900")\
        .config("spark.sql.hive.convertMetastoreParquet","true")\
        .config("spark.sql.broadcastTimeout","36000")\
        .config("spark.ui.killEnabled","true")\
        .config('spark.yarn.queue','root.gapl_plataf_projetos_motores_decisao')\
        .getOrCreate()

### Dataset
We will be running the analyzers on a dataset sampled from th.thbpd38 table

In [4]:
sql_query = """
select
    i1c_renda_final,
    i1c_lim_pre_ap_preventivo,
    i1c_rating_riscos,
    i1d_idade,
    i1d_sexo,
    i1c_cli_possui_conta,
    i1c_soc_cd_segm_empr1,
    i1c_soc_cd_ramo_atvd1,
    dat_ref_carga
from th.thbpd381 where dat_ref_carga='2022-01-03'
"""
df_input = spark.sql(sql_query)

In [5]:
df_input.printSchema()

root
 |-- i1c_renda_final: integer (nullable = true)
 |-- i1c_lim_pre_ap_preventivo: integer (nullable = true)
 |-- i1c_rating_riscos: integer (nullable = true)
 |-- i1d_idade: integer (nullable = true)
 |-- i1d_sexo: string (nullable = true)
 |-- i1c_cli_possui_conta: string (nullable = true)
 |-- i1c_soc_cd_segm_empr1: integer (nullable = true)
 |-- i1c_soc_cd_ramo_atvd1: integer (nullable = true)
 |-- dat_ref_carga: string (nullable = true)



## Analyzers

*dqResearch.runAnalyzer(spark, df, yaml_path)*

* *spark*: SparkSession 
* *df*: Dataframe over witch metrics are going to be calculated
* *yaml_path*: path to the .yaml file containing the listed metrics to be calculated

In [6]:
from dataquality_bnr.dqResearch import main as dqResearch

In [7]:
yaml_path="yamlFiles/analyzers/def_analyzers.yaml"

The .yaml file pattern should look like:

```yaml
- Size()
- Maximum('i1c_renda_final')
- Minimum('i1c_renda_final')
- Mean('i1c_renda_final')
- Mean('i1c_lim_pre_ap_preventivo')
- Maximum('i1d_idade')
- StandardDeviation('i1d_idade')
- Histogram('i1c_rating_riscos')
```

All available analyzers Methods are listed at oficial __PyDeequ repository__:</br>
https://github.com/awslabs/python-deequ/blob/master/docs/analyzers.md

In [20]:
analysisResult_df = dqResearch.runAnalyzer(spark, df_input, yaml_path)

Size()
Maximum('i1c_renda_final')
Minimum('i1c_renda_final')
Mean('i1c_renda_final')
Mean('i1c_lim_pre_ap_preventivo')
Maximum('i1d_idade')
StandardDeviation('i1d_idade')
Histogram('i1c_rating_riscos')


##### show results

Get Dataframe inMemory then show it with *pandas.DataFrame*

In [21]:
analysisResult_df.toPandas()

Unnamed: 0,column,metric,value,research_date
0,i1c_lim_pre_ap_preventivo,Mean,0.01650938,2022-02-02
1,i1d_idade,Maximum,124.0,2022-02-02
2,i1d_idade,StandardDeviation,14.86044,2022-02-02
3,*,Size,541268.0,2022-02-02
4,i1c_rating_riscos,Histogram.bins,9.0,2022-02-02
5,i1c_rating_riscos,Histogram.abs.8,15510.0,2022-02-02
6,i1c_rating_riscos,Histogram.ratio.8,0.02865494,2022-02-02
7,i1c_rating_riscos,Histogram.abs.4,7235.0,2022-02-02
8,i1c_rating_riscos,Histogram.ratio.4,0.01336676,2022-02-02
9,i1c_rating_riscos,Histogram.abs.9,19293.0,2022-02-02
