# Data Quality 

Dopo aver completato la fase di Data Ingestion tramite lo scraping il secondo step è quello di valutare la qualità dei dati raccolti.  
Il dataframe studiato presenta 18 colonne. Ogni colonna presenta delle differenze dalle altre e ci si aspetta determinati valori da queste.

Dopo aver settato:
1. la datasource
2. suite e le relative expectations (si rimanda al noteook SuiteSetUp_main)
3. creazione del checkpoint e validazione

Si condurrà una validazione del sample

In [13]:
import great_expectations as gx
from ruamel import yaml
from great_expectations.data_context.types.resource_identifiers import ExpectationSuiteIdentifier
import os
import sys
sys.path.insert(0, '../funzioni')
from funzioni import *

In [2]:
context = gx.get_context()

### Configurazione della datasource 

In [3]:
datasource_config: dict = {
    "name": "glassdoor_scraping",
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
    "execution_engine": {
        "class_name": "PandasExecutionEngine",
        "module_name": "great_expectations.execution_engine",
    },
     "data_connectors": {
        "all": {
            "class_name": "InferredAssetFilesystemDataConnector",
            "base_directory": "../data",
            "default_regex": {"pattern": "(.*)\\.csv",
            "group_names": ["data_asset_name"]},
            #"batch_spec_passthrough": {
            #    "reader_method": "read_csv",
            #    "reader_options": {
            #        "header": True,
            #        "inferSchema": True,
            #    },
            #},
        }
    },
}

In [4]:
#check configurazione
context.test_yaml_config(yaml.dump(datasource_config))

Attempting to instantiate class from config...
	Instantiating as a Datasource, since class_name is Datasource
	Successfully instantiated Datasource


ExecutionEngine class name: PandasExecutionEngine
Data Connectors:
	all : InferredAssetFilesystemDataConnector

	Available data_asset_names (3 of 30):
		company_overview (1 of 1): ['company_overview.csv']
		scraping_all_20230131 (1 of 1): ['scraping_all_20230131.csv']
		scraping_all_20230201 (1 of 1): ['scraping_all_20230201.csv']

	Unmatched data_references (1 of 1):['jobs']



<great_expectations.datasource.new_datasource.Datasource at 0x7f451310bfd0>

In [None]:
## try to add
try:
    context.get_datasource(datasource_config["name"])
except ValueError:
    context.add_datasource(**datasource_config)
else:
    print(
        f"The datasource {datasource_config['name']} already exists in your Data Context!"
    )

### Indicazione delle suite

Si rimanda al noteook SuiteSetUp_main per eventuali modifiche della stessa

In [12]:
a = context.list_expectation_suite_names()
suite_identifier = ExpectationSuiteIdentifier(expectation_suite_name=a[0])
context.build_data_docs(resource_identifiers=[suite_identifier])
context.open_data_docs(resource_identifier=suite_identifier)
print('http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/expectations/'+ a[0] +'.html')

http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/expectations/Main.html


### Creazione del checkpoint e validazione

E' stata creata una funzione che prendendo come input il data-asset (sotto gruppo della data-source) crerà per ogni data-asset un checkpoint.
Poiché ogni data-asset corrisponde al giorno in cui è stato lanciato il processo di data-ingestion verrà creato un checkpoint per ogni giorno in cui appunto ha girato il processo di data-ingestion

Per le funzioni si rimanda alla percorso `../funzioni/funzioni.py`.
Per ogni validazione del checkpoint saranno disponibili i risultati in formato HTML

In [14]:
run_dataq()

Questa la documentazione sulla suite: Main 
 http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/expectations/Main.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230131" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121253-scraping_all_20230131/20230301T121253.330414Z/a8c2603936f08f68234cfeb52d848e66.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230201" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121257-scraping_all_20230201/20230301T121257.269391Z/6846293c9834bd8f9f0498067de4cc79.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230202" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121301-scraping_all_20230202/20230301T121301.044052Z/5e8fcbd3a476b02c69e7086e74c19643.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230203" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121304-scraping_all_20230203/20230301T121304.959164Z/b99595ff62aa1b77109a89d80ca06516.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230204" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121308-scraping_all_20230204/20230301T121308.984578Z/60a12e85b58c629662f9400945fff6c4.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230205" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121313-scraping_all_20230205/20230301T121313.116252Z/359a3b77c11be2dce60e82e0630966f5.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230206" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121317-scraping_all_20230206/20230301T121317.257335Z/a73990dd45cb2af530da3ccffa2149bf.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230207" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121321-scraping_all_20230207/20230301T121321.445978Z/57debcbab72ee490bf5b0cc7c6214d73.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230208" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121325-scraping_all_20230208/20230301T121325.817822Z/8b5a58ecdfc8f3f2ce58991ab4e96bd6.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230209" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121330-scraping_all_20230209/20230301T121330.086541Z/c062ff09b524767dc467d1c23598650b.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230210" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121334-scraping_all_20230210/20230301T121334.362230Z/8f2773112761e97a56da86f97ca2bfad.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230211" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121338-scraping_all_20230211/20230301T121338.858758Z/fed9054f53dd4148b28cdff1a265776d.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230212" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121343-scraping_all_20230212/20230301T121343.772258Z/4e90ae4b9334a6ab6e7f72dc32d27f65.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230213" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121348-scraping_all_20230213/20230301T121348.751237Z/a835edef57833302c353e9e1d06fbd82.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230214" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121353-scraping_all_20230214/20230301T121353.623452Z/b04c43667dd9ad7593259992ace3ed41.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230215" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121358-scraping_all_20230215/20230301T121358.404775Z/1c201e09ae18cc7a05cf80b5766dd557.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230216" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121403-scraping_all_20230216/20230301T121403.502179Z/c4e91d3d0445ea05aa5a84cca138ff6f.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230217" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121408-scraping_all_20230217/20230301T121408.469436Z/3efa08eb01bba7de1266fd8bb26c5cf3.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230218" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121413-scraping_all_20230218/20230301T121413.492118Z/0793b2276f717bcfe6adfb106b5137ef.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230219" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121418-scraping_all_20230219/20230301T121418.910503Z/e9bd252c14835612b1e4ae4a3ac712c0.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230220" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121424-scraping_all_20230220/20230301T121424.173623Z/565e0e70a3781a19ef552fba61f84a26.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230221" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121429-scraping_all_20230221/20230301T121429.397971Z/e011940fbbbe0aa7eefeed483ce088c6.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230222" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121434-scraping_all_20230222/20230301T121434.575365Z/3664fb024a69ca338ea65b49c7d539d3.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230224" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121440-scraping_all_20230224/20230301T121440.168506Z/4d7cdc08e0ee331cce276f99b1b7442b.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230225" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121445-scraping_all_20230225/20230301T121445.863330Z/44443cc4de44ba37642b43bd85f3a657.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230226" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121451-scraping_all_20230226/20230301T121451.233040Z/3734f5237bcc9347478ba28c3ae377b2.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230227" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121456-scraping_all_20230227/20230301T121456.760728Z/6a90907f0867eb7df7eb6ab73d221a00.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230228" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121502-scraping_all_20230228/20230301T121502.587903Z/fbf67c19836929edb6be46e22e84d420.html


Calculating Metrics:   0%|          | 0/240 [00:00<?, ?it/s]

Consulta i risultati del data_asset_name "scraping_all_20230301" al link:
http://localhost:9000/view/great_expectations/uncommitted/data_docs/local_site/validations/Main/20230301-121508-scraping_all_20230301/20230301T121508.078576Z/29ab79bcc15897ac1099d480a1c3cdc1.html
Checkpoint aggiornati o aggiunti:
 ['checkpoint_20230131', 'checkpoint_20230201', 'checkpoint_20230202', 'checkpoint_20230203', 'checkpoint_20230204', 'checkpoint_20230205', 'checkpoint_20230206', 'checkpoint_20230207', 'checkpoint_20230208', 'checkpoint_20230209', 'checkpoint_20230210', 'checkpoint_20230211', 'checkpoint_20230212', 'checkpoint_20230213', 'checkpoint_20230214', 'checkpoint_20230215', 'checkpoint_20230216', 'checkpoint_20230217', 'checkpoint_20230218', 'checkpoint_20230219', 'checkpoint_20230220', 'checkpoint_20230221', 'checkpoint_20230222', 'checkpoint_20230224', 'checkpoint_20230225', 'checkpoint_20230226', 'checkpoint_20230227', 'checkpoint_20230228', 'checkpoint_20230301', 'main_checkpoint']
