# Pandas and Dask Tutorial

## Importing the libraries

In [10]:
import dask.datasets
import pandas as pd
from soda.scan import Scan

## Create Artifical Pandas and Dask Dataframes

In [11]:
# Load timeseries data from dask datasets
df_timeseries = dask.datasets.timeseries().reset_index()
df_timeseries["email"] = "a@soda.io"

df_timeseries.head()

Unnamed: 0,timestamp,name,id,x,y,email
0,2000-01-01 00:00:00,Alice,982,-0.220556,0.880148,a@soda.io
1,2000-01-01 00:00:01,Edith,1015,0.190462,0.784481,a@soda.io
2,2000-01-01 00:00:02,Oliver,977,0.176636,-0.738644,a@soda.io
3,2000-01-01 00:00:03,Yvonne,988,-0.192201,-0.959786,a@soda.io
4,2000-01-01 00:00:04,Zelda,1083,0.390133,-0.873293,a@soda.io


In [12]:
# Create an artificial pandas dataframe
df_employee = pd.DataFrame(
    {
        "name": ["Bastien", "Titus", "Baturay"],
        "email": ["a@soda.io", "b@soda.io", "c@soda.io"],
    }
)

df_employee.head()

Unnamed: 0,name,email
0,Bastien,a@soda.io
1,Titus,b@soda.io
2,Baturay,c@soda.io


## Create a soda scan object

In [13]:
scan = Scan()
scan.set_scan_definition_name("dask and pandas tutorial")
scan.set_data_source_name("dask")

INFO:soda.scan:[18:02:18] Soda Core 3.0.21


### Add dataframes to the soda scan object

In [14]:
# Add dask dataframe to scan and assign a dataset name to refer from checks yaml
scan.add_dask_dataframe(dataset_name="timeseries", dask_df=df_timeseries)

# Add pandas dataframe to scan and assign a dataset name to refer from checks yaml
scan.add_pandas_dataframe(dataset_name="employee", pandas_df=df_employee)

### Define checks in yaml format

In the first example, we will check row counts of the two dataframes.

In [15]:
# Define checks in yaml format
# alternatively you can refer to a yaml file using scan.add_sodacl_yaml_file(<filepath>)
row_count_checks = """
for each dataset T:
  datasets:
    - include %
  checks:
    - row_count > 0
"""
scan.add_sodacl_yaml_str(row_count_checks)
scan.execute()

INFO:soda.scan:Instantiating for each for ['timeseries', 'employee']
INFO:soda.scan:[18:02:19] Scan summary:
INFO:soda.scan:[18:02:19] 2/2 checks PASSED: 
INFO:soda.scan:[18:02:19]     timeseries in dask
INFO:soda.scan:[18:02:19]       row_count > 0 [PASSED]
INFO:soda.scan:[18:02:19]     employee in dask
INFO:soda.scan:[18:02:19]       row_count > 0 [PASSED]


0

Now, we will apply a cross check between pandas and dask dataframes. We will check if the values of `employee.email` exist in `timeseries.email` dataframe. It is expected that the check will fail because `b@soda.io` and `c@soda.io` are not present in `timeseries.email` dataframe.

In [16]:
cross_table_checks = """
checks for employee:
    - values in (email) must exist in timeseries (email) # Error expected
    - row_count same as timeseries # Error expected
"""
scan.add_sodacl_yaml_str(cross_table_checks)
scan.execute()

INFO:soda.scan:Instantiating for each for ['timeseries', 'employee', 'showtables']
INFO:soda.scan:[18:02:22] Using DefaultSampler
INFO:soda.scan:[18:02:22] Scan summary:
INFO:soda.scan:[18:02:22] 7/9 checks PASSED: 
INFO:soda.scan:[18:02:22]     timeseries in dask
INFO:soda.scan:[18:02:22]       row_count > 0 [PASSED]
INFO:soda.scan:[18:02:22]       row_count > 0 [PASSED]
INFO:soda.scan:[18:02:22]       row_count > 0 [PASSED]
INFO:soda.scan:[18:02:22]     employee in dask
INFO:soda.scan:[18:02:22]       row_count > 0 [PASSED]
INFO:soda.scan:[18:02:22]       row_count > 0 [PASSED]
INFO:soda.scan:[18:02:22]       row_count > 0 [PASSED]
INFO:soda.scan:[18:02:22]     showtables in dask
INFO:soda.scan:[18:02:22]       row_count > 0 [PASSED]
INFO:soda.scan:[18:02:22] 2/9 checks FAILED: 
INFO:soda.scan:[18:02:22]     employee in dask
INFO:soda.scan:[18:02:22]       values in (email) must exist in timeseries (email) [FAILED]
INFO:soda.scan:[18:02:22]         value: 2
INFO:soda.scan:[18:02:22] 

2

Add some custom checks for timeseries data

In [17]:
timeseries_checks = """
checks for timeseries:
  - invalid_count(email) = 0:
      valid format: email
  - valid_count(email) > 0:
      valid format: email
  - duplicate_count(name) < 4:
      samples limit: 2
  - missing_count(y):
      warn: when > -1
  - missing_percent(x) < 5%
  - missing_count(y) = 0
  - avg(x) between -1 and 1
  - max(x) > 0
  - min(x) < 1:
      filter: x > 0.2
  - freshness(timestamp) < 1d
  - values in (email) must exist in employee (email)
"""
scan.add_sodacl_yaml_str(timeseries_checks)
scan.execute()

INFO:soda.scan:Instantiating for each for ['timeseries', 'employee', 'showtables']
INFO:soda.scan:[18:02:46] Using DefaultSampler
INFO:soda.scan:[18:02:48] Using DefaultSampler
INFO:soda.scan:[18:02:50] Using DefaultSampler
INFO:soda.scan:[18:02:50] Scan summary:
INFO:soda.scan:[18:02:50] 23/30 checks PASSED: 
INFO:soda.scan:[18:02:50]     timeseries in dask
INFO:soda.scan:[18:02:50]       row_count > 0 [PASSED]
INFO:soda.scan:[18:02:50]       row_count > 0 [PASSED]
INFO:soda.scan:[18:02:50]       row_count > 0 [PASSED]
INFO:soda.scan:[18:02:50]       row_count > 0 [PASSED]
INFO:soda.scan:[18:02:50]       row_count > 0 [PASSED]
INFO:soda.scan:[18:02:50]       values in (email) must exist in employee (email) [PASSED]
INFO:soda.scan:[18:02:50]       row_count > 0 [PASSED]
INFO:soda.scan:[18:02:50]       invalid_count(email) = 0 [PASSED]
INFO:soda.scan:[18:02:50]       valid_count(email) > 0 [PASSED]
INFO:soda.scan:[18:02:50]       missing_count(y) = 0 [PASSED]
INFO:soda.scan:[18:02:50]  

2