# Example notebook 02

Using the data generated from notebook `00_create_data.ipynb` this notebook takes you through some of the basic functionality using the `Checks` class:

+ [Required keys](#Required-keys)
+ [Looking at the issues ouput](#Looking-at-the-issues-ouput)
+ [category](#category)
+ [long_description](#long_description)
+ [columns](#columns)
+ [count_condition](#count_condition)
+ [index_position](#index_position)
+ [check_condition](#check_condition)
+ [relevant_columns](#relevant_columns)
+ [idx_flag](#idx_flag)
+ [Defaults](#Defaults)

## Setup
<hr>

Imports and settings options

In [1]:
import IPython.core.display as ICD
import pickle
from datetime import datetime

import pandas as pd

pd.set_option('display.max_rows', 12)
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:,.2f}'.format)

In [2]:
from data_etl import Checks

## Examples
<hr>

Initialise the class

In [3]:
var_start_time = datetime.now()
ch_checks = Checks(var_start_time, '1')

Set the data, here we are doing it as a dictionary which is the same output as `DataCuration.tables` gives if multiple tables are defined

In [4]:
dict_data = {
    'df_checks.pkl': pickle.load(open('data/df_checks.pkl', 'rb')),
    'df_checks_issues.pkl': pickle.load(open('data/df_checks_issues.pkl', 'rb'))
}

The `Checks` class also works with a `DataFrame` input  which is the same output `DataCuration.tables` gives if the tables are not separated in dictionary format

Each check is defined by a dictionary entry, each dictionary entry has both required and optional inputs which will be looked at below

### Required keys
<hr>

The required keys are:
+ A short description of the check, this will be output as the short description if an issue is found so it needs to give an understanding of what is being checked
+ A `calc_condition` value, this is a lambda function which will output a series representing if an issue has been found or not

Initialise the dictionary to contain the checks, this can be useful if you have very complex checks that involve extra user defined functions that you want to keep in a sensible group

In [5]:
dict_checks = dict()

Our simple check has a label of `Number should be greater than 0` and the `calc_condition` is a lambda function which takes three arguments of:
+ `df`, this is the individual table that is being checked at that time
+ `col`, this is passing information provided in an optional argument `columns` which has an example later in this notebook 
+ `**kwargs`, for some checks you may want to check against other information such as records that already exist in a database, the kwargs allows you to pass any information that you need for the checks

In [6]:
dict_checks['Number should be greater than 0'] = {
    'calc_condition': lambda df, col, **kwargs: df['number'] <= 0
}

Apply the dictionary of checks to the dictionary of tables

In [7]:
ch_checks.apply_checks(dict_data, dictionary=dict_checks)

Look at the issues that were found

In [8]:
ch_checks.df_issues

Unnamed: 0,key_1,key_2,key_3,file,sub_file,step_number,category,issue_short_desc,issue_long_desc,column,issue_count,issue_idx,grouping
0,1,,,df_checks_issues.pkl,,0,,Number should be greater than 0,,,1,4,2020-05-25 20:34:34.483263


So we can go back to the data and find that problem value

In [9]:
dict_data['df_checks_issues.pkl'].loc[[4], :]

Unnamed: 0,number,category_1,category_2
4,-1,C,c


### Looking at the issues ouput
<hr>

Re output an example

In [10]:
ch_checks.df_issues

Unnamed: 0,key_1,key_2,key_3,file,sub_file,step_number,category,issue_short_desc,issue_long_desc,column,issue_count,issue_idx,grouping
0,1,,,df_checks_issues.pkl,,0,,Number should be greater than 0,,,1,4,2020-05-25 20:34:34.483263


A quick look at the columns, all the optional inputs have examples later in the notebook:
+ `key_1`, `key_2`, `key_3`, `grouping`, as with the `DataCuration` object when you initialise it you can set some keys and a grouping value to kelp identify an individual purpose for the process and an individual run, if you keep the same values across your `DataCuration` and `Checks` objects then it helps to link them together
+ `file`, this is the key that you provided for your data so you know which table it is from, this is useful when you are checking multiple files individually
+ `sub_file`, if you are using spreadsheets which multiple tabs that you are using this gives the opportunity to have an individual label of the source, it uses the internal parameter `\_\_key_separator` to split the input key from the data
    + E.g. Data key 'Spreadsheet name -:- Sheet1' will have a file value of 'Spreadhseet name' and a sub_file value of 'Sheet1' to help identify the source of the issue
+ `step_number`, for some processes it is required that you have checks at different points, the step number allows you to block these checks together in a meaningful way, the value is set against the `Checks` object
+ `category`, an optional input that can be used to indicate if the issue has to be resolved before continuing or is for information purposes or any other labelling that you would find useful
+ `issue_short_desc`, this is the label for the check that defines the unique check, e.g. in our example above 'Number should be greater than 0' is the dictionary key and the `issue_short_desc`
+ `issue_long_desc`, an optional input that can be used to give more detail, for instance if there are invalid value it could note what values are considered invalid
+ `column`, linked to an optional input if the same check is applied to multiple columns individually this will note which column (or set of columns) the check was applied to when an issue was found
+ `issue_count`, an optional input, the default value is to sum the output of `calc_condition` to get a count of the rows with the issue
+ `issue_idx`, an optional input, the default value is to take the index values of the table where an issue was found and create a comma separated list of those indexes

All optional inputs also have a default value in the `Checks` class, these will be looked at later in the notebook

If your data has been provided by someone else who is the data owner these issue / validation logs, potentially modified, can be passed back to them to correct the information before you proceed with your work or put the data into a database

### category
<hr>

Set a new step number to differentiate any new issues found

In [11]:
ch_checks.set_step_no(1)

Create a new dictionary for the checks and put in a category label

In [12]:
dict_checks_category = dict()
dict_checks_category['Number should be greater than 2'] = {
    'calc_condition': lambda df, col, **kwargs: df['number'] <= 2,
    'category': 'severe'
}

Apply the dictionary of checks to the dictionary of tables

In [13]:
ch_checks.apply_checks(dict_data, dictionary=dict_checks_category)

Look at the issues that were found

In [14]:
ch_checks.df_issues

Unnamed: 0,key_1,key_2,key_3,file,sub_file,step_number,category,issue_short_desc,issue_long_desc,column,issue_count,issue_idx,grouping
0,1,,,df_checks_issues.pkl,,0,,Number should be greater than 0,,,1,4,2020-05-25 20:34:34.483263
1,1,,,df_checks_issues.pkl,,1,severe,Number should be greater than 2,,,2,"0, 4",2020-05-25 20:34:34.483263


We have the previous one and a new one with our new step number, and the category is filled in for our new issue

Looking at the data for where the issues are

In [15]:
dict_data['df_checks_issues.pkl'].iloc[[0, 4], :]

Unnamed: 0,number,category_1,category_2
0,1,Z,y
4,-1,C,c


### long_description
<hr>

Set a new step number to differentiate any new issues found

In [16]:
ch_checks.set_step_no(2)

Using the same check as before we will add in a long description

In [17]:
dict_checks_long_description = {
    'check values in list': {
        'calc_condition': lambda df, col, **kwargs: ~df['category_1'].isin(['A', 'B', 'C', 'D']),
        'long_description': lambda df, col, condition, **kwargs: 
            f"The invalid values are: "
            f"{df.loc[~df['category_1'].isin(['A', 'B', 'C', 'D'])]['category_1'].unique().tolist()}"
    }
}

Apply the dictionary of checks to the dictionary of tables

In [18]:
ch_checks.apply_checks(dict_data, dictionary=dict_checks_long_description)

Look at the issues that were found

In [19]:
ch_checks.df_issues

Unnamed: 0,key_1,key_2,key_3,file,sub_file,step_number,category,issue_short_desc,issue_long_desc,column,issue_count,issue_idx,grouping
0,1,,,df_checks_issues.pkl,,0,,Number should be greater than 0,,,1,4,2020-05-25 20:34:34.483263
1,1,,,df_checks_issues.pkl,,1,severe,Number should be greater than 2,,,2,"0, 4",2020-05-25 20:34:34.483263
2,1,,,df_checks_issues.pkl,,2,,check values in list,"The invalid values are: ['Z', 'Y']",,2,"0, 2",2020-05-25 20:34:34.483263


So the description that is set up is now dynamic to the data it is passed, the short and long are just ways to differentiate between the two descriptions rather than to indicate length

### columns
<hr>

Set a new step number to differentiate any new issues found

In [20]:
ch_checks.set_step_no(3)

Sometimes the same check will apply to multiple columns but it would be verbose to define the check for each one individually, so you can use the columns key instead

In [21]:
dict_checks_columns = dict()
dict_checks_columns['Letter should be upper case'] = {
    'columns': ['category_1', 'category_2'],
    'calc_condition': lambda df, col, **kwargs: df[col] == df[col].str.upper()
}
dict_checks_columns['Letter should not be one of B, z'] = {
    'columns': ['category_1', 'category_2'],
    'calc_condition': lambda df, col, **kwargs: df[col].isin(['B', 'z'])
}

Apply the dictionary of checks to the dictionary of tables

In [22]:
ch_checks.apply_checks(dict_data, dictionary=dict_checks_columns)

Look at the issues that were found

In [23]:
ch_checks.df_issues.loc[ch_checks.df_issues['step_number'] == 3]

Unnamed: 0,key_1,key_2,key_3,file,sub_file,step_number,category,issue_short_desc,issue_long_desc,column,issue_count,issue_idx,grouping
3,1,,,df_checks.pkl,,3,,Letter should be upper case,,category_1,5,"0, 1, 2, 3, 4",2020-05-25 20:34:34.483263
4,1,,,df_checks.pkl,,3,,"Letter should not be one of B, z",,category_1,1,2,2020-05-25 20:34:34.483263
5,1,,,df_checks.pkl,,3,,"Letter should not be one of B, z",,category_2,1,1,2020-05-25 20:34:34.483263
6,1,,,df_checks_issues.pkl,,3,,Letter should be upper case,,category_1,5,"0, 1, 2, 3, 4",2020-05-25 20:34:34.483263
7,1,,,df_checks_issues.pkl,,3,,"Letter should not be one of B, z",,category_1,1,3,2020-05-25 20:34:34.483263


There are now issues for the df_checks.pkl file, but it's easy to differentiate them from the issues in the other file

A break down of these issues are:
+ For the check 'Letter should be upper case'
    + The column category_1 errored for all rows for both tables of data
+ For the check 'Letter should not be one of \[B, z\]
    + Both columns errored for df_checks.pkl for one entry each
    + Only column category_1 errored for df_check_issues.pkl for one entry

We can look at the original table to spot these problem values

In [24]:
ICD.display(dict_data['df_checks.pkl'])
ICD.display(dict_data['df_checks_issues.pkl'])

Unnamed: 0,number,category_1,category_2
0,3,A,a
1,10,A,z
2,9,B,b
3,4,D,d
4,7,C,c


Unnamed: 0,number,category_1,category_2
0,1,Z,y
1,10,A,a
2,9,Y,b
3,4,B,b
4,-1,C,c


### count_condition
<hr>

Set a new step number to differentiate any new issues found

In [25]:
ch_checks.set_step_no(4)

Create a new dictionary for the checks and put in a new count condition, we will use the same check as for adding in the category (step 1) so we can compare the outputs

In the lambda for `count_condition` we have an extra argument of `condition`, this is the output of `calc_condition`

In [26]:
dict_checks_count_condition = dict()
dict_checks_count_condition['Number should be greater than 2'] = {
    'calc_condition': lambda df, col, **kwargs: df['number'] <= 2,
    'category': 'severe',
    'count_condition': lambda df, col, condition, **kwargs: -1
}

Apply the dictionary of checks to the dictionary of tables

In [27]:
ch_checks.apply_checks(dict_data, dictionary=dict_checks_count_condition)

Look at the issues that were found

In [28]:
ch_checks.df_issues.loc[ch_checks.df_issues['step_number'].isin([1, 4])]

Unnamed: 0,key_1,key_2,key_3,file,sub_file,step_number,category,issue_short_desc,issue_long_desc,column,issue_count,issue_idx,grouping
1,1,,,df_checks_issues.pkl,,1,severe,Number should be greater than 2,,,2,"0, 4",2020-05-25 20:34:34.483263
8,1,,,df_checks_issues.pkl,,4,severe,Number should be greater than 2,,,-1,"0, 4",2020-05-25 20:34:34.483263


### index_position
<hr>

Set a new step number to differentiate any new issues found

In [29]:
ch_checks.set_step_no(5)

Create a new dictionary for the checks and put in a new index position lambda, this will point you to what row of the data has had an issue

Using the checks from step 0

You may want to have it 1 indexed instead of 0 or if you know that the end user will have 4 header rows and is looking at it in a spreadsheet then you may want the output to be teh spreadsheet row label

You can also choose to keep the default and then make this modification separately before sending out if needed

In [30]:
dict_checks_index_position = dict()
dict_checks_index_position['Number should be greater than 0'] = {
    'calc_condition': lambda df, col, **kwargs: df['number'] <= 0,
    'index_position': lambda df, col, condition, **kwargs: 
        pd.Series(
            condition.values.tolist(), 
            index=[item + 1 for item in condition.index.tolist()]
        )
}

Apply the dictionary of checks to the dictionary of tables

In [31]:
ch_checks.apply_checks(dict_data, dictionary=dict_checks_index_position)

Look at the issues that were found

In [32]:
ch_checks.df_issues.loc[ch_checks.df_issues['step_number'].isin([0, 5])]

Unnamed: 0,key_1,key_2,key_3,file,sub_file,step_number,category,issue_short_desc,issue_long_desc,column,issue_count,issue_idx,grouping
0,1,,,df_checks_issues.pkl,,0,,Number should be greater than 0,,,1,4,2020-05-25 20:34:34.483263
9,1,,,df_checks_issues.pkl,,5,,Number should be greater than 0,,,1,5,2020-05-25 20:34:34.483263


### check_condition
<hr>

Set a new step number to differentiate any new issues found

In [33]:
ch_checks.set_step_no(6)

Create a new dictionary for the checks and put in a new check condition key, this will mean a new trigger for an issue being raised is defined

In [34]:
dict_checks_check_condition = dict()
dict_checks_check_condition['Letter should be upper case'] = {
    'columns': ['category_1', 'category_2'],
    'calc_condition': lambda df, col, **kwargs: df[col] == df[col].str.upper(),
    'check_condition': lambda df, col, condition, **kwargs: condition.sum() > 2
}
dict_checks_check_condition['Letter should not be one of B, z'] = {
    'columns': ['category_1', 'category_2'],
    'calc_condition': lambda df, col, **kwargs: df[col].isin(['B', 'z']),
    'check_condition': lambda df, col, condition, **kwargs: condition.sum() > 2
}

Apply the dictionary of checks to the dictionary of tables

In [35]:
ch_checks.apply_checks(dict_data, dictionary=dict_checks_check_condition)

Look at the issues that were found

In [36]:
ch_checks.df_issues.loc[ch_checks.df_issues['step_number'].isin([3, 6])]

Unnamed: 0,key_1,key_2,key_3,file,sub_file,step_number,category,issue_short_desc,issue_long_desc,column,issue_count,issue_idx,grouping
3,1,,,df_checks.pkl,,3,,Letter should be upper case,,category_1,5,"0, 1, 2, 3, 4",2020-05-25 20:34:34.483263
4,1,,,df_checks.pkl,,3,,"Letter should not be one of B, z",,category_1,1,2,2020-05-25 20:34:34.483263
5,1,,,df_checks.pkl,,3,,"Letter should not be one of B, z",,category_2,1,1,2020-05-25 20:34:34.483263
6,1,,,df_checks_issues.pkl,,3,,Letter should be upper case,,category_1,5,"0, 1, 2, 3, 4",2020-05-25 20:34:34.483263
7,1,,,df_checks_issues.pkl,,3,,"Letter should not be one of B, z",,category_1,1,3,2020-05-25 20:34:34.483263
10,1,,,df_checks.pkl,,6,,Letter should be upper case,,category_1,5,"0, 1, 2, 3, 4",2020-05-25 20:34:34.483263
11,1,,,df_checks_issues.pkl,,6,,Letter should be upper case,,category_1,5,"0, 1, 2, 3, 4",2020-05-25 20:34:34.483263


We can see that the 3 issues with one one problem found have not appeard for step 6, this is because the set our check condition to only trigger if more than 2 problems were found

### relevant_columns
<hr>

Set a new step number to differentiate any new issues found

In [37]:
ch_checks.set_step_no(7)

Create a new dictionary for the checks and put in a new relevant columns key, this will mean even though the columns key is not defined there will still be an output in the 'column' column of the issues table 

In [38]:
dict_checks_relevant_columns = dict()
dict_checks_relevant_columns['Number should be greater than 0'] = {
    'calc_condition': lambda df, col, **kwargs: df['number'] <= 0,
    'relevant_columns': lambda df, col, condition, **kwargs: 'number'
}

Apply the dictionary of checks to the dictionary of tables

In [39]:
ch_checks.apply_checks(dict_data, dictionary=dict_checks_relevant_columns)

Look at the issues that were found

In [40]:
ch_checks.df_issues.loc[ch_checks.df_issues['step_number'].isin([0, 7])]

Unnamed: 0,key_1,key_2,key_3,file,sub_file,step_number,category,issue_short_desc,issue_long_desc,column,issue_count,issue_idx,grouping
0,1,,,df_checks_issues.pkl,,0,,Number should be greater than 0,,,1,4,2020-05-25 20:34:34.483263
12,1,,,df_checks_issues.pkl,,7,,Number should be greater than 0,,number,1,4,2020-05-25 20:34:34.483263


### idx_flag
<hr>

Set a new step number to differentiate any new issues found

In [41]:
ch_checks.set_step_no(8)

Create a new dictionary for the checks and put in a new idx flag key, sometimes it is easier to define a function for when a value is correct = True verses when a value is incorrect = True, the idx flag allows you to switch between them 

In [42]:
dict_checks_idx_flag = dict()
dict_checks_idx_flag['Number should be greater than 0'] = {
    'calc_condition': lambda df, col, **kwargs: df['number'] <= 0,
    'idx_flag': False
}

Apply the dictionary of checks to the dictionary of tables

In [43]:
ch_checks.apply_checks(dict_data, dictionary=dict_checks_idx_flag)

Look at the issues that were found

In [44]:
ch_checks.df_issues.loc[ch_checks.df_issues['step_number'].isin([0, 8])]

Unnamed: 0,key_1,key_2,key_3,file,sub_file,step_number,category,issue_short_desc,issue_long_desc,column,issue_count,issue_idx,grouping
0,1,,,df_checks_issues.pkl,,0,,Number should be greater than 0,,,1,4,2020-05-25 20:34:34.483263
13,1,,,df_checks_issues.pkl,,8,,Number should be greater than 0,,,1,"0, 1, 2, 3",2020-05-25 20:34:34.483263


We can see that all the ones which are correct are flagging as wrong for step 8

### Defaults
<hr>

For the optional parameters there are defaults for if you don't pass a key value pair for it, these defaults are overrideable using the `set_defaults` function in the `Checks` object

The default values are:
+ `columns`: \[np.nan\]
+ `check_condition`: lambda df, col, condition, \**kwargs: condition.sum() > 0,
+ `count_condition`: lambda df, col, condition, \**kwargs: condition.sum(),
+ `index_position`: lambda df, col, condition, \**kwargs: condition,
+ `relevant_columns`: lambda df, col, condition, \**kwargs: col,
+ `long_description`: lambda df, col, condition, \**kwargs: "",
+ `idx_flag`: True,
+ `category`: np.nan

There are some checks on the values that you pass to replace the defaults with so it may error as the below does

In [45]:
ch_checks.set_defaults(check_condition='check_it')

The passed value for `check_condition` is not a function


ValueError: The passed value for `check_condition` is not a function

---
**GigiSR**