# Example notebook 04

Using the data generated from notebook `00_create_data.ipynb` this notebook takes you through some of the basic functionality using the `general_functions` module:

+ [Initialise logging](#Initialise-logging)
+ [Import attribute](#Import-attribute)
+ [Check for issues](#Check-for-issues)


## Setup
<hr>

Imports and setting options

In [1]:
from datetime import datetime
import pickle

from data_etl import Checks, Connections, general_functions

### Initialise logging
<hr>

When running interlocking scripts it can be useful to have logging so that if a problem is encountered there's hopefully enough information provided to debug

This function helps to set up a logging file

In [2]:
general_functions.func_initialise_logging(
    'example_04', 'logs/', '1', None, None, datetime.now()
)

### Import attribute
<hr>

Quite often it is more useful to define the large dictionaries that go into the checks in a separate script so that it is in a collection but doesn't clutter up the main script where the flow of processing is defined

This function is also used in the classes as reading in from other scripts is a frequent action for clarity of the code

In [3]:
dict_checks = general_functions.import_attr('.', '04_example', 'dict_checks')
dict_checks

{'Number should be greater than 0': {'calc_condition': <function 04_example.<lambda>(df, col, **kwargs)>},
 'Number should be greater than 2': {'columns': ['number'],
  'calc_condition': <function 04_example.<lambda>(df, col, **kwargs)>,
  'category': 'severe'},
 'check values in list': {'columns': ['category_1'],
  'calc_condition': <function 04_example.<lambda>(df, col, **kwargs)>,
  'long_description': <function 04_example.<lambda>(df, col, condition, **kwargs)>},
 'The category_1 column can only map to certain values': {'calc_condition': <function 04_example.<lambda>(df, col, **kwargs)>,
  'check_condition': <function 04_example.<lambda>(df, col, condition, **kwargs)>,
  'count_condition': <function 04_example.<lambda>(df, col, condition, **kwargs)>,
  'index_position': <function 04_example.<lambda>(df, col, condition, **kwargs)>,
  'relevant_columns': <function 04_example.<lambda>(df, col, condition, **kwargs)>,
  'long_description': <function 04_example.<lambda>(df, col, conditio

And this can then be used or modified and used in the `DataCuration` and `Checks` classes

### Check for issues
<hr>

The aim of this function is to have a way to create a break in the code if there is are issues, and to store the issues before erroring out of the script

To use this function we need a class instance with issue entries and a connections class instance to write the issues out to

In [4]:
var_start_time = datetime.now()
ch_checks = Checks(var_start_time, '1')

dict_data = {
    'df_checks_issues.pkl': pickle.load(open('data/df_checks_issues.pkl', 'rb'))
}

dict_checks = dict()
dict_checks['Number should be greater than 0'] = {
    'calc_condition': lambda df, col, **kwargs: df['number'] <= 0
}

for step_no in range(5):
    ch_checks.set_step_no(step_no)
    ch_checks.apply_checks(dict_data, dictionary=dict_checks)

ch_checks.df_issues

Unnamed: 0,key_1,key_2,key_3,file,sub_file,step_number,category,issue_short_desc,issue_long_desc,column,issue_count,issue_idx,grouping
0,1,,,df_checks_issues.pkl,,0,,Number should be greater than 0,,,1,4,2020-05-26 07:43:04.328680
1,1,,,df_checks_issues.pkl,,1,,Number should be greater than 0,,,1,4,2020-05-26 07:43:04.328680
2,1,,,df_checks_issues.pkl,,2,,Number should be greater than 0,,,1,4,2020-05-26 07:43:04.328680
3,1,,,df_checks_issues.pkl,,3,,Number should be greater than 0,,,1,4,2020-05-26 07:43:04.328680
4,1,,,df_checks_issues.pkl,,4,,Number should be greater than 0,,,1,4,2020-05-26 07:43:04.328680


In [5]:
cnxs = Connections()
cnxs.add_cnx(
    cnx_key='df_issues', 
    cnx_type='sqlite3',
    table_name='df_issues',
    file_path='data/00_db.db'
)

Now use the issues table in the function

In [6]:
general_functions.func_check_for_issues(
    ch_checks.get_issue_count(), 
    cnxs, 
    'df_issues', 
    ch_checks.df_issues, 
    ch_checks.get_step_no(),
    override=True
)

The above has `override=True`, this means even if problems are found it will not error out, the below doesn't have `override=True` and intentionally errors

In [7]:
general_functions.func_check_for_issues(
    ch_checks.get_issue_count(), 
    cnxs, 
    'df_issues', 
    ch_checks.df_issues, 
    ch_checks.get_step_no()
)

ValueError: There were 5 issues found at step 4

The benefit of the `override` argument is that you may have a mixture of issues you want definitely resolving and those you can live with, this allows you to have errors but to carry on regardless

---
**GigiSR**