# Data sanity check process

In this notebook, we will extract a dataset and the content of a `data_format_ref` file

we will run a data sanity check on this dataset based on the content of`data_format_ref`

In [1]:
import pandas as pd

import fedbiomed.common.data_tool.utils as utils
import fedbiomed.common.data_tool.multi_view_dataframe as multiview

from fedbiomed.common.data_tool.warning_logger import WarningReportLogger
from fedbiomed.common.data_tool.pre_processing_checker import PreProcessingChecker

In [2]:
# loading simple dataset

single_dataset_to_check = utils.load_tabular_datasets(r'data/pseudo_adni/pseudo_adni_mod_2.csv')

# loading `data_format_ref`

single_format_file_ref = utils.load_format_file_ref(r'single_format_file_ref_test2')




file found


In [3]:
single_format_file_ref

{'global_thresholds': {'min_nb_samples': None, 'min_nb_missing_samples': None},
 'pseudo_adni_mod_2.csv': {'CDRSB.bl': {'data_format': 'KEY',
   'data_type': 'CHARACTER',
   'values': ["<class 'str'>"],
   'is_missing_values': False,
   'data_imputation_method': None,
   'data_imputation_parameters': None,
   'data_imputation_variables': None}}}

In [4]:
# Simple routine for data sanity checking (here with a simple tabular dataset)

# select only features in dataset that will be checked
pre_parsed_dataset_to_check = multiview.select_data_from_format_file_ref(single_dataset_to_check,
                                                                         single_format_file_ref)

primary_key = multiview.search_primary_key(single_format_file_ref)

if primary_key is None:
    print('not a multi view file')
    
df_joined =  multiview.join_multi_view_dataset(pre_parsed_dataset_to_check, primary_key, False)
warning_report = WarningReportLogger(disclosure=3)

checker = PreProcessingChecker(single_format_file_ref, df_joined,
                               'dataFormat_file' ,
                               warning_report)

#checker.update_views_features_name(new_views_name)
checker.check_all()

found primary key CDRSB.bl
read data type KEY
Key Variable CDRSB.bl violated uniqueness of data


In [5]:
report = checker.get_warning_logger()
report

Number of error: 0


{'INCORRECT_FORMAT_FILE': [{'view': '',
   'success': True,
   'feature': 'CDRSB.bl',
   'msg': 'Test passed'}],
 'DATA_TYPE_MISMATCH': [{'view': '',
   'success': True,
   'feature': 'CDRSB.bl',
   'msg': 'Test passed'}],
 'INCORRECT_DATA_TYPE': [{'view': '',
   'success': True,
   'feature': 'CDRSB.bl',
   'msg': 'Test passed'}],
 'OUTLIER_DETECTION_LOWER_BOUND': [{'view': '',
   'success': None,
   'feature': 'CDRSB.bl',
   'msg': 'Test skipped'}],
 'OUTLIER_DETECTION_UPPER_BOUND': [{'view': '',
   'success': None,
   'feature': 'CDRSB.bl',
   'msg': 'Test skipped'}],
 'INCORRECT_VALUES_CATEGORICAL_DATA': [{'view': '',
   'success': None,
   'feature': 'CDRSB.bl',
   'msg': 'Test skipped'}],
 'KEY_UNIQUENESS_VIOLATED': [{'view': '',
   'success': False,
   'feature': 'CDRSB.bl',
   'msg': 'Key Variable CDRSB.bl violated uniqueness of data'}]}

## Explanations of the tests run for data sanity checks

- 'N_SAMPLES_BELOW_THRESHOLD': check if number of samples in a view are below threshold set in data fromat file
- 'INCORRECT_FORMAT_FILE': check if for a given feature, each feilds have been correctly set (check if data format file has been modified)
- 'DATA_TYPE_MISMATCH': check if there is any inconsistance between data format (eg QUANTITATIVE, CATEGORICAL) and subtype (if any (eg DISCRETE, BOOLEAN, ...)
- 'INCORRECT_DATA_TYPE': check if there is a mismatch beween specified data type and actual data type for agiven feature
- 'MISSING_DATA_ALLOWED': check if missing data are allowed or not. Should raise an exception when using `get_raised_exception` if test fails
- 'N_MISSING_DATA_ABOVE_THRESHOLD': check for each feature if number of missing data is below threshold. Should raise an exception when using `get_raised_exception` if test fails
- 'OUTLIER_DETECTION_LOWER_BOUND': check for outliers (samples below threshold)
- 'OUTLIER_DETECTION_UPPER_BOUND': check for outliers (samples above threshold)
- 'INCORRECT_VALUES_CATEGORICAL_DATA': check if values in categorical variables 
- INCORRECT_DATETIME_DATA: check if datetime in variable is pasreable
- KEY_UNIQUENESS_VIOLATED: check if KEY variable samples are unique.


In [6]:
# raise exception (if found any)

checker.get_raised_exception()

In [7]:
# loading dataset

multi_dataset_to_check = utils.load_tabular_datasets(r'data/csv_folder')

directory found


In [8]:
# loading `data_fromat_ref`

multi_format_file_ref = utils.load_format_file_ref('int_multi_view_format_test')

In [9]:
multi_format_file_ref

{'global_thresholds': {'min_nb_samples': 10000, 'min_nb_missing_samples': 10},
 'file1': {'a': {'data_format': 'UNKNOWN',
   'data_type': 'UNKNOWN',
   'values': None,
   'is_missing_values': False,
   'data_imputation_method': None,
   'data_imputation_parameters': None,
   'data_imputation_variables': None},
  'e': {'data_format': 'DATETIME',
   'data_type': 'DATETIME',
   'values': ["<class 'numpy.datetime64'>"],
   'is_missing_values': False,
   'data_imputation_method': None,
   'data_imputation_parameters': None,
   'data_imputation_variables': None},
  'i': {'data_format': 'UNKNOWN',
   'data_type': 'UNKNOWN',
   'values': None,
   'is_missing_values': True,
   'data_imputation_method': 'LINEAR_INTERPOLATION_IMPUTATION',
   'data_imputation_parameters': None,
   'data_imputation_variables': {'file1': ['a',
     'e',
     'i',
     'o',
     '0',
     '1',
     '2',
     '3',
     'time',
     'pressure',
     'sp02',
     'a.1',
     'e.1',
     'i.1',
     'o.1',
     'gender',

we wil extract only features indicated in format_file_ref

In [10]:
# select only features in dataset that will be checked
pre_parsed_dataset_to_check = multiview.select_data_from_format_file_ref(multi_dataset_to_check,
                                                                         multi_format_file_ref)

the dataset loaded is a dataset composed from multiple CSV files. To convert those dataset into a single one, we will do a join operation based on a primary key

let's  find this primary key: It is the only column that contains the `KEY`Data type

In [11]:
primary_key = multiview.search_primary_key(multi_format_file_ref)

found primary key pkey
found primary key pkey
found primary key pkey


Performing a join operation

In [12]:
# extract views names from format_file_ref
views_names = utils.get_view_names(multi_format_file_ref)

# rename columns names before join operation

pre_parsed_dataset_to_check, new_views_name = multiview.rename_variables_before_joining(pre_parsed_dataset_to_check, views_names,
                                                             primary_key)

In [13]:
# perform a join operation

# jointure operation (takesplace only if primary key has been specfied in foramt_file)
df_joined = multiview.join_multi_view_dataset(pre_parsed_dataset_to_check, primary_key, False)

In [14]:
df_joined

Unnamed: 0,discrete,city,pkey,a,e,i,o,0,file1.time,pressure,sp02,a.1,gender,blood type,1,file2.time,pH
0,64.0,Lille,qpqorfhylu gmfjy bdj,67,16,54,25,True,2018-01-03 04:00:00,0.986670,0.676041,6,WOMAN,A,False,2018-01-02 06:00:00,
1,26.0,Lille,kkmjozalfyirgsire ui,42,96,69,61,True,2018-01-02 04:00:00,0.996889,0.864713,32,MAN,AB,True,2018-01-01 00:00:00,0.023107
2,61.0,Paris,ezfasuuycdda foisjte,46,8,89,21,False,2018-01-01 09:00:00,0.777026,0.253232,64,MAN,A,False,2018-01-02 10:00:00,0.587685
3,29.0,Paris,faxiqkt xggzmwzoidbg,29,6,77,14,False,2018-01-04 20:00:00,0.877527,0.450811,73,MAN,AB,True,2018-01-03 12:00:00,0.894073
4,99.0,Lille,znwhlj rwzdutnagwasy,96,79,19,33,False,2018-01-04 09:00:00,0.447389,0.639577,85,WOMAN,O,True,2018-01-01 10:00:00,0.026831
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,9.0,Paris,zeqhcikzdodus jn qjf,81,62,52,68,True,2018-01-02 13:00:00,0.953184,0.439232,34,MAN,AB,False,2018-01-02 05:00:00,0.788560
96,98.0,Marseille,iicthcvfmkajbvr gzir,16,49,30,7,False,2018-01-02 21:00:00,0.442283,0.600981,82,MAN,,True,2018-01-05 02:00:00,0.402979
97,21.0,Lille,ztjakcsk bhjoksdz lm,90,14,36,24,False,2018-01-02 06:00:00,0.988543,0.377624,35,MAN,B,False,2018-01-01 12:00:00,
98,42.0,Marseille,sabunaa opt vpulnxj,91,10,69,58,True,2018-01-02 01:00:00,0.059791,0.490360,42,MAN,B,True,2018-01-02 09:00:00,0.651801


In [15]:
# convert joined dataset into multi view dataset
new_feature_name = { v: list(pre_parsed_dataset_to_check[v].columns) for v in views_names}
new_feature_name


multiview.create_multi_view_dataframe_from_dataframe(df_joined,new_feature_name, primary_key=primary_key)

length ['a', 'e', 'i', 'o', '0', 'file1.time', 'pressure', 'sp02', 'a.1', 'gender', 'blood type', 'discrete', 'city', '1', 'file2.time', 'pH'] ['file1', 'file1', 'file1', 'file1', 'file1', 'file1', 'file1', 'file1', 'file1', 'file1', 'file1', 'contatct', 'contatct', 'file2', 'file2', 'file2']


views,file1,file1,file1,file1,file1,file1,file1,file1,file1,file1,file1,contatct,contatct,file2,file2,file2,primary_key
feature_name,a,e,i,o,0,file1.time,pressure,sp02,a.1,gender,blood type,discrete,city,1,file2.time,pH,pkey
0,67,16,54,25,True,2018-01-03 04:00:00,0.98667,0.676041,6,WOMAN,A,64.0,Lille,False,2018-01-02 06:00:00,,qpqorfhylu gmfjy bdj
1,42,96,69,61,True,2018-01-02 04:00:00,0.996889,0.864713,32,MAN,AB,26.0,Lille,True,2018-01-01 00:00:00,0.023107,kkmjozalfyirgsire ui
2,46,8,89,21,False,2018-01-01 09:00:00,0.777026,0.253232,64,MAN,A,61.0,Paris,False,2018-01-02 10:00:00,0.587685,ezfasuuycdda foisjte
3,29,6,77,14,False,2018-01-04 20:00:00,0.877527,0.450811,73,MAN,AB,29.0,Paris,True,2018-01-03 12:00:00,0.894073,faxiqkt xggzmwzoidbg
4,96,79,19,33,False,2018-01-04 09:00:00,0.447389,0.639577,85,WOMAN,O,99.0,Lille,True,2018-01-01 10:00:00,0.026831,znwhlj rwzdutnagwasy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,81,62,52,68,True,2018-01-02 13:00:00,0.953184,0.439232,34,MAN,AB,9.0,Paris,False,2018-01-02 05:00:00,0.78856,zeqhcikzdodus jn qjf
96,16,49,30,7,False,2018-01-02 21:00:00,0.442283,0.600981,82,MAN,,98.0,Marseille,True,2018-01-05 02:00:00,0.402979,iicthcvfmkajbvr gzir
97,90,14,36,24,False,2018-01-02 06:00:00,0.988543,0.377624,35,MAN,B,21.0,Lille,False,2018-01-01 12:00:00,,ztjakcsk bhjoksdz lm
98,91,10,69,58,True,2018-01-02 01:00:00,0.059791,0.49036,42,MAN,B,42.0,Marseille,True,2018-01-02 09:00:00,0.651801,sabunaa opt vpulnxj


## Perform a data sanity check

In [18]:
multi_format_file_ref

{'global_thresholds': {'min_nb_samples': 10000, 'min_nb_missing_samples': 10},
 'file1': {'a': {'data_format': 'UNKNOWN',
   'data_type': 'UNKNOWN',
   'values': None,
   'is_missing_values': False,
   'data_imputation_method': None,
   'data_imputation_parameters': None,
   'data_imputation_variables': None},
  'e': {'data_format': 'DATETIME',
   'data_type': 'DATETIME',
   'values': ["<class 'numpy.datetime64'>"],
   'is_missing_values': False,
   'data_imputation_method': None,
   'data_imputation_parameters': None,
   'data_imputation_variables': None},
  'i': {'data_format': 'UNKNOWN',
   'data_type': 'UNKNOWN',
   'values': None,
   'is_missing_values': True,
   'data_imputation_method': 'LINEAR_INTERPOLATION_IMPUTATION',
   'data_imputation_parameters': None,
   'data_imputation_variables': {'file1': ['a',
     'e',
     'i',
     'o',
     '0',
     '1',
     '2',
     '3',
     'time',
     'pressure',
     'sp02',
     'a.1',
     'e.1',
     'i.1',
     'o.1',
     'gender',

In [16]:
warning_report = WarningReportLogger(disclosure=3)

checker = PreProcessingChecker(multi_format_file_ref,
                               df_joined,
                               'multiview_dataFormat_file' ,
                               warning_report)

checker.update_views_features_name(new_views_name)
checker.check_all()



features names updated
MinimumSamplesViolatedException: Number of samples contained in dataset file1 is below threshold (expected at least 10000 samples, found 100 samples)
MinimumSamplesViolatedException: Number of samples contained in dataset file1 is below threshold (expected at least 10000 samples, found 100 samples)
MissingDataException: Variable blood type must not have missing data, but some were found
MissingDataException: Variable blood type must not have missing data, but some were found
MinimumSamplesViolatedException: Number of samples contained in dataset contatct is below threshold (expected at least 10000 samples, found 100 samples)
MinimumSamplesViolatedException: Number of samples contained in dataset contatct is below threshold (expected at least 10000 samples, found 100 samples)
MinimumSamplesViolatedException: Number of samples contained in dataset file2 is below threshold (expected at least 10000 samples, found 100 samples)
MinimumSamplesViolatedException: Number o

## get report

In [17]:
report = checker.get_warning_logger()

Number of error: 5


In [19]:
report

{'N_SAMPLES_BELOW_THRESHOLD': [{'view': 'file1',
   'success': False,
   'feature': 'ALL',
   'msg': 'MinimumSamplesViolatedException: Number of samples contained in dataset file1 is below threshold (expected at least 10000 samples, found 100 samples)'},
  {'view': 'contatct',
   'success': False,
   'feature': 'ALL',
   'msg': 'MinimumSamplesViolatedException: Number of samples contained in dataset contatct is below threshold (expected at least 10000 samples, found 100 samples)'},
  {'view': 'file2',
   'success': False,
   'feature': 'ALL',
   'msg': 'MinimumSamplesViolatedException: Number of samples contained in dataset file2 is below threshold (expected at least 10000 samples, found 100 samples)'}],
 'INCORRECT_FORMAT_FILE': [{'view': '',
   'success': True,
   'feature': 'a',
   'msg': 'Test passed'},
  {'view': '', 'success': True, 'feature': 'e', 'msg': 'Test passed'},
  {'view': '', 'success': True, 'feature': 'i', 'msg': 'Test passed'},
  {'view': '', 'success': True, 'featur

In [20]:

report2 = {}

for k in report.keys():
    report2[k] = pd.DataFrame(report[k])

report2.pop('N_SAMPLES_BELOW_THRESHOLD')
report2.pop('INCORRECT_DATETIME_DATA')
multiview.create_multi_view_dataframe_from_dictionary(report2, ['test name', 'index'])

ValueError: Cannot create multi view dataset: different number of samples for each modality have been detectedDetails: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 19 and the array at index 1 has size 17

In [13]:
report

{'N_SAMPLES_BELOW_THRESHOLD': [{'view': 'file1',
   'success': True,
   'feature': 'ALL',
   'msg': 'Test passed'},
  {'view': 'contatct',
   'success': True,
   'feature': 'ALL',
   'msg': 'Test passed'},
  {'view': 'file2', 'success': True, 'feature': 'ALL', 'msg': 'Test passed'}],
 'INCORRECT_FORMAT_FILE': [{'view': '',
   'success': True,
   'feature': 'e',
   'msg': 'Test passed'},
  {'view': '', 'success': True, 'feature': '1', 'msg': 'Test passed'},
  {'view': '', 'success': True, 'feature': '2', 'msg': 'Test passed'},
  {'view': '', 'success': True, 'feature': 'time', 'msg': 'Test passed'},
  {'view': '', 'success': True, 'feature': 'pressure', 'msg': 'Test passed'},
  {'view': '', 'success': True, 'feature': 'e.1', 'msg': 'Test passed'},
  {'view': '', 'success': True, 'feature': 'gender', 'msg': 'Test passed'},
  {'view': '', 'success': True, 'feature': 'blood type', 'msg': 'Test passed'},
  {'view': '', 'success': True, 'feature': 'pkey', 'msg': 'Test passed'},
  {'view': ''

# raise collected error

This will raises all collected error found in the warning logger

In [22]:
checker.get_raised_exception()

DataSanityCheckException: MinimumSamplesViolatedException: Number of samples contained in dataset file1 is below threshold (expected at least 10000 samples, found 100 samples)
MissingDataException: Variable blood type must not have missing data, but some were found
MinimumSamplesViolatedException: Number of samples contained in dataset contatct is below threshold (expected at least 10000 samples, found 100 samples)
MinimumSamplesViolatedException: Number of samples contained in dataset file2 is below threshold (expected at least 10000 samples, found 100 samples)
MissingDataException: Variable pH must not have missing data, but some were found