Related to user story: [SP11-Item04: General Data Wrapper PoC](https://gitlab.inria.fr/fedbiomed/fedbiomed/-/issues/164)

## Tabular dataset

Workflow of data pre processing:

1. Columns name should be shared with the researcher
2. Data format file to be filled by clinicians.
3. Specify if missing data are allowed for a given columns (Exception). The file will be used for data verification during FL pre-processing,
4. Outlier verification for quantitative data, continuous and discrete, and for dates (Critical warning),
5. Missing data imputation by local mean (or optional NN), or majority voting for discrete labels. Give warnings when missing data are found (for verification a posteriori).
6. Give critical warning when too many missing are found (>50%),
7. Verify that number of available data is greater then minimum required (Error)

Critical warnings have different levels of disclosure to the researcher (1) only the warning, 2) type of warning, 3) type of warning and column affected).

# 1. What is a `data_format_ref` file


A `data_format_ref_file` is a JSON file containing all the skeleton of the data. 
It should be created by a clinician and sent to nodes, in order to check nodes dataset integrity.
In case there is a mismatch between nodes dataset and `data_format_ref_file` 

Structure of a `data_format_file_ref`:



# 2. Creating a `data_format_ref` file using an existing dataset

In this notebook, we provide a way to create `data_fromat_ref_file` using an existing dataset. 
Then, user can edit it (ie change `data_format_file_ref` file, and complete information already contained into the file)

## 2.1 Loading tabular dataset

For that, we provide a `load_tabular_datasets` function able to load every kind of tabular dataset `*.csv`, `.xls`, folder of `*.csv`.

In [1]:
from fedbiomed.common.data_tool.utils import load_tabular_datasets, save_format_file_ref
from fedbiomed.common.data_tool.data_format_ref_cli import get_from_user_multi_view_dataset_fromat_file, edit_format_file_ref

**load_tabular_datasets** method can export several files type (excel file, csv file, and folder containing csv files) 

## Caution! all data are stored in a folder named `/data/` in `fedbiomed/notebooks` 

### 2.1.1 Export Excel File

You can export Excel files as shown in the example below

In [2]:
!pip install openpyxl #(for opening excel files)
!pip install aenum  # needed to create custom data type



In [4]:
# it can load excel files
load_tabular_datasets(r'./data/excel/Exceltest.xlsx')

file found
err 'utf-8' codec can't decode byte 0x8c in position 15: invalid start byte in file ./data/excel/Exceltest.xlsx


{'Exceltest.xlsx':    ID   Age Eligibility
 0    1   45           Y
 1    2   45           Y
 2    3   33           N
 3    4   54           Y
 4    5   45           Y
 5    6   54         NaN
 6    7   34           N
 7    8   54         NaN
 8    9   45         NaN
 9   10   44           Y}

In [5]:
load_tabular_datasets(r'./data/excel/test_excel.xlsx')

file found
err 'utf-8' codec can't decode byte 0x87 in position 11: invalid start byte in file ./data/excel/test_excel.xlsx


{'test_excel.xlsx':      a  b     val
 0    1  a  10.000
 1    2  b   3.000
 2    3  c   3.900
 3    4  d   8.000
 4    5  a   9.009
 5    6  b  11.000
 6    7  c   4.000
 7    8  d   4.900
 8    9  a   9.000
 9   10  b  10.009
 10  11  c  12.000
 11  12  d   5.000
 12  13  a   5.900
 13  14  b  10.000
 14  15  c  11.009
 15  16  d  13.000
 16  17  a   6.000
 17  18  b   6.900
 18  19  c  11.000}

### 2.1.2 Export CSV file


It is a modified version of `pseudo_adni_mod` csv dataset with the first column (`CDRSB.bl`) being converted into a column of string

In [4]:
# simple csv dataset
pseudo_adni = load_tabular_datasets(r'data/pseudo_adni/pseudo_adni_mod_2.csv')
#single_view_dataset = load_tabular_datasets(r'/user/ybouilla/home/Documents/data/pseudo_adni_mod/pseudo_adni_mod.csv')

file found


### 2.1.3 Export a folder containing csv files

for loading a folder (multi view dataset, one has to load the path of the directory (instead of a file)

In [9]:
!tree data/csv_folder

[01;34mdata/csv_folder[0m
├── contatct
├── file1
└── file2

0 directories, 3 files


In [13]:
# folder of dataset
multi_view_dataframe =  load_tabular_datasets('data/csv_folder')


directory found


## 2.2 create the `data_format_ref` file from the loaded dataset

Let's assume the loaded dataset is a reference dataset, and we want to create from this  dataset a `data_format_ref`. 


### workflow logic:

```
for each view:
    for each feature whithin view:
        ask user to select one type within all type present in `Enum` class `DataType` (if user wants to add this feature inside`data_format_ref`)
        ask user to indicate if variable can have missing data or not
        ask user to indicate which data imputation method he wants to select
        (imputation methods are given from the one specified in `Enum` class `ImputationMethods`
        
```

**Default DataType**:

1) KEY 
2) QUANTITATIVE 
3) CATEGORICAL 
4) DATETIME 
5) UNKNOWN (when user doesnot want to specify a variable type: it will downgrade the data sanity check)


**Data Imputation Methods**

1) MEAN_IMPUTATION
2) MODE_IMPUTATION
3) KNN_IMPUTATION
4) LINEAR_INTERPOLATION_IMPUTATION

In [19]:
single_data_format_file_test = get_from_user_multi_view_dataset_fromat_file(pseudo_adni)

+++++++++ Editing Global Thresholds +++++++++++
Do you want to add a threshold for the minimum number of samples each dataset should contain?
1) YES
2) NO
2
Do you want to add a threshold for the minimum number of missing data each feature should contain?
1) YES
2) NO
2
++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++ Now parsing view: pseudo_adni_mod_2.csv +++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++
displaying first 10 values of feature CDRSB.bl in view: pseudo_adni_mod_2.csv (n_feature: 1/16)
0    b
1    a
2    a
3    a
4    a
5    b
6    e
7    a
8    d
9    c
Name: CDRSB.bl, dtype: object
number of differents samples: 8 / total of samples: 1000
specify data type for CDRSB.bl:
1) KEY 
2) QUANTITATIVE 
3) CATEGORICAL 
4) DATETIME 
5) CUSTOM 
6) UNKNOWN 
7) ignore this column
1
data_type DataType.KEY KEY
CATEGORICAL KEY
Datetime ambiguity: Please specify if variable CDRSB.bl is a date or not
1) YES
2) NO
2
found  <class 'str'>
<class 'str'> <class 'str

specify data type for AGE:
1) KEY 
2) QUANTITATIVE 
3) CATEGORICAL 
4) DATETIME 
5) CUSTOM 
6) UNKNOWN 
7) ignore this column
7
Ignoring feature AGE


In [6]:
single_data_format_file_test

{'global_thresholds': {'min_nb_samples': 1000, 'min_nb_missing_samples': 10},
 'pseudo_adni_mod_2.csv': {'CDRSB.bl': {'data_format': 'CATEGORICAL',
   'data_type': 'CHARACTER',
   'values': ["<class 'str'>"],
   'is_missing_values': False,
   'data_imputation_method': None,
   'data_imputation_parameters': None,
   'data_imputation_variables': None},
  'ADAS11.bl': {'data_format': 'DATETIME',
   'data_type': 'DATETIME',
   'values': ["<class 'numpy.datetime64'>"],
   'is_missing_values': False,
   'data_imputation_method': None,
   'data_imputation_parameters': None,
   'data_imputation_variables': None},
  'MMSE.bl': {'data_format': 'UNKNOWN',
   'data_type': 'UNKNOWN',
   'values': None,
   'is_missing_values': False,
   'data_imputation_method': None,
   'data_imputation_parameters': None,
   'data_imputation_variables': None}}}

## single_data_format_file_test

In [16]:
multi_data_format_file_test = get_from_user_multi_view_dataset_fromat_file(multi_view_dataframe)

+++++++++ Editing Global Thresholds +++++++++++
Do you want to add a threshold for the minimum number of samples each dataset should contain?
1) YES
2) NO
1
enter threshold minimal number of samples per dataset
10000
Do you want to add a threshold for the minimum number of missing data each feature should contain?
1) YES
2) NO
1
enter threshold minimal number of missing sample per variable
10
++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++ Now parsing view: file1 +++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++
displaying first 10 values of feature a in view: file1 (n_feature: 1/18)
0    48
1    87
2    46
3    84
4    94
5    18
6    15
7    30
8    54
9    46
Name: a, dtype: int64
number of differents samples: 57 / total of samples: 100
specify data type for a:
1) KEY 
2) QUANTITATIVE 
3) CATEGORICAL 
4) DATETIME 
5) CUSTOM 
6) UNKNOWN 
7) ignore this column
6
data_type DataType.UNKNOWN UNKNOWN
CATEGORICAL UNKNOWN
Allow a to have missing values:
1) YES
2)

specify data type for i.1:
1) KEY 
2) QUANTITATIVE 
3) CATEGORICAL 
4) DATETIME 
5) CUSTOM 
6) UNKNOWN 
7) ignore this column
7
Ignoring feature i.1
displaying first 10 values of feature o.1 in view: file1 (n_feature: 15/18)
0     8
1    10
2    22
3    79
4    90
5     2
6    85
7    64
8    47
9    30
Name: o.1, dtype: int64
number of differents samples: 69 / total of samples: 100
specify data type for o.1:
1) KEY 
2) QUANTITATIVE 
3) CATEGORICAL 
4) DATETIME 
5) CUSTOM 
6) UNKNOWN 
7) ignore this column
7
Ignoring feature o.1
displaying first 10 values of feature gender in view: file1 (n_feature: 16/18)
0      MAN
1      MAN
2    WOMAN
3    WOMAN
4      MAN
5      MAN
6    WOMAN
7    WOMAN
8    WOMAN
9      MAN
Name: gender, dtype: object
number of differents samples: 2 / total of samples: 100
specify data type for gender:
1) KEY 
2) QUANTITATIVE 
3) CATEGORICAL 
4) DATETIME 
5) CUSTOM 
6) UNKNOWN 
7) ignore this column
3
data_type DataType.CATEGORICAL CATEGORICAL
CATEGORICAL CATEGO

specify data type for pH:
1) KEY 
2) QUANTITATIVE 
3) CATEGORICAL 
4) DATETIME 
5) CUSTOM 
6) UNKNOWN 
7) ignore this column
2
data_type DataType.QUANTITATIVE QUANTITATIVE
CATEGORICAL QUANTITATIVE
found  <class 'float'>
found  <class 'numpy.float64'>
float64 <class 'numpy.float64'> [dtype('float64')]
Allow pH to have missing values:
1) YES
2) NO
2
displaying first 10 values of feature pkey in view: file2 (n_feature: 7/7)
0    kkmjozalfyirgsire ui
1    xkdawggpnuulcewuoyzz
2    khuulhwgwnjggrfoefce
3    xxysdmwwmjsmyhaswfdb
4    ldejfuij mnbnf wwmms
5    pvhkafscqfwzofgqziko
6    rgggl wzpfkfbftmtjoo
7    stjrqcvljprtvralmnil
8    kms wptwzta nzdbkncc
9    kzrybjgjxm rde xqmra
Name: pkey, dtype: object
number of differents samples: 100 / total of samples: 100
specify data type for pkey:
1) KEY 
2) QUANTITATIVE 
3) CATEGORICAL 
4) DATETIME 
5) CUSTOM 
6) UNKNOWN 
7) ignore this column
1
data_type DataType.KEY KEY
CATEGORICAL KEY
Datetime ambiguity: Please specify if variable pkey is a da

In [17]:
multi_data_format_file_test 

{'global_thresholds': {'min_nb_samples': 10000, 'min_nb_missing_samples': 10},
 'file1': {'a': {'data_format': 'UNKNOWN',
   'data_type': 'UNKNOWN',
   'values': None,
   'is_missing_values': False,
   'data_imputation_method': None,
   'data_imputation_parameters': None,
   'data_imputation_variables': None},
  'e': {'data_format': 'DATETIME',
   'data_type': 'DATETIME',
   'values': ["<class 'numpy.datetime64'>"],
   'is_missing_values': False,
   'data_imputation_method': None,
   'data_imputation_parameters': None,
   'data_imputation_variables': None},
  'i': {'data_format': 'UNKNOWN',
   'data_type': 'UNKNOWN',
   'values': None,
   'is_missing_values': True,
   'data_imputation_method': 'LINEAR_INTERPOLATION_IMPUTATION',
   'data_imputation_parameters': None,
   'data_imputation_variables': {'file1': ['a',
     'e',
     'i',
     'o',
     '0',
     '1',
     '2',
     '3',
     'time',
     'pressure',
     'sp02',
     'a.1',
     'e.1',
     'i.1',
     'o.1',
     'gender',

Data format file to be filled by clinicians (step 2 int he workflow):

Data format file will be a dictionary specifying the type: 


```
{<view_name>: {<feature_name>: {'data_format' : <data_fomat>,
                                   'data_type': <data_type>,
                                   'type':<values_taken>,
                                   'is_missing_values': <True/False>,
                                   'data_imputation_method': <name_of_dta_imputation_method> ,
                                   'data_imputation_parameters': <parameter_of_iputation_method>,
                                   'lower_bound': <float>,
                                   'upper_bound': <float>,
                                   'categorical_value': <List[values]>}
                                   }
                   }
    }
```

where
* `<view_name>` is the name of the view
* `<feature_name>` is the name of the feature
* `<data_type>` can be categorical or continuous or missing_data or datetime
* `<value_taken>` is the type of the value (eg int, str, float, ...)


Saving `format_file_ref` 

In [20]:
save_format_file_ref(single_data_format_file_test, 'single_format_file_ref_test2')

Format Reference File successfully saved at single_format_file_ref_test2


In [18]:

save_format_file_ref(multi_data_format_file_test, 'int_multi_view_format_test')

Format Reference File successfully saved at int_multi_view_format_test


## 2.3 edit   `data_format_ref` file from an existing  `data_format_ref` file

CLI `edit_format_file_ref` allows to specify additional information on variable when performing data sanity checking, including:
 - lower and upper bounds
 - catgorical_value (eg MALE, FEMALE)
 - data_type (not inferred anymore)
 - data imputation method (not inferred anymore)
 - date format (DO NOT WORK YET)

In [11]:
from fedbiomed.common.data_tool.data_format_ref_cli import edit_format_file_ref
from  fedbiomed.common.data_tool import utils

single_data_format_file_test = edit_format_file_ref(single_data_format_file_test)

+++++++ Now editing format file ref ++++++++++
Edit file: pseudo_adni_mod_2.csv?
1) YES
2) NO
1
Edit variable: CDRSB.bl? (type: CATEGORICAL)
1) YES
2) NO
1
Which field should be modified?
1) data_type
2) Values taken
3) Data Value imputation method
4) Cancel Operation
4
action 4 4 {'1': <function ask_for_data_type at 0x7f2592f2f3a0>, '2': <function ask_for_categorical_values at 0x7f2592f2f4c0>, '3': <function ask_for_data_imputation_method at 0x7f2592f2f430>, '4': <function cancel_operation at 0x7f2592f2f160>}
Edit variable: ADAS11.bl? (type: DATETIME)
1) YES
2) NO
1
Which field should be modified?
1) data_type
2) lower bound
3)upper bound
4) Date format
5) Cancel Operation
2
action 2 5 {'1': <function ask_for_data_type at 0x7f2592f2f3a0>, '2': <function ask_for_lower_bound at 0x7f2592f2f1f0>, '3': <function ask_for_upper_bound at 0x7f2592f2f280>, '4': <function ask_for_date_format at 0x7f2592f2f550>, '5': <function cancel_operation at 0x7f2592f2f160>}
enter lower bound0
Continue Editi

In [10]:
single_data_format_file_test

{'global_thresholds': {'min_nb_samples': 1000, 'min_nb_missing_samples': 10},
 'pseudo_adni_mod_2.csv': {'CDRSB.bl': {'data_format': 'CATEGORICAL',
   'data_type': 'CHARACTER',
   'values': ["<class 'str'>"],
   'is_missing_values': False,
   'data_imputation_method': None,
   'data_imputation_parameters': None,
   'data_imputation_variables': None,
   'categorical_values': ['1']},
  'ADAS11.bl': {'data_format': 'DATETIME',
   'data_type': 'DATETIME',
   'values': ["<class 'numpy.datetime64'>"],
   'is_missing_values': False,
   'data_imputation_method': None,
   'data_imputation_parameters': None,
   'data_imputation_variables': None,
   'lower_bound': 0.0},
  'MMSE.bl': {'data_format': 'UNKNOWN',
   'data_type': 'UNKNOWN',
   'values': None,
   'is_missing_values': False,
   'data_imputation_method': None,
   'data_imputation_parameters': None,
   'data_imputation_variables': None}}}

In [12]:
save_format_file_ref(single_data_format_file_test, 'single_format_file_ref_test3')

Format Reference File successfully saved at single_format_file_ref_test3


In [8]:
from fedbiomed.common.data_tool.data_format_ref_cli import get_from_user_multi_view_dataset_fromat_file, edit_format_file_ref
from  fedbiomed.common.data_tool import utils

multi_data_format_file_test = utils.load_format_file_ref('int_multi_view_format_test')

edit_multi_view_format_file_test = edit_format_file_ref(multi_data_format_file_test)

+++++++ Now editing format file ref ++++++++++
Edit file: file1?
1) YES
2) NO
1


KeyboardInterrupt: Interrupted by user

In [2]:
save_format_file_ref(edit_multi_view_format_file_test, 'edit_multi_view_format_file_test3')

Format Reference File successfully saved at edit_multi_view_format_file_test3


In [4]:
edit_multi_view_format_file_test

{'global_thresholds': {'min_nb_samples': 40, 'min_nb_missing_samples': 20},
 'file1': {'a': {'data_format': 'QUANTITATIVE',
   'data_type': 'DISCRETE',
   'values': 'int64',
   'is_missing_values': True,
   'data_imputation_method': 'LINEAR_INTERPOLATION_IMPUTATION',
   'data_imputation_parameters': None,
   'data_imputation_variables': {'file1': ['e',
     'i',
     'o',
     '0',
     '1',
     '2',
     '3',
     'time',
     'pressure',
     'sp02',
     'a.1',
     'e.1',
     'i.1',
     'o.1',
     'gender',
     'blood type',
     'pkey'],
    'contatct': [],
    'file2': []}},
  'e': {'data_format': 'QUANTITATIVE',
   'data_type': 'DISCRETE',
   'values': 'int64',
   'is_missing_values': False,
   'data_imputation_method': None,
   'data_imputation_parameters': None,
   'data_imputation_variables': None},
  'i': {'data_format': 'UNKNOWN',
   'data_type': 'UNKNOWN',
   'values': ['U', 'N', 'K', 'N', 'O', 'W', 'N'],
   'is_missing_values': False,
   'data_imputation_method': Non

## 2.4 create custom DataType

This may be re work because depends on a third party package `aenum`