Related to user story: [SP11-Item04: General Data Wrapper PoC](https://gitlab.inria.fr/fedbiomed/fedbiomed/-/issues/164)

## Tabular dataset

Workflow of data pre processing:

1. Columns name should be shared with the researcher
2. Data format file to be filled by clinicians.
3. Specify if missing data are allowed for a given columns (Exception). The file will be used for data verification during FL pre-processing,
4. Outlier verification for quantitative data, continuous and discrete, and for dates (Critical warning),
5. Missing data imputation by local mean (or optional NN), or majority voting for discrete labels. Give warnings when missing data are found (for verification a posteriori).
6. Give critical warning when too many missing are found (>50%),
7. Verify that number of available data is greater then minimum required (Error)

Critical warnings have different levels of disclosure to the researcher (1) only the warning, 2) type of warning, 3) type of warning and column affected).

# 1. What is a `data_format_ref` file


A `data_format_ref_file` is a JSON file containing all the skeleton of the data. 
It should be created by a clinician and sent to nodes, in order to check nodes dataset integrity.
In case there is a mismatch between nodes dataset and `data_format_ref_file` 

Structure of a `data_format_file_ref`:



# 2. Creating a `data_format_ref` file using by loading a dataset

In this notebook, we provide a way to create `data_fromat_ref_file` using an existing dataset. 
Then, user can edit it (ie change `data_format_file_ref` file, and complete information already contained into the file)

## 2.1 Loading tabular dataset

For that, we provide a `load_tabular_datasets` function able to load every kind of tabular dataset `*.csv`, `.xls`, folder of `*.csv`.

In [1]:
from fedbiomed.common.data_tool.utils import load_tabular_datasets

from fedbiomed.common.data_tool.data_format_ref_cli import get_from_user_multi_view_dataset_fromat_file, edit_format_file_ref

In [3]:
# it can load excel files
load_tabular_datasets(r'../../Exceltest.xlsx')

file found
err 'utf-8' codec can't decode byte 0x8c in position 15: invalid start byte in file ../../Exceltest.xlsx


{'Exceltest.xlsx':    ID   Age Eligibility
 0    1   45           Y
 1    2   45           Y
 2    3   33           N
 3    4   54           Y
 4    5   45           Y
 5    6   54         NaN
 6    7   34           N
 7    8   54         NaN
 8    9   45         NaN
 9   10   44           Y}

In [4]:
# simple csv dataset

single_view_dataset = load_tabular_datasets(r'/user/ybouilla/home/Documents/data/pseudo_adni_mod/pseudo_adni_mod.csv')

file found


In [5]:
single_view_dataset

{'pseudo_adni_mod.csv':      CDRSB.bl  ADAS11.bl  MMSE.bl  RAVLT.immediate.bl  RAVLT.learning.bl  \
 0           1          8     27.0           23.739439                4.0   
 1           0          0     30.0           64.933800                9.0   
 2           0          8     24.0           36.987722                3.0   
 3           0          3     29.0           50.314425                5.0   
 4           0          0     30.0           57.217830                9.0   
 ..        ...        ...      ...                 ...                ...   
 995         1          2     29.0           61.896022                8.0   
 996         0          1     29.0           62.083170                8.0   
 997         3         14     24.0           22.289059                2.0   
 998         0         13     26.0           31.650504                2.0   
 999         0         15     28.0           29.089863                3.0   
 
      RAVLT.forgetting.bl  FAQ.bl  WholeBrain.bl  V

In [7]:
!tree demo_poc_data_wrapper/data/test7

[01;34mdemo_poc_data_wrapper/data/test7[0m
├── contatct
├── file1
└── file2

0 directories, 3 files


In [2]:
# folder of dataset
multi_view_dataframe =  load_tabular_datasets('demo_poc_data_wrapper/data/test7')


directory found


## 2.2 create the `data_format_ref` file from the loaded dataset

Let's assume the loaded dataset is a reference dataset, and we want to create from this  dataset a `data_format_ref`. 


### workflow

```
for each view:
    for each feature whithin view:
        ask user to select one type within all type present in `Enum` class `DataType` (if user wants to add this feature inside`data_format_ref`)
        ask user to indicate if variable can have missing data or not
        ask user to indicate which data imputation method he wants to select
        (imputation methods are given from the one specified in `Enum` class `ImputationMethods`
        
```

**Default DataType**:
1) KEY 
2) QUANTITATIVE 
3) CATEGORICAL 
4) DATETIME 
5) UNKNOWN 


**Data Imputation Methods**

1) MEAN_IMPUTATION
2) MODE_IMPUTATION
3) KNN_IMPUTATION
4) LINEAR_INTERPOLATION_IMPUTATION

In [3]:
multi_data_format_file_test = get_from_user_multi_view_dataset_fromat_file(multi_view_dataframe)

++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++ Now parsing view: file1 +++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++
displaying first 10 values of feature a (n_feature: 1/18)
0    48
1    87
2    46
3    84
4    94
5    18
6    15
7    30
8    54
9    46
Name: a, dtype: int64
number of differents samples: 57 / total of samples: 100
specify data type for a:
1) KEY 
2) QUANTITATIVE 
3) CATEGORICAL 
4) DATETIME 
5) CUSTOM 
6) UNKNOWN 
7) ignore this column
2
QuantitativeDataType.DISCRETE int64
QuantitativeDataType.DISCRETE | [dtype('int64')]
Allow a to have missing values:
1) YES
2) NO
1
Please select the following method for filling missing values (if some are found)
1) MEAN_IMPUTATION
2) MODE_IMPUTATION
3) KNN_IMPUTATION
4) LINEAR_INTERPOLATION_IMPUTATION
5) No method
3
Selected: KNN_IMPUTATION

please specify k value:
2
imput param {'k': '2'}
Please select view that contained desired feature:
1) file1 
2) contatct 
3) file2 
4) select all views
1
Please

KeyboardInterrupt: Interrupted by user

In [5]:
multi_data_format_file_test

{'file1': {'o': {'data_format': 'QUANTITATIVE',
   'data_type': 'DISCRETE',
   'values': 'int64',
   'is_missing_values': True,
   'data_imputation_method': None,
   'data_imputation_parameters': None},
  '0': {'data_format': 'CATEGORICAL',
   'data_type': 'BOOLEAN',
   'values': 'bool',
   'is_missing_values': False,
   'data_imputation_method': None,
   'data_imputation_parameters': None},
  '3': {'data_format': 'UNKNOWN',
   'data_type': 'UNKNOWN',
   'values': 'bool',
   'is_missing_values': False,
   'data_imputation_method': None,
   'data_imputation_parameters': None},
  'time': {'data_format': 'DATETIME',
   'data_type': 'DATETIME',
   'values': 'object',
   'is_missing_values': False,
   'data_imputation_method': None,
   'data_imputation_parameters': None},
  'pressure': {'data_format': 'QUANTITATIVE',
   'data_type': 'CONTINUOUS',
   'values': 'float64',
   'is_missing_values': True,
   'data_imputation_method': 'LINEAR_INTERPOLATION_IMPUTATION',
   'data_imputation_paramete

Data format file to be filled by clinicians (step 2 int he workflow):

Data format file will be a dictionary specifying the type: 


```
{<view_name>: {<feature_name>: {'data_format' : <data_fomat>,
                                   'data_type': <data_type>,
                                   'type':<values_taken>,
                                   'is_missing_values': <True/False>,
                                   'data_imputation_method': <name_of_dta_imputation_method> ,
                                   'data_imputation_parameters': <parameter_of_iputation_method>,
                                   'lower_bound': <float>,
                                   'upper_bound': <float>,
                                   'categorical_value': <List[values]>}
                                   }
                   }
    }
```

where
* `<view_name>` is the name of the view
* `<feature_name>` is the name of the feature
* `<data_type>` can be categorical or continuous or missing_data or datetime
* `<value_taken>` is the type of the value (eg int, str, float, ...)


In [9]:
from typing import Any, Dict
import json

def save_format_file_ref(format_file_ref: Dict[str, Dict[str, Any]], path: str):
    # save `format_file_ref` into a JSON file
    with open(path, "w") as format_file:
        json.dump(format_file_ref, format_file)
    print(f"Model successfully saved at {path}")
    
    
save_format_file_ref(multi_data_format_file_test, 'multi_data_format_file_test.json')

Model successfully saved at multi_data_format_file_test.json


## 2.3 edit   `data_format_ref` file from an existing  `data_format_ref` file

CLI `edit_format_file_ref` allows to specify additional information for checking, including:
 - lower and upper bounds
 - catgorical_value (eg MALE, FEMALE)
 - data_type (not inferred anymore)
 - data imputation method (not inferred anymore)

In [13]:
from fedbiomed.common.data_tool.data_format_ref_cli import get_from_user_multi_view_dataset_fromat_file, edit_format_file_ref
from  fedbiomed.common.data_tool import utils


multi_data_format_file_test = utils.load_format_file_ref('multi_format_file')

edit_multi_view_format_file_test = edit_format_file_ref(multi_data_format_file_test)

Now editing format file ref
Edit file: file1?
1) YES
2) NO
1
Edit variable: e?
1) YES
2) NO
1
Which field should be modified?
1) data_type
2) lower bound
3)upper bound
4) Data Value amputation method
5) Cancel Operation
4
action 4 5 {'1': <function ask_for_data_type at 0x7f931e5edca0>, '2': <function ask_for_lower_bound at 0x7f931e5edaf0>, '3': <function ask_for_upper_bound at 0x7f931e5edb80>, '4': <function ask_for_data_amputation_method at 0x7f931e5edd30>, '5': <function cancel_operation at 0x7f931e5eda60>}
Please select the following method for filling missing values (if some are found)
1) MEAN_IMPUTATION
2) MODE_IMPUTATION
3) KNN_IMPUTATION
4) LINEAR_INTERPOLATION_IMPUTATION
5) No method
4
Continue Editing variable: e?
1) YES
2) NO
1
Which field should be modified?
1) data_type
2) lower bound
3)upper bound
4) Data Value amputation method
5) Cancel Operation
4
action 4 5 {'1': <function ask_for_data_type at 0x7f931e5edca0>, '2': <function ask_for_lower_bound at 0x7f931e5edaf0>, '3':

KeyboardInterrupt: Interrupted by user

## 2.4 create custom DataType

This may be re work because depends on a third party package `aenum`

In [4]:
from demo_poc_data_wrapper.data_type import extend_data_type_properties, extend_data_type


extend_data_type_properties('MYDATATYPE', (False, False, False, False, False))

extend_data_type('MYDATATYPE', 'datatype')

In [5]:
multi_data_format_file_test = get_from_user_multi_view_dataset_fromat_file(multi_view_dataframe)

++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++ Now parsing view: file1 +++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++
displaying first 10 values of feature a (n_feature: 1/18)
0    48
1    87
2    46
3    84
4    94
5    18
6    15
7    30
8    54
9    46
Name: a, dtype: int64
number of differents samples: 57 / total of samples: 100


KeyboardInterrupt: Interrupted by user