# Handling data

[kahan-summation]: https://en.wikipedia.org/wiki/Kahan_summation_algorithm "Kahan summation algorithm"

## The "dataBox"

```{eval-rst}
.. class:: flx.dataBox

   Handles the processing/storage of data-points.

   .. method:: __init__(M_in, M_out)

      Initialize an empty data-box.
      
      :param M_in: The dimension of the input vector. Value must be positive or *zero*.
      :type M_in: int
      :param M_out: The dimension of the output vector. Value must be positive or *zero*.
      :type M_out: int

   .. py:method:: write2mem(config)

      Allocate memory for storing data-points.

      :param config: The following keys are allowed in `config`:
    
         - ``N_reserve`` (type *unsigned long*): The total number of data-points the memory can hold.
         - ``cols`` (*string* or *list*): ... see parameter ``cols`` in :func:`flx.dataBox.write2file`.

      :type config: dict
      :rtype: None
      
   .. py:method:: extract_col_from_mem(col)

      Return a numpy-array that points to the memory of ``col``.

      :param col: An identifier for the data-column to extract.
      :type col: :type:`dataBox_colID`
      :rtype: numpy.ndarray[float]
      
   .. py:method:: free_mem()

      Free the allocated memory for storing data.

      :rtype: None

   .. py:method:: write2file(config)

      Set the handle to write data-points to a file / output stream.

      :param config: The following keys are allowed in `config`:
    
         - ``fname`` (type *string*): The name of the file to open for output.
         - ``append`` (*bool*, default: *True*): ``True``: Append output to an existing file. ``False``: Overwrite an existing file.
         - ``binary`` (*bool*, default: *True*): ``True``: Output data in binary format. ``False``: Create a human-readable text file.
         - ``cols`` (*string* or *list*): 
         
             If a *string* is provided, the following keywords are accepted:

                 - ``all``: all data columns (i.e., model output and model input) are sent to the output stream. First, the model output and thereafter the model input is written.
                 - ``all_in``: only the model input is written to the output stream.
                 - ``all_out``: only the model output is written to the output stream.
                 
             If a *list* is provided, the list must be composed of entries of type :type:`dataBox_colID`.             

      :type config: dict
      :rtype: None
     
   .. py:method:: read_from_file(config)

      Import data-points from a file. 
      
      The total number of values stored in the file must be a multiple of ``M_in+M_out``.

      :param config: The following keys are allowed in `config`:
    
         - ``fname`` (type *string*): The name of the file to read from.
         - ``binary`` (*bool*, default: *True*): ``True``: Input data in binary format. ``False``: Input from a human-readable text file.          

      :type config: dict
      :rtype: None
      
   .. py:method:: close_file()

      Closes an open file stream. No more samples will be written to the file.

      :rtype: None

   .. py:method:: register_post_processor(config)

      Registers a new :class:`post-processor<flx.dataBox.postProc>` and returns it.

      :param config: The configuration of the post-processor.
      :type config: :type:`dataBox_postProc`
      :rtype: :class:`flx.dataBox.postProc`
```

```{eval-rst}
.. py:type:: dataBox_colID
   :canonical: int | dict

   Syntax:
       ``COL``

   Description:   
       The configuration used to identify a data-column in a :type:`flx.dataBox`.

   The following types are accepted for `COL`:
   
     - ``int``: A *integer* that specifies the ID of a data-column. The numbering of column IDs for the model output starts with *zero*. The numbering of the column IDs for the model input starts with the total number of output columns. Value must be postive for *zero*.
     - ``dict``: A Python-dictionary that expects the following keys:

         - ``set`` (type *string*): Specifies how the ``id`` is interpreted. The value must either be ``full`` (on full set of data-columns), ``in`` (on data-columns of the input) or ``out`` (on data-columns of the output). 
         - ``id`` (type *int*): The index of the data-column.
```

## Post-processors
### Overview

```{eval-rst}
.. py:type:: dataBox_postProc
   :canonical: dict

   Syntax:
       ``CONFIG``

   Description:   
       The configuration used to initialize a post-processor for a :type:`flx.dataBox`, where `CONFIG` is of type *dict*.

   The following keys are allowed independent of the :type:`type of the post-processor<dataBox_postProc_type>`:
     - ``type`` (:type:`dataBox_postProc_type`): The type of the post-processor (*required*).
     
   Additionally, depending on the specified ``type`` of the random variable, other keys can be required for definition; see section :ref:`content:basics:data:postproc:types`.
```

```{eval-rst}
.. type:: dataBox_postProc_type
   :canonical: str

   Syntax:
       ``TYPE``

   Description:
       Specifies the type of a :class:`post-processor<flx.dataBox.postProc>` for a :type:`flx.dataBox`.

   The following values/types for :class:`post-processors<flx.dataBox.postProc>` can be used:
     - ``mean_double`` » :ref:`content:basics:data:postproc:types:mean_double`
     - ``mean_pdouble`` » :ref:`content:basics:data:postproc:types:mean_pdouble`
     - ``mean_qdouble`` » :ref:`content:basics:data:postproc:types:mean_qdouble`
     - ``vdouble`` » :ref:`content:basics:data:postproc:types:vdouble`

   The state of a :class:`post-processor<flx.dataBox.postProc>` can be retrieved by means of the function :func:`flx.dataBox.postProc.eval`.
```

```{eval-rst}
.. class:: flx.dataBox.postProc

   A post-processor for a :type:`flx.dataBox`.

   .. method:: eval()

       Syntax:
           ``flx.dataBox.postProc.eval()``

       Description:
           Returns the current state of the post-processor. The states of the different post-processors are documented in section :ref:`content:basics:data:postproc:types`.
        
       :rtype: dict
```

(content:basics:data:postproc:types)=
### Types

(content:basics:data:postproc:types:mean_double)=
#### ``mean_double``

```{eval-rst}
.. py:property:: mean_double

   A :class:`post-processor<flx.dataBox.postProc>` that tracks the mean of the data-column based on a floating-point variable.
   It is a fast post-processor.
   However, for large sums, accuracy can become an issue due to floating-point precission.

   Parametrization:
       Parameters of this post-processor can be specified as additional key-value pairs in an object of type :type:`dataBox_postProc_type`. 
       The following parameters are accepted:

         - ``col`` (:type:`dataBox_colID`): An identifier for the data-column to track.

   States:
       When the function :func:`flx.dataBox.postProc.eval` is called on this post-processor, the following states are returned:

       - ``mean`` (*float*): The sample mean of the tracked data-column.
       - ``N`` (*int*): The total number of samples of the tracked data-column.

```

(content:basics:data:postproc:types:mean_pdouble)=
#### ``mean_pdouble``

```{eval-rst}
.. py:property:: mean_pdouble

   A :class:`post-processor<flx.dataBox.postProc>` that tracks the mean of the data-column based on a special floating-point variable that minimizes potential numerical summation errors (based on the *Kahan summation algorithm*).

   Parametrization:
       Parameters of this post-processor can be specified as additional key-value pairs in an object of type :type:`dataBox_postProc_type`. 
       The following parameters are accepted:

         - ``col`` (:type:`dataBox_colID`): An identifier for the data-column to track.

   States:
       When the function :func:`flx.dataBox.postProc.eval` is called on this post-processor, the following states are returned:

       - ``mean`` (*float*): The sample mean of the tracked data-column.
       - ``N`` (*int*): The total number of samples of the tracked data-column.

```

(content:basics:data:postproc:types:mean_qdouble)=
#### ``mean_qdouble``

```{eval-rst}
.. py:property:: mean_qdouble

   A :class:`post-processor<flx.dataBox.postProc>` that tracks the mean of the data-column based on a special floating-point variable that minimizes potential numerical summation errors (based on performing the summation in separate bins).

   Parametrization:
       Parameters of this post-processor can be specified as additional key-value pairs in an object of type :type:`dataBox_postProc_type`. 
       The following parameters are accepted:

         - ``col`` (:type:`dataBox_colID`): An identifier for the data-column to track.
         - ``NpV`` (*int*): A number of points. Value must be larger than *zero*.
         - ``ppb`` (*bool*): ``True``, `NpV` is interpreted as the number of points per summation bin. ``False``, `NpV` is interpreted as the total number of samples - and the number of bins is estimated from this number.

   States:
       When the function :func:`flx.dataBox.postProc.eval` is called on this post-processor, the following states are returned:

       - ``mean`` (*float*): The sample mean of the tracked data-column.
       - ``N`` (*int*): The total number of samples of the tracked data-column.

```

(content:basics:data:postproc:types:vdouble)=
#### ``vdouble``

```{eval-rst}
.. py:property:: vdouble

   A :class:`post-processor<flx.dataBox.postProc>` that tracks the mean and the variance of the data-column.
   A special floating-point variable (based on the Kahan summation algorithm) is used to increase floating-point precision.

   Parametrization:
       Parameters of this post-processor can be specified as additional key-value pairs in an object of type :type:`dataBox_postProc_type`. 
       The following parameters are accepted:

         - ``col`` (:type:`dataBox_colID`): An identifier for the data-column to track.

   States:
       When the function :func:`flx.dataBox.postProc.eval` is called on this post-processor, the following states are returned:

       - ``mean`` (*float*): The sample mean of the tracked data-column.
       - ``sd`` (*float*): The sample standard deviation of the tracked data-column.
       - ``var`` (*float*): The sample variance of the tracked data-column.
       - ``N`` (*int*): The total number of samples of the tracked data-column.
       - ``rv_mean`` (:class:`flx.rv`): A random variable quantifying the uncertainty about the mean value of the tracked data-column.

```

(content:basics:data:postproc:types:reliability)=
#### ``reliability``

```{eval-rst}
.. py:property:: reliability

   A :class:`post-processor<flx.dataBox.postProc>` that interprets the values of the data-column as a limit-state function and tracks the reliability.

   Parametrization:
       Parameters of this post-processor can be specified as additional key-value pairs in an object of type :type:`dataBox_postProc_type`. 
       The following parameters are accepted:

         - ``col`` (:type:`dataBox_colID`): An identifier for the data-column to track.

   States:
       When the function :func:`flx.dataBox.postProc.eval` is called on this post-processor, the following states are returned:

       - ``N`` (*int*): The total number of samples of the tracked data-column.
       - ``H`` (*int*): The number of samples of the tracked data-column with a limit-state function smaller or equal than *zero*.
       - ``mean_freq`` (*float*): Frequentist estimate of the mean value.
       - ``mean_bayes`` (*float*): Bayesian estimate of the mean value.
       - ``rv_pf`` (:class:`flx.rv`): A random variable quantifying the uncertainty about the probability of failure of the tracked data-column.

```

## Examples

In [1]:
import fesslix as flx
flx.load_engine()
import fesslix.model_templates as flx_model_templates

import numpy as np

Random Number Generator: MT19937 - initialized with rand()=962222853;
Random Number Generator: MT19937 - initialized with 1000 initial calls.


### Write samples to a file and to memory

In [2]:
## ==============================================
## Generate model
## ==============================================
my_model = flx_model_templates.generate_reliability_R_S_example()

## ==============================================
## Set up dataBox
## ==============================================
dBox_1 = flx.dataBox(my_model['sampler'].get_NOX(),len(my_model['model']))

## ----------------------------------------
## set up writing to a file
## ----------------------------------------
dBox_1.write2file( {
    'fname': "mcs_samples.bin",
    'append': False,
    'binary': True,
    'cols': 'all'
    } )

## ----------------------------------------
## set up storing data in memory
## ----------------------------------------
dBox_1.write2mem( {
    'N_reserve': int(1e6),
    'cols': 'all'
    } )

## ----------------------------------------
## register post-processors
## ----------------------------------------
pp_1a = dBox_1.register_post_processor({ 'type':'mean_double', 'col':{ 'set':'in', 'id':0} })
pp_1b = dBox_1.register_post_processor({ 'type':'mean_pdouble', 'col':{ 'set':'in', 'id':0} })
pp_1c = dBox_1.register_post_processor({ 'type':'mean_qdouble', 'col':{ 'set':'in', 'id':0} })
pp_1d = dBox_1.register_post_processor({ 'type':'vdouble', 'col':{ 'set':'in', 'id':0} })

## ==============================================
## Perform the Monte Carlo simulation
## ==============================================
my_model['sampler'].perform_MCS(10000,my_model['model'],dBox_1)

## ==============================================
## Close the file stream of dBox_1
## ==============================================
dBox_1.close_file()

## ==============================================
## Extract a data-column from dBox_1
## ==============================================
data_fvec = dBox_1.extract_col_from_mem( { 'set':'in', 'id':1} )
print( f"mean of S: {np.mean(data_fvec):.2f}" )

## ==============================================
## Evaluate post-processors
## ==============================================

print( "pp_1a:", pp_1a.eval() )
print( "pp_1b:", pp_1b.eval() )
print( "pp_1c:", pp_1c.eval() )
pp_1d_res = pp_1d.eval()
print( "pp_1d:", pp_1d_res, pp_1d_res['rv_mean'].mean() )


mean of S: 0.99
pp_1a: {'N': 10000, 'mean': 4.992860617932716}
pp_1b: {'N': 10000, 'mean': 4.992860617932716}
pp_1c: {'mean': 4.992860617932717, 'N': 10000}
pp_1d: {'mean': 4.992860617932716, 'var': 1.0009114343890506, 'sd': 1.000455613402739, 'N': 10000, 'rv_mean': <fesslix.core.rv object at 0x7c832d516ff0>} 4.992860617932716


### Import samples from a binary file

In [3]:
## ==============================================
## Set up dataBox and post-processors
## ==============================================
dBox_2 = flx.dataBox(2,1)

pp_2a = dBox_2.register_post_processor({ 'type':'mean_qdouble', 'col':{ 'set':'in', 'id':0} })
pp_2b = dBox_2.register_post_processor({ 'type':'mean_qdouble', 'col':{ 'set':'in', 'id':1} })

dBox_2.write2file( {
    'fname': "mcs_samples.dat",
    'append': False,
    'binary': False,
    'cols': 'all_in'
    } )

## ==============================================
## import data from file
## ==============================================
dBox_2.read_from_file({ 'fname': "mcs_samples.bin", 'binary': True })

## ==============================================
## Evaluate post-processors
## ==============================================
print( "pp_2a:", pp_2a.eval() )
print( "pp_2b:", pp_2b.eval() )


pp_2a: {'mean': 4.992860618948937, 'N': 10000}
pp_2b: {'mean': 0.9865246530708391, 'N': 10000}


### Import samples from a text-file

In [4]:
## ==============================================
## Set up dataBox and post-processors
## ==============================================
dBox_3 = flx.dataBox(2,0)

pp_3a = dBox_3.register_post_processor({ 'type':'mean_qdouble', 'col':{ 'set':'in', 'id':0} })
pp_3b = dBox_3.register_post_processor({ 'type':'mean_qdouble', 'col':{ 'set':'in', 'id':1} })

## ==============================================
## import data from file
## ==============================================
dBox_3.read_from_file({ 'fname': "mcs_samples.dat", 'binary': False })

## ==============================================
## Evaluate post-processors
## ==============================================
print( "pp_3a:", pp_3a.eval() )
print( "pp_3b:", pp_3b.eval() )

pp_3a: {'mean': 4.992860621699999, 'N': 10000}
pp_3b: {'mean': 0.9865246652292301, 'N': 10000}
