Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamping Matrices: Data Format Changes and Integration of Apache Arrow IPC Binary Format #2086

Open
hdinia opened this issue Jul 9, 2024 · 1 comment
Assignees
Labels
documentation Improvements or additions to documentation draft enhancement New feature or request
Milestone

Comments

@hdinia
Copy link
Member

hdinia commented Jul 9, 2024

Data Format Changes

Introduction:

The current data format used in our application has some limitations, such as a lack of clarity and the need for expensive calculated columns to be generated on the browser.
To address these issues, we are proposing a new data format that follows best practices such as having a single source of truth and separating the UI presentation layer from the logic layer.
The new format will also improve performance by reducing the need for calculated columns and provide a clearer data structure.
In this specification, we will outline the proposed changes to the data format and provide examples of how it will be implemented.

  1. The actual format:
    • The actual format provide only the data, and the UI information like timestamps or aggregation columns are generated by the frontend on the fly, at each request.
    • This approach is not ideal because it requires the frontend to perform expensive calculations, which could lead to poor performance and a poor user experience.
    • Additionally, it goes against the separation of concerns principle, as the frontend is responsible for both presenting the data and performing logic on it.
{
  "index": [0, 1, 2, 3, 4],
  "columns": [0, 1],
  "data": [
    [0, 0.5],
    [1, 0.75],
    [2, 1],
    [3, 1.25],
    [4, 1.5]
  ]
}

And here's an example of the new format:

{
  "dateTime": ["2022-01-01T00:00:00Z", "2022-01-02T00:00:00Z", "2022-01-03T00:00:00Z"],
  "data": [
    [20, 30],
    [25, 35],
    [22, 32]
  ],
  "aggregates": {
    "min": [20, 30],
    "max": [25, 35],
    "avg": [22, 32]
  },
  "columns": [
    {
      "id": "ts1",
      "title": "TS 1",
      "type": "number",
      "width": 100,
      "editable": false
    },
    {
      "id": "ts2",
      "title": "TS 2",
      "type": "number",
      "width": 100,
      "editable": false
    }
  ],
  "rows": 3,
  "metadata": {
    "defaultValue": 0,
    "kind": "daily",
    "title": "Daily Thermal Time Series"
  }
}

In the new format, the dateTime array contains the timestamps for each row of data, the data array contains the actual data values, the aggregates object contains the minimum, maximum, and average values for each column, the columns array contains metadata for each column (such as its title, data type, and width), and the metadata object contains additional metadata for the entire dataset (such as its default value, kind, and title).

The new format is more explicit and easier to understand, as it separates the data, metadata, and UI presentation logic into distinct sections.
It also allows for more efficient data processing, as the aggregates and column metadata can be calculated once on the server and then reused by the client, rather than being recalculated every time the data is rendered.

  1. Format changes:

    • The new format includes the following properties:
      • dateTime: array of strings in ISO 8601 date format, example: 2024-07-01T00:00:00Z representing the date and time for each row of data:

        "dateTime": [
          "2024-07-01T00:00:00Z",
          "2024-07-02T00:00:00Z",
        ]
      • data: a 2D array of numbers representing the data for each row and column. For example:

        "data": [
          [1, 2, 3, 4, 5],
          [6, 7, 8, 9, 10],
        ]
      • aggregates: an object that includes the following properties:

        • min: an array of numbers representing the minimum value for each column. For example:

          "aggregates": {
            "min": [1, 6, 11, 16, 21],
          }
        • max: an array of numbers representing the maximum value for each column. For example:

          "aggregates": {
            "max": [5, 10, 15, 20, 25],
          }
        • avg: an array of numbers representing the average value for each column. For example:

          "aggregates": {
            "avg": [3, 8, 13, 18, 23],
          }
      • columns: an array of objects that includes the following properties for each column:

      • Note that the columns object shape is specific to GlideDataGrid, see the columns docs

      • id: a string representing the unique identifier for the column. For example:

        "columns": [
          {
            "id": "datetime",
          },
        ]
      • title: a string representing the display name for the column. For example:

        "columns": [
          {
            "title": "Date / Time",
          },
        ]
      • type: a string representing the type for the column. The purpose of this type is to differentiate UI columns that are readonly, and have special styling, from pure data columns that are editable

      • str Enum: "datetime" | "number" | "aggregate" (may change)

      • For example:

        "columns": [
          {
            "type": "datetime",
          },
        ]
      • width: a number representing the width of the column in pixels. For example:

        "columns": [
          {
            "width": 150,
          },
        ]
      • editable: a boolean indicating whether the column is editable. For example:

        "columns": [
          {
            "editable": false,
          },
        ]
      • style: an optional string representing the CSS style for the column. For example:

      • str Enum: "normal" | "highlight" if not provided defaults to "normal".

        "columns": [
          {
            "style": "highlight",
          },
        ]
      • rows: a number representing the number of rows in the data. For example:

      • Required prop for GlideDataGrid, see required props docs

        "rows": 10
      • metadata: an object that includes additional metadata for the data.

      • defaultValue: a number that indicates the default value to fill the matrix when a resize is performed (this value may change depending on the kind of matrix)

      • kind: the kinf of matrix e.g.: "hydroStorage", "waterValues", "allocation". Enables the possibility to switch some features depending on the kind of matrix, or simply identify it.

      • title: displayed title of the matrix

        "metadata": {
          "defaultValue": 1,
          "kind": "hourly",
          "title": "Hourly Temperature Data",
        }
  2. Benefits of changing the format:

    • Respect of good practices: The new format follows the principle of a single source of truth, as the data is stored and managed in a central location (the backend) and is presented to the user in a consistent and accurate way.

    • Performance: By including the aggregates and column metadata in the data payload, the frontend can present the data more efficiently and accurately, without having to perform expensive calculations on the fly.

    • Clarity: The new format provides a straightforward data structure that is easy to understand and work with, both for the frontend and backend developers.

    • Separation of concerns: The new format separates the data from the UI presentation logic, which improves the separation of concerns and makes the codebase more maintainable.

Here's an example of Pydantic classes describing the new format:

class Column(BaseModel):
    id: str
    title: str
    type: str
    width: int
    editable: bool
    style: str = None

class Aggregates(BaseModel):
    min: List[float]
    max: List[float]
    avg: List[float]

class Metadata(BaseModel):
    defaultValue: float
    kind: str
    title: str

class Matrix(BaseModel):
    dateTime: List[str]
    data: List[List[float]]
    aggregates: Aggregates
    columns: List[Column]
    rows: int
    metadata: Metadata

About the Apache Arrow format

Arrow is a columnar memory format designed for efficient in-memory data processing and interchange.
It was developed by Apache Arrow, an open-source project that aims to improve the performance of big data processing applications.

  • Arrow is a columnar memory format designed for efficient in-memory data processing and interchange.
  • In a columnar format, data is stored in columns rather than rows, which can improve performance for certain types of data processing tasks and enable more efficient storage of data.
  • Arrow uses a columnar format to store data in memory, and it also provides a binary format for serializing and deserializing data.
  • The Arrow IPC format is a binary format that is optimized for serializing and deserializing data in a columnar format. It encodes each column using a specific encoding scheme that is optimized for the data type, and then concatenates the encoded columns together to form a single binary blob.
  • The Arrow IPC format provides a efficient and flexible way to serialize and deserialize data in a columnar format, allowing for efficient data processing and interchange across different systems and languages.

Data flow Frontend -> Backend :
(frontend makes a PUT request to update a matrix)

  1. The frontend generates a JSON object that represents the data.
  2. The JSON object is converted to an Arrow Table
  3. The Arrow Table is serialized to a binary format using the tableToIPC method and sent to the backend via a POST request.
  4. The backend deserializes the binary data to an Arrow Table.
  5. FastAPI validates the incoming data using a Pydantic model.
  6. The data is processed using pyarrow, and pandas.
  7. The processed data is stored.

arrow-front-to-back

Data flow Backend -> Frontend:
frontend makes a GET request to read a matrix content)

  1. The frontend sends an HTTP GET request to the backend to read a matrix content.
  2. The backend retrieves the stored data and processes it using tools like pyarrow.compute, and pandas.
  3. The processed data is converted to an Arrow Table.
  4. The Arrow Table is serialized to a binary format.
  5. The serialized binary data is sent to the frontend using an HTTP response.
  6. The frontend deserializes the binary data to an Arrow Table using the tableFromIPC method from the apache-arrow library.
  7. The Arrow Table is converted to a JSON object.
  8. The JSON object is used to populate the frontend UI with the matrix content.

arrow-back-to-front

Resources:

Apache Arrow in JS

Apache Arrow Usage with Pandas

Python Cookbook

@hdinia hdinia added documentation Improvements or additions to documentation enhancement New feature or request draft labels Jul 9, 2024
@makdeuneuv makdeuneuv added this to the v2.18 milestone Jul 10, 2024
@sylvlecl
Copy link
Member

Thanks for the description!
As discussed a few days ago, the proposed data model seems very good to me as the data model on the front side.

Then, in order to advance the dev step by step we would:

  1. As a first step, re-create this data model from the current backend endpoints, and use data glide from it.
    This will allow us to validate the use of data glide without having to do heavy refactorings on the backend

  2. As a second step, improve the way this data model is populated from the backend:

  • By using arrow instead of JSON for the matrix content
  • By computing the index on the backend side
  1. About aggregates computation on the backend side, this would be aded in again another step I think. In particular, it will probably not be possible to have them inside the same arrow structure as the matrix content. So we need to think of how this could be transferred: either another endpoint (but then we should take care of caching the values on the backend side to not reload the file just for computing those values), or finding a way to return in the same body the arrow table and those values.

About matrix updates from front to back

Currently, those updates are specified through operations which have the advantage of not sending the whole matrix to the backend. I thinks in a first step we should keep this mechanism.
Then we should assess if it's OK to send the whole matrix every time as an arrow table or not.
Or maybe having several endpoints for different kinds of operations (updating only a few cells vs. updating a whole subpart of the table).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation draft enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants