Skip to content

Latest commit

 

History

History
131 lines (89 loc) · 8.03 KB

File metadata and controls

131 lines (89 loc) · 8.03 KB

DSS Plugin: Visual Edit

Visual Edit provides components to create data validation and editing interfaces, packaged in a Dataiku plugin.

This document serves as developer documentation. As a preliminary step, we recommend reading all pages of the user documentation for data experts and for developers (including the FAQ and Troubleshooting pages which provide insights into the functioning of the plugin).

Contents

The code was developed in Python (environment specs are in code-env/) and is organized as follows:

  • Data persistence layer based on the Event Sourcing pattern: we don't store the edited data directly, but instead, we use an append-only store to record the full series of actions on that data (the "editlog"); we then use it to recreate the edited state.
  • Integration of edits within a Dataiku Flow: this is where Recipes and other Dataiku-specific components are specified and implemented, based on the above.
  • Dash Webapp: provides an interface for end-users to edit data; the front-end is powered by the dash_tabulator component and the backend is powered by the data persistence layer. The webapp is also packaged as a Dataiku Visual Webapp.

Data persistence layer

  • python-lib/DataEditor.py provides a CRUD Python API with methods to log and replay edits; these can be run in real-time mode within a webapp, or in batch mode within a data pipeline.
  • The API reference documentation was generated from docstrings by Mkdocs (following this tutorial). Updates to the documentation website are manual, they require running mkdocs build from python-lib/ and moving the output (in site/) to ../docs/backend/.
  • python-lib/commons.py provides the core replay logic.

Integration of edits within a Dataiku Flow

We recommend reading the components' descriptions, available on your Dataiku instance once you've installed the plugin.

  • custom-recipes/ (Dataiku plugin components: Recipes) leverage the data persistence layer and Custom Fields (see below) to replay edits found in an editlog dataset, and integrate them into a Dataiku Flow.
  • custom-fields/visual-edit-schema/ (Dataiku plugin component: Custom Fields) provides a place to store dataset settings such as primary keys and editable columns, which are used by the Recipes. Currently, these settings are duplicated across the original dataset and the editlog:
    • original dataset:
      • primary_keys is used by the Apply Edits function/recipe, which joins the original and edits datasets
      • editable_columns is used to present columns in a certain order
      • Note: both properties are returned by the get_original_df() method in commons.py.
    • editlog dataset:
      • primary_keys is used by the Replay Edits function/recipe to unpack key values into 1 column per primary key and to figure out key names.
  • python-steps/visual-edit-empty-editlog/ (Dataiku plugin component: Scenario step) empties the editlog by deleting all rows (this is only used for testing purposes on a design instance).

Dash Webapp

Structure of webapps/visual-edit/backend.py:

  • Dataiku-specific logic
  • Instantiate DataEditor
  • Add a dash_tabulator component to the layout, for the data table, with data coming from DataEditor
  • Add a callback triggered upon edits, using DataEditor to log edits.

See below for how to run the webapp locally and have access to Dash dev tools in the browser, including the callback graph: it provides a visual overview of the components and callbacks, which helps understand the logic behind the automatic detection of changes in the original dataset.

Modifying the plugin

Most changes to the code can be tested by running the webapp locally. You can then further test the integration with Dataiku by installing the plugin with your customizations. For this, you would change the repo URL used by the plugin installer to fetch from Git. Alternatively, you can create a ZIP file from the dss-plugin-visual-edit directory and upload it to the plugin installer.

In the rest of this section, we explain how to run the webapp locally. As a pre-requisite, you should configure your machine to connect to your Dataiku instance (on a Mac, this is in ~/.dataiku/config.json).

Create a Python environment

  • Pre-requisite: install pyenv-virtualenv. On a Mac:
brew install pyenv pyenv-virtualenv
  • Create and activate a Python 3.9 virtual environment (to match the Python version used by the plugin, specified in code-env/python/desc.json):
pyenv virtualenv 3.9.19 visual-edit
pyenv activate visual-edit
  • Install plugin requirements and dev requirements:
pip install --upgrade pip
pip install -r code-env/python/spec/requirements.txt
pip install -r code-env/python/spec/requirements.dev.39.txt
  • Install Dataiku internal client (this would be done automatically when creating a code environment within Dataiku):
    • Bash:

      instance_name=$(jq -r '.default_instance' ~/.dataiku/config.json)
      DKU_DSS_URL=$(jq -r --arg instance $instance_name '.dss_instances[$instance].url' ~/.dataiku/config.json)
      pip install $DKU_DSS_URL/public/packages/dataiku-internal-client.tar.gz
    • Fish:

      set instance_name (jq -r '.default_instance' ~/.dataiku/config.json)
      set DKU_DSS_URL (jq -r --arg instance $instance_name '.dss_instances[$instance].url' ~/.dataiku/config.json)
      pip install $DKU_DSS_URL/public/packages/dataiku-internal-client.tar.gz

Store webapp settings in a JSON file

When the webapp runs locally, we don't have a settings interface where we can specify the original dataset, primary keys, editable columns, linked records, etc. The workaround is to:

  • Store the name of the dataset in an environment variable named ORIGINAL_DATASET;
  • Have the webapp look for these settings in a JSON file with the same name, in webapp-settings/ (you can find examples in that directory).

Solution 1: run the webapp using VS Code Flask launch config (recommended)

  • Copy .vscode/launch.json.example to .vscode/launch.json. Adapt these two definitions:
    • Project key (DKU_CURRENT_PROJECT_KEY)
    • Dataset name (ORIGINAL_DATASET)
  • Go to "Run and Debug" in VS Code
  • You can set breakpoints in the backend code, and you have access to Dash dev tools in the browser (callback graph, etc.)
  • If you start an interactive Python session in VS Code, you'll be able to use "Jupyter: Variables" next to "Debug Console", and in particular the dataframe inspector.

Solution 2: run the webapp using python backend.py

With app.run_server(debug=True) in backend.py.

You would also need to define some environment variables first:

export DKU_CURRENT_PROJECT_KEY=JOIN_COMPANIES_SIMPLE
export ORIGINAL_DATASET=matches_uncertain
export PYTHONPATH=../../python-lib
python backend.py

Integration tests

Visual Edit is validated against an integration test suite located in dss-plugin-visual-edit/tests. These tests are written using the Gherkin specification language. They are using a dataiku library containing generic steps. The tests run on the test instance of Business Solutions.

You can launch the tests from the "Actions" tab of this repository, then on the side bar click "Run Gherkin tests", then on the right, click on "Run workflow" and choose the branch that contains the test suite to run.

The plugin is not updated automatically before each test run. So to test a new version, you first have to update the plugin on the test instance. We will make that automatic in the future.

Tests can also be run locally. Read more in the tests' README.