# Data Space Analysis Example

This notebook will analyze the parametric data in a dataspace to calculate statistics.

### Imports

Import Python modules for executing the notebook. The ni_data_space_analyzer is used for performing some of the standard analyses. Pandas is used for building and handling the data in the data space. Scrapbook is used for running notebooks and recording data for the SystemLink Notebook Execution Service.

In [1]:
import pandas as pd
import scrapbook as sb

from ni_data_space_analyzer import DataSpaceAnalyzer
from ni_data_space_analyzer.exception import DataSpaceAnalyzerError

### Custom analysis 

Apart from the standard statistics, you can add custom analysis functions to compute additional statistics on the trace data. You can follow the following steps:

- Add the custom analysis in [parameters cell](#parameters) output metadata (see commented lines in the [Metadata](#Metadata) section).
- Add the custom analysis in supported analysis options (see commented lines in the [Supported analysis options](#supported-analysis-options) section).
- Implement the custom analysis logic (see commented lines in the [Perform analysis](#Perform-analysis) section). A sample implementation is provided below for reference.

### Parameters

1. `trace_data` - Data from the traces plotted in the dataspace to be analyzed. It will be stored as notebook execution artifact.
2. `analysis_options` - List of analysis to be performed against plotted traces in the dataspace.
3. `workspace_id` - Workspace ID of the dataspace to be analyzed.

In [2]:
trace_data = {"artifact_id": "<artifact_id>"}
analysis_options = []
workspace_id = ""

### Metadata

These are the parameters that the notebook expects to be passed in by SystemLink. For notebooks designed to be perform analysis inside a dataspace, must tag the cell with 'parameters' and at minimum specify the following in the cell metadata using the JupyterLab Property Inspector (double gear icon):

```json
{
  "papermill": {
    "parameters": {
      "analysis_options": [],
      "trace_data": {"artifact_id": "<artifact_id>"},
      "workspace_id": ""
    }
  },
  "systemlink": {
    "interfaces": [],
    "outputs": [
      {
        "display_name": "Min",
        "id": "min",
        "type": "scalar"
      },
      {
        "display_name": "Max",
        "id": "max",
        "type": "scalar"
      },
      {
        "display_name": "Mean",
        "id": "mean",
        "type": "scalar"
      },
      {
        "display_name": "2 STD",
        "id": "2std",
        "type": "scalar"
      },
      {
        "display_name": "-2 STD",
        "id": "-2std",
        "type": "scalar"
      },
      {
        "display_name": "Moving Mean",
        "id": "moving_mean",
        "type": "vector"
      },
      {
        "display_name": "CP",
        "id": "cp",
        "type": "vector"
      },
      {
        "display_name": "CPK",
        "id": "cpk",
        "type": "vector"
      },
      // {
      //   "display_name": "Custom Analysis Scalar",
      //   "id": "custom_analysis_scalar",
      //   "type": "scalar"
      // },
      // {
      //   "display_name": "Custom Analysis Vector",
      //   "id": "custom_analysis_vector",
      //   "type": "vector"
      // }
    ],
    "parameters": [
      {
        "display_name": "Trace data",
        "id": "trace_data",
        "type": "dict[string, string]"
      },
      {
        "display_name": "Analysis Options",
        "id": "analysis_options",
        "type": "string[]"
      },
      {
        "display_name": "Workspace ID",
        "id": "workspace_id",
        "type": "string"
      }
    ],
    "version": 2
  },
  "tags": ["parameters"]
}
````

For more information on how parameterization works, review the [papermill documentation](https://papermill.readthedocs.io/en/latest/usage-parameterize.html#how-parameters-work).


### Supported analysis options

1. Mean: The central value of the data set.
2. 2 STD: Two standard deviations from the mean.
3. -2 STD: Negative two standard deviations from the mean.
4. Min: The minimum value in the data set.
5. Max: The maximum value in the data set.
6. Moving Mean: The central value of the most recent X data points.
7. Cpk: The process capability index. Describes the ability of a process to provide output that will be within the required specifications consistently.
8. Cp: The process capability. The process capability is a measure of the potential for a process to provide output that is within upper and lower specification limits.

9. *(Optional) custom_analysis_scalar: Sample scalar custom analysis.*
10. *(Optional) custom_analysis_vector: Sample vector custom analysis.*

In [3]:
supported_analysis = [
    {"id": "min", "type": "scalar"},
    {"id": "max", "type": "scalar"},
    {"id": "mean", "type": "scalar"},
    {"id": "2std", "type": "scalar"},
    {"id": "-2std", "type": "scalar"},
    {"id": "moving_mean", "type": "vector"},
    {"id": "cp", "type": "vector"},
    {"id": "cpk", "type": "vector"},
    # {"id": "custom_analysis_scalar", "type": "scalar"},
    # {"id": "custom_analysis_vector", "type": "vector"},
]

supported_analysis_options = list(map(lambda x: x["id"], supported_analysis))

### Validate Analysis options

It validates that the analysis options from execution input are in the list of analysis options the notebook supports.

In [4]:
def validate_analysis_options(analysis_options) -> None:
    analysis_options = list(map(str.strip, analysis_options))

    invalid_options = list(set(analysis_options) - set(supported_analysis_options))

    if invalid_options:
        raise DataSpaceAnalyzerError(
            "The analysis failed because the following options are not supported: {0}.".format(
                ", ".join(invalid_options)
            )
        )

### Sample custom analysis

Sample implementation for scalar and vector custom analysis, and how it should be saved.

In [None]:
def compute_custom_analysis_scalar(trace_data_dataframe: pd.DataFrame) -> None:
    """Compute `custom_analysis_scalar` of the dataframe."""
    
    # Perform actual custom_analysis_scalar
    custom_analysis_scalar_sample_result = trace_data_dataframe['y'].describe()['count']
    trace_data_dataframe['custom_analysis_scalar'] = custom_analysis_scalar_sample_result

def compute_custom_analysis_vector(trace_data_dataframe: pd.DataFrame) -> None:
    """Compute `custom_analysis_vector` of the dataframe."""

    # Perform actual custom_analysis_vector
    custom_analysis_vector_sample_result = trace_data_dataframe['y'].rolling(1).median()
    trace_data_dataframe['custom_analysis_vector'] = custom_analysis_vector_sample_result

### Perform analysis

By default the analysis results will be appended to original dataframe, and users can generate the analysis results using **generate_analysis_output** method inside **data_space_analyzer** for the given analysis options and supported analysis as below.

In [6]:
def perform_analysis(trace_data_dataframe: pd.DataFrame) -> pd.DataFrame:
    data_space_analyzer = DataSpaceAnalyzer(dataframe=trace_data_dataframe)

    for option in analysis_options:
        if option == "min":
            data_space_analyzer.compute_min()
        elif option == "max":
            data_space_analyzer.compute_max()
        elif option == "mean":
            data_space_analyzer.compute_mean()
        elif option == "2std":
            data_space_analyzer.compute_2std()
        elif option == "-2std":
            data_space_analyzer.compute_negative_2std()
        elif option == "moving_mean":
            data_space_analyzer.compute_moving_mean()
        elif option == "cp":
            data_space_analyzer.compute_cp()
        elif option == "cpk":
            data_space_analyzer.compute_cpk()
        # elif option == "custom_analysis_scalar":
        #     compute_custom_analysis_scalar(trace_data_dataframe)
        # elif option == "custom_analysis_vector":
        #     compute_custom_analysis_vector(trace_data_dataframe)

    return data_space_analyzer.generate_analysis_output(
        analysis_options=analysis_options, supported_analysis=supported_analysis
    )

### Save analysis results

Users can save the analysis results into an artifact using the **save_analysis** method within **data_space_analyzer**. The output will be an artifact ID, representing the compressed and stored analysis data.

In [7]:
analysis_options = list(map(str.lower, analysis_options))
final_result = []

try:
    validate_analysis_options(analysis_options)
    data_space_analyzer = DataSpaceAnalyzer(pd.DataFrame())
    traces = data_space_analyzer.load_dataset(trace_data)

    for trace in traces:
        trace_data_name = trace["name"]
        trace_data_dataframe = trace["data"]

        analysis_results = perform_analysis(trace_data_dataframe)
        
        final_result.append({"plot_label": trace_data_name, "data": analysis_results})
    
    output_artifact_id = data_space_analyzer.save_analysis(workspace_id, final_result)

except DataSpaceAnalyzerError as e:
    raise Exception(e) from None

### Store the result information so that SystemLink can access it

SystemLink uses scrapbook to store result information from each notebook execution to display to the user in the Execution Details slide-out.
   

In [None]:
sb.glue("result", output_artifact_id)

#### Sample Output format

```json
{
    output_artifact_id: "ec25561d-6509-49e5-9a78-30e9752733fe"
}
```

`output_artifact_id` - The ID of the artifact file where the output data is compressed and stored.

# Next Steps

1. Publish this notebook to SystemLink by right-clicking it in the JupyterLab File Browser with the interface as Data Space Analysis.
1. Manually Analyze the parametric data inside the dataspace by clicking analyze button.
   