Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return JSON from Pipeline DAG structure #2812

Merged
merged 34 commits into from
Oct 12, 2021
Merged

Return JSON from Pipeline DAG structure #2812

merged 34 commits into from
Oct 12, 2021

Conversation

ParthivNaresh
Copy link
Contributor

@ParthivNaresh ParthivNaresh commented Sep 20, 2021

Fixes #1969

This allows users to return a serialized JSON representation of a pipeline's DAG structure for their own graphing purposes.
The template for the structure is

{"Edges": {"from_parent_node_to_child_node" : {"data": what_value_is_passed,
                                               "from": "parent_node",
                                               "to": "child_node"}
                  ...},
 "Nodes": {"component_name": {"class": class_name,
                              "attributes": self.parameters_attributes}
                  ...}
}

dag_json = json.loads(pipeline_.graph_json())
----------------------------------------------
{'Edges': {'from_Imputer_to_Standard Scaler': {'data': 'X modified by Imputer',
                                               'from': 'Imputer',
                                               'to': 'Standard Scaler'},
           'from_One_Hot_Encoder_Special_to_Imputer': {'data': 'X modified by One Hot '
                                                       'Encoder',
                                               'from': 'One Hot Encoder',
                                               'to': 'Imputer'},
           'from_Standard Scaler_to_Linear Regressor': {'data': 'X modified by '
                                                                'Standard '
                                                                'Scaler',
                                                        'from': 'Standard '
                                                                'Scaler',
                                                        'to': 'Linear '
                                                              'Regressor'},
           'from_X_to_One_Hot_Encoder_Special': {'data': 'Original X input',
                                         'from': 'X',
                                         'to': 'One Hot Encoder'},
           'from_y_to_Imputer': {'data': 'Original y input',
                                 'from': 'y',
                                 'to': 'Imputer'},
           'from_y_to_Linear Regressor': {'data': 'Original y input',
                                          'from': 'y',
                                          'to': 'Linear Regressor'},
           'from_y_to_One_Hot_Encoder_Special': {'data': 'Original y input',
                                         'from': 'y',
                                         'to': 'One Hot Encoder'},
           'from_y_to_Standard Scaler': {'data': 'Original y input',
                                         'from': 'y',
                                         'to': 'Standard Scaler'}},
 'Nodes': {'Imputer': {'attributes': {'categorical_fill_value': None,
                                      'categorical_impute_strategy': 'most_frequent',
                                      'numeric_fill_value': None,
                                      'numeric_impute_strategy': 'mean'},
                       'class': 'Imputer'},
           'Linear Regressor': {'attributes': {'fit_intercept': True,
                                               'n_jobs': -1,
                                               'normalize': False},
                                'class': 'Linear Regressor'},
           'One_Hot_Encoder_Special': {'attributes': {'categories': None,
                                              'drop': 'if_binary',
                                              'features_to_encode': None,
                                              'handle_missing': 'error',
                                              'handle_unknown': 'ignore',
                                              'top_n': 10},
                               'class': 'One Hot Encoder'},
           'Standard Scaler': {'attributes': None, 'class': 'Standard Scaler'},
           'X': 'X',
           'y': 'y'}}

@codecov
Copy link

codecov bot commented Sep 20, 2021

Codecov Report

Merging #2812 (61846a6) into main (e4dabef) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2812     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        302     302             
  Lines      28312   28388     +76     
=======================================
+ Hits       28216   28292     +76     
  Misses        96      96             
Impacted Files Coverage Δ
evalml/automl/automl_search.py 99.9% <ø> (ø)
evalml/pipelines/pipeline_base.py 98.4% <100.0%> (+0.2%) ⬆️
evalml/tests/automl_tests/test_automl.py 99.5% <100.0%> (+0.1%) ⬆️
evalml/tests/conftest.py 98.3% <100.0%> (+0.1%) ⬆️
evalml/tests/pipeline_tests/test_graphs.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e4dabef...61846a6. Read the comment docs.

@@ -314,7 +314,7 @@ class AutoMLSearch:
Only applicable if patience is not None. Defaults to None.

allowed_component_graphs (dict): A dictionary of lists or ComponentGraphs indicating the component graphs allowed in the search.
The format should follow { "Name_0": [list_of_components], "Name_1": [ComponentGraph(...)] }
The format should follow { "Name_0": [list_of_components], "Name_1": ComponentGraph(...) }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we meant to add a list here when passing ComponentGraphs

Returns:
dag_json (str): A serialized JSON representation of a DAG structure.
"""
dag_json = {"Nodes": {"X": "X", "y": "y"}, "Edges": {}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added X and y as default nodes

else component_info[0].name,
"attributes": self.parameters[component_name]
if component_name in self.parameters
else None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some classes don't have any parameters that get passed through like StandardScaler

dag_json["Edges"][f"from_{from_comp}_to_{component_name}"] = {
"from": from_comp,
"to": component_name,
"data": f"X modified by {from_comp}",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data is meant to provide a little more detail as to what's being passed through that edge even though it can be inferred based on the from and to fields. OS users might find this useful in understanding this output.

dag_str = automl.allowed_pipelines[0].graph_json()
dag_json = json.loads(dag_str)
for node_, params_ in automl_parameters_.items():
for key_, val_ in params_.items():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specific structure is tested in test_graphs, this is primarily to check the self.parameters as they get passed through the pipeline

component_graph = {
"Imputer": ["Imputer", "X", "y"],
"Target Imputer": ["Target Imputer", "X", "y"],
"OneHot_RandomForest": ["One Hot Encoder", "Imputer.x", "Target Imputer.y"],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted a text fixture that provided a graph with a modification of y

@ParthivNaresh ParthivNaresh marked this pull request as ready for review September 21, 2021 13:20
Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super interesting! I'm a little worried about the structure of the nodes, where the format of the X and y nodes don't match the rest of the dictionary. I can't really think of a better way to represent the data, but I think users might be a little confused by it. Because of this, I'd argue we need to make sure the JSON structure is in the documentation, and is easy to both find and read.

docs/source/release_notes.rst Outdated Show resolved Hide resolved
evalml/pipelines/pipeline_base.py Outdated Show resolved Hide resolved
evalml/pipelines/pipeline_base.py Outdated Show resolved Hide resolved
Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! That test_component_as_json test is mighty thorough. I think we might want to move towards removing that intermediary key from the json, though, as it doesn't really add too much and might actually clutter up the json a little bit. I think also that we owe you a much better specification for the JSON next time. I'll add this to our sprint feedback.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments on potential improvements that could be added! Let me know what you think. Looking good though!

@@ -425,6 +426,58 @@ def feature_importance(self):
df = pd.DataFrame(importance, columns=["feature", "importance"])
return df

def graph_json(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can we make the output formatted? Would make it a lot easier to read
image

@@ -425,6 +426,58 @@ def feature_importance(self):
df = pd.DataFrame(importance, columns=["feature", "importance"])
return df

def graph_json(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another nit:
image
In "x_edges", can the node that takes in X (the "DRop Columns Transformer") come first in the list?

evalml/pipelines/pipeline_base.py Show resolved Hide resolved
x_edges.append(("X", component_name))
elif parent == "y":
y_edges.append(("y", component_name))
nodes.update({"X": "X", "y": "y"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? Doesn't seem like very useful information to the user, since it's not going to change between different pipelines, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary, but since X_edges and y_edges can both include X and y, I'd expect a user looking to implement this in a custom visualization tool to want any node mentioned in the edges to also be mentioned in nodes.

@@ -5214,3 +5215,44 @@ def test_automl_chooses_engine(engine_choice, X_y_binary):
automl = AutoMLSearch(
X_train=X, y_train=y, problem_type="binary", engine=engine_choice
)


def test_graph_automl(X_y_multi):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this isn't likely using much time or memory but can we mock this test?

evalml/tests/pipeline_tests/test_graphs.py Show resolved Hide resolved
Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good Parthiv!

@ParthivNaresh ParthivNaresh merged commit 2c6bbe3 into main Oct 12, 2021
@freddyaboulton freddyaboulton deleted the JSON-Graph branch May 13, 2022 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pipeline graph: return JSON with nodes and edges
6 participants