-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return JSON from Pipeline DAG structure #2812
Conversation
# Conflicts: # docs/source/release_notes.rst
Codecov Report
@@ Coverage Diff @@
## main #2812 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 302 302
Lines 28312 28388 +76
=======================================
+ Hits 28216 28292 +76
Misses 96 96
Continue to review full report at Codecov.
|
# Conflicts: # docs/source/release_notes.rst
@@ -314,7 +314,7 @@ class AutoMLSearch: | |||
Only applicable if patience is not None. Defaults to None. | |||
|
|||
allowed_component_graphs (dict): A dictionary of lists or ComponentGraphs indicating the component graphs allowed in the search. | |||
The format should follow { "Name_0": [list_of_components], "Name_1": [ComponentGraph(...)] } | |||
The format should follow { "Name_0": [list_of_components], "Name_1": ComponentGraph(...) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we meant to add a list here when passing ComponentGraphs
evalml/pipelines/pipeline_base.py
Outdated
Returns: | ||
dag_json (str): A serialized JSON representation of a DAG structure. | ||
""" | ||
dag_json = {"Nodes": {"X": "X", "y": "y"}, "Edges": {}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added X and y as default nodes
evalml/pipelines/pipeline_base.py
Outdated
else component_info[0].name, | ||
"attributes": self.parameters[component_name] | ||
if component_name in self.parameters | ||
else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some classes don't have any parameters that get passed through like StandardScaler
evalml/pipelines/pipeline_base.py
Outdated
dag_json["Edges"][f"from_{from_comp}_to_{component_name}"] = { | ||
"from": from_comp, | ||
"to": component_name, | ||
"data": f"X modified by {from_comp}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data
is meant to provide a little more detail as to what's being passed through that edge even though it can be inferred based on the from
and to
fields. OS users might find this useful in understanding this output.
dag_str = automl.allowed_pipelines[0].graph_json() | ||
dag_json = json.loads(dag_str) | ||
for node_, params_ in automl_parameters_.items(): | ||
for key_, val_ in params_.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specific structure is tested in test_graphs
, this is primarily to check the self.parameters
as they get passed through the pipeline
component_graph = { | ||
"Imputer": ["Imputer", "X", "y"], | ||
"Target Imputer": ["Target Imputer", "X", "y"], | ||
"OneHot_RandomForest": ["One Hot Encoder", "Imputer.x", "Target Imputer.y"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanted a text fixture that provided a graph with a modification of y
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is super interesting! I'm a little worried about the structure of the nodes, where the format of the X and y nodes don't match the rest of the dictionary. I can't really think of a better way to represent the data, but I think users might be a little confused by it. Because of this, I'd argue we need to make sure the JSON structure is in the documentation, and is easy to both find and read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! That test_component_as_json test is mighty thorough. I think we might want to move towards removing that intermediary key from the json, though, as it doesn't really add too much and might actually clutter up the json a little bit. I think also that we owe you a much better specification for the JSON next time. I'll add this to our sprint feedback.
# Conflicts: # evalml/tests/automl_tests/test_automl.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments on potential improvements that could be added! Let me know what you think. Looking good though!
@@ -425,6 +426,58 @@ def feature_importance(self): | |||
df = pd.DataFrame(importance, columns=["feature", "importance"]) | |||
return df | |||
|
|||
def graph_json(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -425,6 +426,58 @@ def feature_importance(self): | |||
df = pd.DataFrame(importance, columns=["feature", "importance"]) | |||
return df | |||
|
|||
def graph_json(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x_edges.append(("X", component_name)) | ||
elif parent == "y": | ||
y_edges.append(("y", component_name)) | ||
nodes.update({"X": "X", "y": "y"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? Doesn't seem like very useful information to the user, since it's not going to change between different pipelines, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not necessary, but since X_edges
and y_edges
can both include X
and y
, I'd expect a user looking to implement this in a custom visualization tool to want any node mentioned in the edges
to also be mentioned in nodes
.
@@ -5214,3 +5215,44 @@ def test_automl_chooses_engine(engine_choice, X_y_binary): | |||
automl = AutoMLSearch( | |||
X_train=X, y_train=y, problem_type="binary", engine=engine_choice | |||
) | |||
|
|||
|
|||
def test_graph_automl(X_y_multi): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that this isn't likely using much time or memory but can we mock this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good Parthiv!
Fixes #1969
This allows users to return a serialized JSON representation of a pipeline's DAG structure for their own graphing purposes.
The template for the structure is