Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrating Google AutoML example_dags to sys tests #32368

Merged
merged 3 commits into from
Jul 7, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -35,17 +35,14 @@
AutoMLTrainModelOperator,
)

GCP_PROJECT_ID = os.environ.get("GCP_PROJECT_ID", "your-project-id")
GCP_AUTOML_LOCATION = os.environ.get("GCP_AUTOML_LOCATION", "us-central1")
GCP_AUTOML_TEXT_CLS_BUCKET = os.environ.get("GCP_AUTOML_TEXT_CLS_BUCKET", "gs://INVALID BUCKET NAME")

# Example values
DATASET_ID = ""
ENV_ID = os.environ.get("SYSTEM_TESTS_ENV_ID")
DAG_ID = "example_automl_classification"
GCP_PROJECT_ID = os.environ.get("SYSTEM_TESTS_GCP_PROJECT", "default")
GCP_AUTOML_LOCATION = "us-central1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we hardcode these values? Shouldn't it be fetched from env var?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the convention of the other system tests written, for example in #30003 and also in the automl folder. Does it make sense to auto hard code them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @VladaZakharova might help here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main principle is to keep these tests self-sufficient. But they need an external resource and infra on which those resources run. We can control those resources, like creating a bucket on the fly, but if a user has a certain infra, IMO it would be best to give flexibility to the user to set up the test given the infra rather than looking at the test and setting up the infra. That would end up making these tests less useful wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice point. Any suggestions that i can follow? It will help me tweak the DAG according to your use cases

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, all the tests look good, except for the first point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Do you mean to say that we should have it as:

GCP_PROJECT_ID = os.environ.get("GCP_PROJECT_ID", "your-project-id")
GCP_AUTOML_LOCATION = os.environ.get("GCP_AUTOML_LOCATION", "us-central1")
GCP_AUTOML_TEXT_CLS_BUCKET = os.environ.get("GCP_AUTOML_TEXT_CLS_BUCKET", "gs://INVALID BUCKET NAME")

And so on?


# Example model
MODEL = {
"display_name": "auto_model_1",
"dataset_id": DATASET_ID,
"text_classification_model_metadata": {},
}

Expand All @@ -55,7 +52,10 @@
"text_classification_dataset_metadata": {"classification_type": "MULTICLASS"},
}

IMPORT_INPUT_CONFIG = {"gcs_source": {"input_uris": [GCP_AUTOML_TEXT_CLS_BUCKET]}}

DATA_SAMPLE_GCS_BUCKET_NAME = f"bucket_{DAG_ID}_{ENV_ID}"
AUTOML_DATASET_BUCKET = f"gs://{DATA_SAMPLE_GCS_BUCKET_NAME}/automl-text/dataset.csv"
IMPORT_INPUT_CONFIG = {"gcs_source": {"input_uris": [AUTOML_DATASET_BUCKET]}}

extract_object_id = CloudAutoMLHook.extract_object_id

Expand All @@ -65,24 +65,23 @@
start_date=datetime(2021, 1, 1),
catchup=False,
tags=["example"],
) as example_dag:
) as dag:
create_dataset_task = AutoMLCreateDatasetOperator(
task_id="create_dataset_task", dataset=DATASET, location=GCP_AUTOML_LOCATION
)

dataset_id = cast(str, XComArg(create_dataset_task, key="dataset_id"))
MODEL["dataset_id"] = dataset_id

import_dataset_task = AutoMLImportDataOperator(
task_id="import_dataset_task",
dataset_id=dataset_id,
location=GCP_AUTOML_LOCATION,
input_config=IMPORT_INPUT_CONFIG,
)

MODEL["dataset_id"] = dataset_id

create_model = AutoMLTrainModelOperator(task_id="create_model", model=MODEL, location=GCP_AUTOML_LOCATION)

model_id = cast(str, XComArg(create_model, key="model_id"))

delete_model_task = AutoMLDeleteModelOperator(
Expand All @@ -99,10 +98,17 @@
project_id=GCP_PROJECT_ID,
)

# TEST BODY
import_dataset_task >> create_model
# TEST TEARDOWN
delete_model_task >> delete_datasets_task

# Task dependencies created via `XComArgs`:
# create_dataset_task >> import_dataset_task
# create_dataset_task >> create_model
# create_dataset_task >> delete_datasets_task

Comment on lines 106 to +107
Copy link
Contributor

@Adaverse Adaverse Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we need a watcher wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can have a watcher too for each of these tests

from tests.system.utils import get_test_run # noqa: E402

# Needed to run the example DAG with pytest (see: tests/system/README.md#run_via_pytest)
test_run = get_test_run(dag)
Original file line number Diff line number Diff line change
Expand Up @@ -35,17 +35,14 @@
AutoMLTrainModelOperator,
)

GCP_PROJECT_ID = os.environ.get("GCP_PROJECT_ID", "your-project-id")
GCP_AUTOML_LOCATION = os.environ.get("GCP_AUTOML_LOCATION", "us-central1")
GCP_AUTOML_SENTIMENT_BUCKET = os.environ.get("GCP_AUTOML_SENTIMENT_BUCKET", "gs://INVALID BUCKET NAME")

# Example values
DATASET_ID = ""
ENV_ID = os.environ.get("SYSTEM_TESTS_ENV_ID")
DAG_ID = "example_automl_text_sentiment"
GCP_PROJECT_ID = os.environ.get("SYSTEM_TESTS_GCP_PROJECT", "default")
GCP_AUTOML_LOCATION = "us-central1"

# Example model
MODEL = {
"display_name": "auto_model_1",
"dataset_id": DATASET_ID,
"text_sentiment_model_metadata": {},
}

Expand All @@ -55,23 +52,26 @@
"text_sentiment_dataset_metadata": {"sentiment_max": 10},
}

IMPORT_INPUT_CONFIG = {"gcs_source": {"input_uris": [GCP_AUTOML_SENTIMENT_BUCKET]}}
DATA_SAMPLE_GCS_BUCKET_NAME = f"bucket_{DAG_ID}_{ENV_ID}"
AUTOML_DATASET_BUCKET = f"gs://{DATA_SAMPLE_GCS_BUCKET_NAME}/automl-text/dataset.csv"
IMPORT_INPUT_CONFIG = {"gcs_source": {"input_uris": [AUTOML_DATASET_BUCKET]}}

extract_object_id = CloudAutoMLHook.extract_object_id

# Example DAG for AutoML Natural Language Text Sentiment
with models.DAG(
"example_automl_text_sentiment",
DAG_ID,
start_date=datetime(2021, 1, 1),
catchup=False,
user_defined_macros={"extract_object_id": extract_object_id},
tags=["example"],
) as example_dag:
) as dag:
create_dataset_task = AutoMLCreateDatasetOperator(
task_id="create_dataset_task", dataset=DATASET, location=GCP_AUTOML_LOCATION
)

dataset_id = cast(str, XComArg(create_dataset_task, key="dataset_id"))
MODEL["dataset_id"] = dataset_id

import_dataset_task = AutoMLImportDataOperator(
task_id="import_dataset_task",
Expand Down Expand Up @@ -100,11 +100,18 @@
project_id=GCP_PROJECT_ID,
)

# TEST BODY
import_dataset_task >> create_model
# TEST TEARDOWN
delete_model_task >> delete_datasets_task

# Task dependencies created via `XComArgs`:
# create_dataset_task >> import_dataset_task
# create_dataset_task >> create_model
# create_model >> delete_model_task
# create_dataset_task >> delete_datasets_task

from tests.system.utils import get_test_run # noqa: E402

# Needed to run the example DAG with pytest (see: tests/system/README.md#run_via_pytest)
test_run = get_test_run(dag)
Original file line number Diff line number Diff line change
Expand Up @@ -35,19 +35,14 @@
AutoMLTrainModelOperator,
)

GCP_PROJECT_ID = os.environ.get("GCP_PROJECT_ID", "your-project-id")
GCP_AUTOML_LOCATION = os.environ.get("GCP_AUTOML_LOCATION", "us-central1")
GCP_AUTOML_TRANSLATION_BUCKET = os.environ.get(
"GCP_AUTOML_TRANSLATION_BUCKET", "gs://INVALID BUCKET NAME/file"
)

# Example values
DATASET_ID = "TRL123456789"
ENV_ID = os.environ.get("SYSTEM_TESTS_ENV_ID")
DAG_ID = "example_automl_translation"
GCP_PROJECT_ID = os.environ.get("SYSTEM_TESTS_GCP_PROJECT", "default")
GCP_AUTOML_LOCATION = "us-central1"

# Example model
MODEL = {
"display_name": "auto_model_1",
"dataset_id": DATASET_ID,
"translation_model_metadata": {},
}

Expand All @@ -60,19 +55,23 @@
},
}

IMPORT_INPUT_CONFIG = {"gcs_source": {"input_uris": [GCP_AUTOML_TRANSLATION_BUCKET]}}

DATA_SAMPLE_GCS_BUCKET_NAME = f"bucket_{DAG_ID}_{ENV_ID}"
AUTOML_DATASET_BUCKET = f"gs://{DATA_SAMPLE_GCS_BUCKET_NAME}/automl-text/file"
IMPORT_INPUT_CONFIG = {"gcs_source": {"input_uris": [AUTOML_DATASET_BUCKET]}}

extract_object_id = CloudAutoMLHook.extract_object_id


# Example DAG for AutoML Translation
with models.DAG(
"example_automl_translation",
DAG_ID,
start_date=datetime(2021, 1, 1),
schedule="@once",
catchup=False,
user_defined_macros={"extract_object_id": extract_object_id},
tags=["example"],
) as example_dag:
) as dag:
create_dataset_task = AutoMLCreateDatasetOperator(
task_id="create_dataset_task", dataset=DATASET, location=GCP_AUTOML_LOCATION
)
Expand Down Expand Up @@ -106,11 +105,19 @@
project_id=GCP_PROJECT_ID,
)

# TEST BODY
import_dataset_task >> create_model
# TEST TEARDOWN
delete_model_task >> delete_datasets_task

# Task dependencies created via `XComArgs`:
# create_dataset_task >> import_dataset_task
# create_dataset_task >> create_model
# create_model >> delete_model_task
# create_dataset_task >> delete_datasets_task


from tests.system.utils import get_test_run # noqa: E402

# Needed to run the example DAG with pytest (see: tests/system/README.md#run_via_pytest)
test_run = get_test_run(dag)
Original file line number Diff line number Diff line change
Expand Up @@ -35,19 +35,14 @@
AutoMLTrainModelOperator,
)

GCP_PROJECT_ID = os.environ.get("GCP_PROJECT_ID", "your-project-id")
GCP_AUTOML_LOCATION = os.environ.get("GCP_AUTOML_LOCATION", "us-central1")
GCP_AUTOML_VIDEO_BUCKET = os.environ.get(
"GCP_AUTOML_VIDEO_BUCKET", "gs://INVALID BUCKET NAME/hmdb_split1.csv"
)

# Example values
DATASET_ID = "VCN123455678"
ENV_ID = os.environ.get("SYSTEM_TESTS_ENV_ID")
DAG_ID = "example_automl_video"
GCP_PROJECT_ID = os.environ.get("SYSTEM_TESTS_GCP_PROJECT", "default")
GCP_AUTOML_LOCATION = "us-central1"

# Example model
MODEL = {
"display_name": "auto_model_1",
"dataset_id": DATASET_ID,
"video_classification_model_metadata": {},
}

Expand All @@ -57,24 +52,27 @@
"video_classification_dataset_metadata": {},
}

IMPORT_INPUT_CONFIG = {"gcs_source": {"input_uris": [GCP_AUTOML_VIDEO_BUCKET]}}
DATA_SAMPLE_GCS_BUCKET_NAME = f"bucket_{DAG_ID}_{ENV_ID}"
AUTOML_DATASET_BUCKET = f"gs://{DATA_SAMPLE_GCS_BUCKET_NAME}/automl-text/hmdb_split1.csv"
IMPORT_INPUT_CONFIG = {"gcs_source": {"input_uris": [AUTOML_DATASET_BUCKET]}}

extract_object_id = CloudAutoMLHook.extract_object_id


# Example DAG for AutoML Video Intelligence Classification
with models.DAG(
"example_automl_video",
DAG_ID,
start_date=datetime(2021, 1, 1),
catchup=False,
user_defined_macros={"extract_object_id": extract_object_id},
tags=["example"],
) as example_dag:
) as dag:
create_dataset_task = AutoMLCreateDatasetOperator(
task_id="create_dataset_task", dataset=DATASET, location=GCP_AUTOML_LOCATION
)

dataset_id = cast(str, XComArg(create_dataset_task, key="dataset_id"))
MODEL["dataset_id"] = dataset_id

import_dataset_task = AutoMLImportDataOperator(
task_id="import_dataset_task",
Expand Down Expand Up @@ -103,11 +101,18 @@
project_id=GCP_PROJECT_ID,
)

# TEST BODY
import_dataset_task >> create_model
# TEST TEARDOWN
delete_model_task >> delete_datasets_task

# Task dependencies created via `XComArgs`:
# create_dataset_task >> import_dataset_task
# create_dataset_task >> create_model
# create_model >> delete_model_task
# create_dataset_task >> delete_datasets_task

from tests.system.utils import get_test_run # noqa: E402

# Needed to run the example DAG with pytest (see: tests/system/README.md#run_via_pytest)
test_run = get_test_run(dag)
Original file line number Diff line number Diff line change
Expand Up @@ -35,20 +35,15 @@
AutoMLTrainModelOperator,
)

GCP_PROJECT_ID = os.environ.get("GCP_PROJECT_ID", "your-project-id")
GCP_AUTOML_LOCATION = os.environ.get("GCP_AUTOML_LOCATION", "us-central1")
GCP_AUTOML_TRACKING_BUCKET = os.environ.get(
"GCP_AUTOML_TRACKING_BUCKET",
"gs://INVALID BUCKET NAME/youtube_8m_videos_animal_tiny.csv",
)
ENV_ID = os.environ.get("SYSTEM_TESTS_ENV_ID")
DAG_ID = "example_automl_video_tracking"
GCP_PROJECT_ID = os.environ.get("SYSTEM_TESTS_GCP_PROJECT", "default")
GCP_AUTOML_LOCATION = "us-central1"

# Example values
DATASET_ID = "VOT123456789"

# Example model
MODEL = {
"display_name": "auto_model_1",
"dataset_id": DATASET_ID,
"video_object_tracking_model_metadata": {},
}

Expand All @@ -58,24 +53,27 @@
"video_object_tracking_dataset_metadata": {},
}

IMPORT_INPUT_CONFIG = {"gcs_source": {"input_uris": [GCP_AUTOML_TRACKING_BUCKET]}}
DATA_SAMPLE_GCS_BUCKET_NAME = f"bucket_{DAG_ID}_{ENV_ID}"
AUTOML_DATASET_BUCKET = f"gs://{DATA_SAMPLE_GCS_BUCKET_NAME}/automl-text/youtube_8m_videos_animal_tiny.csv"
IMPORT_INPUT_CONFIG = {"gcs_source": {"input_uris": [AUTOML_DATASET_BUCKET]}}

extract_object_id = CloudAutoMLHook.extract_object_id


# Example DAG for AutoML Video Intelligence Object Tracking
with models.DAG(
"example_automl_video_tracking",
DAG_ID,
start_date=datetime(2021, 1, 1),
catchup=False,
user_defined_macros={"extract_object_id": extract_object_id},
tags=["example"],
) as example_dag:
) as dag:
create_dataset_task = AutoMLCreateDatasetOperator(
task_id="create_dataset_task", dataset=DATASET, location=GCP_AUTOML_LOCATION
)

dataset_id = cast(str, XComArg(create_dataset_task, key="dataset_id"))
MODEL["dataset_id"] = dataset_id

import_dataset_task = AutoMLImportDataOperator(
task_id="import_dataset_task",
Expand Down Expand Up @@ -104,11 +102,18 @@
project_id=GCP_PROJECT_ID,
)

# TEST BODY
import_dataset_task >> create_model
# TEST TEARDOWN
delete_model_task >> delete_datasets_task

# Task dependencies created via `XComArgs`:
# create_dataset_task >> import_dataset_task
# create_dataset_task >> create_model
# create_model >> delete_model_task
# create_dataset_task >> delete_datasets_task

from tests.system.utils import get_test_run # noqa: E402

# Needed to run the example DAG with pytest (see: tests/system/README.md#run_via_pytest)
test_run = get_test_run(dag)
Loading