Skip to content

[AIRFLOW-729] Add Google Cloud Dataproc cluster creation operator#1971

Closed
bodschut wants to merge 1 commit intoapache:masterfrom
bodschut:feature/dataproc_operator
Closed

[AIRFLOW-729] Add Google Cloud Dataproc cluster creation operator#1971
bodschut wants to merge 1 commit intoapache:masterfrom
bodschut:feature/dataproc_operator

Conversation

@bodschut
Copy link
Contributor

@bodschut bodschut commented Jan 4, 2017

Add an operator to create a Google Cloud Dataproc cluster with a given name in a given google cloud project.

The operator checks if there is already a cluster running with the provided name in the provided project.
If so, the operator finishes successfully. Otherwise, the operator issues a rest API call to initiate
the cluster creation and waits until the creation is successful before exiting. Most of the possible cluster customisation parameters available in dataproc are made available when constructing the operator.

Dear Airflow Maintainers,

Please accept this PR that addresses the following issues:

Testing Done:

  • Unittests not possible due to private google cloud connections. Extensive production testing has been performed.

@codecov-io
Copy link

codecov-io commented Jan 4, 2017

Current coverage is 67.13% (diff: 100%)

Merging #1971 into master will increase coverage by <.01%

@@             master      #1971   diff @@
==========================================
  Files           135        135          
  Lines         10295      10295          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           6911       6912     +1   
+ Misses         3384       3383     -1   
  Partials          0          0          

Powered by Codecov. Last update 43bf89d...deab7dd

@bodschut bodschut force-pushed the feature/dataproc_operator branch from d113330 to 84d3024 Compare January 5, 2017 08:41
Copy link
Contributor

@bolkedebruin bolkedebruin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nits. Please check if you adhere to the commit guidelines (max columns)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this smart? Should this not be a required setting instead of having a default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this should not be a default

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use logging.info across airflow. Besides can this als be "debug"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @bolkedebruin , I actually don't agree here. Due to how logging is wired log messages under INFO will never be shown or you need to rebuild Airflow (I'm looking into it though...).

So INFO is very appropriate for Operators and Hooks. Results come in a separate logfile anyway, and you want to see how the progress of a Task goes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem as above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem as above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem as above, debug seems more appropiate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem as above

@bolkedebruin
Copy link
Contributor

cc @alexvanboxel

The operator checks if there is already a cluster running with the provided name in the provided project.
If so, the operator finishes successfully. Otherwise, the operator issues a rest API call to initiate
the cluster creation and waits until the creation is successful before exiting.
@bodschut bodschut force-pushed the feature/dataproc_operator branch from 84d3024 to deab7dd Compare January 9, 2017 07:47
@bodschut
Copy link
Contributor Author

bodschut commented Jan 9, 2017

Hi @bolkedebruin

I changed the zone parameter to be required and adopted the logging.info method. However, I also feel that the INFO level is more appropriate to follow the task progress.

Another question: I see that that you adhere to the 90 characters limit for line lengths (original PEP-8 recommendation). However, I believe that sometimes this is bad for code readability... A lot of modern open source projects have already extended the limit to 120 characters (which is also the default in IDE's like pycharm). What do you think of this?

Thanks,
Bob

@alexvanboxel
Copy link
Contributor

@bolkedebruin I think it's good, if we get +1 for @criccomini I'll merge it in before tonight. I want to build a new alpha to run on our environment and would like to have this one and the other included.

@asfgit asfgit closed this in ffbe728 Jan 9, 2017
alekstorm pushed a commit to alekstorm/incubator-airflow that referenced this pull request Jun 1, 2017
The operator checks if there is already a cluster
running with the provided name in the provided
project.
If so, the operator finishes successfully.
Otherwise, the operator issues a rest API call to
initiate
the cluster creation and waits until the creation
is successful before exiting.

Closes apache#1971 from
bodschut/feature/dataproc_operator
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants