Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New feature: HTTP interface? #92

Closed
gtoonstra opened this issue Jun 30, 2015 · 2 comments
Closed

New feature: HTTP interface? #92

gtoonstra opened this issue Jun 30, 2015 · 2 comments

Comments

@gtoonstra
Copy link
Contributor

So I went through the codebase and docs of 'airflow' today and I think it's a great fit for one of my projects. I'm the maintainer of "remap", which is a 100% python implementation of MapReduce only intended to run on a dozen of nodes for now. Jobs are kicked off through a REST interface.

Here's where a potential contribution comes in.

I didn't find anything already done with http interfaces, so my idea is to write an HTTPHook, an HTTP operator and a sensor for this. The operator/hook calls a URL resource with an indicated method and potentially some post data. The sensor later on calls another URL to check on progress. This would allow work to be executed asynchronously until some later checkpoint where the sensor needs to check whether something is available or not.

The HTTP library I intend to use is "requests", which should come installed with "pip": http://docs.python-requests.org/en/latest/

Operators never seem to return values and probably by design, which means that a worker process waits around for the job to complete, which means that an operator executes the action synchronously.

So, couple of questions:

  1. Is this a welcome contribution? Is someone already working on an HTTP hook?
  2. Can operators return data that is checked for later with a sensor to support inherently asynchronous systems like this one? Or are operators required to be synchronous in the sense that it always waits for the result of an operation, so must keep checking inside the operator at intervals until the action completes?
  3. Is the "requests" library an acceptable library (considering license, availability, installability, etc)?

As extra thought on 2, it's possible that external systems contain key/values that are useful to use in workflows. Is there a recommended mechanism of loading small pieces of data into a DAG workflow so that it's available from a context for example when another task is executed?

@mistercrunch
Copy link
Member

The requests lib is already a dependency for Airflow, you can see it both in the requirements.txt and the setup.py, it's a great library.

Airflow tasks are expected to be synchronous, or made so by writing some sort of sleep/check routine in your operator. It's also expected to raise an exception as a way to communicate an error.

It may be tricky to generalize an HttpOperator since all systems expect different endpoints, payload and return different results. Using a PythonOperator that uses the requests lib is the quick way to do this. Maybe an HttpSensor would be generic enough, receiving an enpoint, payload and a regex to match in the response.

I don't know much about remap, but maybe a RemapOperator would make more sense. In general, it's probably a better approach, so that instead of receiving generic enpoint and payload, it can receive something more meaningful like job_name, parameters_dict or whatever makes sense for that specific external system.

A side note about hooks, they use the Connection model to store connection information as opposed to hard-coding it in script. It may be nice to have a thin HttpHook that would retrieve that info from the DB and acts as some thin wrapper around the requests lib. I'm not 100% on this though, it may not add a whole lot of value (vs confusion)...

@gtoonstra
Copy link
Contributor Author

So I put something preliminary together of a hook and a sensor to get an idea of the complexity. You can see this development branch at:

master...gtoonstra:http_protocol_sensor

I agree, the flavours of what's written with HTTP is too rich to create anything that is generic enough and attempts in such generic approaches usually start to pollute other areas with logic, for example here proliferation of logic into the DAG's.

In the branch, the hook raises exceptions, but I think most of those should be moved to the operator instead. This is based on the assumption that the operator class decides on success or failure based on the responses of the hook, not the hook itself. In cases of database when a db was expected and didn't exist, the hook is allowed to raise exceptions.

Then all we need probably for now is a SimpleHttpOperator, which is limited to the following:

  • it fails or succeeds on the return code only, not on the response content.
  • it only supports a 'one-shot' call, no cookies, sessions or conversations, so no calls in a sequence.
  • maybe add a 'regex' on the "GET" call to check the state of the object beyond the return code (existence).
  • it has a timeout and never uses a persistent connection.
  • only supports basic auth of login and password.

Anything beyond this simple use requires a specific operator:

  • file upload/download over http, special auth schemes, http conversations and sessions, special mime-types, etc.

So the remap operator probably falls into the latter category.

mobuchowski pushed a commit to mobuchowski/airflow that referenced this issue Jan 4, 2022
* improve marquez_dag unit test

Signed-off-by: Julien Le Dem <julien@apache.org>

* test new method

Signed-off-by: Julien Le Dem <julien@apache.org>

* improve marquez.DAG tests

Signed-off-by: Julien Le Dem <julien@apache.org>

* adress review feedback

Signed-off-by: Julien Le Dem <julien@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants