Skip to content

Data platform#2615

Closed
youngyjd wants to merge 27 commits intoapache:masterfrom
lyft:data-platform
Closed

Data platform#2615
youngyjd wants to merge 27 commits intoapache:masterfrom
lyft:data-platform

Conversation

@youngyjd
Copy link
Contributor

@youngyjd youngyjd commented Sep 20, 2017

Mistakenly created pr

jlowin and others added 27 commits May 9, 2016 16:12
Dag hash function tried (and failed) to hash the list of tasks, then fell back on repr-ing the list, which took forever. Instead, hash tuple(task_dict.keys()). In addition this replaces two slow list comprehensions with much faster hash lookups (using the new task_dict).
When Scheduler is run with `—num-runs`, there can be multiple
Schedulers and Executors all trying to run tasks. For queued tasks,
Scheduler was previously only trying to run tasks that it itself had
queued — but that doesn’t work if the Scheduler is restarting. This PR
reverts that behavior and adds two types of “best effort” executions —
before running a TI, executors check if it is already running, and
before ending executors call sync() one last time
The scheduler can encounter a queued task twice before the
task actually starts to run -- this locks the task and avoids
that condition.
Pulling this patch [1] in from upstream in order to resolve a dependency
conflict in the ETL repo:

airflow 1.7.1.2 -> (funcsigs >= 0.4, funcsigs < 1)
mock 2.0.0 -> (funcsigs >=1)

Without this change, `control run piptools.compile etl` fails to
resolve the dependencies in `requirements.in` because of the conflict.
Tested locally and confirmed that this solves the conflict.

If this change looks good, I'll follow up with a change to pin ETL's
dependency on Airflow to this git SHA.

[1] https://github.com/apache/incubator-airflow/pull/1826/files
Lyft's custom gunicorn is on version 19.1.1, which I believe is causing
the following error in the scheduler process when I tested etl in staging:

    Traceback (most recent call last):
      File "/usr/local/lib/service_venv/bin/airflow", line 4, in <module>
        __import__('pkg_resources').require('airflow==1.7.1.2')
      File
    "/usr/local/lib/service_venv/local/lib/python2.7/site-packages/pkg_resources/__init__.py",
    line 3036, in <module>
        @_call_aside
      File
    "/usr/local/lib/service_venv/local/lib/python2.7/site-packages/pkg_resources/__init__.py",
    line 3020, in _call_aside
        f(*args, **kwargs)
      File
    "/usr/local/lib/service_venv/local/lib/python2.7/site-packages/pkg_resources/__init__.py",
    line 3049, in _initialize_master_working_set
        working_set = WorkingSet._build_master()
      File
    "/usr/local/lib/service_venv/local/lib/python2.7/site-packages/pkg_resources/__init__.py",
    line 656, in _build_master
        return cls._build_from_requirements(__requires__)
      File
    "/usr/local/lib/service_venv/local/lib/python2.7/site-packages/pkg_resources/__init__.py",
    line 669, in _build_from_requirements
        dists = ws.resolve(reqs, Environment())
      File
    "/usr/local/lib/service_venv/local/lib/python2.7/site-packages/pkg_resources/__init__.py",
    line 854, in resolve
        raise DistributionNotFound(req, requirers)
    pkg_resources.DistributionNotFound: The 'gunicorn<19.4.0,>=19.3.0'
    distribution was not found and is required by airflow

Related PR: https://github.com/lyft/etl/pull/6037
* Background

We're patching the socket library in order to use IP addresses instead
of hostnames, because Lyft hostnames are special and don't actually
resolve. (See https://github.com/lyft/etl/pull/5560 for context)
`service_venv` sets the \$PATH such that
`/usr/local/lib/service_venv/bin/` takes precedence over
`srv/service/current/bin/`, so it's not enough to simply put a patched
version of `airflow` in the ETL repository.

We're ensuring that our patched version of `airflow` takes precedence by
overwriting `/usr/local/lib/service_venv/bin/airflow` with a script that
just calls our patched version at `/srv/service/current/bin/airflow`
(from the ETL repository). See this salt-state [1].

That doesn't work anymore, because with the rollout of frozen_venvs we
can't count on `airflow` being at a hardcoded path.
https://github.com/lyft/etl/pull/6073 was one attempt to solve this by
finding the service_venv root dynamically. That approach doesn't work,
because it depends on `service_venv` being available by the time the
Jinja template is rendered, but `service_venv` won't be available until
it's created in the lyft-python state (which requires that the template
is rendered).

We can avoid this chicken and egg problem if we simply patch the Airflow
code directly, rather than trying to overwrite it on deployment. We
apply our patch iff the `prefer_ip_over_hostname` key under the `lyft`
configuration section is set.

* Rollout

On the sharedairflowworker (which is only running test DAGs so far), we
will stop overwriting the airflow command with our patched version and
set the `prefer_ip_over_hostname` key.

If this works, then we can take the same approach for the other ASGs.
Until then, we will continue overwriting the `airflow` command under
service_venv/ with our patched version from ETL.

[1] https://github.com/lyft/etl/blob/master/ops/config/states/etl/init.sls#L50-L59
Support configuration key to toggle using IPs or hostnames
This symlink is left dangling when ETL builds airflow as a dependency,
which is causing deployment to fail. Excerpt from the logs:

    /srv/salt/venv/bin/python /srv/pulldeploy/pulldeploy/deploy.py -p etl
    2017-07-19 20:17:42,202 INFO frozen_venv ./venvs/service directory
    created
    2017-07-19 20:17:57,857 ERROR Command failed: cp -RL
    /srv/venvs/service/* ./venvs/service
    2017-07-19 20:17:57,857 ERROR cp: cannot stat
    '/srv/venvs/service/trusty/service_venv/src/airflow/airflow/www/static/docs':
    No such file or directory
Remove symlink to docs directory
I've examined the schema migrations, and I think that there's nothing
backwards-incompatible. That is, we can decouple the database migrations
from the deployment of the code.

To prepare this commit, I took all of the Alembic changes in the
directory `airflow/migrations` from the upstream master branch and
applied them here [1]

I'll deploy these changes in staging by running the `airflow upgradedb`
command and then verify that the scheduler is able to schedule tasks and
the workers can still execute them when using the code from Airflow 1.7
against the database schema defined in Airflow 1.8.

[1] git checkout upstream-master -- airflow/migrations
[DATA-4816] Take Alembic migrations from upstream
Airflow 1.8 depends on `flask-wtf==0.14`, so in order to run airflow 1.8
and 1.7 side by side we need to force etl to upgrade from
`flask-wtf==0.12` to `flask-wtf==0.14`.
[DATA-4839] Upgrade flask-wtf to 0.14
The airflow 1.7 scheduler deletes any queued TI's belonging to DAGs that
are not in the DagBag.  Since we're filtering out DAGs whitelisted for
1.8 (just the long_running_test DAG for now), the scheduler thinks those
TI's belong to a DAG that has been removed and deletes them before they
get picked up by an airflow 1.8 worker.

With this change, we'll simply ignore any TI's from a DAG that we know
is handled by the 1.8 scheduler.
[DATA-4839] Do not delete queued TIs from invisible DAGs
This DAG will get scheduled by 1.8, so don't delete any of its "phantom"
Task Instances which were created by the 1.8 scheduler.
[DPTOOLS-50] Ignore presto_query_logs DAG during 1.8 migration
@mention-bot
Copy link

@youngyjd, thanks for your PR! By analyzing the history of the files in this pull request, we identified @Fokko, @edgarRd and @bolkedebruin to be potential reviewers.

@youngyjd youngyjd closed this Sep 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants