Skip to content

Support on packaged dags #8716

@mfumagalli68

Description

@mfumagalli68

I'm trying to use apache airlfow with packaged dags.

I've written my code as a python package and my code depends on other libraries such as numpy, scipy etc.

This is setup.py of my custom python package:

    from setuptools import setup, find_packages
    from pathlib import Path
    from typing import List
    
    import distutils.text_file
    
    def parse_requirements(filename: str) -> List[str]:
        """Return requirements from requirements file."""
        # Ref: https://stackoverflow.com/a/42033122/
        return distutils.text_file.TextFile(filename=str(Path(__file__).with_name(filename))).readlines()
    
    
    setup(name='classify_business',
          version='0.1',
          python_requires=">=3.6",
          description='desc',
          url='https://urlgitlab/datascience/classifybusiness',
          author='Marco fumagalli',
          author_email='marco.fumagalli@mycompany.com',
          packages = find_packages(),
          license='MIT',
          install_requires=
          parse_requirements('requirements.txt'),
          zip_safe=False,
          include_package_data=True)

requirements.txt contains packages ( vertica_python, pandas, numpy etc) along with their version needed for my code.

I wrote a litte shell script based on the one provied in the doc for creating packaged dags:

    set -eu -o pipefail
    
    if [ $# == 0 ]; then
        echo "First param should be /srv/user_name/virtualenvs/name_virtual_env"
        echo "Second param should be name of temp_directory"
        echo "Third param directory should be git url"
        echo "Fourth param should be dag zip name, i.e dag_zip.zip to be copied into AIRFLOW__CORE__DAGS__FOLDER"
        echo "Fifth param should be package name, i.e classify_business"
    fi
    
    
    venv_path=${1}
    dir_tmp=${2}
    git_url=${3}
    dag_zip=${4}
    pkg_name=${5}
    
    
    
    python3 -m venv $venv_path
    source $venv_path/bin/activate
    mkdir $dir_tmp
    cd $dir_tmp
    
    python3 -m pip install --prefix=$PWD git+$git_url
        
    zip -r $dag_zip *
    cp $dag_zip $AIRFLOW__CORE__DAGS_FOLDER
    
    rm -r $dir_tmp

The shell will install my package along with dependencies directly from gitlab, zip and then move to the dags folder.

This is the content of the folder tmp_dir before being zipped.

    bin  
    lib  
    lib64  
    predict_dag.py  
    train_dag.py

Airflow doesn't seem to be able to import package installed in lib or lib64.
I'm getting this error

ModuleNotFoundError: No module named 'vertica_python'

I even tried to move my custom package outside of lib:

    bin
    my_custom_package
    lib  
    lib64  
    predict_dag.py  
    train_dag.py

But still getting same error.

One of the problem I think relies on how to use pip to install package in a specific location.
Airflow example use --install-option="--install-lib=/path/" but it's unsupported:

Location-changing options found in --install-option: ['--install-lib']
from command line. This configuration may cause unexpected behavior
and is unsupported. pip 20.2 will remove support for this
functionality. A possible replacement is using pip-level options like
--user, --prefix, --root, and --target. You can find discussion regarding this at pypa/pip#7309.

Using --prefix leads to a structure like above, with module not found error.

Using --target leads to every package installed in the directory specified.
In this case I have a pandas related error

C extension: No module named 'pandas._libs.tslibs.conversion' not built

I guess that it's related to dynamic libraries that should be available at a system level?

I really don't know how to do that.

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions