Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-4032]Support staging binary distributions of dependency packages #16633

Merged
merged 23 commits into from Feb 11, 2022

Conversation

AnandInguva
Copy link
Contributor

@AnandInguva AnandInguva commented Jan 27, 2022

This PR changes how we stage packages mentioned in the --requirements_file. Prior to this, the default behavior was to download the source distributions of the packages specified in the --requirements_file during staging. The current behavior would be to download binary distributions(wheels) for packages specified in the --requirements_file without their sub-dependencies. To follow the prior behavior, set --requirements_cache_only_sources. Staging of the packages/pipeline dependencies can be skipped by setting --requirements_cache=skip during job submission.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

ValidatesRunner compliance status (on master branch)

Lang ULR Dataflow Flink Samza Spark Twister2
Go --- Build Status Build Status Build Status Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Python --- Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status ---
XLang Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status ---

Examples testing status on various runners

Lang ULR Dataflow Flink Samza Spark Twister2
Go --- --- --- --- --- --- ---
Java --- Build Status
Build Status
Build Status
--- --- --- --- ---
Python --- --- --- --- --- --- ---
XLang --- --- --- --- --- --- ---

Post-Commit SDK/Transform Integration Tests Status (on master branch)

Go Java Python
Build Status Build Status Build Status
Build Status
Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status Build Status --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@codecov
Copy link

codecov bot commented Jan 27, 2022

Codecov Report

Merging #16633 (9b3a04a) into master (fee1528) will increase coverage by 8.94%.
The diff coverage is 53.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16633      +/-   ##
==========================================
+ Coverage   74.64%   83.59%   +8.94%     
==========================================
  Files         655      452     -203     
  Lines       82325    62274   -20051     
==========================================
- Hits        61454    52058    -9396     
+ Misses      19871    10216    -9655     
+ Partials     1000        0    -1000     
Impacted Files Coverage Δ
...s/python/apache_beam/runners/portability/stager.py 85.54% <51.85%> (-2.82%) ⬇️
...dks/python/apache_beam/options/pipeline_options.py 95.28% <100.00%> (+0.01%) ⬆️
.../python/apache_beam/transforms/periodicsequence.py 96.72% <0.00%> (-1.64%) ⬇️
sdks/python/apache_beam/internal/metrics/metric.py 90.00% <0.00%> (-1.00%) ⬇️
...hon/apache_beam/runners/worker/bundle_processor.py 93.51% <0.00%> (-0.25%) ⬇️
sdks/go/pkg/beam/core/runtime/exec/fullvalue.go
sdks/go/pkg/beam/core/graph/node.go
sdks/go/pkg/beam/core/runtime/genx/genx.go
sdks/go/pkg/beam/core/graph/coder/map.go
sdks/go/pkg/beam/core/metrics/dumper.go
... and 199 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fee1528...9b3a04a. Read the comment docs.

@AnandInguva
Copy link
Contributor Author

R: @tvalentyn

@AnandInguva
Copy link
Contributor Author

Run Python PreCommit

@AnandInguva
Copy link
Contributor Author

cc @robertwb

@@ -51,7 +51,7 @@ func pipInstallRequirements(files []string, dir, name string) error {
// used without following their dependencies.
args := []string{"install", "-r", filepath.Join(dir, name), "--disable-pip-version-check", "--no-index", "--no-deps", "--find-links", dir}
if err := execx.Execute(pip, args...); err != nil {
return err
fmt.Println("Requirements cache " + dir + " is empty. Downloading packages from PyPI")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Printing this in case if the first Pip call fails(when user skips populating requirements cache) to provide more context if user looks into the logs

sdks/python/apache_beam/options/pipeline_options.py Outdated Show resolved Hide resolved
'the requirements file using the --requirements_file option.'))
'the requirements file using the --requirements_file option.'
'If you want to skip populating requirements cache, please '
'specify --requirements_cache skip. This would install all'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the last sentence.

'--requirements_cache_only_sources',
action='store_true',
help=(
'Enable this flag to populate requirements cache with Source'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'Enable this flag to populate requirements cache with Source'
'Enable this flag to populate requirements cache only with Source'

'Enable this flag to populate requirements cache with Source'
'distributions(sdists) of the dependencies mentioned in the '
'--requirements_file'
'Note: This step would slow down the worker startup time'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note (BEAM-XXXX): This flag may significantly slow down the pipeline submission. It is added to preserve the requirements cache behavior prior to 2.37.0 and will likely be removed in future releases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added BEAM-4032, since it has the discussion thread, and comments on this issue

if setup_options.requirements_cache is None else
os.path.join(tempfile.gettempdir(), 'dataflow-requirements-cache') if
(setup_options.requirements_cache is None) and
(setup_options.requirements_cache != SKIP_REQUIREMENTS_CACHE) else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and... part seems not necessary

# TODO(anandinguva): When https://github.com/pypa/pip/issues/10760 is
# addressed download wheel based on glib version in Beam's Python Base image
pip_version = pkg_resources.get_distribution('pip').version
if float(pip_version[0:4]) >= 19.3:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not reliable, for example it would fail on pip < 10; let's use an off-the-shelf helper that knows how to parse
and compare versions, see: pkg_resources.parse_version or packaging.version.parse

os.path.join(temp_dir, 'nothing.tar.gz'), 'Fake tarball')
else:
self.create_temp_file(
os.path.join(temp_dir, 'nothing.tar.gz'), 'Fake tarball')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appears in both branches

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we download whls and if a whl is not present on PyPI, we download sdist. That is the reason why it is present in both branches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that you can add nothing.tar.gz outside the if since this code is in both branches. But feel free to keep as is if you think it's more clear this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -342,6 +343,47 @@ def test_with_requirements_file_and_cache(self):
self.assertTrue(os.path.isfile(os.path.join(staging_dir, 'abc.txt')))
self.assertTrue(os.path.isfile(os.path.join(staging_dir, 'def.txt')))

def test_with_requirements_file_skipping_cache(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's possible to rename the test method to show what's being tested and expected outcome, this makes test cases easier to follow. for example:

test_requirements_cache_not_populated_when_cache_disabled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


resources = self.stager.create_and_stage_job_resources(
options,
populate_requirements_cache=self.populate_requirements_cache,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively, you could pass a mock method and verify that it was never called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -723,6 +765,65 @@ def test_remove_dependency_from_requirements(self):
self.assertEqual(['apache_beam\n', 'avro-python3\n', 'numpy\n'],
sorted(lines))

def _create_file(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _create_file(
def _populate_requitements_cache_fake(

@AnandInguva
Copy link
Contributor Author

PTAL: @tvalentyn

@AnandInguva
Copy link
Contributor Author

Run Python 3.8 PostCommit

1 similar comment
@AnandInguva
Copy link
Contributor Author

Run Python 3.8 PostCommit

# See the License for the specific language governing permissions and
# limitations under the License.
###############################################################################
numpy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example already requires numpy in

Given that you are modifying the example, users may be confused why numpy needs to be specified twice: in a requirements.txt file and in setup.py.

Let's try to use a different pipeline for this test. You could create an example where a requirements file is used, or just a test pipeline that does something trivial, like checking whether a particular library is installed and doesn't do any data processing.

This reverts commit d95f2d0.
@AnandInguva
Copy link
Contributor Author

PTAL @tvalentyn. Added a trivial Integration test to check if the specified packages in the requirements.txt are installed on the worker container.

  1. Assert passes if package is installed in the container.
  2. Assert fails if the package is not installed in the container(happens when the package is not specified in the requirements file.)

I will resolve conflicts once this code is good to go

Creates requirements text for the integration test during runtime
@AnandInguva
Copy link
Contributor Author

Run Python 3.8 PostCommit

@tvalentyn
Copy link
Contributor

@AnandInguva
Copy link
Contributor Author

https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Commit/12436/

Yes, I already created a commit on that locally. I used to have a lint checker installed but later I disabled it for some reason. I have to enable it now(it would make my life a lot easier)

@tvalentyn
Copy link
Contributor

Portable precommit fails due to a side effect from my testing on #16008. Sorry about that.

@tvalentyn
Copy link
Contributor

Run Portable_Python PreCommit

@AnandInguva
Copy link
Contributor Author

PTAL @tvalentyn. Thanks

CHANGES.md Outdated Show resolved Hide resolved
@github-actions github-actions bot removed the examples label Feb 10, 2022
@AnandInguva
Copy link
Contributor Author

Run Python 3.8 PostCommit

@AnandInguva
Copy link
Contributor Author

PTAL @tvalentyn

@tvalentyn tvalentyn merged commit 7975127 into apache:master Feb 11, 2022
@tvalentyn
Copy link
Contributor

Merging, thanks! Please update the PR description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants