Skip to content

Increase run-complete-tests parallelism to 4 which is the combinations count#35828

Closed
hussein-awala wants to merge 2 commits intoapache:mainfrom
hussein-awala:ci/complete_tests_parallelism
Closed

Increase run-complete-tests parallelism to 4 which is the combinations count#35828
hussein-awala wants to merge 2 commits intoapache:mainfrom
hussein-awala:ci/complete_tests_parallelism

Conversation

@hussein-awala
Copy link
Member

Currently, the test runs with 3 parallel processes, but there are 4 tests. The last test starts when the 3 other tests finish, which leads to resources and time wasting. Running the 4 tests in parallel will help to reduce the overall time.

@hussein-awala hussein-awala added the full tests needed We need to run full set of tests for this PR to merge label Nov 23, 2023
@potiuk
Copy link
Member

potiuk commented Nov 23, 2023

Few comments:

Let's see. Tried that before but it was introducing a lot of random failures (likely 4 k8s simply use too much memory).

Likely we could think about switching a bit memory utilisation on our self-hosted runners (we allocate a lot of memory for tmpfs to do things more stable/faster - so whole docker installation runs using docker backed by tmpfs - not by "regular" filesystems on our self-hosted runners. Currently 85% of memory (of 64 GB) is used as tmpfs and only remaining 15% is used for regular memory of the system. Maybe some optimisations of images and size could allow us to decrease the % memory used for filesystem and then running 4 k8s will be fine ? It's difficult to play with however because that setting is per image and we will have to switch and test (ideally during weekend/hours when no other committers do stuf).

https://github.com/apache/airflow-ci-infra/blob/main/github-runner-ami/packer/files/mounts_setup.sh

One more comment - doing it this way is good for experimenting, but not good for merging. Hardcoding parallelism should not happen - because when it runs on self-hosted runners it's fine to run 3 but for public runners even 2 is too much - so we have special parallelism for k8s tests that has a bit weird formula to fine-tune behaviour of automated parallelism calculation

option_parallelism_cluster = click.option(
    "--parallelism",
    help="Maximum number of processes to use while running the operation in parallel for cluster operations.",
    type=click.IntRange(1, max(1, (mp.cpu_count() + 1) // 3) if not generating_command_images() else 4),
    default=max(1, (mp.cpu_count() + 1) // 3) if not generating_command_images() else 2,
    envvar="PARALLELISM",
    show_default=True,
)

So eventually if we want to change parallelism - we will have to change default calculation here, or have a conditional PARALLELISM setting in CI depending on type of worker (I'd very much prefer the default calculation as conditionals in GH aations CI tend to look strange and be prone to errors.

@hussein-awala
Copy link
Member Author

doing it this way is good for experimenting, but not good for merging

Yes, it's just to test, then we can discuss the implementation if we find some benefits.

@hussein-awala hussein-awala marked this pull request as draft November 23, 2023 21:43
@potiuk
Copy link
Member

potiuk commented Nov 23, 2023

doing it this way is good for experimenting, but not good for merging

Yes, it's just to test, then we can discuss the implementation if we find some benefits.

And yeah. I would very much love to optimse these tests - they are now on critical path after all the other improvements :)

@potiuk
Copy link
Member

potiuk commented Nov 23, 2023

Yeah... same case - a lot of "connection reset by peer" - which likely means killed by lack of memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:dev-tools full tests needed We need to run full set of tests for this PR to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants