Skip to content

Commit

Permalink
feat: Add parallelism to major chunk of hooks. Check Parallelism se…
Browse files Browse the repository at this point in the history
…ction in README (#620)

### Reasoning 

We have a GH workflow that runs lockflies updates every week (implementation and reasoning [here](https://grem1.in/post/terraform-lockfiles-maxymvlasov/)). 
It usually takes from 2h 30min to  3h 15min. That was fine for us, till we found that our GH runners, based on AWS EC2s, started silently failing after 30min "without recent logs", and that was fixed by crutch which sends a dummy log every 10min.

However, during the debugging, I spent some time describing why hooks were not utilizing all the provided resources. 
And that means a waste of time and money, not only for that corner case but for every huge commit, which can cause opting out by `git commit -n` of using hooks locally for changes that affect many directories.

### Description of your changes

* Add per-hook `--parallelism-limit` setting to `--hook-config`. Defaults to `number of logical CPUs - 1`
* As quick tests show, ~5% of stacks face race condition problem, no matter if any locking mechanism exists or dirs try to init in parallel. I suppose the lock failed as it uses disk when hooks run in memory, so the creation of the lock can take some time as there bunch of caches between Mem and Disk. These milliseconds are enough to allow running a few `t init` in parallel.
* Final implementation uses a retry mechanism for cases when race condition failed to `t init` directory. 

In quick tests, I can say that on big changes:
* Up to 2000% speed increase for `terraform_validate`, and up to 500% - for other affected hooks. 
* When `--parallelism-limit=1` I observed an insignificant increase in time (about 5-10%) compared to v1.86.0 which has no parallelism at all. This may be the cost of maintaining parallelism or the result of external factors since the tests were not conducted in a vacuum.

For small changes, improvements are less significant.

-----
Other significant findings/solutions included to this PR:


* feat: Investigate and fix issue with wrong CPU count for containers (#623)

So, I found that `nproc` always shows how many CPUs available is. K8s "limits" and docker `--cpus` are throttling mechanisms, which do not hide the visibility of all cores.
There are a few workarounds, but  IMO, it is better to implement checks for that than do them

>Workaround for docker - set `--cpuset-cpus`
>Workaraund for K8s - somehow deal with [kubelet static CPU management policy](https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#cpu-management-policies), as [recommend in Reddit](https://news.ycombinator.com/item?id=25224714)

* Send all "colorify" logs through stderr, as make able to add user-facing-logs in functions that also need to return same value to the function-caller. Needed for `common::get_cpu_num` err_msg show up

------

* Count --parallelism-ci-cpu-cores only in edge-cases

Details:
#620 (review)

---------

Co-authored-by: George L. Yermulnik <yz@yz.kiev.ua>
  • Loading branch information
MaxymVlasov and yermulnik committed Feb 17, 2024
1 parent d7ec933 commit 6c6eca4
Show file tree
Hide file tree
Showing 15 changed files with 292 additions and 41 deletions.
13 changes: 7 additions & 6 deletions .github/ISSUE_TEMPLATE/bug_report_local_install.md
Expand Up @@ -81,16 +81,17 @@ pre-commit --version 2>/dev/null || echo "pre-commit SKIPPE
terraform --version | head -n 1 2>/dev/null || echo "terraform SKIPPED"
python --version 2>/dev/null || echo "python SKIPPED"
python3 --version 2>/dev/null || echo "python3 SKIPPED"
echo -n "checkov " && checkov --version 2>/dev/null || echo "checkov SKIPPED"
echo -n "checkov " && checkov --version 2>/dev/null || echo "SKIPPED"
infracost --version 2>/dev/null || echo "infracost SKIPPED"
terraform-docs --version 2>/dev/null || echo "terraform-docs SKIPPED"
terragrunt --version 2>/dev/null || echo "terragrunt SKIPPED"
echo -n "terrascan " && terrascan version 2>/dev/null || echo "terrascan SKIPPED"
echo -n "terrascan " && terrascan version 2>/dev/null || echo "SKIPPED"
tflint --version 2>/dev/null || echo "tflint SKIPPED"
echo -n "tfsec " && tfsec --version 2>/dev/null || echo "tfsec SKIPPED"
echo -n "trivy " && trivy --version 2>/dev/null || echo "tfsec SKIPPED"
echo -n "tfupdate " && tfupdate --version 2>/dev/null || echo "tfupdate SKIPPED"
echo -n "hcledit " && hcledit version 2>/dev/null || echo "hcledit SKIPPED"
echo -n "tfsec " && tfsec --version 2>/dev/null || echo "SKIPPED"
echo -n "trivy " && trivy --version 2>/dev/null || echo "SKIPPED"
echo -n "tfupdate " && tfupdate --version 2>/dev/null || echo "SKIPPED"
echo -n "hcledit " && hcledit version 2>/dev/null || echo "SKIPPED"
echo -n "flock " && flock --version 2>/dev/null || echo "SKIPPED"
EOF
-->
Expand Down
1 change: 1 addition & 0 deletions Dockerfile
Expand Up @@ -173,6 +173,7 @@ RUN . /.env && \
RUN . /.env && \
F=tools_versions_info && \
pre-commit --version >> $F && \
echo "flock $(flock 2>&1 | head -n 1)" >> $F && \
./terraform --version | head -n 1 >> $F && \
(if [ "$CHECKOV_VERSION" != "false" ]; then echo "checkov $(checkov --version)" >> $F; else echo "checkov SKIPPED" >> $F ; fi) && \
(if [ "$INFRACOST_VERSION" != "false" ]; then echo "$(./infracost --version)" >> $F; else echo "infracost SKIPPED" >> $F ; fi) && \
Expand Down
65 changes: 65 additions & 0 deletions README.md
Expand Up @@ -38,6 +38,7 @@ If you are using `pre-commit-terraform` already or want to support its developme
* [All hooks: Usage of environment variables in `--args`](#all-hooks-usage-of-environment-variables-in---args)
* [All hooks: Set env vars inside hook at runtime](#all-hooks-set-env-vars-inside-hook-at-runtime)
* [All hooks: Disable color output](#all-hooks-disable-color-output)
* [Many hooks: Parallelism](#many-hooks-parallelism)
* [checkov (deprecated) and terraform\_checkov](#checkov-deprecated-and-terraform_checkov)
* [infracost\_breakdown](#infracost_breakdown)
* [terraform\_docs](#terraform_docs)
Expand Down Expand Up @@ -338,6 +339,70 @@ To disable color output for all hooks, set `PRE_COMMIT_COLOR=never` var. Eg:
PRE_COMMIT_COLOR=never pre-commit run
```

### Many hooks: Parallelism

> All, except deprecated hooks: `checkov`, `terraform_docs_replace` and hooks which can't be paralleled this way: `infracost_breakdown`, `terraform_wrapper_module_for_each`.
> Also, there's a chance that parallelism have no effect on `terragrunt_fmt` and `terragrunt_validate` hooks
By default, parallelism is set to `number of logical CPUs - 1`.
If you'd like to disable parallelism, set it to `1`

```yaml
- id: terragrunt_validate
args:
- --hook-config=--parallelism-limit=1
```

In the same way you can set it to any positive integer.

If you'd like to set parallelism value relative to number of CPU logical cores - provide valid Bash arithmetic expression and use `CPU` as a reference to the number of CPU logical cores


```yaml
- id: terraform_providers_lock
args:
- --hook-config=--parallelism-limit=CPU*4
```

> [!TIP]
> <details><summary>Info useful for parallelism fine-tunning</summary>
>
> <br>
> Tests below were run on repo with 45 Terraform dirs on laptop with 16 CPUs, SSD and 1Gbit/s network. Laptop was slightly used in the process.
>
> Observed results may vary greatly depending on your repo structure, machine characteristics and their usage.
>
> If during fine-tuning you'll find that your results are very different from provided below and you think that this data could help someone else - feel free to send PR.
>
>
> | Hook | Most used resource | Comparison of optimization results / Notes |
> | ------------------------------------------------------------------------------ | ---------------------------------- | --------------------------------------------------------------- |
> | terraform_checkov | CPU heavy | - |
> | terraform_fmt | CPU heavy | - |
> | terraform_providers_lock (3 platforms,<br>`--mode=always-regenerate-lockfile`) | Network & Disk heavy | `defaults (CPU-1)` - 3m 39s; `CPU*2` - 3m 19s; `CPU*4` - 2m 56s |
> | terraform_tflint | CPU heavy | - |
> | terraform_tfsec | CPU heavy | - |
> | terraform_trivy | CPU moderate | `defaults (CPU-1)` - 32s; `CPU*2` - 30s; `CPU*4` - 31s |
> | terraform_validate (t validate only) | CPU heavy | - |
> | terraform_validate (t init + t validate) | Network & Disk heavy, CPU moderate | `defaults (CPU-1)` - 1m 30s; `CPU*2` - 1m 25s; `CPU*4` - 1m 41s |
> | terragrunt_fmt | CPU heavy | N/A? need more info from TG users |
> | terragrunt_validate | CPU heavy | N/A? need more info from TG users |
> | terrascan | CPU moderate-heavy | `defaults (CPU-1)` - 8s; `CPU*2` - 6s |
> | tfupdate | Disk/Network? | too quick in any settings. More info needed |
>
>
> </details>


```yaml
args:
- --hook-config=--parallelism-ci-cpu-cores=N
```

If you don't see code above in your `pre-commit-config.yaml` or logs - you don't need it.
`--parallelism-ci-cpu-cores` used only in edge cases and is ignored in other situations. Check out its usage in [hooks/_common.sh](hooks/_common.sh)

### checkov (deprecated) and terraform_checkov

> `checkov` hook is deprecated, please use `terraform_checkov`.
Expand Down
195 changes: 174 additions & 21 deletions hooks/_common.sh
Expand Up @@ -170,6 +170,75 @@ function common::is_hook_run_on_whole_repo {
fi
}

#######################################################################
# Get the number of CPU logical cores available for pre-commit to use
# Arguments:
# parallelism_ci_cpu_cores (string) Used in edge cases when number of
# CPU cores can't be derived automatically
# Outputs:
# Returns number of CPU logical cores, rounded down to nearest integer
#######################################################################
function common::get_cpu_num {
local -r parallelism_ci_cpu_cores=$1

local millicpu

if [[ -f /sys/fs/cgroup/cpu/cpu.cfs_quota_us ]]; then
# Inside K8s pod or DinD in K8s
millicpu=$(< /sys/fs/cgroup/cpu/cpu.cfs_quota_us)

if [[ $millicpu -eq -1 ]]; then
# K8s no limits or in DinD
if [[ -n $parallelism_ci_cpu_cores ]]; then
if [[ ! $parallelism_ci_cpu_cores =~ ^[[:digit:]]+$ ]]; then
common::colorify "yellow" "--parallelism-ci-cpu-cores set to" \
"'$parallelism_ci_cpu_cores' which is not a positive integer.\n" \
"To avoid possible harm, parallelism is disabled.\n" \
"To re-enable it, change corresponding value in config to positive integer"

echo 1
return
fi

echo "$parallelism_ci_cpu_cores"
return
fi

common::colorify "yellow" "Unable to derive number of available CPU cores.\n" \
"Running inside K8s pod without limits or inside DinD without limits propagation.\n" \
"To avoid possible harm, parallelism is disabled.\n" \
"To re-enable it, set corresponding limits, or set the following for the current hook:\n" \
" args:\n" \
" - --hook-config=--parallelism-ci-cpu-cores=N\n" \
"where N is the number of CPU cores to allocate to pre-commit."

echo 1
return
fi

echo $((millicpu / 1000))
return
fi

if [[ -f /sys/fs/cgroup/cpu.max ]]; then
# Inside Linux (Docker?) container
millicpu=$(cut -d' ' -f1 /sys/fs/cgroup/cpu.max)

if [[ $millicpu == max ]]; then
# No limits
nproc 2> /dev/null || echo 1
return
fi

echo $((millicpu / 1000))
return
fi

# On host machine or any other case
# `nproc` - Linux/FreeBSD, `sysctl -n hw.ncpu` - macOS/BSD, `echo 1` - fallback
nproc 2> /dev/null || sysctl -n hw.ncpu 2> /dev/null || echo 1
}

#######################################################################
# Hook execution boilerplate logic which is common to hooks, that run
# on per dir basis.
Expand Down Expand Up @@ -219,10 +288,16 @@ function common::per_dir_hook {

# Lookup hook-config for modifiers that impact common behavior
local change_dir_in_unique_part=false

local parallelism_limit
IFS=";" read -r -a configs <<< "${HOOK_CONFIG[*]}"
for c in "${configs[@]}"; do
IFS="=" read -r -a config <<< "$c"
key=${config[0]}

# $hook_config receives string like '--foo=bar; --baz=4;' etc.
# It gets split by `;` into array, which we're parsing here ('--foo=bar' ' --baz=4')
# Next line removes leading spaces, to support >1 `--hook-config` args
key="${config[0]## }"
value=${config[1]}

case $key in
Expand All @@ -233,32 +308,82 @@ function common::per_dir_hook {
change_dir_in_unique_part="delegate_chdir"
fi
;;
--parallelism-limit)
# this flag will limit the number of parallel processes
parallelism_limit="$value"
;;
--parallelism-ci-cpu-cores)
# Used in edge cases when number of CPU cores can't be derived automatically
parallelism_ci_cpu_cores="$value"
;;
esac
done

CPU=$(common::get_cpu_num "$parallelism_ci_cpu_cores")
# parallelism_limit can include reference to 'CPU' variable
local parallelism_disabled=false

if [[ ! $parallelism_limit ]]; then
# Could evaluate to 0
parallelism_limit=$((CPU - 1))
elif [[ $parallelism_limit -eq 1 ]]; then
parallelism_disabled=true
else
# Could evaluate to <1
parallelism_limit=$((parallelism_limit))
fi

if [[ $parallelism_limit -lt 1 ]]; then
# Suppress warning for edge cases when only 1 CPU available or
# when `--parallelism-ci-cpu-cores=1` and `--parallelism_limit` unset
if [[ $CPU -ne 1 ]]; then

common::colorify "yellow" "Observed Parallelism limit '$parallelism_limit'." \
"To avoid possible harm, parallelism set to '1'"
fi

parallelism_limit=1
parallelism_disabled=true
fi

local pids=()

mapfile -t dir_paths_unique < <(echo "${dir_paths[@]}" | tr ' ' '\n' | sort -u)
local length=${#dir_paths_unique[@]}
local last_index=$((${#dir_paths_unique[@]} - 1))

local final_exit_code=0
# preserve errexit status
shopt -qo errexit && ERREXIT_IS_SET=true
# allow hook to continue if exit_code is greater than 0
set +e
local final_exit_code=0

# run hook for each path
for dir_path in $(echo "${dir_paths[*]}" | tr ' ' '\n' | sort -u); do
dir_path="${dir_path//__REPLACED__SPACE__/ }"
# run hook for each path in parallel
for ((i = 0; i < length; i++)); do
dir_path="${dir_paths_unique[$i]//__REPLACED__SPACE__/ }"
{
if [[ $change_dir_in_unique_part == false ]]; then
pushd "$dir_path" > /dev/null
fi

if [[ $change_dir_in_unique_part == false ]]; then
pushd "$dir_path" > /dev/null || continue
fi
per_dir_hook_unique_part "$dir_path" "$change_dir_in_unique_part" "$parallelism_disabled" "${args[@]}"
} &
pids+=("$!")

per_dir_hook_unique_part "$dir_path" "$change_dir_in_unique_part" "${args[@]}"
if [[ $parallelism_disabled == true ]] ||
[[ $i -ne 0 && $((i % parallelism_limit)) -eq 0 ]] || # don't stop on first iteration when parallelism_limit>1
[[ $i -eq $last_index ]]; then

local exit_code=$?
if [ $exit_code -ne 0 ]; then
final_exit_code=$exit_code
fi
for pid in "${pids[@]}"; do
# Get the exit code from the background process
local exit_code=0
wait "$pid" || exit_code=$?

if [[ $change_dir_in_unique_part == false ]]; then
popd > /dev/null
if [ $exit_code -ne 0 ]; then
final_exit_code=$exit_code
fi
done
# Reset pids for next iteration
unset pids
fi

done
Expand Down Expand Up @@ -291,14 +416,15 @@ function common::colorify {

# Params start #
local COLOR="${!1}"
local -r TEXT=$2
shift
local -r TEXT="$*"
# Params end #

if [ "$PRE_COMMIT_COLOR" = "never" ]; then
COLOR=$RESET
fi

echo -e "${COLOR}${TEXT}${RESET}"
echo -e "${COLOR}${TEXT}${RESET}" >&2
}

#######################################################################
Expand All @@ -307,15 +433,19 @@ function common::colorify {
# command_name (string) command that will tun after successful init
# dir_path (string) PATH to dir relative to git repo root.
# Can be used in error logging
# parallelism_disabled (bool) if true - skip lock mechanism
# Globals (init and populate):
# TF_INIT_ARGS (array) arguments for `terraform init` command
# TF_PLUGIN_CACHE_DIR (string) user defined env var with name of the directory
# which can't be R/W concurrently
# Outputs:
# If failed - print out terraform init output
#######################################################################
# TODO: v2.0: Move it inside terraform_validate.sh
function common::terraform_init {
local -r command_name=$1
local -r dir_path=$2
local -r parallelism_disabled=$3

local exit_code=0
local init_output
Expand All @@ -325,9 +455,32 @@ function common::terraform_init {
TF_INIT_ARGS+=("-no-color")
fi

if [ ! -d .terraform/modules ] || [ ! -d .terraform/providers ]; then
init_output=$(terraform init -backend=false "${TF_INIT_ARGS[@]}" 2>&1)
exit_code=$?
recreate_modules=$([[ ! -d .terraform/modules ]] && echo true || echo false)
recreate_providers=$([[ ! -d .terraform/providers ]] && echo true || echo false)

if [[ $recreate_modules == true || $recreate_providers == true ]]; then
# Plugin cache dir can't be written concurrently or read during write
# https://github.com/hashicorp/terraform/issues/31964
if [[ -z $TF_PLUGIN_CACHE_DIR || $parallelism_disabled == true ]]; then
init_output=$(terraform init -backend=false "${TF_INIT_ARGS[@]}" 2>&1)
exit_code=$?
else
# Locking just doesn't work, and the below works quicker instead. Details:
# https://github.com/hashicorp/terraform/issues/31964#issuecomment-1939869453
for i in {1..10}; do
init_output=$(terraform init -backend=false "${TF_INIT_ARGS[@]}" 2>&1)
exit_code=$?

if [ $exit_code -eq 0 ]; then
break
fi
sleep 1

common::colorify "green" "Race condition detected. Retrying 'terraform init' command [retry $i]: $dir_path."
[[ $recreate_modules == true ]] && rm -rf .terraform/modules
[[ $recreate_providers == true ]] && rm -rf .terraform/providers
done
fi

if [ $exit_code -ne 0 ]; then
common::colorify "red" "'terraform init' failed, '$command_name' skipped: $dir_path"
Expand Down
5 changes: 4 additions & 1 deletion hooks/terraform_checkov.sh
Expand Up @@ -34,6 +34,7 @@ function main {
# change_dir_in_unique_part (string/false) Modifier which creates
# possibilities to use non-common chdir strategies.
# Availability depends on hook.
# parallelism_disabled (bool) if true - skip lock mechanism
# args (array) arguments that configure wrapped tool behavior
# Outputs:
# If failed - print out hook checks status
Expand All @@ -43,7 +44,9 @@ function per_dir_hook_unique_part {
local -r dir_path="$1"
# shellcheck disable=SC2034 # Unused var.
local -r change_dir_in_unique_part="$2"
shift 2
# shellcheck disable=SC2034 # Unused var.
local -r parallelism_disabled="$3"
shift 3
local -a -r args=("$@")

checkov -d . "${args[@]}"
Expand Down

0 comments on commit 6c6eca4

Please sign in to comment.