Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add juicefs diagnose script and fix alluxio diagnose script #2156

Merged
merged 5 commits into from
Sep 28, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion docs/en/userguide/troubleshooting.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,13 @@
# Troubleshooting

You may encounter various problems during installation or development in Fluid. Usually, logs are useful for debugging. But the Runtime containers where Fluid's underlying Distributed Cache Engine is running, are distributed on different hosts under distributed environment, so it's quite annoying to collect these logs one by one. To make this troublesome work easier, we provided a [shell script](https://raw.githubusercontent.com/fluid-cloudnative/fluid/master/tools/diagnose-fluid.sh) to help users collect logs more quickly. This document describes how to use that script.
You may encounter various problems during installation or development in Fluid. Usually, logs are useful for debugging. But the Runtime containers where Fluid's underlying Distributed Cache Engine is running, are distributed on different hosts under distributed environment, so it's quite annoying to collect these logs one by one.
To make this troublesome work easier, we provided a shell script to help users collect logs more quickly. This document describes how to use that script.

Fluid provides different diagnostic scripts for different Runtimes, but the usage is the same. You can download the runtime diagnostic scripts you use:

Alluxio: [diagnose-fluid.sh](https://raw.githubusercontent.com/fluid-cloudnative/fluid/master/tools/diagnose-fluid.sh)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to rename diagnose-fluid.sh to diagnose-fluid-alluxio.sh.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

JuiceFS: [diagnose-fluid-juicefs.sh](https://raw.githubusercontent.com/fluid-cloudnative/fluid/master/tools/diagnose-fluid-juicefs.sh)
GooseFS: [diagnose-fluid-goosefs.sh](https://raw.githubusercontent.com/fluid-cloudnative/fluid/master/tools/diagnose-fluid-goosefs.sh)

## Diagnose Fluid using Script

Expand Down
9 changes: 8 additions & 1 deletion docs/zh/userguide/troubleshooting.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,13 @@
# Fluid问题诊断

您可能会在部署、开发Fluid的过程中遇到各种问题,而查看日志可以协助我们定位问题原因。但在分布式环境下,Fluid底层的分布式缓存引擎(Runtime)运行在不同主机的容器上,手动收集这些容器的日志效率低下。因此,Fluid提供了shell脚本[diagnose-fluid.sh](https://raw.githubusercontent.com/fluid-cloudnative/fluid/master/tools/diagnose-fluid.sh),帮助使用者快速收集Fluid系统和Runtime容器的日志信息。
您可能会在部署、开发Fluid的过程中遇到各种问题,而查看日志可以协助我们定位问题原因。但在分布式环境下,Fluid底层的分布式缓存引擎(Runtime)运行在不同主机的容器上,手动收集这些容器的日志效率低下。
因此,Fluid提供了shell脚本,帮助使用者快速收集Fluid系统和Runtime容器的日志信息。

针对不同的 Runtime,Fluid 提供了不同的诊断脚本,但使用方式是一致的。您可以下载您使用的 Runtime 诊断脚本:

Alluxio: [diagnose-fluid.sh](https://raw.githubusercontent.com/fluid-cloudnative/fluid/master/tools/diagnose-fluid.sh)
JuiceFS: [diagnose-fluid-juicefs.sh](https://raw.githubusercontent.com/fluid-cloudnative/fluid/master/tools/diagnose-fluid-juicefs.sh)
GooseFS: [diagnose-fluid-goosefs.sh](https://raw.githubusercontent.com/fluid-cloudnative/fluid/master/tools/diagnose-fluid-goosefs.sh)

## 如何使用脚本收集日志

Expand Down
2 changes: 1 addition & 1 deletion tools/diagnose-fluid-goosefs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ kubectl_resource() {
# runtime, dataset, pv and pvc should have the same name
kubectl describe dataset --namespace ${runtime_namespace} ${runtime_name} &>"${diagnose_dir}/dataset-${runtime_name}.yaml" 2>&1
kubectl describe goosefsruntime --namespace ${runtime_namespace} ${name} &>"${diagnose_dir}/goosefsruntime-${runtime_name}.yaml" 2>&1
kubectl describe pv ${runtime_name} &>"${diagnose_dir}/pv-${runtime_name}.yaml" 2>&1
kubectl describe pv ${runtime_namespace}-${runtime_name} &>"${diagnose_dir}/pv-${runtime_name}.yaml" 2>&1
kubectl describe pvc ${runtime_name} --namespace ${runtime_namespace} &>"${diagnose_dir}/pvc-${runtime_name}.yaml" 2>&1
}

Expand Down
154 changes: 154 additions & 0 deletions tools/diagnose-fluid-juicefs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
#!/usr/bin/env bash
set +x

print_usage() {
echo "Usage:"
echo " ./diagnose-fluid-juicefs.sh COMMAND [OPTIONS]"
echo "COMMAND:"
echo " help"
echo " Display this help message."
echo " collect"
echo " Collect pods logs of controller and runtime."
echo "OPTIONS:"
echo " -r, --name name"
echo " Set the name of runtime."
echo " -n, --namespace name"
echo " Set the namespace of runtime."
}

run() {
echo
echo "-----------------run $*------------------"
timeout 10s "$@"
if [ $? != 0 ]; then
echo "failed to collect info: $*"
fi
echo "------------End of ${1}----------------"
}

helm_get() {
run helm get all -n ${runtime_namespace} "${1}" &>"$diagnose_dir/helm-${1}.yaml"
}

pod_status() {
local namespace=${1:-"default"}
run kubectl get po -owide -n ${namespace} &>"$diagnose_dir/pods-${namespace}.log"
}

fluid_pod_logs() {
core_component "${fluid_namespace}" "manager" "control-plane=juicefsruntime-controller"
core_component "${fluid_namespace}" "manager" "control-plane=dataset-controller"
core_component "${fluid_namespace}" "plugins" "app=csi-nodeplugin-fluid"
core_component "${fluid_namespace}" "node-driver-registrar" "app=csi-nodeplugin-fluid"
}

runtime_pod_logs() {
core_component "${runtime_namespace}" "juicefs-worker" "role=juicefs-worker" "release=${runtime_name}"
core_component "${runtime_namespace}" "juicefs-fuse" "role=juicefs-fuse" "release=${runtime_name}"
}

core_component() {
# namespace container selectors...
local namespace="$1"
local container="$2"
shift 2
local selectors="$*"
local constrains
local pods
constrains=$(echo "${selectors}" | tr ' ' ',')
if [[ -n ${constrains} ]]; then
constrains="-l ${constrains}"
fi
mkdir -p "$diagnose_dir/pods-${namespace}"
pods=$(kubectl get po -n ${namespace} "${constrains}" | awk '{print $1}' | grep -v NAME)
for po in ${pods}; do
kubectl logs "${po}" -c "$container" -n ${namespace} &>"$diagnose_dir/pods-${namespace}/${po}-${container}.log" 2>&1
done
}

kubectl_resource() {
# runtime, dataset, pv and pvc should have the same name
kubectl describe dataset --namespace ${runtime_namespace} ${runtime_name} &>"${diagnose_dir}/dataset-${runtime_name}.yaml" 2>&1
kubectl describe juicefsruntime --namespace ${runtime_namespace} ${name} &>"${diagnose_dir}/juicefsruntime-${runtime_name}.yaml" 2>&1
kubectl describe pv ${runtime_namespace}-${runtime_name} &>"${diagnose_dir}/pv-${runtime_name}.yaml" 2>&1
kubectl describe pvc ${runtime_name} --namespace ${runtime_namespace} &>"${diagnose_dir}/pvc-${runtime_name}.yaml" 2>&1
}

archive() {
tar -zcvf "${current_dir}/diagnose_fluid_${timestamp}.tar.gz" "${diagnose_dir}"
echo "please get diagnose_fluid_${timestamp}.tar.gz for diagnostics"
}

pd_collect() {
echo "Start collecting, runtime-name=${runtime_name}, runtime-namespace=${runtime_namespace}"
helm_get "${fluid_name}"
helm_get "${runtime_name}"
pod_status "${fluid_namespace}"
pod_status "${runtime_namespace}"
runtime_pod_logs
fluid_pod_logs
kubectl_resource
archive
}

collect()
{
# ensure params
fluid_name=${fluid_name:-"fluid"}
fluid_namespace=${fluid_namespace:-"fluid-system"}
runtime_name=${runtime_name:?"the name of runtime must be set"}
runtime_namespace=${runtime_namespace:-"default"}

current_dir=$(pwd)
timestamp=$(date +%s)
diagnose_dir="/tmp/diagnose_fluid_${timestamp}"
mkdir -p "$diagnose_dir"

pd_collect
}

main() {
if [[ $# -eq 0 ]]; then
print_usage
exit 1
fi

action="help"

while [[ $# -gt 0 ]]; do
case $1 in
-h|--help|"-?")
print_usage
exit 0;
;;
collect|help)
action=$1
;;
-r|--name)
runtime_name=$2
shift
;;
-n|--namespace)
runtime_namespace=$2
shift
;;
*)
echo "Error: unsupported option $1" >&2
print_usage
exit 1
;;
esac
shift
done

case ${action} in
collect)
collect
;;
help)
print_usage
;;
esac
}

main "$@"
2 changes: 1 addition & 1 deletion tools/diagnose-fluid.sh
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ kubectl_resource() {
# runtime, dataset, pv and pvc should have the same name
kubectl describe dataset --namespace ${runtime_namespace} ${runtime_name} &>"${diagnose_dir}/dataset-${runtime_name}.yaml" 2>&1
kubectl describe alluxioruntime --namespace ${runtime_namespace} ${name} &>"${diagnose_dir}/alluxioruntime-${runtime_name}.yaml" 2>&1
kubectl describe pv ${runtime_name} &>"${diagnose_dir}/pv-${runtime_name}.yaml" 2>&1
kubectl describe pv ${runtime_namespace}-${runtime_name} &>"${diagnose_dir}/pv-${runtime_name}.yaml" 2>&1
kubectl describe pvc ${runtime_name} --namespace ${runtime_namespace} &>"${diagnose_dir}/pvc-${runtime_name}.yaml" 2>&1
}

Expand Down