Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong filesystem capacity detected #351

Closed
hugochinchilla opened this issue Oct 16, 2018 · 11 comments

Comments

Projects
None yet
5 participants
@hugochinchilla
Copy link

commented Oct 16, 2018

What happened:

I'm having problems with my workers reporting the wrong ammount of disk capacity on the vm, I'm getting pod evictions while having a lot of free space on the ephemeral disk. The VM has two disks, sda1 with the system install and sdb2 for ephemeral data.

kubelet seems to detect the size of sda1 as the ammount of space available for ephemeral storage.

$ kubectl describe node
...
Capacity:
 cpu:                4
 ephemeral-storage:  3030944Ki
 hugepages-2Mi:      0
 memory:             32940952Ki
 pods:               110
...

take the 3030944Ki and convert it to bytes, you get 3103686656, searching for this number on the kubelet logs I can see it is the exact size of the system partition (sda1), docker is running with --graph /var/vcap/data/docker/docker which is sdb2 not sda1.

$ ps aux | grep dockerd
dockerd --bridge=cni0 --debug=false --default-ulimit=nofile=65536 --group vcap --graph /var/vcap/data/docker/docker --host unix:///var/vcap/sys/run/docker/docker.sock --icc=true --ip-forward=true --ip-masq=false --iptables=false --ipv6=false --log-level=error --log-opt=max-size=128m --log-opt=max-file=2 --mtu=1450 --pidfile /var/vcap/sys/run/docker/docker.pid --selinux-enabled=false --storage-driver=overlay2 --host tcp://127.0.0.1:4243 --userland-proxy=true

Here is the output of df (redacted):

$ df 

Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/sda1        3030944 2468276    391096  87% /
/dev/sdb2       20507260 4782664  14659844  25% /var/vcap/data

And the relevant section from the kubelet log:

I1016 08:14:33.044417    7365 server.go:526] Successfully initialized cloud provider: "vsphere" from the config file: ""
I1016 08:14:33.044431    7365 server.go:772] cloud provider determined current node name to be 312c6d1e-82f3-470d-bf10-b5a04a66a0f4
I1016 08:14:33.047753    7365 manager.go:154] cAdvisor running in container: "/sys/fs/cgroup/cpu,cpuacct"
I1016 08:14:33.050821    7365 fs.go:142] Filesystem UUIDs: map[516a9d28-f7e2-4e30-b208-8b3473bcb46e:/dev/sda1 88894553-2ba1-45ba-8b20-9c93f150a743:/dev/sdb1 ca728db0-2dcc-4000-ace7-32d408f150f9:/dev/sdb2]
I1016 08:14:33.050843    7365 fs.go:143] Filesystem partitions: map[/dev/sda1:{mountpoint:/ major:8 minor:1 fsType:ext4 blockSize:0} /dev/sdb2:{mountpoint:/var/vcap/data major:8 minor:18 fsType:ext4 blockSize:0} tmpfs:{mountpoint:/run major:0 minor:22 fsType:tmpfs blockSize:0}]
I1016 08:14:33.054683    7365 manager.go:227] Machine: {NumCores:4 CpuFrequency:2799999 MemoryCapacity:33731534848 HugePages:[{PageSize:2048 NumPages:0}] MachineID:d546192840e7281040ccf3722d167fb7 SystemUUID:4213FECA-BE7C-8906-717A-8938D6F23FA0 BootID:ff5f408d-70dd-49e4-94aa-31ad22282f5a Filesystems:[{Device:tmpfs DeviceMajor:0 DeviceMinor:22 Capacity:3373154304 Type:vfs Inodes:4117619 HasInodes:true} {Device:/dev/sda1 DeviceMajor:8 DeviceMinor:1 Capacity:3103686656 Type:vfs Inodes:195840 HasInodes:true} {Device:/dev/sdb2 DeviceMajor:8 DeviceMinor:18 Capacity:20999434240 Type:vfs Inodes:1310720 HasInodes:true}] DiskMap:map[8:0:{Name:sda Major:8 Minor:0 Size:3221225472 Scheduler:cfq} 8:16:{Name:sdb Major:8 Minor:16 Size:21474836480 Scheduler:cfq}] NetworkDevices:[{Name:eth0 MacAddress:00:50:56:93:f7:5a Speed:10000 Mtu:1500}] Topology:[{Id:0 Memory:33731534848 Cores:[{Id:0 Threads:[0] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:26214400 Type:Unified Level:3}]} {Id:2 Memory:0 Cores:[{Id:0 Threads:[1] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:26214400 Type:Unified Level:3}]} {Id:4 Memory:0 Cores:[{Id:0 Threads:[2] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:26214400 Type:Unified Level:3}]} {Id:6 Memory:0 Cores:[{Id:0 Threads:[3] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:26214400 Type:Unified Level:3}]}] CloudProvider:Unknown InstanceType:Unknown InstanceID:None}

What you expected to happen:

kubelet to detect sdb2 as the correct storage for ephemeral data.

How to reproduce it (as minimally and precisely as possible):

Deploy cfcr on a vsphere cluster:

bosh -e bosh-1 deploy -d cfcr manifests/cfcr.yml \
  -o manifests/ops-files/misc/single-master.yml \
  -o manifests/ops-files/enable-bbr.yml \
  -o manifests/ops-files/add-hostname-to-master-certificate.yml \
  -o manifests/ops-files/iaas/vsphere/use-vm-extensions.yml \
  -o manifests/ops-files/iaas/vsphere/cloud-provider.yml \
  -o manifests/ops-files/disable-security-context-deny.yml \
  -v api-hostname=ss.kube.habitissimo.net \
  -v vcenter_master_user=admin \
  -v vcenter_master_password=jyKgcs4O \
  -v vcenter_ip=10.58.39.2 \
  -v vcenter_dc="Interxion MAD2" \
  -v vcenter_ds=habitissimo_premium \
  -v vcenter_vms=bosh-1-vms \
  -v director_uuid=9e8ffae0-1a12-400a-8346-486e7b2e08be

bosh -e bosh-1 -d cfcr run-errand apply-specs
bosh -e bosh-1 -d cfcr run-errand smoke-tests

Get the description of a worker node with kubectl describe node, search for ephemeral-storage under Capacity.

Anything else we need to know?:

Environment:

  • Deployment Info (bosh -d <deployment> deployment):
Name  Release(s)       Stemcell(s)                                     Config(s)        Team(s)  
cfcr  bosh-dns/1.10.0  bosh-vsphere-esxi-ubuntu-xenial-go_agent/97.18  4 cloud/default  -  
      bpm/0.12.3                                                                          
      cfcr-etcd/1.5.0                                                                     
      docker/32.0.0                                                                       
      kubo/0.22.0   
  • Environment Info (bosh -e <environment> environment):
Name      bosh-1  
UUID      9e8ffae0-1a12-400a-8346-486e7b2e08be  
Version   268.0.1 (00000000)  
CPI       vsphere_cpi  
Features  compiled_package_cache: disabled  
          config_server: enabled  
          dns: disabled  
          snapshots: disabled  
User      admin  
  • Kubernetes version (kubectl version):
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"bb9ffb1654d4a729bb4cec18ff088eacc153c239", GitTreeState:"archive", BuildDate:"2018-08-20T08:45:20Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-09T17:53:03Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider (e.g. aws, gcp, vsphere): vsphere
@cf-gitbot

This comment has been minimized.

Copy link

commented Oct 16, 2018

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/161254182

The labels on this github issue will be updated when the story is started.

@hugochinchilla

This comment has been minimized.

Copy link
Author

commented Oct 16, 2018

may be related to kubernetes/kubernetes#66961

@hugochinchilla

This comment has been minimized.

Copy link
Author

commented Oct 17, 2018

Ok, I think I've found the problem.

Kubelet is running with /var/lib/kubelet as root dir, I think it should be using /var/vcap/data/kubelet instead.

hugochinchilla added a commit to hugochinchilla/kubo-release that referenced this issue Oct 17, 2018

@seanos11

This comment has been minimized.

Copy link
Contributor

commented Nov 22, 2018

closing based on resolution of cloudfoundry-incubator/kubo-release#259

@seanos11 seanos11 closed this Nov 22, 2018

@cf-gitbot cf-gitbot removed the unscheduled label Nov 22, 2018

@seanos11 seanos11 reopened this Nov 22, 2018

@cf-gitbot

This comment has been minimized.

Copy link

commented Nov 22, 2018

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/162153525

The labels on this github issue will be updated when the story is started.

@seanos11

This comment has been minimized.

Copy link
Contributor

commented Nov 22, 2018

reopened as it is still under review

@seanos11

This comment has been minimized.

Copy link
Contributor

commented Nov 22, 2018

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/162153525

The labels on this github issue will be updated when the story is started.

my close/reopen created a duplicate tracker item, I have deleted 162153525, https://www.pivotaltracker.com/story/show/161254182 is the one to track

@instantlinux

This comment has been minimized.

Copy link

commented Jan 8, 2019

I would like this error message improved:

Jan  8 01:07:30 vinson kubelet[1514]: I0108 01:07:30.586617    1514 image_gc_manager.go:300] [imageGCManager]:
 Disk usage on image filesystem is at 85% which is over the high threshold (85%). Trying to free 481610956 bytes down to the low threshold (80%).

It doesn't give device name or mount point so I can't figure out which filesystem it's complaining about (there is plenty of space on mounted volumes, so perhaps it's looking at the thinpool LVM which I'm using for image storage).

Also, the garbage-collector barfs when it encounters statically-launched images started by docker run rather than k8s.

@tvs

This comment has been minimized.

Copy link
Member

commented Mar 26, 2019

@instantlinux That seems more within the purview of the Kubernetes community. Please raise an issue there.

@hugochinchilla This should be fixed in the default manifest as of CFCR v0.31.0 (Kubelet's root-dir is set to /var/vcap/data/kubelet)

@tvs tvs closed this Mar 26, 2019

@cf-gitbot cf-gitbot removed the unscheduled label Mar 26, 2019

@instantlinux

This comment has been minimized.

Copy link

commented Mar 26, 2019

Sure thing @tvs, thanks for reminding me. I've reported there as issue #75708.

@hugochinchilla

This comment has been minimized.

Copy link
Author

commented Mar 26, 2019

Thank's for the update @tvs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.