Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcache and vdo not work #9

Closed
serjponomarev opened this issue Apr 17, 2018 · 9 comments
Closed

bcache and vdo not work #9

serjponomarev opened this issue Apr 17, 2018 · 9 comments

Comments

@serjponomarev
Copy link

When i create bcache over vdo device and vdo logical size bigger then original drive, i got this message:

[ 4966.723860] bcache: run_cache_set() invalidating existing data
[ 4966.728316] bcache: register_cache() registered cache device sdf
[ 4966.763060] bcache: bcache_device_init() nr_stripes too large or invalid: 2415919102 (start sector beyond end of disk?)
[ 4966.763101] bcache: register_bdev() error dm-0: cannot allocate memory
[ 4966.763166] bcache: bcache_device_free() (null) stopped

This is a bug vdo or bcache?

@serjponomarev
Copy link
Author

@corwin do you have time to solve this issue?

@raeburn
Copy link
Member

raeburn commented Jul 20, 2018

Please describe in a little more detail what you were doing to trigger the error.
How did you configure the two devices? What were the parameters? What's the logical size of the VDO device?

Also, looking a little bit at the bcache code in the function that reports that error, I see some possible issues with size handling. Is the error dependent on the logical size of the VDO device? Does it work for, say, 15GB logical? (Or if you put an LVM logical volume of 15GB on top of VDO, and put bcache on top of that?)

@serjponomarev
Copy link
Author

@raeburn now i tried to reproduce, but all working.
My tested scenario:
environment ESXi 6.7, VMFS datastore on SSD Intel 3610
Tested VM
Distro: CentOS Linux release 7.5.1804 (Core)
Kernel from elrepo: 4.17.8-1.el7.elrepo.x86_64
kvdo & vdo builded from master branch

In VM i created two devices, 10G for cache and 50G for vdo.
Created VDO device with logical size - 200G.
Created bcache device (make-bcache -C "10G cache device" -B "200G VDO device" --writeback)
format bcache device to xfs and mount with discard option

Run 2 test fio with option --dedupe_percentage=80 and vdostats --si show me 80% saving space
After first test - i removed test file. Space discarding working well.
In time second test - i reset VM and after boot devices don't corrupt or destroyed.
XFS fully mounted after cold reset without any problem.

Issue can be closed.

@raeburn
Copy link
Member

raeburn commented Jul 24, 2018

I figured the sizes wrong in my previous message. I think 8TB and 16TB are interesting boundary sizes to look at. One of my co-workers is taking a look right now.

@bgurney-rh
Copy link

I was able to recreate this on Fedora 28, running kernel version 4.17.7-200.fc28.x86_64, and kvdo module version 6.2.0.132.

If I use a 33 GB partition as a cache device, and a VDO volume with a logical size of just over 8 TB ("--vdoLogicalSize=8388609M"), I can see the same error:

kernel: bcache: bcache_device_init() nr_stripes too large or invalid: 2147483902 (start sector beyond end of disk?)

Interestingly, I can also see the same error for a VDO volume of 9 TB:

vdo create --name=vdo1 --device=/dev/sdd1 --vdoLogicalSize=9T
make-bcache -C /dev/sdd2 -B /dev/mapper/vdo1
...
bcache: bcache_device_init() nr_stripes too large or invalid: 2415919102 (start sector beyond end of disk?)

...and for a VDO volume of 25 TB:

vdo create --name=vdo1 --device=/dev/sdd1 --vdoLogicalSize=25T
make-bcache -C /dev/sdd2 -B /dev/mapper/vdo1
...
kernel: bcache: bcache_device_init() nr_stripes too large or invalid: 2415919102 (start sector beyond end of disk?)

(Note that the "nr_stripes" value is identical between the 9 TB VDO volume and the 25 TB VDO volume, which suggests a rollover.)

@raeburn
Copy link
Member

raeburn commented Jul 24, 2018

Thanks, Bryan! That confirms my suspicions.

The issues with the bcache initialization (bcache_device_init, drivers/md/bcache/super.c) and VDO that I noticed, in the version I looked at:

A "stripe" is chosen to be the optimal I/O size for the underlying device (cached_dev_init). For RAID 5/6 devices, this makes a lot of sense; stripe size will be tens or hundreds of kB or more, and writing a stripe all at once will be more efficient. For VDO, there's no such grouping happening under the covers, but partial or misaligned blocks carry a penalty because they require read-modify-write cycles, so our optimal I/O size is our block size, 4kB.

The bcache initialization code computes the number of stripes (device size divided by stripe size), and caps it at INT_MAX on a 64-bit system (or less, for a 32-bit system). This means 2**31 stripes, or 8TB at a 4kB block size, would exceed the limit.

Also, the quotient from the calculation is stuffed into a 32-bit (type "unsigned") field, and read back out for comparison against the maximum. So if the quotient exceeds 32 bits (16TB, at a 4kB block size), the computed nr_stripes (and thus the sizes of the stripe_sectors_dirty and full_dirty_stripes) will be wrong. I'm suspicious about out-of-bounds references in such cases, but haven't dug into this.

(Bryan mentioned to me that creating a bcache device atop VDO worked if the VDO logical size was 18TB or 34TB, but failed in the cases above. So it seems to be a question of whether the VDO logical size, mod 16TB, is greater or less than 8TB; less, initialization works, greater, it fails. That matches up with my expectations.)

I suppose one might argue whether VDO should declare an optimal block size at all, but it seems pretty clear that bcache has a problem with large devices that have small optimal I/O sizes. Bryan and I were speculating that an MD device or other such device with a lot of storage and a tiny chunk size might replicate this problem without VDO in the mix, which might be a little more persuasive that it's not a VDO problem.

Perhaps some other caching driver like dm-writecache might work better...

@serjponomarev
Copy link
Author

serjponomarev commented Jul 25, 2018

I still have everything working =)
Distro: CentOS Linux release 7.5.1804 (Core)
Kernel from elrepo: 4.17.9-1.el7.elrepo.x86_64
kvdo & vdo builded from master branch
VM: 8 CPU, 64Gb RAM

sdb - 30Gb --> for cache device
sdc - 100Gb --> for VDO device
vdo - 50T

#Create VDO device 50Tb
vdo create --name=vdo --device=/dev/sdc --vdoLogicalSize=50T
#Create bcache device
make-bcache --block 4K -C /dev/sdb -B /dev/mapper/vdo

#Disable bcache sequential cutoff and threshold

echo 0 > /sys/block/bcache0/bcache/sequential_cutoff
for i in `ls /sys/fs/bcache/ | grep -v register`; do echo 0 > /sys/fs/bcache/$i/congested_read_threshold_us; done
for i in `ls /sys/fs/bcache/ | grep -v register`; do echo 0 > /sys/fs/bcache/$i/congested_write_threshold_us; done

#mkfs ext4
mkfs.ext4 -E nodiscard /dev/bcache0

#Mount bcache device
mount /dev/bcache0 /mnt

NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sdb           8:16   0   30G  0 disk 
└─bcache0   253:0    0   50T  0 disk /mnt
sdc           8:32   0  100G  0 disk 
└─vdo       252:0    0   50T  0 vdo  
  └─bcache0 253:0    0   50T  0 disk /mnt

#Create random 10G file
dd if=/dev/urandom of=io bs=1M count=10240 status=progress

#DD to /mnt 10 copy of file io
for i in {1..10}; dd if=./io of=/mnt/io$i bs=1M oflag=direct status=progress; done

#Before DD

vdostats --si
Device                    Size      Used Available Use% Space saving%
/dev/mapper/vdo         107.4G      4.6G    102.8G   4%           99%
df -h /mnt
Filesystem      Size  Used Avail Use% Mounted on
/dev/bcache0     50T   20K   48T   1% /mnt

#After DD

vdostats --si
Device                    Size      Used Available Use% Space saving%
/dev/mapper/vdo         107.4G     15.5G     91.9G  14%           93%
df -h /mnt
Filesystem      Size  Used Avail Use% Mounted on
/dev/bcache0     50T  101G   48T   1% /mnt
ls -l --block-size=G /mnt
total 101G
-rw-r--r-- 1 root root 10G Jul 25 04:44 io1
-rw-r--r-- 1 root root 10G Jul 25 04:59 io10
-rw-r--r-- 1 root root 10G Jul 25 04:46 io2
-rw-r--r-- 1 root root 10G Jul 25 04:47 io3
-rw-r--r-- 1 root root 10G Jul 25 04:49 io4
-rw-r--r-- 1 root root 10G Jul 25 04:51 io5
-rw-r--r-- 1 root root 10G Jul 25 04:52 io6
-rw-r--r-- 1 root root 10G Jul 25 04:54 io7
-rw-r--r-- 1 root root 10G Jul 25 04:56 io8
-rw-r--r-- 1 root root 10G Jul 25 04:57 io9
drwx------ 2 root root  1G Jul 25 04:09 lost+found

@bgurney-rh
Copy link

I should clarify the comments from yesterday. From raeburn:

"So it seems to be a question of whether the VDO logical size, [modulo] 16TB, is greater or less than 8TB; less, initialization works, greater, it fails."

In other words, if you create a bcache volume on top of a VDO volume with a logical size of approximately...
...0 TB to 8 TB: the bcache creation succeeds.
...8 TB to 16 TB: the bcache creation fails with "nr_stripes too large or invalid".
...16 TB to 24 TB: the bcache creation succeeds.
...24 TB to 32 TB: the bcache creation fails with "nr_stripes too large or invalid".
...32 TB to 40 TB: the bcache creation succeeds.
...40 TB to 48 TB: the bcache creation fails with "nr_stripes too large or invalid".
...48 TB to 56 TB: the bcache creation succeeds.
...56 TB to 64 TB: the bcache creation fails with "nr_stripes too large or invalid".

...and so on, and so on. The failure appears to happen when "nr_stripes" is between 2147483648 and 4294967295.

bcache seems to interpret the VDO volume as a device with "stripes" of 4096 bytes, and it calculates the number of stripes in the device. ("nr_stripes") An 8 TB "device" with a "stripe size" of 4096 bytes will have 2147483648 "stripes", which will overflow a 32-bit signed counter. The counter will reset to zero at sizes divisible by 16 TB, and will result in the success/failure pattern I detailed above.

@serjponomarev
Copy link
Author

serjponomarev commented Jul 25, 2018

@bgurney-rh Many thanks for the clarification, I rechecked and tested, everything works exactly as you commented.
@raeburn thanks for comment about dm-writecache. I found out that it is already available in Linux 4.18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants