Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

RHEL/CentOS 6.2 x86_64, kernel panic with flashcache on reboot #58

Closed
kloderik opened this Issue · 8 comments

3 participants

kloderik Mohan Srinivasan Boopathi Rajaa
kloderik

uname -r

2.6.32-220.4.1.el6.x86_64

flashcache built from fresh sources (last commit ID is af3e101)

HDD & SSD configuration:
/dev/sd[abcd] build up a software RAID10 array /dev/md0:

cat /proc/mdstat

Personalities : [raid10]
md0 : active raid10 sdd[3] sda[0] sdb[1] sdc[2]
1953522848 blocks super 1.2 4K chunks 2 near-copies [4/4] [UUUU]

/dev/sde is root device
/dev/sdf is SSD used for caching

Operating system drops to kernel panic at last stages of reboot process.
Trace captured with netconsole/nc follows:

kernel BUG at drivers/md/md.c:6657!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/kernel/kexec_loaded
CPU 0
Modules linked in: flashcache(U) netconsole configfs microcode k10temp edac_core edac_mce_amd snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc i2c_piix4 sg r8169 mii xhci_hcd shpchp ext4 mbcache jbd2 raid10 ata_generic pata_acpi pata_jmicron firewire_ohci firewire_core crc_itu_t sd_mod crc_t10dif ahci radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 1115, comm: kcopyd Not tainted 2.6.32-220.4.1.el6.x86_64 #1 Gigabyte Technology Co., Ltd. GA-880GA-UD3H/GA-880GA-UD3H
RIP: 0010:[] [] md_write_start+0x1bb/0x1c0
RSP: 0018:ffff880400fd99a0 EFLAGS: 00010246
RAX: 0000000000000001 RBX: ffff880400c15c00 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff8803ade24e00 RDI: ffff880400c15c00
RBP: ffff880400fd99f0 R08: 0000000000001000 R09: ffffe8ffffc011e8
R10: 0000000000141650 R11: 0000000000000000 R12: ffff8803ffc9ca80
R13: ffff880400c15c00 R14: ffff8803ade24e00 R15: 0000000000000000
FS: 00007fc34431c700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007fc343f00f00 CR3: 00000003ae933000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kcopyd (pid: 1115, threadinfo ffff880400fd8000, task ffff8803ffd05500)
Stack:
ffff880401770360 ffff8804007ea3f8 ffff880400745948 ffff8804007ea3f8
ffff880400fd99e0 ffffffff8126c977 ffff880400745800 0000000000000007
ffff8803ffc9ca80 ffff880400c15c00 ffff880400fd9a70 ffffffffa009d55e
Call Trace:
[] ? kobject_put+0x27/0x60
[] make_request+0x7e/0x570 [raid10]
[] ? throtl_find_tg+0x46/0x60
[] md_make_request+0xd3/0x210
[] ? cache_alloc_refill+0x9e/0x240
[] ? __inc_zone_state+0x11/0x70
[] generic_make_request+0x2b2/0x5c0
[] submit_bio+0x8f/0x120
[] dispatch_io+0x1ff/0x260 [dm_mod]
[] ? list_get_page+0x0/0x30 [dm_mod]
[] ? list_next_page+0x0/0x20 [dm_mod]
[] ? complete_io+0x0/0xa0 [dm_mod]
[] dm_io+0xc5/0x1c0 [dm_mod]
[] ? list_get_page+0x0/0x30 [dm_mod]
[] ? list_next_page+0x0/0x20 [dm_mod]
[] run_io_job+0x6f/0x110 [dm_mod]
[] ? complete_io+0x0/0xa0 [dm_mod]
[] process_jobs+0x5b/0x100 [dm_mod]
[] ? run_io_job+0x0/0x110 [dm_mod]
[] ? do_work+0x0/0xa0 [dm_mod]
[] do_work+0x4c/0xa0 [dm_mod]
[] worker_thread+0x170/0x2a0
[] ? autoremove_wake_function+0x0/0x40
[] ? worker_thread+0x0/0x2a0
[] kthread+0x96/0xa0
[] child_rip+0xa/0x20
[] ? kthread+0x0/0xa0
[] ? child_rip+0x0/0x20
Code: c7 83 b4 01 00 00 00 00 00 00 f0 80 4b 28 02 f0 80 4b 28 04 48 8b bb 40 01 00 00 41 bc 01 00 00 00 e8 9a 74 ff ff e9 5e ff ff ff 0b eb fe 90 55 48 89 e5 48 83 ec 20 48 89 5d e8 4c 89 65 f0
RIP [] md_write_start+0x1bb/0x1c0
RSP
---[ end trace 41faf5005cfb45a9 ]---

How to reproduce:
1. Build & install flashcache on a newly installed CentOS 6.2 x86_64.
2. Create a RAID10 array:
mdadm -C /dev/md0 -l10 -n4 -c4 /dev/sd[abcd]
3. Initialize the cache:
flashcache_create -p back cache_var /dev/sdf /dev/md0
4. Make a new filesystem on cached volume:
mkfs.ext4 /dev/mapper/cache_var
5. Reboot.

Mohan Srinivasan
Collaborator

Looks like an interaction between md and kcopyd. Any chance you can try a later Linux kernel to see if this is fixed there ? We don't use md and we don't use 2.6.32, so obviously we haven't run into this. I think there are other users that use flashcache with md.

kloderik

Unfortunately, no. There is no fresher kernel for RHEL 6 in binary packages (and i think there will be no move to the more fresh branch in the future), and I avoid using other kernels in production environment. BTW, this bug does not appear with write-through caching mode.
In a couple of days I will be setting up a new machine with similar configuration. I'll post here the results of testing on that machine.

Boopathi Rajaa

Even I ran into this problem, when rebooting flashcache. Tried writin a sysVinit script that would unmount the raids, flush the cache to disk while stopping, but that doesn't seem to have fixed the kernel panic. Also the init script works fine. As in when using it as service flashcache stop, the data is properly flushed, and a reboot doesn't cause kernel panic. But when the init scripts are run automatically during shutdown, i.e via /etc/rc6.d/, kernel panic occurs ..

Mohan Srinivasan
Collaborator

boopathi - Is your stack crash from the crash similar to the one in the issue or is it something different ? Can you paste it here ?

Boopathi Rajaa

It looks something similar to this..
http://picpaste.com/abc-42vRRXFM.jpg
Not able to access the entire list through this console.

Boopathi Rajaa

Hey .. This fixed my problem ... Went through /etc/rc.d/rc - the file that is called when the runlevel changes. The script tells that if there is a subsys lock, then run /etc/rc.d/rcX.d/Kfoo stop .. and I had a flashcache sysVscript that doesnt maintain any subsys lock or PID file. So when the system reboot happens, cache has NOT yet been flushed or cachedev is still present. So, I modified the flashcache init script to create a subsys lock, and it is working fine. Reboot doesn't end up with a kernel panic.

Mohan Srinivasan
Collaborator

Boopathi - Just curious. Did you do testing to verify 100% that this fixes your crash ? Most important test case would be where there are plenty of dirty blocks in the writeback cache that need to be cleaned.

My next question is - if this works around/fixes the crash, is this something we should document in the SA guide ? If you think it might be useful and write a short HOWTO up, I can add it to the SA guide with attribution.

Boopathi Rajaa

I'm not 100% sure, because, I've not examined it with most of the possible test cases. I use something like this, for the init script. It would be better if many test it with their configurations.
#83 .

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.