Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Crash at arrow::internal::CountSetBits #21840

Closed
asfimport opened this issue May 21, 2019 · 34 comments
Closed

[C++] Crash at arrow::internal::CountSetBits #21840

asfimport opened this issue May 21, 2019 · 34 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented May 21, 2019

I've got a lot of crash dump from a customer's windows machine. The stacktrace shows that it crashed at arrow::internal::CountSetBits.

 

STACK_TEXT:  
000000c9`5354a4c0 00007ff7`2f2830fd : 000000c9`544841c0 00000000`00000000 00000000`00001e00 cccccccc`cccccccc : CortexService!arrow::internal::CountSetBits+0x16d
000000c9`5354a550 00007ff7`2f2834b7 : 000000c9`5337c930 cccccccc`cccccccc cccccccc`cccccccc cccccccc`cccccccc : CortexService!arrow::ArrayData::GetNullCount+0x8d
000000c9`5354a580 00007ff7`2f13df55 : 000000c9`54476080 000000c9`5354a5d8 cccccccc`cccccccc cccccccc`cccccccc : CortexService!arrow::Array::null_count+0x37
000000c9`5354a5b0 00007ff7`2f13fb68 : 000000c9`5354ab40 000000c9`5354a6f8 000000c9`54476080 cccccccc`cccccccc : CortexService!parquet::arrow::`anonymous namespace'::LevelBuilder::Visit<arrow::NumericArray<arrow::FloatType> >+0xa5
000000c9`5354a640 00007ff7`2f12fa34 : 000000c9`5354a6f8 000000c9`54476080 000000c9`5354ab40 cccccccc`cccccccc : CortexService!arrow::VisitArrayInline<parquet::arrow::`anonymous namespace'::LevelBuilder>+0x298
000000c9`5354a680 00007ff7`2f14bf03 : 000000c9`5354ab40 000000c9`5354a6f8 000000c9`54476080 cccccccc`cccccccc : CortexService!parquet::arrow::`anonymous namespace'::LevelBuilder::VisitInline+0x44
000000c9`5354a6c0 00007ff7`2f12fe2a : 000000c9`5354ab40 000000c9`5354ae18 000000c9`54476080 000000c9`5354b208 : CortexService!parquet::arrow::`anonymous namespace'::LevelBuilder::GenerateLevels+0x93
000000c9`5354aa00 00007ff7`2f14de56 : 000000c9`5354b1f8 000000c9`5354afc8 000000c9`54476080 00000000`00001e00 : CortexService!parquet::arrow::`anonymous namespace'::ArrowColumnWriter::Write+0x25a
000000c9`5354af20 00007ff7`2f14e66b : 000000c9`5354b1f8 000000c9`5354b238 000000c9`54445c20 00000000`00000000 : CortexService!parquet::arrow::`anonymous namespace'::ArrowColumnWriter::Write+0x2a6
000000c9`5354b040 00007ff7`2f12f137 : 000000c9`544041f0 000000c9`5354b4d8 000000c9`5354b4a8 00000000`00000000 : CortexService!parquet::arrow::FileWriter::Impl::WriteColumnChunk+0x70b
000000c9`5354b400 00007ff7`2f14b4d5 : 000000c9`54431180 000000c9`5354b4d8 000000c9`5354b4a8 00000000`00000000 : CortexService!parquet::arrow::FileWriter::WriteColumnChunk+0x67
000000c9`5354b450 00007ff7`2f12eef1 : 000000c9`5354b5d8 000000c9`5354b648 00000000`00000000 00000000`00001e00 : CortexService!<lambda_f279b6cdf777bbf919efc3f7f42faf89>::operator()+0x195
000000c9`5354b530 00007ff7`2eb8e31e : 000000c9`54431180 000000c9`5354b760 000000c9`54442fb0 00000000`00001e00 : CortexService!parquet::arrow::FileWriter::WriteTable+0x521
000000c9`5354b730 00007ff7`2eb58ac5 : 000000c9`5307bd88 000000c9`54442fb0 ffffffff`ffffffff ffffffff`ffffffff : CortexService!Cortex::Storage::ParquetStreamWriter::writeRowGroup+0xfe
000000c9`5354b860 00007ff7`2eafdce6 : 000000c9`5307bd80 000000c9`5354ba08 000000c9`5354b9e0 000000c9`5354b9d8 : CortexService!Cortex::Storage::ParquetFileWriter::writeRowGroup+0x545
000000c9`5354b9a0 00007ff7`2eaf8bae : 000000c9`53275600 000000c9`53077220 ffffffff`fffffffe 00000000`00000000 : CortexService!Cortex::Storage::DataStreamWriteWorker::onNewData+0x1a6
FAILED_INSTRUCTION_ADDRESS: 
CortexService!arrow::internal::CountSetBits+16d [c:\jenkins\workspace\cortexv2-dev-win64-service\src\thirdparty\arrow\cpp\src\arrow\util\bit-util.cc @ 99]
00007ff7`2f3a4e4d f3480fb800      popcnt  rax,qword ptr [rax]

FOLLOWUP_IP: 
CortexService!arrow::internal::CountSetBits+16d [c:\jenkins\workspace\cortexv2-dev-win64-service\src\thirdparty\arrow\cpp\src\arrow\util\bit-util.cc @ 99]
00007ff7`2f3a4e4d f3480fb800      popcnt  rax,qword ptr [rax]

FAULTING_SOURCE_LINE:  c:\jenkins\workspace\cortexv2-dev-win64-service\src\thirdparty\arrow\cpp\src\arrow\util\bit-util.cc

FAULTING_SOURCE_FILE:  c:\jenkins\workspace\cortexv2-dev-win64-service\src\thirdparty\arrow\cpp\src\arrow\util\bit-util.cc

FAULTING_SOURCE_LINE_NUMBER:  99

SYMBOL_STACK_INDEX:  0

SYMBOL_NAME:  cortexservice!arrow::internal::CountSetBits+16d
ERROR_CODE: (NTSTATUS) 0xc000001d - {EXCEPTION}  Illegal Instruction  An attempt was made to execute an illegal instruction.

EXCEPTION_CODE: (NTSTATUS) 0xc000001d - {EXCEPTION}  Illegal Instruction  An attempt was made to execute an illegal instruction.

APP:  cortexservice.exe

ANALYSIS_VERSION: 6.3.9600.17336 (debuggers(dbg).150226-1500) amd64fre

FAULTING_THREAD:  000000000000169c

BUGCHECK_STR:  APPLICATION_FAULT_INVALID_POINTER_READ_BEFORE_WRITE

PRIMARY_PROBLEM_CLASS:  INVALID_POINTER_READ_BEFORE_WRITE

DEFAULT_BUCKET_ID:  INVALID_POINTER_READ_BEFORE_WRITE

LAST_CONTROL_TRANSFER:  from 00007ff72f2830fd to 00007ff72f3a4e4d

 

 

Environment: Operating System: Windows 7 Professional 64-bit (6.1, Build 7601) Service Pack 1 (7601.win7sp1_ldr_escrow.181110-1429)
Language: English (Regional Setting: English)
System Manufacturer: SAMSUNG ELECTRONICS CO., LTD.
System Model: RV420/RV520/RV720/E3530/S3530/E3420/E3520
BIOS: Phoenix SecureCore-Tiano(tm) NB Version 2.1 05PQ
Processor: Intel(R) Pentium(R) CPU B950 @ 2.10GHz (2 CPUs), ~2.1GHz
Memory: 2048MB RAM
Available OS Memory: 1962MB RAM
Page File: 1517MB used, 2405MB available
Windows Dir: C:\Windows
DirectX Version: DirectX 11

Reporter: Tham / @thamht4190

Related issues:

Original Issue Attachments:

PRs and other links:

Note: This issue was originally created as ARROW-5381. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
It seems that your CPU (Pentium B950) should support the popcnt instruction. Are you running this in a VM?

@asfimport
Copy link
Collaborator Author

Tham / @thamht4190:

 Are you running this in a VM?

No, it's not a virtual machine.

I've got another machine which has the same crash:

Operating System: Windows 10 Pro 64-bit (10.0, Build 10240) (10240.th1.170602-2340)

                 Language: English (Regional Setting: English)

      System Manufacturer: HP

             System Model: HP Laptop 14-bs0xx

                     BIOS: F.31

                Processor: Intel(R) Celeron(R) CPU  N3060  @ 1.60GHz (2 CPUs), ~1.6GHz

                   Memory: 4096MB RAM

      Available OS Memory: 4002MB RAM

                Page File: 2189MB used, 2516MB available

              Windows Dir: C:\Windows

          DirectX Version: 12

      DX Setup Parameters: Not found

         User DPI Setting: Using System DPI

       System DPI Setting: 96 DPI (100 percent)

          DWM DPI Scaling: Disabled

                 Miracast: Available, with HDCP

Microsoft Graphics Hybrid: Not Supported

           DxDiag Version: 10.00.10240.16384 64bit Unicode

Can you please take a look?

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Can you download and run this program:
https://docs.microsoft.com/en-us/sysinternals/downloads/coreinfo

Among its output will be a line saying "Supports POPCNT instruction". It will tell you whether the CPU supports the required instruction.

@asfimport
Copy link
Collaborator Author

Tham / @thamht4190:
Thanks for quick response. I'll send this to my customer and ask them to run it. The response will be not as fast as you :)

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Any news @thamht4190?

@asfimport
Copy link
Collaborator Author

Tham / @thamht4190:
Sorry for the late reply. I haven't got answer from my customer. Let me try again.

@asfimport
Copy link
Collaborator Author

Tham / @thamht4190:
@pitrou Finally I can get core info from this customer. It seems his machine support POPCNT.

popcnt_support.png

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
So this is a bit mysterious. Does it work right on other machines? Does the customer execute their binary in a special way? Perhaps they have some kind of security layer that restricts the instruction set? Something else?

@asfimport
Copy link
Collaborator Author

Tham / @thamht4190:
Our application is a Windows service, running with admin permission. Mostly users don't work directly with our service, but some GUI application that connects with our service. We got this issue only on Windows 7.
I think it's better to have this issue reproduced and can be debugged, but our resource is limited for Windows 7. If you find out something, please let know. I'll try to reproduce it as soon as possible :)

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
I don't think Windows 7 is an issue. I build regularly on Windows 7 and never had a POPCNT problem.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
By the way do all your users use the same binary? Is it possible for you to disassemble the CountSetBits function?

@asfimport
Copy link
Collaborator Author

Tham / @thamht4190:

I don't think Windows 7 is an issue
Yes, you're right. I set up a Windows 7 and I don't meet this issue. We've just got a customer who got this issue on MacOS as well.
By the way do all your users use the same binary? Is it possible for you to disassemble the CountSetBits function?
Ok. Attach it in a while.

@asfimport
Copy link
Collaborator Author

Tham / @thamht4190:
bit-util.asm is assembly for bit-util.cc on Windows. Can you take a look?

@asfimport
Copy link
Collaborator Author

Tham / @thamht4190:
on the MacOS, I run this command:

$ sysctl -n machdep.cpu.features
FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 DTES64 MON DSCPL VMX EST TM2 SSSE3 CX16 TPR PDCM SSE4.1 XSAVE

We don't see POPCNT, then it seems that this machine doesn't support POPCNT instruction :( Is there any way to avoid this situation?

iMac-late2009.png

@asfimport
Copy link
Collaborator Author

Tham / @thamht4190:
Or at least, we can try/catch then arrow won't crash but return an error?

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
The iMac is much too old indeed :(

It won't be easy to return an error (from which function exactly?), as Arrow is a library and not an application. Perhaps we can add some kind of Initialize() function that would return an error if the CPU isn't compatible.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Could you try this patch on Windows?

https://gist.github.com/pitrou/a9803b98b651f766f45c5fd79c47cde1

 

@asfimport
Copy link
Collaborator Author

Tham / @thamht4190:

Could you try this patch on Windows?

Ok, let me contact our customer. This will take time again :(

@asfimport
Copy link
Collaborator Author

Tham / @thamht4190:
After checking with Customer Support team, I found a mistake that we've got coreinfo from wrong customer. We haven't been able to contact the customer that has this issue on Windows 7 yet. So please ignore the images I sent until I can find more information. Sorry!

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
You may have to submit some patches to provide compatibility for very old processors if you need to support these machines. All the people that I work with, for example, are using relatively new hardware so I can't justify investing my own time in such things

@asfimport
Copy link
Collaborator Author

Tham / @thamht4190:
@wesm I agree that if the machine is too old that we may not need to support.

I will check again to confirm if POPCNT is supported on the machine running Windows 7 that I received crash report.

You may have to submit some patches to provide compatibility for very old processors if you need to support these machines.

In case we want to support, I'm not sure how to do it. Can you share any guide or suggestion? I'll try to implement.

Update: I found this discussion http://www.cplusplus.com/forum/general/185927/. Using may be a candidate?

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
@thamht4190 Sorry for the delay. I'm not sure we want to carry maintenance baggage for very old CPUs. If you really need this, it may be better that you maintain your own patches against Arrow.

(it's probably not difficult to find a good software implementation of popcount on the Internet, in any case)

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
It may not end up being a good use of my time but I'm tracking down a refurbished 2009-era Macbook which has a Penryn-based CPU supporting SSE4.1 but not SSE4.2, and lacking the popcnt instruction. If we can support pre-SSE4.2 CPUs without great hardship that doesn't seem so bad. I'll have a look later, and patches are welcome in the meantime

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
I have a pre-SSE4.2 laptop in hand now and will spend a few minutes to see what's involved with getting the software running properly in this environment.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
I put up a PR to get the project running on 2009-era Intel architecture. I am not sure this will fix the Windows issue, though

We might add a software implementation of __builtin_popcountll just in case

https://github.com/RoaringBitmap/CRoaring/blob/master/include/roaring/portability.h#L162

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
@thamht4190 Should this issue be kept open?

@asfimport
Copy link
Collaborator Author

Tham / @thamht4190:
I have used CpuInfo::IsSupported(POPCNT) to see if this computer supports POPCNT, then use to count bit. What is your idea about this approach? I'll try your suggestion as well.

(Sorry that I don't receive any notification about your comment, so I have to check it by opening this page directly and sometimes reply late).

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
@thamht4190 So you confirm that the machine that had this issue doesn't support POPCNT?

Did you encounter many such cases?

@asfimport
Copy link
Collaborator Author

Tham / @thamht4190:

  So you confirm that the machine that had this issue doesn't support POPCNT?

@pitrou  Yes, absolutely.

  Did you encounter many such cases?

Like 12 machines until I fixed this issue using .

@asfimport
Copy link
Collaborator Author

Tham / @thamht4190:
@pitrou Hi, I'm going to upgrade to arrow 3.0.0. May I ask if this issue has been fixed in this version of arrow? If not, I'm not sure where to apply soft implementation for POPCNT, the code has changed a lot since then. While I don't have any machine without POPCNT support to test, it may affect our customers when releasing. Can you please confirm if it's fixed?

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
@thamht4190 No, it isn't fixed. The code hasn't changed that much, though, you can still patch these lines:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bit_util.h#L25-L29

(this is git master, but you get the idea)

@abcbarryn
Copy link

Is this ever going to get fixed? This has been broken for at least 4 years now...

@westonpace
Copy link
Member

The challenge here is generally supporting hardware without any kind of CI infrastructure. Perhaps we can run some very basic tests with qemu (it's too slow to run the full test suite) to get reasonable coverage.

@assignUser
Copy link
Member

I'd like to mention that popcnt is a 15 year old instruction set this point. TBH I don't think it is worth the maintenance burden (+ CI issues) to add this with the use case for this only ever shrinking as time goes on but I will defer to @westonpace

I am also closing this issue in favor of #23013 as it represents the actual fix vs. a long troubleshooting thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants