Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LAPACK test segfault on zen/zen2/zen3 at bli_sgemmsup_rd_haswell_asm_1x16n #821

Closed
j-bm opened this issue Jul 25, 2024 · 16 comments
Closed

Comments

@j-bm
Copy link
Contributor

j-bm commented Jul 25, 2024

Building blis on OpenBSD (-current, that is to say most recent development version).

Configuration argument: x86_64
compiler: clang version 16.0.6
cpu: cpu0: AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics, 3307.99 MHz, 19-74-01

cpu0: cpuid 1 edx=78bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2> ecx=f6f83203<SSE3,PCLMUL,SSSE3,FMA3,CX16,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,HV>

Built LAPACK version 3.8.0, run the test code:

$ LIN/xlintsts < stest.in
 Tests of the REAL LAPACK routines
 LAPACK VERSION 3.8.0

 The following parameter values will be used:
    M   :       0     1     2     3     5    10    50
    N   :       0     1     2     3     5    10    50
    NRHS:       1     2    15
    NB  :       1     3     3     3    20
    NX  :       1     0     5     9     1
    RANK:      30    50    90
...[stuff omitted]...


 All tests for SQT routines passed the threshold (    510 tests run)

 SXQ routines passed the tests of the error exits

Program received signal SIGSEGV, Segmentation fault.
0x000008dcc789a771 in bli_sgemmsup_rd_haswell_asm_1x16n ()
(gdb) bt
#0  0x000008dcc789a771 in bli_sgemmsup_rd_haswell_asm_1x16n ()
#1  0x000008dcc789a21a in bli_sgemmsup_rd_haswell_asm_6x16n ()
#2  0x000008dcc7731b57 in bli_gemmsup_ref_var1n ()
#3  0x000008dcc772df15 in bli_gemmsup_int ()
#4  0x000008dcc7735bba in bli_l3_sup_thread_decorator_entry ()
#5  0x000008dcc7735ae7 in bli_l3_sup_thread_decorator ()
#6  0x000008dcc772c00b in bli_gemmsup_ref ()
#7  0x000008dcc7d341fb in bli_gemmsup ()
#8  0x000008dcc7d32f8b in bli_gemm_ex ()
#9  0x000008dcc7d317c3 in sgemm_ ()
#10 0x000008dcc75fe1ca in slqt05_ ()
#11 0x000008dcc75f073f in schklqtp_ ()
#12 0x000008dcc7574b6f in MAIN__ ()
#13 0x000008dcc7574ddb in main ()


(gdb) x/i $pc
=> 0x186c5369771 <bli_sgemmsup_rd_haswell_asm_1x16n+721>:       vmovss (%rax,%r8,1),%xmm1
(gdb)
   0x186c5369777 <bli_sgemmsup_rd_haswell_asm_1x16n+727>:       add    $0x4,%rax
(gdb)
   0x186c536977b <bli_sgemmsup_rd_haswell_asm_1x16n+731>:       vmovss (%rbx),%xmm3
(gdb)
   0x186c536977f <bli_sgemmsup_rd_haswell_asm_1x16n+735>:       vfmadd231ps %ymm0,%ymm3,%ymm4
(gdb)
   0x186c5369784 <bli_sgemmsup_rd_haswell_asm_1x16n+740>:       vmovss (%rbx,%r11,1),%xmm3
(gdb)
   0x186c536978a <bli_sgemmsup_rd_haswell_asm_1x16n+746>:       vfmadd231ps %ymm0,%ymm3,%ymm7
(gdb)
   0x186c536978f <bli_sgemmsup_rd_haswell_asm_1x16n+751>:       vmovss (%rbx,%r11,2),%xmm3
(gdb)
   0x186c5369795 <bli_sgemmsup_rd_haswell_asm_1x16n+757>:       vfmadd231ps %ymm0,%ymm3,%ymm10

Experimenting with the $ export BLIS_ARCH_TYPE= yields the conclusion zen/zen2/zen3 fails exactly as above. BLIS_ARCH_TYPE=4 (sandybridge) succeeds, as does Penryn.

It seems to be the SQZ and STQ tests that fail.

@devinamatthews
Copy link
Member

Would it be possible to extract the specific sgemm parameters leading to this in order to create a MWE?

@j-bm j-bm closed this as completed Jul 26, 2024
@j-bm j-bm reopened this Jul 26, 2024
@j-bm
Copy link
Contributor Author

j-bm commented Jul 27, 2024

Just a remark -- I deleted my last two comments as the test code in them was incorrect.

Better code to come!

@j-bm
Copy link
Contributor Author

j-bm commented Jul 27, 2024

Here is a test code with some assertions included.

$ cat sgemmtest.f90
program sgemmtest
   IMPLICIT NONE

   REAL, ALLOCATABLE ::  Q(:, :), A(:, :), R(:, :)

   REAL ONE, ZERO
   PARAMETER(ONE=1.0, ZERO=0.0)

   INTEGER L, M, N

   INTRINSIC MAX, MIN

   M = 50
   N = 10

   L = MAX(M, N, 1)

   ALLOCATE (Q(L, L), A(M, N), R(M, L))

   CALL SLASET('A', M, N, ONE, ONE, A, M)
   CALL SLASET('A', L, L, ONE, ONE, Q, L)

   print *,'sgemmtest:'
   print *,'   M = ',M,' N = ',N,' L = ',L
   print *,'   A(1,1) is ',A(1,1),' Q(1,1) is ',Q(1,1)
   print *,' '
   print *,' R = Q**T * A, except Q is square but we use MxN of it'
   print *,' assert sum(A)==M*N*ONE is ', M*N*ONE == SUM(A)
   print *,' assert sum(Q)==L*L*ONE is ', L*L*ONE == SUM(Q)
   CALL SGEMM('T', 'N', M, N, M, ONE, Q, M, A, M, ZERO, R, M)
   print *,' r11 ',r(1,1), ' r211', r(2,1), ' rml ',r(M,L)
   print *,' matrix of MxN filled with M:'
   print *,' assert sum(R)==M*N*M is ', SUM(R) - M*N*M*ONE == ZERO

   print *,' done'
end

Here is a successful run:

$ export GFORTRAN_UNBUFFERED_ALL=1
$ export MALLOC_OPTIONS=CFG
$ export BLIS_ARCH_DEBUG=1
$ ./tblis.x
 sgemmtest:
    M =           50  N =           10  L =           50
    A(1,1) is    1.00000000      Q(1,1) is    1.00000000

  R = Q**T * A, except Q is square but we use MxN of it
  assert sum(A)==M*N*ONE is  T
  assert sum(Q)==L*L*ONE is  T
libblis: selecting sub-configuration 'zen3'.
  r11    50.0000000      r211   50.0000000      rml    0.00000000
  matrix of MxN filled with M:
  assert sum(R)==M*N*M is  T
  done

Here is unsuccessful run:

$ egdb -q tblis.x
Reading symbols from tblis.x...
(gdb) run
Starting program: /home/jal/checkblis/tblis.x
 sgemmtest:
    M =           50  N =           10  L =           50
    A(1,1) is    1.00000000      Q(1,1) is    1.00000000

  R = Q**T * A, except Q is square but we use MxN of it
  assert sum(A)==M*N*ONE is  T
  assert sum(Q)==L*L*ONE is  T
libblis: selecting sub-configuration 'zen3'.

Program received signal SIGSEGV, Segmentation fault.
0x00000aad13098881 in bli_sgemmsup_rd_haswell_asm_1x16n ()
   from /usr/local/lib/libblis.so.0.0

(gdb) bt
#0  0x00000aad13098881 in bli_sgemmsup_rd_haswell_asm_1x16n ()
   from /usr/local/lib/libblis.so.0.0
#1  0x00000aad1309832a in bli_sgemmsup_rd_haswell_asm_6x16n ()
   from /usr/local/lib/libblis.so.0.0
#2  0x00000aad136fac87 in bli_gemmsup_ref_var1n ()
   from /usr/local/lib/libblis.so.0.0
#3  0x00000aad136f8a55 in bli_gemmsup_int ()
   from /usr/local/lib/libblis.so.0.0
#4  0x00000aad136f86aa in bli_l3_sup_thread_decorator_entry ()
   from /usr/local/lib/libblis.so.0.0
#5  0x00000aad136f85d7 in bli_l3_sup_thread_decorator ()
   from /usr/local/lib/libblis.so.0.0
#6  0x00000aad136f9f7b in bli_gemmsup_ref ()
   from /usr/local/lib/libblis.so.0.0
#7  0x00000aad136f82db in bli_gemmsup () from /usr/local/lib/libblis.so.0.0
#8  0x00000aad136f6a1b in bli_gemm_ex () from /usr/local/lib/libblis.so.0.0
#9  0x00000aad137464e3 in sgemm_ () from /usr/local/lib/libblis.so.0.0
#10 0x00000aab12ff8ca5 in sgemmtest () at sgemmtest.f90:30
#11 0x00000aab12ff9059 in main (argc=1, argv=0x7571737f3dd0)
    at sgemmtest.f90:36
#12 0x00000aab12ff7e7b in _start ()

@devinamatthews
Copy link
Member

@fgvanzee since you're the most familiar with this code, could you take a look?

@fgvanzee
Copy link
Member

fgvanzee commented Aug 1, 2024

@devinamatthews Sure, I'll see what I can figure out.

@fgvanzee
Copy link
Member

fgvanzee commented Aug 3, 2024

@j-bm I used your Fortran driver (thanks for providing that!), but was unable to reproduce your issue. :-\

  1. What version/commit of BLIS are you using?
  2. Is it vanilla or from AMD?
  3. Assuming you built from source, how did you configure your copy of BLIS?

@j-bm
Copy link
Contributor Author

j-bm commented Aug 4, 2024

This is running flame/blis version 1.0.0 .tar.gz file.

Built on OpenBSD-current (as of a few weeks ago) using

o75snap$ egfortran --version
GNU Fortran (GCC) 8.4.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

o75snap$ cc --version
OpenBSD clang version 16.0.6
Target: amd64-unknown-openbsd7.5
Thread model: posix
InstalledDir: /usr/bin

$ gmake showconfig
configuration family:       x86_64
sub-configurations:         skx knl haswell sandybridge penryn zen3 zen2 zen excavator steamroller piledriver bulldozer generic
requisite kernels sets:     skx knl sandybridge penryn zen3 zen2 haswell zen piledriver bulldozer generic
kernel-to-config map:       bulldozer:bulldozer generic:generic haswell:haswell knl:knl penryn:penryn piledriver:piledriver sandybridge:sandybridge skx:skx zen:zen zen2:zen2 zen3:zen3
-------------------------
BLIS version string:        1.0
.so major version:          0
.so minor.build vers:       0
install libdir:             /usr/local/lib
install includedir:         /usr/local/include
install sharedir:           /usr/local/share
debugging status:           off
enable AddressSanitizer?    no
enabled threading model(s): single
enable BLAS API?            yes
enable CBLAS API?           yes
build static library?       yes
build shared library?       yes
ARG_MAX hack enabled?       no


I rebuilt on another cpu (Intel i3-series instead of Ryzen) but had the same issues at the same routine. Both are Windows10/11 running VMware so all my tests are on a VM guest not actual hardware.

Some further notes and trials:

  1. Only happens on OpenBSD. Could not reproduce on Linux (MX 23.3 which is debian based).

  2. Only, it seems, running debugging/testing malloc (defined by MALLOC_OPTIONS=CFG) on this test program.

  3. The original test with the LAPACK program LIN/xlintsts resulted in segfaults with or without this memory allocation check.

  4. I tried the Microsoft mimalloc on MX Linux, thinking it was some kind of malloc issue not blis/fortran/os. No segfault found. Mimalloc does not have the same testing/debugging features as OpenBSD malloc.

  5. Segfault occurs about one third of the time on the test program. The LAPACK program LIN/xlintsts failures occur at different subtests and the program does not ever run to completion, because it has so many subtests.

  6. Some speculation on what could be happening:

  • malloc issue (does malloc cause the problem or does the debugging form of malloc just uncover an underlying problem?)
  • fortran memory issue (very unlikely)
  • OpenBSD malloc issue (very unlikely)
  • OpenBSD assembler language compatibility issue (I don't know enough about x86 and avx and OpenBSD assembly to say)
  • BLIS overrunning allocated arrays. The answers are correct but does BLIS "stay inside the lines"?
  • clang vs gcc

@j-bm
Copy link
Contributor Author

j-bm commented Aug 4, 2024

Here is a debugging run with a modified test program which prints the address of the allocated arrays.

code fragment:

24         print *,'malloctest:'
25         write (*,'(A,Z16)') '  location A is ', loc(A)
26         write (*,'(A,Z16)') '  location Q is ', loc(Q)
27         write (*,'(A,Z16)') '  location R is ', loc(R)
28         CALL SGEMM('T', 'N', M, N, M, ONE, Q, M, A, M, ZERO, R, M)

debugger output:

(gdb) break *bli_sgemmsup_rd_haswell_asm_1x16n+721
Breakpoint 3 at 0xd826ecae9a1
(gdb) c
Continuing.
 malloctest:
  location A is      D82E68897D0
  location Q is      D8214FBA000
  location R is      D822BE43000
libblis: selecting sub-configuration 'zen3'.

Breakpoint 3, 0x00000d826ecae9a1 in bli_sgemmsup_rd_haswell_asm_1x16n ()
   from /usr/local/lib/libblis.so.0.0
(gdb) x/i $pc
=> 0xd826ecae9a1 <bli_sgemmsup_rd_haswell_asm_1x16n+721>:
    vmovss (%rax,%r8,1),%xmm1
(gdb) info reg r8 rax rsp
r8             0xc8                200
rax            0xd82e6889f98       14855864623000
rsp            0x734c53fa1830      0x734c53fa1830
(gdb) p 0xd82e6889f98 - 0xD82E68897D0
$1 = 1992

This shows that $rax is pointing at the last element of array A.

(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x00000d826ecae9a1 in bli_sgemmsup_rd_haswell_asm_1x16n ()
   from /usr/local/lib/libblis.so.0.0
(gdb) p/x *(0xD82E68897D0)
$2 = 0x3f800000
(gdb) p/x *($rax)
$3 = 0x3f800000

The dereference fails:

(gdb) p/x *($rax+$r8)
Cannot access memory at address 0xd82e688a060

Which suggests a bug, accessing beyond the end of array A.

@fgvanzee
Copy link
Member

fgvanzee commented Aug 6, 2024

@j-bm Thank you for those additional details, they were quite helpful! I think you helped us narrow it down to the last phase of the edge case handling code in the offending s1x16n kernel.

In kernels/haswell/3/sup/bli_gemmsup_rd_haswell_asm_s6x16n.c, line 2215 2214 appears to not belong there and should be deleted. (You can see it on line 1708 in the s2x16n version of the kernel; so this is very likely a copy-paste bug.)

Please try deleting this line and let us know if it fixes the bug.

    label(.SLOOPKLEFT1)                // EDGE LOOP (scalar)
                                       // NOTE: We must use ymm registers here bc
                                       // using the xmm registers would zero out the
                                       // high bits of the destination registers,
                                       // which would destory intermediate results.

    vmovss(mem(rax       ), xmm0)
    vmovss(mem(rax, r8, 1), xmm1)     // ***TRY DELETING THIS LINE
    add(imm(1*4), rax)                 // a += 1*cs_a = 1*4;
    
    vmovss(mem(rbx        ), xmm3)
    vfmadd231ps(ymm0, ymm3, ymm4)

    vmovss(mem(rbx, r11, 1), xmm3)
    vfmadd231ps(ymm0, ymm3, ymm7)

    vmovss(mem(rbx, r11, 2), xmm3)
    vfmadd231ps(ymm0, ymm3, ymm10)

    vmovss(mem(rbx, r13, 1), xmm3)
    add(imm(1*4), rbx)                 // b += 1*rs_b = 1*4;
    vfmadd231ps(ymm0, ymm3, ymm13)


    dec(rsi)                           // i -= 1;
    jne(.SLOOPKLEFT1)                  // iterate again if i != 0.

@j-bm
Copy link
Contributor Author

j-bm commented Aug 7, 2024

Yes, that fixes the issue.

Did that extra instruction do anything important (other that segfaulting)?

@fgvanzee
Copy link
Member

fgvanzee commented Aug 8, 2024

Yes, that fixes the issue.

Great news. Thanks for your help!

Did that extra instruction do anything important (other that segfaulting)?

No, it was 100% a copy-paste bug. I probably started with the 2x16 case and deleted instructions until it became a 1x16, but then forgot to delete that instruction (which would have loaded the second of the two elements of A).

I'll open a PR with the fix and credit you. I really appreciate your feedback!

@fgvanzee
Copy link
Member

fgvanzee commented Aug 8, 2024

@j-bm Sorry for getting the line numbers a little wrong btw. I had forgotten that I had inserted printf() calls to signal the entry into and exit from that function (as a sanity check to make sure the right code was being called).

fgvanzee added a commit that referenced this issue Aug 8, 2024
Details:
- Fixed a bug in the bli_sgemmsup_rd_haswell_asm_1x16n() millikernel.
  The kernel was erroneously performing an out-of-bounds read whenever
  the singleton edge case loop executed (that is, whenever the k
  dimension of the millikernel problem was not a multiple of 8). This
  OOB error was the result of a copy-paste bug; when developing the
  s1x16n function, I started from a copy of the s2x16n function, but
  then failed to delete the instruction that reads the second element
  of A in the code that handles the PR loop's edge case. Thanks to
  @j-bm for reporting this bug in Issue #821 and helping narrow down
  the cause to the rax register.
@j-bm
Copy link
Contributor Author

j-bm commented Aug 8, 2024

Thanks for the quick fix!

@fgvanzee
Copy link
Member

fgvanzee commented Aug 8, 2024

I'm going to close this issue now. If you encounter any further problems or concerns, please let us know.

@fgvanzee fgvanzee closed this as completed Aug 8, 2024
@BhaskarNallani
Copy link
Contributor

Hi @fgvanzee ,
Creating memory for input matrixes with simple malloc( ) which creates exact size without any alignment for functionality test helps out to find these out of order memory accesses.
In addition to that ASAN testing helps further.

@fgvanzee
Copy link
Member

fgvanzee commented Aug 8, 2024

I completely agree. Thanks for that reminder, @BhaskarNallani!

fgvanzee added a commit that referenced this issue Aug 8, 2024
Details:
- Fixed a bug in the bli_sgemmsup_rd_haswell_asm_1x16n() millikernel.
  The kernel was erroneously performing an out-of-bounds read whenever
  the singleton edge case loop executed (that is, whenever the k
  dimension of the millikernel problem was not a multiple of 8). This
  OOB error was the result of a copy-paste bug; when developing the
  s1x16n function, I started from a copy of the s2x16n function, but
  then failed to delete the instruction that reads the second element
  of A in the code that handles the PR loop's edge case. Thanks to
  @j-bm for reporting this bug in Issue #821 and helping narrow down
  the cause to the rax register.
- CREDITS file update.
fgvanzee added a commit that referenced this issue Aug 20, 2024
Details:
- Previously, if the user enabled CBLAS via 'configure --enable-cblas'
  and then ran 'make', the flattened blis.h header file would be created
  immediately, but the flattened cblas.h header file would not be
  created until 'make install' was run. This was happening because
  nothing in the BLIS build process (except installation) depended on
  the flattened cblas.h (whereas *everything* depends on the flattened
  blis.h, and therefore it was being created first). This behavior can
  be confusing to application developers who could reasonably expect
  that the flattened cblas.h header would be available (to inspect or
  use) prior to running 'make install'.
- This commit fixes the aforementioned issue by (1) adding cblas.h (if
  CBLAS is enabled) as a dependency to all of the build rules for core
  framework object files, and (2) making the flattened blis.h a
  prerequisite for flattening cblas.h. The upshot is that (1) ensures
  that the flattened cblas.h is created around the the same time that
  the flattened blis.h is created, and (2) ensures that the two headers
  are flattened sequentially (first blis.h and then cblas.h) even when
  using 'make -j[n]', which ensures that the output of the two processes
  do not comingle.
- Thanks to Jeff Diamond for reporting this issue.
- (cherry picked from commit 8d9be87)

Fixed out-of-bounds read bug in sup haswell ukr. (#824)

Details:
- Fixed a bug in the bli_sgemmsup_rd_haswell_asm_1x16n() millikernel.
  The kernel was erroneously performing an out-of-bounds read whenever
  the singleton edge case loop executed (that is, whenever the k
  dimension of the millikernel problem was not a multiple of 8). This
  OOB error was the result of a copy-paste bug; when developing the
  s1x16n function, I started from a copy of the s2x16n function, but
  then failed to delete the instruction that reads the second element
  of A in the code that handles the PR loop's edge case. Thanks to
  @j-bm for reporting this bug in Issue #821 and helping narrow down
  the cause to the rax register.
- CREDITS file update.
- (cherry picked from commit a822cb2)

Fixed typo in 4158930; variable renames. (#815)

Details:
- Fixed a typo in the "./configure --help" output for the ScaLAPACK
  compatibility option implemented in 4158930.
- Trivial variable renames.
- (cherry picked from commit 8820f8f)

Fix a bug in the piledriver microkernels. (#814)

Details:
- At some point, the piledriver (and bulldozer and excavator)
  microkernel tests via SDE had been removed from Travis CI testing.
  This PR re-enables them.
- A bug in the piledriver complex gemm microkernels has also been
  fixed. The `beta*C` product was not being correctly added to the `A*B`
  product before writing back out to memory.
- Fixes #811.
- (cherry picked from commit 31ecf82)

Add ScaLAPACK compatibility mode. (#813)

Details:
- Add configure options '--enable-scalapack-compat' and
  '--disabled-scalapack-compat' (default disabled).
- Add a macro BLIS_{ENABLE,DISABLE}_SCALAPACK_COMPAT to bli_config.h.
- This option and macro control any changes to the API necessary to
  maintain compatibility with ScaLAPACK. Currently, this only means
  disabling the complex versions of syr, syr2, and symv. In the
  future, other changes could be controlled by the same flag.
- Complex syr2 wasn't enabled at the same time that complex syr and
  symv were. This is now corrected.
- (cherry picked from commit 4158930)

Update CREDITS

- (cherry picked from commit 5cbec65)

Fix SyntaxWarning messages from python 3.12 (#809)

Details:
- When using regexes in Python, certain characters need backslash
  escaping, e.g.:

    regex = re.compile( '^[\s]*#include (["<])([\w\.\-/]*)([">])' )

  However, technically escape sequences like `\s` are not valid and
  should actually be double-escaped: `\\s`. Python 3.12 now warns about
  such escape sequences, and in a later version these warning will be
  promoted to errors. See also:
  https://docs.python.org/dev/whatsnew/3.12.html#other-language-changes
  The fix here is to use Python's "raw strings" to avoid
  double-escaping. This issue can be checked for all files in the
  current directory with the command:

    python -m compileall -d . -f -q .

  Thanks to @AngryLoki for the fix.
- (cherry picked from commit 729c57c)

Updates to README.md section on downloading.

Details:
- Updated the text in README.md in the "How to Download BLIS" section.
  The new text no longer recommends that the reader use the 'master'
  branch over official releases, as the previous text did. The text was
  tweaked since (a) the 'master' branch is now akin to a development
  branch, and (b) the reader will no longer forgo bugfixes by sticking
  to official releases since we will (going forward) publish bugfix
  releases for the most recent version.
- (cherry picked from 6d0ab74)

Updated RELEASING file; fixes to ReleaseNotes.md.

Details:
- Updated RELEASING file to reflect new release protocols, given the
  more sophisticated policy of maintaining release candidate branches
  separate from 'master' (which is now more akin to a development
  branch). Further refinements to this file will likely follow.
- Fixed typos in ReleaseNotes.md. Thanks to Robert van de Geijn for
  reporting these.
- (cherry picked from 01e151a)

ReleaseNotes.md update.

- (cherry picked from 06dddf1)

CHANGELOG update (1.0)

- (cherry picked from a876918)

Version file update (1.0)

- (cherry picked from c2af113)

Added a script to help create new rc branches.

Details:
- Added a new script, build/start-new-rc.sh, which:
  1. Updates the version file with a new version string.
  2. Commits (locally) the version string update.
  3. Updates the CHANGELOG file with the output of 'git log'.
  4. Commits (locally) the CHANGLOG file update.
  5. Creates a new branch whose name is equal to "<vers>-rc0" where
     <vers> is the new version string.
  6. Reminds the user to execute some final steps if everything looks
     good.
  This new script will help in the future when it's time to start a new
  release candidate branch/lineage off of 'master'. Note that this
  script is based on build/bump-version.sh (which itself may change in
  the future due to changes in the way versions/releases will be handled
  going forward).
- (cherry picked from 5ab286f)
devinamatthews pushed a commit that referenced this issue Nov 3, 2024
Details:
- Fixed a bug in the bli_sgemmsup_rd_haswell_asm_1x16n() millikernel.
  The kernel was erroneously performing an out-of-bounds read whenever
  the singleton edge case loop executed (that is, whenever the k
  dimension of the millikernel problem was not a multiple of 8). This
  OOB error was the result of a copy-paste bug; when developing the
  s1x16n function, I started from a copy of the s2x16n function, but
  then failed to delete the instruction that reads the second element
  of A in the code that handles the PR loop's edge case. Thanks to
  @j-bm for reporting this bug in Issue #821 and helping narrow down
  the cause to the rax register.
- CREDITS file update.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants