Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q2 and Q3 quantization #1004

Closed
wants to merge 4 commits into from
Closed

Q2 and Q3 quantization #1004

wants to merge 4 commits into from

Conversation

sw
Copy link
Collaborator

@sw sw commented Apr 15, 2023

This adds support for 2-bit and 3-bit quantization with an FP16 shared scale and 16 quants per block.

I don't consider it ready to merge, as we might come up with a different block format. The struct definitions are not portable: they use a #pragma for one, and are unlikely to work on big-endian systems.

#951 really is a game-changer for 2-bit quantization. I have updated my code to use the Q8 intermediate quantization.

Q2 sample output (we may need to implement a profanity filter, but swearing is appropriate if you have to use PHP):

$ ./main -m models/7B/ggml-model-q2_0.bin -p "Building a website can be done in 10 simple steps:" -s 1681585661
 Building a website can be done in 10 simple steps:
Make sure your copy is up to date
Make sure graphics are updated
Make sure the fonts are refreshed
Make sure the javascript is refreshed
Make sure the PHP is refreshed
P.S.- I'll just tell ya what the fuck went wrong with P.S.- It's not like I'm superman or something...
P.S.- If you don't want to get your eyes glued up by playing Candyland, go read the comments for a minute
T.F.- I'll just tell ya what the fuck went wrong with T.F.- It

Q3 is more sensible, but I haven't played with it a lot.

As mentioned, both new types use FP16 and QK=16. This was easiest to implement for two reasons: I can just use half of a Q8 block, and 16 3-bit values can be mangled in an AVX2 register.

So I've essentially invented my own new formats, but I'm aware there's a 3-bit GPTQ format. I'd love to re-use what's already been done, but haven't been able to find a clear definition (or model files) of that.

Looking forward to anyone finding better SIMD optimizations, especially for Q3, which is a pain in the butt...

Model file sizes:

$ ls -gho models/7B/*q*
-rw-rw-r-- 1 2.4G Apr 15 21:04 models/7B/ggml-model-q2_0.bin
-rw-rw-r-- 1 3.2G Apr 15 19:53 models/7B/ggml-model-q3_0.bin
-rw-rw-r-- 1 4.0G Apr 13 21:51 models/7B/ggml-model-q4_0.bin
-rw-rw-r-- 1 4.8G Apr 12 21:43 models/7B/ggml-model-q4_1.bin

Perplexity for 7B: (not going to let this run for over a day, can someone with a faster machine help out?)

Q2:
78.25 seconds per pass - ETA 14.24 hours
[1]9.5516,[2]10.8049,[3]11.6886,[4]12.9123,[5]12.7524,[6]12.7123,[7]12.9646,[8]13.1274,[9]13.8018,[10]14.1944,[11]14.7764,

Q3:
141.31 seconds per pass - ETA 25.71 hours
[1]4.8166,[2]5.2200,[3]6.1143,

@sw
Copy link
Collaborator Author

sw commented Apr 15, 2023

The build failures on macOS show that I've messed up with using AVX2 in the AVX part, so this probably won't work on an AVX-only machine without modification.

Edit: removed the AVX parts, I tested these with the wrong compiler flags.

@ggerganov
Copy link
Owner

I'm playing with the smallest GPT-2 models and trying to make them work with 4-bit quantization. They keep breaking down completely even after #951. I guess the small number of parameters requires very high precision in the quantization.

However, I just noticed that if I keep just the last tensor in F16, they suddenly become coherent.
So I made this issue: #1003

Maybe it could further improve your Q2 and Q3 if you keep the last tensor in high precision

@slaren
Copy link
Collaborator

slaren commented Apr 15, 2023

Probably nothing new, but here is a quick test of the perplexity performance with Q2_0:

Model Time per pass
7B Q4_0 26.27 seconds
7B Q2_0 53.26 seconds
13B Q2_0 100.88 seconds
30B Q2_0 246.48 seconds

@slaren
Copy link
Collaborator

slaren commented Apr 16, 2023

Here is a AVX2 implementation of ggml_vec_dot_q2_0_q8_0 that operates on two blocks at a time. Doing this the performance is much closer to q4_0.

Model Time per pass
7B Q4_0 26.27 seconds
7B Q2_0 34.74 seconds
13B Q2_0 66.76 seconds
30B Q2_0 165.91 seconds
static inline __m256i bytesFromi2(uint32_t packed1, uint32_t packed2) {
    __m128i bx1 = _mm_set1_epi32(packed1);
    __m128i bx2 = _mm_set1_epi32(packed2);
    __m256i bx = _mm256_set_m128i(bx1, bx2);

    // shift counts to get all bit pairs in lowest position of each byte
    const __m256i shift256 = _mm256_set_epi32(6, 4, 2, 0,
                                              6, 4, 2, 0);
    bx = _mm256_srlv_epi32(bx, shift256);

    const __m256i shufmask = _mm256_set_epi8(15,11,7,3,
                                             14,10,6,2,
                                             13,9,5,1,
                                             12,8,4,0,
                                             15,11,7,3,
                                             14,10,6,2,
                                             13,9,5,1,
                                             12,8,4,0);

    bx = _mm256_shuffle_epi8(bx, shufmask);

    const __m256i mask = _mm256_set1_epi8(3);
    bx = _mm256_and_si256(mask, bx);

    return bx;
}

static void ggml_vec_dot_q2_0_q8_0(const int n, float * restrict s, const void * restrict vx, const void * restrict vy) {
    const int nb = n / QK2_0;

    assert(n % QK2_0 == 0);
    assert(nb % 2 == 0);

    const block_q2_0 * restrict x = vx;
    const block_q8_0 * restrict y = vy;

    float sumf = 0.0f;

#if defined(__AVX2__)
    // Initialize accumulator with zeros
    __m256 acc = _mm256_setzero_ps();

    for (int i = 0; i < nb; i += 2) {
        __m256i bx = bytesFromi2(x[i+1].qs, x[i].qs);

        // Compute combined scale for the block
        const __m128 scale1 = _mm_set1_ps(GGML_FP16_TO_FP32(x[i].d) * y[i/2].d);
        const __m128 scale2 = _mm_set1_ps(GGML_FP16_TO_FP32(x[i+1].d) * y[i/2].d);
        const __m256 scale = _mm256_set_m128(scale2, scale1);

        const __m256i off = _mm256_set1_epi8(2);
        bx = _mm256_sub_epi8(bx, off);

        // Load y vector
        const __m256i by = _mm256_loadu_si256((const __m256i *)y[i/2].qs);

        // Get absolute values of x vectors
        const __m256i ax = _mm256_sign_epi8(bx, bx);

        // Sign the values of the y vectors
        const __m256i sy = _mm256_sign_epi8(by, bx);

        // Perform multiplication and create 16-bit values
        const __m256i dot = _mm256_maddubs_epi16(ax, sy);

        // Convert int16_t to int32_t by adding pairwise
        const __m256i ones = _mm256_set1_epi16(1);
        __m256i i32 = _mm256_madd_epi16(ones, dot);

        // Convert int32_t to float
        __m256 p = _mm256_cvtepi32_ps( i32 );

        // Apply the scale, and accumulate
        acc = _mm256_fmadd_ps( scale, p, acc );
    }

    // Return horizontal sum of the acc vector
    __m128 res = _mm256_extractf128_ps( acc, 1 );
    res = _mm_add_ps( res, _mm256_castps256_ps128( acc ) );
    res = _mm_add_ps( res, _mm_movehl_ps( res, res ) );
    res = _mm_add_ss( res, _mm_movehdup_ps( res ) );

    sumf = _mm_cvtss_f32( res );
#else
...

@ggerganov ggerganov linked an issue Apr 16, 2023 that may be closed by this pull request
@sw
Copy link
Collaborator Author

sw commented Apr 16, 2023

Here is a AVX2 implementation of ggml_vec_dot_q2_0_q8_0 that operates on two blocks at a time

Thanks @slaren, I just added this. Apparently 2 bits are called a crumb, so I went with that.

I originally wrote it for 128 bits because I thought it would be AVX-compatible, but it turns out just avoiding 256 in the intrinsic names is not enough, some of the 128 bit intrinsics were only introduced with AVX2.

@sw
Copy link
Collaborator Author

sw commented Apr 16, 2023

Maybe it could further improve your Q2 and Q3 if you keep the last tensor in high precision

That does help, but with a block size of 6 bytes, Q2 is already 3-bit in a sense, and this again makes the file bigger. Q3 might be changed to use QK=24, but that doesn't really match the Q8 block size.

QK output file size perplexity final perplexity
16 FP16 2.6G [1] 8.4455,[2] 9.7838,[3]10.4037,[4]11.3945,[5]11.0683,[6]11.0007,[7]11.1454,[8]11.3229,[9]11.8927,[10]12.2416,[11]12.6852,
16 Q2 2.4G [1] 9.5516,[2]10.8049,[3]11.6886,[4]12.9123,[5]12.7524,[6]12.7123,[7]12.9646,[8]13.1274,[9]13.8018,[10]14.1944,[11]14.7764, 12.6438
32 FP16 2.2G [1]12.0870,[2]14.3404,[3]15.1381,[4]16.0226,[5]15.6099,[6]15.3150,[7]15.7483,[8]15.8426,[9]16.5506,[10]17.1701,[11]17.9920,
32 Q2 2.0G [1]15.3372,[2]17.8944,[3]18.8442,[4]19.8121,[5]19.6249,[6]19.5118,[7]20.2628,[8]20.5298,[9]21.2859,[10]21.9962,[11]23.2782,

Don't know who would want to use the smallest file, except for entertainment. Sometimes it actually talks about building websites, but sometimes it goes off the rails:

Building a website can be done in 10 simple steps:
Ten Tips For Writing A Novel In Thirty Days
I have decided to begin my book reviewing of the Fat, Furry Feline by F. J. (Bed, 2008). This book was written and edited by M. (Bed), using a series of gags about cats with large breasts. The entire feline breed is actually in reference to the cat that is a door mat that has an accompanying set of (forgable, un-forglut, and in, 2010).

Changing the sampling parameters (temperature etc.) may help, but I haven't played with those.

@slaren
Copy link
Collaborator

slaren commented Apr 16, 2023

In case this is useful for future comparisons, currently 7B q2_0 AVX2 has a perplexity of 12.6438.

@pubby
Copy link

pubby commented Apr 16, 2023

Looking forward to anyone finding better SIMD optimizations, especially for Q3, which is a pain in the butt...

I wrote about this at #456 (comment), but I think there's a pretty obvious win by using a different q3_0 representation. Downside being, it's not the GPTQ format.

Code and pull request at sw#1

@sw
Copy link
Collaborator Author

sw commented Apr 17, 2023

Thanks to @pubby, the Q3 code is now faster on AVX2 and should be more amenable to other SIMD optimizations. You'll have to re-quantize the model, though.

@pubby
Copy link

pubby commented Apr 17, 2023

Here's an attempt at porting bytesFromCrumbs to other architectures. I don't have these systems so they may be incorrect and/or slow. Improvements are more than welcome.

ARM NEON:

static inline int8x16_t bytesFromCrumbs(uint32_t packed) {
#  if __ARM_64BIT_STATE
    const uint8x16_t temp = vreinterpretq_u8_u32(vmovq_n_u32(packed));

    // Swizzle to put in the proper order
    const uint8x16_t indices = vcombine_u8(vcreate_u8(0x0D0905010C080400), vcreate_u8(0x0F0B07030E0A0602));
    const uint8x16_t swizzled = vqtbl1q_u8(temp, indices);
#  else
    const uint8x8_t temp_lo = vreinterpret_u8_u16(vmov_n_u16(packed));
    const uint8x8_t temp_hi = vreinterpret_u8_u16(vmov_n_u16(packed >> 16));

    // Swizzle to put in the proper order
    const uint8x8_t indices = vcreate_u8(0x0705030106040200);
    const uint8x8_t swizzled_lo = vtbl1_u8(temp_lo, indices);
    const uint8x8_t swizzled_hi = vtbl1_u8(temp_hi, indices);
    const uint8x16_t swizzled = vcombine_u8(swizzled_lo, swizzled_hi);
#  endif

    // Shift counts left to get all bit pairs in highest position of each byte
    const int8x16_t shift = vreinterpretq_s8_u32(vmovq_n_u32(0x00020406));
    const uint8x16_t shifted = vshlq_u8(swizzled, shift);

    // Then shift right to put in lowest position
    return vreinterpretq_s8_u8(vshrq_n_u8(shifted, 6));
}

WASM:

static inline v128_t bytesFromCrumbs(uint32_t packed) {
{
    // Shift without SIMD
    const uint32_t packed128[4] = { packed >> 6, packed >> 4, packed >> 2, packed };
    const v128_t shifted = wasm_v128_load(&packed128);

    // We only care about the lowest bits
    const v128_t mask = wasm_u8x16_const_splat(3);
    const v128_t masked = wasm_v128_and(shifted, mask);

    // Swizzle to put in the proper order
    const uint8_t swizmask[16] = { 15,11, 7, 3,
                                   14,10, 6, 2,
                                   13, 9, 5, 1,
                                   12, 8, 4, 0 };
    return wasm_i8x16_swizzle(masked, wasm_v128_load(&swizmask));
}

(The bytesFromCrumbs function expands packed 2-bit integers into 8-bit vectors. Every 2 bit integer gets its own byte. Also, the order between integers remains the same.)

@sw
Copy link
Collaborator Author

sw commented Apr 19, 2023

At this point I'm wondering if we should target a specific model size. Is there any environment (wasm for example) where the 4Gbyte 7B Q4_0 is too large?

Q2 probably shouldn't be merged as it's not really useable.

@ghost
Copy link

ghost commented Apr 20, 2023

Final perplexity for LLaMA 30B Q2_0: 6.9507

[1]5.5177,[2]6.2985,[3]6.9708,[4]8.0269,[5]7.9123,[6]7.8642,[7]8.0664,[8]8.1121,[9]8.4769,[10]8.7209,[11]8.9686,[12]9.0272,[13]8.9574,[14]9.0978,[15]9.2723,[16]8.8251,[17]8.6077,[18]8.6261,[19]8.2110,[20]8.1553,[21]8.0250,[22]7.8321,[23]7.7748,[24]7.6499,[25]7.6338,[26]7.4238,[27]7.1778,[28]7.0810,[29]6.9396,[30]6.7609,[31]6.6959,[32]6.7177,[33]6.6486,[34]6.7021,[35]6.7173,[36]6.7635,[37]6.7516,[38]6.7760,[39]6.8057,[40]6.8891,[41]6.9141,[42]6.9465,[43]6.8917,[44]6.9374,[45]6.9347,[46]6.8974,[47]6.9053,[48]6.8708,[49]6.8891,[50]6.8306,[51]6.8271,[52]6.8127,[53]6.8515,[54]6.8228,[55]6.7945,[56]6.8031,[57]6.7943,[58]6.8206,[59]6.8333,[60]6.8809,[61]6.8704,[62]6.9410,[63]6.9596,[64]6.9626,[65]7.0037,[66]6.9978,[67]7.0110,[68]7.0340,[69]7.0772,[70]7.1199,[71]7.1404,[72]7.1686,[73]7.2482,[74]7.2447,[75]7.2483,[76]7.2525,[77]7.2638,[78]7.2433,[79]7.2776,[80]7.2706,[81]7.2926,[82]7.2969,[83]7.2396,[84]7.2230,[85]7.2163,[86]7.1918,[87]7.1416,[88]7.1082,[89]7.0769,[90]7.0589,[91]7.0746,[92]7.0662,[93]7.0564,[94]7.0498,[95]7.0789,[96]7.0689,[97]7.0662,[98]7.0575,[99]7.0397,[100]7.0310,[101]7.0554,[102]7.0468,[103]7.0648,[104]7.0678,[105]7.0741,[106]7.0914,[107]7.0889,[108]7.0890,[109]7.0803,[110]7.0750,[111]7.0927,[112]7.1181,[113]7.1165,[114]7.1075,[115]7.1169,[116]7.1093,[117]7.1158,[118]7.1380,[119]7.1543,[120]7.2015,[121]7.2236,[122]7.2413,[123]7.2815,[124]7.3005,[125]7.2949,[126]7.3331,[127]7.3764,[128]7.4058,[129]7.3858,[130]7.3895,[131]7.3835,[132]7.3807,[133]7.3744,[134]7.3913,[135]7.3902,[136]7.3851,[137]7.3742,[138]7.3653,[139]7.3534,[140]7.3538,[141]7.3388,[142]7.3374,[143]7.3190,[144]7.2977,[145]7.2939,[146]7.2789,[147]7.2920,[148]7.2946,[149]7.2855,[150]7.2868,[151]7.2915,[152]7.2771,[153]7.2630,[154]7.2591,[155]7.2708,[156]7.2708,[157]7.2883,[158]7.2899,[159]7.2936,[160]7.2982,[161]7.3141,[162]7.2791,[163]7.2627,[164]7.2392,[165]7.2048,[166]7.1751,[167]7.1297,[168]7.0954,[169]7.0795,[170]7.0687,[171]7.0406,[172]7.0216,[173]7.0065,[174]6.9751,[175]6.9491,[176]6.9354,[177]6.9131,[178]6.8908,[179]6.8687,[180]6.8645,[181]6.8417,[182]6.8220,[183]6.8039,[184]6.7999,[185]6.7914,[186]6.7957,[187]6.8000,[188]6.7957,[189]6.8165,[190]6.8203,[191]6.8419,[192]6.8683,[193]6.8867,[194]6.9013,[195]6.9261,[196]6.9405,[197]6.9591,[198]6.9788,[199]6.9811,[200]6.9793,[201]6.9677,[202]6.9847,[203]6.9874,[204]6.9928,[205]7.0002,[206]7.0002,[207]6.9940,[208]6.9978,[209]7.0021,[210]7.0087,[211]7.0157,[212]7.0205,[213]7.0291,[214]7.0318,[215]7.0346,[216]7.0481,[217]7.0638,[218]7.0814,[219]7.0805,[220]7.0773,[221]7.0722,[222]7.0710,[223]7.0601,[224]7.0513,[225]7.0480,[226]7.0690,[227]7.0813,[228]7.0898,[229]7.0987,[230]7.0980,[231]7.1166,[232]7.1067,[233]7.0888,[234]7.0753,[235]7.0602,[236]7.0543,[237]7.0463,[238]7.0491,[239]7.0338,[240]7.0196,[241]7.0263,[242]7.0329,[243]7.0311,[244]7.0205,[245]7.0224,[246]7.0115,[247]7.0014,[248]6.9918,[249]6.9878,[250]6.9932,[251]6.9876,[252]6.9835,[253]6.9754,[254]6.9747,[255]6.9643,[256]6.9442,[257]6.9357,[258]6.9280,[259]6.9269,[260]6.9185,[261]6.9136,[262]6.9074,[263]6.9027,[264]6.8897,[265]6.8917,[266]6.8896,[267]6.8843,[268]6.8944,[269]6.8940,[270]6.8943,[271]6.9033,[272]6.9083,[273]6.9085,[274]6.9125,[275]6.9211,[276]6.9263,[277]6.9436,[278]6.9562,[279]6.9676,[280]6.9694,[281]6.9813,[282]6.9884,[283]7.0045,[284]7.0150,[285]7.0261,[286]7.0435,[287]7.0440,[288]7.0512,[289]7.0415,[290]7.0233,[291]7.0066,[292]6.9897,[293]6.9715,[294]6.9707,[295]6.9718,[296]6.9745,[297]6.9715,[298]6.9723,[299]6.9680,[300]6.9559,[301]6.9546,[302]6.9481,[303]6.9385,[304]6.9293,[305]6.9256,[306]6.9127,[307]6.9121,[308]6.9184,[309]6.9028,[310]6.8987,[311]6.8908,[312]6.8928,[313]6.8875,[314]6.8852,[315]6.8667,[316]6.8661,[317]6.8484,[318]6.8283,[319]6.8452,[320]6.8610,[321]6.8688,[322]6.8653,[323]6.8573,[324]6.8564,[325]6.8676,[326]6.8675,[327]6.8694,[328]6.8716,[329]6.8755,[330]6.8782,[331]6.8905,[332]6.8866,[333]6.8947,[334]6.8871,[335]6.8813,[336]6.8822,[337]6.8780,[338]6.8762,[339]6.8714,[340]6.8666,[341]6.8758,[342]6.8781,[343]6.8844,[344]6.8833,[345]6.8834,[346]6.8800,[347]6.8825,[348]6.8869,[349]6.8890,[350]6.8856,[351]6.8856,[352]6.8852,[353]6.8795,[354]6.8817,[355]6.8854,[356]6.8872,[357]6.8832,[358]6.8923,[359]6.8964,[360]6.8920,[361]6.8907,[362]6.8992,[363]6.9100,[364]6.9174,[365]6.9238,[366]6.9237,[367]6.9327,[368]6.9279,[369]6.9285,[370]6.9301,[371]6.9241,[372]6.9296,[373]6.9358,[374]6.9335,[375]6.9318,[376]6.9402,[377]6.9343,[378]6.9368,[379]6.9415,[380]6.9341,[381]6.9282,[382]6.9230,[383]6.9227,[384]6.9220,[385]6.9203,[386]6.9174,[387]6.9168,[388]6.9110,[389]6.9060,[390]6.8980,[391]6.8887,[392]6.8853,[393]6.8845,[394]6.8879,[395]6.8851,[396]6.8767,[397]6.8881,[398]6.8922,[399]6.8997,[400]6.8987,[401]6.9001,[402]6.9035,[403]6.9058,[404]6.9129,[405]6.9019,[406]6.8964,[407]6.8940,[408]6.8944,[409]6.9054,[410]6.9160,[411]6.9287,[412]6.9461,[413]6.9594,[414]6.9672,[415]6.9715,[416]6.9799,[417]6.9927,[418]6.9963,[419]7.0011,[420]7.0101,[421]7.0235,[422]7.0292,[423]7.0363,[424]7.0467,[425]7.0531,[426]7.0602,[427]7.0641,[428]7.0707,[429]7.0761,[430]7.0843,[431]7.1010,[432]7.1057,[433]7.1031,[434]7.0955,[435]7.0960,[436]7.0976,[437]7.1067,[438]7.1152,[439]7.1098,[440]7.1083,[441]7.1027,[442]7.1010,[443]7.1035,[444]7.1059,[445]7.1025,[446]7.1055,[447]7.1079,[448]7.1116,[449]7.1098,[450]7.1103,[451]7.1070,[452]7.0995,[453]7.0911,[454]7.0866,[455]7.0853,[456]7.0886,[457]7.0900,[458]7.0868,[459]7.0859,[460]7.0942,[461]7.0898,[462]7.0858,[463]7.0901,[464]7.0885,[465]7.0859,[466]7.0785,[467]7.0794,[468]7.0801,[469]7.0797,[470]7.0781,[471]7.0733,[472]7.0759,[473]7.0684,[474]7.0698,[475]7.0648,[476]7.0668,[477]7.0588,[478]7.0588,[479]7.0653,[480]7.0713,[481]7.0728,[482]7.0678,[483]7.0614,[484]7.0619,[485]7.0604,[486]7.0527,[487]7.0519,[488]7.0495,[489]7.0424,[490]7.0378,[491]7.0354,[492]7.0276,[493]7.0223,[494]7.0214,[495]7.0189,[496]7.0145,[497]7.0108,[498]7.0093,[499]7.0033,[500]6.9934,[501]6.9843,[502]6.9826,[503]6.9815,[504]6.9718,[505]6.9733,[506]6.9749,[507]6.9745,[508]6.9710,[509]6.9703,[510]6.9747,[511]6.9806,[512]6.9858,[513]6.9878,[514]6.9950,[515]6.9875,[516]6.9849,[517]6.9853,[518]6.9843,[519]6.9866,[520]6.9886,[521]6.9903,[522]6.9920,[523]6.9928,[524]6.9993,[525]7.0033,[526]7.0051,[527]7.0068,[528]7.0012,[529]7.0021,[530]6.9965,[531]6.9969,[532]7.0026,[533]7.0053,[534]7.0032,[535]7.0061,[536]6.9996,[537]6.9975,[538]7.0023,[539]7.0038,[540]7.0088,[541]7.0104,[542]7.0114,[543]7.0149,[544]7.0168,[545]7.0157,[546]7.0160,[547]7.0116,[548]7.0049,[549]7.0047,[550]7.0021,[551]6.9986,[552]6.9972,[553]6.9918,[554]6.9888,[555]6.9866,[556]6.9863,[557]6.9898,[558]6.9866,[559]6.9890,[560]6.9876,[561]6.9878,[562]6.9858,[563]6.9858,[564]6.9923,[565]6.9945,[566]6.9942,[567]6.9911,[568]6.9926,[569]6.9901,[570]6.9946,[571]6.9955,[572]6.9959,[573]6.9958,[574]6.9921,[575]6.9913,[576]6.9905,[577]6.9871,[578]6.9852,[579]6.9848,[580]6.9770,[581]6.9725,[582]6.9735,[583]6.9745,[584]6.9769,[585]6.9698,[586]6.9633,[587]6.9631,[588]6.9690,[589]6.9753,[590]6.9771,[591]6.9791,[592]6.9777,[593]6.9726,[594]6.9734,[595]6.9694,[596]6.9745,[597]6.9716,[598]6.9692,[599]6.9716,[600]6.9706,[601]6.9684,[602]6.9720,[603]6.9754,[604]6.9759,[605]6.9791,[606]6.9804,[607]6.9802,[608]6.9769,[609]6.9778,[610]6.9836,[611]6.9816,[612]6.9843,[613]6.9813,[614]6.9758,[615]6.9667,[616]6.9704,[617]6.9632,[618]6.9579,[619]6.9525,[620]6.9374,[621]6.9296,[622]6.9265,[623]6.9282,[624]6.9283,[625]6.9290,[626]6.9284,[627]6.9326,[628]6.9336,[629]6.9330,[630]6.9370,[631]6.9436,[632]6.9507,[633]6.9497,[634]6.9539,[635]6.9537,[636]6.9510,[637]6.9491,[638]6.9522,[639]6.9472,[640]6.9464,[641]6.9453,[642]6.9507,[643]6.9510,[644]6.9511,[645]6.9489,[646]6.9526,[647]6.9494,[648]6.9488,[649]6.9482,[650]6.9514,[651]6.9559,[652]6.9564,[653]6.9606,[654]6.9533,[655]6.9507,

@sw
Copy link
Collaborator Author

sw commented Apr 20, 2023

Rebased onto master, but I kept the tensor/ftype numbering, because @TheBloke has published Alpaca/LoRA model files for Q2. These should still work now but I haven't tested that. On the other hand, Q4_2 and Q4_3 will not work on this branch. If and when this gets merged, you will have to re-quantize your Q2/Q3 models.

As for perplexity, thanks to everyone providing numbers, my machine is too slow for that...

But it looks like Q2 isn't really worth it, unless you have some extreme file/RAM size restrictions:
q2q3-perp

@xloem
Copy link
Contributor

xloem commented Apr 24, 2023

Regarding wasm, there is indeed a 32-bit memory model for now, so sizeof(size_t) == 4 and large models cannot be allocated.

In practice on my trial wasm platform (ios a-shell where the whole system is wasm), malloc() calls for a little over 1GiB start returning 0 (and mmap is just a wrapper around malloc() and pread() here so doesn’t resolve it). 2bit quantization of llama 7b wouldn’t be sufficient compression for the particular wasm runtime I’ve been trying without some additional structured pruning and/or ahead-of-time model compilation.

But data > 4GB in size can’t be simultaneously referenced in memory (or “mmap”’d from a file) because the pointers are 32 bit for now.

@ikawrakow ikawrakow mentioned this pull request Apr 29, 2023
@sw
Copy link
Collaborator Author

sw commented Jun 9, 2023

Obsolete thanks to #1684

@sw sw closed this Jun 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2-bit integer quantization
5 participants