Skip to content

Latest commit

 

History

History
383 lines (249 loc) · 18.8 KB

20240324_01.md

File metadata and controls

383 lines (249 loc) · 18.8 KB

Mac Studio上运行Grok-1 大语言模型? 别着急, 可能要先降低品质减少起步内存用量

作者

digoal

日期

2024-03-24

标签

PostgreSQL , PolarDB , DuckDB , Grok-1 , llama , Mac Studio , 4-bit


背景

《马斯克开源3140亿参数大模型Grok》

《好消息, Mac Studio 可以跑马斯克开源的 Grok-1 开源大语言模型》

《下载和使用马斯克开源的 Grok-1 大语言模型》

《德说-第278期, 大语言模型是"毒品"吗?》

《2023年人工智能(AI)开源项目前20名》

mac studio 可以成为大模型生产力工具了? 且慢, 内存消耗量太大, 可能要转换一下精度来降低内存消耗需求.

xai-org/grok-1#42

Q、it will work with one GPU?

A1、No way. Not this model, even highly quantized. Unless it's a GH200 Data Center edition, which does have 96gb of VRAM integrated with 480GB of CPU ram. Then MAYBE.

A2、A 4 bit quantized model would likely be at least 96GB, so it might fit on four 24GB cards.

Someone may of figured it out: https://huggingface.co/eastwind/grok-1-hf-4bit/tree/main

Looks to be about 90.2 GB on file when adding up the safetensor shards from the mentioned hugging face eastwind repo. Not sure what that would be needed for loading for inference. Likely will need more for overhead. I can't speak for mem usage or quality since this is still beyond my capacity.

Q、killed by os when running mac m3max and 128G Mem。

A、Likely an OOM (out of memory) situation. Which is expected with a model that is larger then 128 GB with the raw weights. (300GB + for the full model) You likely will need to wait to run the whole model on a CPU with 128 GB of memory. 4-bit quantitation may be able to get the model down to the size of 96 GB with some quality loss in the output due to quantization effects. That is just speculation right now in regards of 4-bit quant weight size. #42

Q&A、除了降低精度, 还可以使用多机计算(MPI), 参考https://github.com/ggerganov/llama.cpp

MPI 构建

MPI 允许您将计算分布到机器集群上。 由于 LLM 预测的串行性质,这不会产生任何端到端加速,但它可以让您运行比单机 RAM 更大的模型。

首先,您需要在系统上安装 MPI 库。 两个最流行(唯一?)的选项是 MPICH 和 OpenMPI。 两者都可以使用包管理器(apt、Homebrew、MacPorts 等)进行安装。

接下来,您需要在所有机器上将 LLAMA_MPI 设置为 true 来构建项目; 如果使用 make 构建,则还需要指定支持 MPI 的编译器(使用 CMake 构建时,这是自动配置的)

补充知识:

什么是高性能消息传递库 Open MPI https://www.open-mpi.org/

Open MPI 项目是一个开源消息传递接口实现,由学术、研究和行业合作伙伴组成的联盟开发和维护。 因此,Open MPI 能够结合整个高性能计算社区的专业知识、技术和资源,以构建可用的最佳 MPI 库。 开放 MPI 为系统和软件供应商、应用程序开发人员和计算机科学研究人员提供了优势。

open mpi 的主要贡献者几乎都来自HPC领域商业玩家(ibm, cisco, aws, nvidia) https://www-lb.open-mpi.org/papers/sc-2023/Open-MPI-SC23-BOF.pdf

什么是hpc? https://www.ibm.com/cn-zh/topics/hpc

HPC 是一种技术,它使用并行工作的强大处理器集群,处理海量多维数据集(大数据),并以极高的速度解决复杂问题。 HPC 系统的运行速度通常要比最快的商用台式机、笔记本电脑或服务器系统快一百万倍以上。

超级计算机作为 HPC 系统的范式已存在了几十年,这是一种专门构建的计算机,它包含数百万个处理器或处理器核心。 超级计算机如今仍在我们身边;截至本文撰写之时,最快的超级计算机就是位于美国的 Frontier (链接位于 ibm.com 外部),其处理速度为1.102 exaflops,即每秒 quintillion (10^18) 次浮点运算(flops)。 但如今,越来越多的组织在高速计算机服务器集群上运行 HPC 解决方案,这些服务器托管在本地或云中。

HPC 工作负载揭示了重要的新锐洞察,这些洞察可以促进人类知识的进步,并创造显著的竞争优势。 例如,HPC 被用于 DNA 测序、股票交易自动化,以及运行人工智能 (AI) 算法和模拟(比如支持自动驾驶汽车),它可以实时分析从物联网传感器、雷达和 GPS 系统流入的太字节数据,进而在瞬间做出决策。

如何转换精度?

1、参考这里下载grok-1参数文件(约300+GB). 《下载和使用马斯克开源的 Grok-1 大语言模型》

2、参考以下项目.

Unofficial dequantized weight of grok-1 in HF Transformers format.

https://huggingface.co/keyfan/grok-1-hf

Note: If you haven't download the weight yet, please use the fp32 revision instead which uses float32 precision for RMSNorm and Router layer for better consistency.

The (fp32) weights are converted using the script here(https://gist.github.com/chu-tianxiang/ec310e15d56949fd0f351cb5f65ee7a1) ran inside the grok-1 repo(https://github.com/xai-org/grok-1). Since downloading the dequantized weight needs twice as much time, it's recommended to download the original weight and convert on your own.

转换脚本, 在grok-1 repo目录中执行:

脚本内容:

import numpy as np  
import torch  
import jax  
from tqdm import tqdm  
  
from model import LanguageModelConfig, TransformerConfig, QuantizedWeight8bit as QW8Bit  
from runners import InferenceRunner, ModelRunner, sample_from_model  
  
CKPT_PATH = "./checkpoints"  
  
grok_1_model = LanguageModelConfig(  
    vocab_size=128 * 1024,  
    pad_token=0,  
    eos_token=2,  
    sequence_len=8192,  
    embedding_init_scale=1.0,  
    output_multiplier_scale=0.5773502691896257,  
    embedding_multiplier_scale=78.38367176906169,  
    model=TransformerConfig(  
        emb_size=48 * 128,  
        widening_factor=8,  
        key_size=128,  
        num_q_heads=48,  
        num_kv_heads=8,  
        num_layers=64,  
        attn_output_multiplier=0.08838834764831845,  
        shard_activations=True,  
        # MoE.  
        num_experts=8,  
        num_selected_experts=2,  
        # Activation sharding.  
        data_axis="data",  
        model_axis="model",  
    ),  
)  
  
runner = ModelRunner(  
    model=grok_1_model,  
    bs_per_device=0.125,  
    checkpoint_path=CKPT_PATH,  
)  
dummy_data = dict(  
    inputs=np.zeros((1, 256), dtype=np.int32),  
    targets=np.zeros((1, 256), dtype=np.int32),  
)  
runner.transform_forward = True  
runner.initialize(dummy_data, (1, 1), (1, 1))  
params = runner.load_or_init(dummy_data)  
  
new_params = {}  
keys = list(params.keys())  
for key in tqdm(keys):  
    new_key = key.replace('/', '.').replace('decoder_layer_', 'decoder_layer.').replace('language_model', 'transformer')  
    new_key += '.weight'  
    v = list(params[key].values())[0]  
    if hasattr(v , 'scales'):  
        dtype = torch.float32 if v.scales.dtype == np.float32 else torch.bfloat16  
        # torch cannot convert bfloat16 directly  
        weight = torch.from_numpy(np.asarray(v.weight).astype(np.float32)).to(dtype)  
        scale =torch.from_numpy(np.asarray(v.scales).astype(np.float32)).to(dtype)  
        # row parallel layers have sharded scale  
        if len(scale.shape) >= 2 and scale.shape[-2] != 1:  
            scale = scale[..., None, :]  
            weight = weight.view(*weight.shape[:-2], 8, -1, weight.shape[-1])  
            weight = (weight * scale).view(*weight.shape[:-3], -1, weight.shape[-1])  
        else:  
            weight = weight * scale  
    else:  
        dtype = torch.float32 if v.dtype == np.float32 else torch.bfloat16  
        weight = torch.from_numpy(np.asarray(v).astype(np.float32)).to(dtype)  
    # Transpose linear matrix  
    if len(weight.shape) >= 2 and 'in_out_embed' not in new_key:  
        weight = weight.transpose(-1, -2).contiguous()  
  
    if 'moe' not in new_key:  
        new_params[new_key] = weight  
    else:  
        # split moe  
        for i in range(8):  
            new_key_i = new_key.replace('moe', f'moe.{i}')  
            new_params[new_key_i] = weight[i]  
  
    del params[key]  
  
torch.save(new_params, 'hf/pytorch_model.bin')  

3、直接下载转换好的如下项目. 或者 参考以上脚本及以下项目, 可能要修改转换脚本, 使用4bit品质? 进行转换.

https://huggingface.co/eastwind/grok-1-hf-4bit/tree/main

Unofficial 4bit Version of Grok-1 Quantised from https://huggingface.co/keyfan/grok-1-hf .

或者参考: https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/llama.cpp%E9%87%8F%E5%8C%96%E9%83%A8%E7%BD%B2

gemini: how to dequantized grok-1 weight to 4-bit quant weight size with safetensors format?

You cannot directly dequantize a model's weights from a higher precision format (e.g., float32) to a lower precision format (e.g., 4-bit) using SafeTensor. Dequartization refers to removing the quantization information added during the quantization process to achieve the original full-precision weights. This process is generally not reversible.

However, you can achieve a similar outcome by converting the original Grok-1 weights directly to a 4-bit SafeTensor format. Here's how:

1. Install Libraries:

2. Export Grok-1 Weights (if not already done):

from transformers import AutoModelForSequenceClassification  
  
# Choose the model to convert  
model_name = "google/grok-1-base"  
  
# Export the model weights  
model = AutoModelForSequenceClassification.from_pretrained(model_name)  
model.save_pretrained("grok-1-base")  

3. Convert Weights to 4-bit SafeTensor format:

from safetensor import converters  
  
# Convert the model weights to 4-bit format  
converters.convert_checkpoint("grok-1-base", "4bit", output_path="grok-1-base-4bit.safetensor")  

4. Validate Conversion (Optional):

from safetensor import utils  
  
# Validate the converted SafeTensor file (optional)  
utils.validate_safetensor("grok-1-base-4bit.safetensor")  

Explanation:

  1. We export the Grok-1 weights in the original full-precision format (float32) using Hugging Face Transformers.
  2. SafeTensor's converters.convert_checkpoint function directly quantizes the weights to a 4-bit format during the conversion process. This avoids the need for dequantization.
  3. The converted file "grok-1-base-4bit.safetensor" stores the quantized weights in a format compatible with SafeTensor for inference.

Additional Notes:

  • Conversion speed depends on model size and complexity.
  • The converted SafeTensor file can be used for inference with SafeTensor API: [移去了无效网址]

Alternative Approach (if Grok-1 access restricted):

If you cannot access Grok-1 weights due to Google limitations, consider using other open-source large language models like:

These models might be available pre-quantized or offer conversion tools to achieve a similar 4-bit format for inference with SafeTensor.

Here's how to convert Hugging Face Transformers format to SafeTensor format:

1. Install Libraries:

  • Install Hugging Face Transformers: pip install transformers
  • Install SafeTensor: pip install safetensors

2. Export Transformers Model Weights:

from transformers import AutoModelForSequenceClassification  # Adjust based on your model type  
  
# Choose the model to convert  
model_name = "your_model_name"  # Replace with your specific model identifier  
  
# Export the model weights  
model = AutoModelForSequenceClassification.from_pretrained(model_name)  
model.save_pretrained("model_name")  

3. Convert Weights to SafeTensors:

from safetensor import converters  
  
# Specify target precision (e.g., 4-bit for memory efficiency)  
target_precision = "4bit"  # Options: "float16", "bfloat16", "int8", "4bit"  
  
# Convert the model weights  
converters.convert_checkpoint("model_name", target_precision, output_path="model_name.safetensor")  

4. Validate Conversion (Optional):

from safetensor import utils  
  
# Validate the converted SafeTensor file  
utils.validate_safetensor("model_name.safetensor")  

Explanation:

  1. We import necessary libraries: transformers for model loading and safetensors for conversion.
  2. Define the model name you want to convert.
  3. Load the model using AutoModelForSequenceClassification (adjust the class based on your model type) and save its weights using save_pretrained.
  4. Use converters.convert_checkpoint to perform the conversion. Specify the source model directory ("model_name"), target precision (e.g., 4-bit for memory efficiency), and the desired output filename ("model_name.safetensor").
  5. The validation step (optional) verifies the integrity of the converted SafeTensor file.

Additional Notes:

参考

xai-org/grok-1#277

Now run grok-1 with less than 420 G VRAM

adds grok-1 support to llama.cpp

ggerganov/llama.cpp#6120

Grok-1 GGUF Quantizations

Unofficial 4bit Version of Grok-1 Quantised from https://huggingface.co/keyfan/grok-1-hf .

https://huggingface.co/keyfan/grok-1-hf

Unofficial dequantized weight of grok-1 in HF Transformers format.    
  
Note: If you haven't download the weight yet, please use the fp32 revision instead which uses float32 precision for RMSNorm and Router layer for better consistency.  
  
The (fp32) weights are converted using the script here(https://gist.github.com/chu-tianxiang/ec310e15d56949fd0f351cb5f65ee7a1) ran inside the grok-1 repo(https://github.com/xai-org/grok-1). Since downloading the dequantized weight needs twice as much time, it's recommended to download the original weight and convert on your own.  

将grok-1品质降级为4bit, 减少内存用量

https://cn.bing.com/search?q=ggml-model-Q8_0.gguf&PC=U316&FORM=CHROMN

llama.cpp量化部署

LLM inference in C/C++ . M系列芯片推荐:使用Metal启用GPU推理,显著提升速度。只需将编译命令改为:LLAMA_METAL=1 make . 如已通过Metal编译,则只需加上-ngl 1即可启用GPU推理

A fast llama2 decoder in pure Rust.

Open Assistant 是一个有远见的项目,旨在使对基于聊天的特殊大语言模型的访问民主化。作为 ChatGPT 的开源替代品,它旨在刺激语言创新的革命。我们的愿景是促进语言的突破性进步,就像稳定扩散改变了艺术和图像的创作一样。通过创建Open Assistant,我们希望促进语言能力的进步,最终为改善整个社会做出贡献。

LeCun转赞:在苹果M1/M2芯片上跑LLaMA!130亿参数模型仅需4GB内存

Pytorch Mac GPU 训练与测评

digoal's wechat