Module 'quant_cuda' has no attribute 'vecquant4matmul' #53

vrunm · 2023-05-04T06:05:58Z

Unable to load the package after:

BUILD_CUDA_EXT=0 
!pip install auto-gptq

Ran into this error:

 Building wheel for auto-gptq (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [83 lines of output]
      /opt/conda/lib/python3.10/site-packages/setuptools/dist.py:493: UserWarning: Normalizing 'v0.0.5' to '0.0.5'
        warnings.warn(tmpl.format(**locals()))
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-3.10
      creating build/lib.linux-x86_64-3.10/auto_gptq
      copying auto_gptq/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq
      creating build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/gptj.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/bloom.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/llama.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/_const.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/gpt_neox.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/moss.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/_base.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/opt.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/auto.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/_utils.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      creating build/lib.linux-x86_64-3.10/auto_gptq/quantization
      copying auto_gptq/quantization/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/quantization
      copying auto_gptq/quantization/gptq.py -> build/lib.linux-x86_64-3.10/auto_gptq/quantization
      copying auto_gptq/quantization/quantizer.py -> build/lib.linux-x86_64-3.10/auto_gptq/quantization
      creating build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/sequence_classification_task.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/_base.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/text_summarization_task.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/language_modeling_task.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      creating build/lib.linux-x86_64-3.10/auto_gptq/nn_modules
      copying auto_gptq/nn_modules/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules
      copying auto_gptq/nn_modules/qlinear.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules
      copying auto_gptq/nn_modules/qlinear_triton.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules
      creating build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      copying auto_gptq/eval_tasks/_utils/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      copying auto_gptq/eval_tasks/_utils/classification_utils.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      copying auto_gptq/eval_tasks/_utils/generation_utils.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      copying auto_gptq/eval_tasks/_utils/data_utils.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      creating build/lib.linux-x86_64-3.10/auto_gptq/nn_modules/triton_utils
      copying auto_gptq/nn_modules/triton_utils/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules/triton_utils
      copying auto_gptq/nn_modules/triton_utils/custom_autotune.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules/triton_utils
      running build_ext
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-3byddf0o/auto-gptq_c352ae6c79014b188e05fc83034e86ff/setup.py", line 47, in <module>
          setup(
        File "/opt/conda/lib/python3.10/site-packages/setuptools/__init__.py", line 153, in setup
          return distutils.core.setup(**attrs)
        File "/opt/conda/lib/python3.10/distutils/core.py", line 148, in setup
          dist.run_commands()
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 966, in run_commands
          self.run_command(cmd)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/site-packages/wheel/bdist_wheel.py", line 343, in run
          self.run_command("build")
        File "/opt/conda/lib/python3.10/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/distutils/command/build.py", line 135, in run
          self.run_command(cmd_name)
        File "/opt/conda/lib/python3.10/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 79, in run
          _build_ext.run(self)
        File "/opt/conda/lib/python3.10/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
          _build_ext.build_ext.run(self)
        File "/opt/conda/lib/python3.10/distutils/command/build_ext.py", line 340, in run
          self.build_extensions()
        File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 499, in build_extensions
          _check_cuda_version(compiler_name, compiler_version)
        File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 387, in _check_cuda_version
          raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))
      RuntimeError:
      The detected CUDA version (12.1) mismatches the version that was used to compile
      PyTorch (11.3). Please make sure to use the same CUDA versions.
      
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for auto-gptq
  Running setup.py clean for auto-gptq
Failed to build auto-gptq
Installing collected packages: safetensors, rouge, auto-gptq
  Running setup.py install for auto-gptq ... error
  error: subprocess-exited-with-error
  
  × Running setup.py install for auto-gptq did not run successfully.
  │ exit code: 1
  ╰─> [87 lines of output]
      /opt/conda/lib/python3.10/site-packages/setuptools/dist.py:493: UserWarning: Normalizing 'v0.0.5' to '0.0.5'
        warnings.warn(tmpl.format(**locals()))
      running install
      /opt/conda/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
        warnings.warn(
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-3.10
      creating build/lib.linux-x86_64-3.10/auto_gptq
      copying auto_gptq/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq
      creating build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/gptj.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/bloom.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/llama.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/_const.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/gpt_neox.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/moss.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/_base.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/opt.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/auto.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/_utils.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      creating build/lib.linux-x86_64-3.10/auto_gptq/quantization
      copying auto_gptq/quantization/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/quantization
      copying auto_gptq/quantization/gptq.py -> build/lib.linux-x86_64-3.10/auto_gptq/quantization
      copying auto_gptq/quantization/quantizer.py -> build/lib.linux-x86_64-3.10/auto_gptq/quantization
      creating build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/sequence_classification_task.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/_base.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/text_summarization_task.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/language_modeling_task.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      creating build/lib.linux-x86_64-3.10/auto_gptq/nn_modules
      copying auto_gptq/nn_modules/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules
      copying auto_gptq/nn_modules/qlinear.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules
      copying auto_gptq/nn_modules/qlinear_triton.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules
      creating build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      copying auto_gptq/eval_tasks/_utils/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      copying auto_gptq/eval_tasks/_utils/classification_utils.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      copying auto_gptq/eval_tasks/_utils/generation_utils.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      copying auto_gptq/eval_tasks/_utils/data_utils.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      creating build/lib.linux-x86_64-3.10/auto_gptq/nn_modules/triton_utils
      copying auto_gptq/nn_modules/triton_utils/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules/triton_utils
      copying auto_gptq/nn_modules/triton_utils/custom_autotune.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules/triton_utils
      running build_ext
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-3byddf0o/auto-gptq_c352ae6c79014b188e05fc83034e86ff/setup.py", line 47, in <module>
          setup(
        File "/opt/conda/lib/python3.10/site-packages/setuptools/__init__.py", line 153, in setup
          return distutils.core.setup(**attrs)
        File "/opt/conda/lib/python3.10/distutils/core.py", line 148, in setup
          dist.run_commands()
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 966, in run_commands
          self.run_command(cmd)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/site-packages/setuptools/command/install.py", line 68, in run
          return orig.install.run(self)
        File "/opt/conda/lib/python3.10/distutils/command/install.py", line 568, in run
          self.run_command('build')
        File "/opt/conda/lib/python3.10/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/distutils/command/build.py", line 135, in run
          self.run_command(cmd_name)
        File "/opt/conda/lib/python3.10/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 79, in run
          _build_ext.run(self)
        File "/opt/conda/lib/python3.10/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
          _build_ext.build_ext.run(self)
        File "/opt/conda/lib/python3.10/distutils/command/build_ext.py", line 340, in run
          self.build_extensions()
        File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 499, in build_extensions
          _check_cuda_version(compiler_name, compiler_version)
        File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 387, in _check_cuda_version
          raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))
      RuntimeError:
      The detected CUDA version (12.1) mismatches the version that was used to compile
      PyTorch (11.3). Please make sure to use the same CUDA versions.
      
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> auto-gptq

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

The text was updated successfully, but these errors were encountered:

TheBloke · 2023-05-04T07:43:59Z

Try this:

! BUILD_CUDA_EXT=0 pip install auto-gptq

vrunm · 2023-05-04T07:45:11Z

  Preparing metadata (setup.py) ... done
Requirement already satisfied: accelerate>=0.18.0 in /opt/conda/lib/python3.10/site-packages (from auto-gptq==0.1.0.dev0) (0.18.0)
Requirement already satisfied: datasets in /opt/conda/lib/python3.10/site-packages (from auto-gptq==0.1.0.dev0) (2.1.0)
Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from auto-gptq==0.1.0.dev0) (1.23.5)
Requirement already satisfied: rouge in /opt/conda/lib/python3.10/site-packages (from auto-gptq==0.1.0.dev0) (1.0.1)
Requirement already satisfied: torch>=1.13.0 in /opt/conda/lib/python3.10/site-packages (from auto-gptq==0.1.0.dev0) (2.0.0)
Requirement already satisfied: safetensors in /opt/conda/lib/python3.10/site-packages (from auto-gptq==0.1.0.dev0) (0.3.1)
Requirement already satisfied: transformers>=4.26.1 in /opt/conda/lib/python3.10/site-packages (from auto-gptq==0.1.0.dev0) (4.28.1)
Requirement already satisfied: psutil in /opt/conda/lib/python3.10/site-packages (from accelerate>=0.18.0->auto-gptq==0.1.0.dev0) (5.9.4)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.10/site-packages (from accelerate>=0.18.0->auto-gptq==0.1.0.dev0) (21.3)
Requirement already satisfied: pyyaml in /opt/conda/lib/python3.10/site-packages (from accelerate>=0.18.0->auto-gptq==0.1.0.dev0) (6.0)
Requirement already satisfied: sympy in /opt/conda/lib/python3.10/site-packages (from torch>=1.13.0->auto-gptq==0.1.0.dev0) (1.11.1)
Requirement already satisfied: jinja2 in /opt/conda/lib/python3.10/site-packages (from torch>=1.13.0->auto-gptq==0.1.0.dev0) (3.1.2)
Requirement already satisfied: networkx in /opt/conda/lib/python3.10/site-packages (from torch>=1.13.0->auto-gptq==0.1.0.dev0) (3.1)
Requirement already satisfied: typing-extensions in /opt/conda/lib/python3.10/site-packages (from torch>=1.13.0->auto-gptq==0.1.0.dev0) (4.5.0)
Requirement already satisfied: filelock in /opt/conda/lib/python3.10/site-packages (from torch>=1.13.0->auto-gptq==0.1.0.dev0) (3.11.0)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /opt/conda/lib/python3.10/site-packages (from transformers>=4.26.1->auto-gptq==0.1.0.dev0) (0.13.3)
Requirement already satisfied: huggingface-hub<1.0,>=0.11.0 in /opt/conda/lib/python3.10/site-packages (from transformers>=4.26.1->auto-gptq==0.1.0.dev0) (0.13.4)
Requirement already satisfied: regex!=2019.12.17 in /opt/conda/lib/python3.10/site-packages (from transformers>=4.26.1->auto-gptq==0.1.0.dev0) (2023.3.23)
Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from transformers>=4.26.1->auto-gptq==0.1.0.dev0) (2.28.2)
Requirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.10/site-packages (from transformers>=4.26.1->auto-gptq==0.1.0.dev0) (4.64.1)
Requirement already satisfied: fsspec[http]>=2021.05.0 in /opt/conda/lib/python3.10/site-packages (from datasets->auto-gptq==0.1.0.dev0) (2023.4.0)
Requirement already satisfied: pyarrow>=5.0.0 in /opt/conda/lib/python3.10/site-packages (from datasets->auto-gptq==0.1.0.dev0) (10.0.1)
Requirement already satisfied: dill in /opt/conda/lib/python3.10/site-packages (from datasets->auto-gptq==0.1.0.dev0) (0.3.6)
Requirement already satisfied: xxhash in /opt/conda/lib/python3.10/site-packages (from datasets->auto-gptq==0.1.0.dev0) (3.2.0)
Requirement already satisfied: pandas in /opt/conda/lib/python3.10/site-packages (from datasets->auto-gptq==0.1.0.dev0) (1.5.3)
Requirement already satisfied: responses<0.19 in /opt/conda/lib/python3.10/site-packages (from datasets->auto-gptq==0.1.0.dev0) (0.18.0)
Requirement already satisfied: multiprocess in /opt/conda/lib/python3.10/site-packages (from datasets->auto-gptq==0.1.0.dev0) (0.70.14)
Requirement already satisfied: aiohttp in /opt/conda/lib/python3.10/site-packages (from datasets->auto-gptq==0.1.0.dev0) (3.8.4)
Requirement already satisfied: six in /opt/conda/lib/python3.10/site-packages (from rouge->auto-gptq==0.1.0.dev0) (1.16.0)
Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets->auto-gptq==0.1.0.dev0) (2.1.1)
Requirement already satisfied: multidict<7.0,>=4.5 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets->auto-gptq==0.1.0.dev0) (6.0.4)
Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets->auto-gptq==0.1.0.dev0) (1.8.2)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets->auto-gptq==0.1.0.dev0) (4.0.2)
Requirement already satisfied: frozenlist>=1.1.1 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets->auto-gptq==0.1.0.dev0) (1.3.3)
Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets->auto-gptq==0.1.0.dev0) (22.2.0)
Requirement already satisfied: aiosignal>=1.1.2 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets->auto-gptq==0.1.0.dev0) (1.3.1)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.10/site-packages (from packaging>=20.0->accelerate>=0.18.0->auto-gptq==0.1.0.dev0) (3.0.9)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests->transformers>=4.26.1->auto-gptq==0.1.0.dev0) (2022.12.7)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests->transformers>=4.26.1->auto-gptq==0.1.0.dev0) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests->transformers>=4.26.1->auto-gptq==0.1.0.dev0) (1.26.15)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.10/site-packages (from jinja2->torch>=1.13.0->auto-gptq==0.1.0.dev0) (2.1.2)
Requirement already satisfied: python-dateutil>=2.8.1 in /opt/conda/lib/python3.10/site-packages (from pandas->datasets->auto-gptq==0.1.0.dev0) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas->datasets->auto-gptq==0.1.0.dev0) (2023.3)
Requirement already satisfied: mpmath>=0.19 in /opt/conda/lib/python3.10/site-packages (from sympy->torch>=1.13.0->auto-gptq==0.1.0.dev0) (1.3.0)
Building wheels for collected packages: auto-gptq
  Building wheel for auto-gptq (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [45 lines of output]
      /opt/conda/lib/python3.10/site-packages/setuptools/dist.py:493: UserWarning: Normalizing 'v0.1.0-dev' to '0.1.0.dev0'
        warnings.warn(tmpl.format(**locals()))
      running bdist_wheel
      running build
      running build_py
      running build_ext
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/kaggle/working/AutoGPTQ/setup.py", line 49, in <module>
          setup(
        File "/opt/conda/lib/python3.10/site-packages/setuptools/__init__.py", line 153, in setup
          return distutils.core.setup(**attrs)
        File "/opt/conda/lib/python3.10/distutils/core.py", line 148, in setup
          dist.run_commands()
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 966, in run_commands
          self.run_command(cmd)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/site-packages/wheel/bdist_wheel.py", line 343, in run
          self.run_command("build")
        File "/opt/conda/lib/python3.10/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/distutils/command/build.py", line 135, in run
          self.run_command(cmd_name)
        File "/opt/conda/lib/python3.10/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 79, in run
          _build_ext.run(self)
        File "/opt/conda/lib/python3.10/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
          _build_ext.build_ext.run(self)
        File "/opt/conda/lib/python3.10/distutils/command/build_ext.py", line 340, in run
          self.build_extensions()
        File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 499, in build_extensions
          _check_cuda_version(compiler_name, compiler_version)
        File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 387, in _check_cuda_version
          raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))
      RuntimeError:
      The detected CUDA version (12.1) mismatches the version that was used to compile
      PyTorch (11.3). Please make sure to use the same CUDA versions.
      
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for auto-gptq
  Running setup.py clean for auto-gptq
Failed to build auto-gptq
Installing collected packages: auto-gptq
  Running setup.py install for auto-gptq ... error
  error: subprocess-exited-with-error
  
  × Running setup.py install for auto-gptq did not run successfully.
  │ exit code: 1
  ╰─> [90 lines of output]
      /opt/conda/lib/python3.10/site-packages/setuptools/dist.py:493: UserWarning: Normalizing 'v0.1.0-dev' to '0.1.0.dev0'
        warnings.warn(tmpl.format(**locals()))
      running install
      /opt/conda/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
        warnings.warn(
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-3.10
      creating build/lib.linux-x86_64-3.10/auto_gptq
      copying auto_gptq/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq
      creating build/lib.linux-x86_64-3.10/auto_gptq/nn_modules
      copying auto_gptq/nn_modules/qlinear.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules
      copying auto_gptq/nn_modules/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules
      copying auto_gptq/nn_modules/qlinear_triton.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules
      creating build/lib.linux-x86_64-3.10/auto_gptq/utils
      copying auto_gptq/utils/data_utils.py -> build/lib.linux-x86_64-3.10/auto_gptq/utils
      copying auto_gptq/utils/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/utils
      creating build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/_base.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/text_summarization_task.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/sequence_classification_task.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/language_modeling_task.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      copying auto_gptq/eval_tasks/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks
      creating build/lib.linux-x86_64-3.10/auto_gptq/quantization
      copying auto_gptq/quantization/gptq.py -> build/lib.linux-x86_64-3.10/auto_gptq/quantization
      copying auto_gptq/quantization/quantizer.py -> build/lib.linux-x86_64-3.10/auto_gptq/quantization
      copying auto_gptq/quantization/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/quantization
      creating build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/moss.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/auto.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/_base.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/gptj.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/_utils.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/gpt2.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/_const.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/opt.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/gpt_neox.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/bloom.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      copying auto_gptq/modeling/llama.py -> build/lib.linux-x86_64-3.10/auto_gptq/modeling
      creating build/lib.linux-x86_64-3.10/auto_gptq/nn_modules/triton_utils
      copying auto_gptq/nn_modules/triton_utils/custom_autotune.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules/triton_utils
      copying auto_gptq/nn_modules/triton_utils/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/nn_modules/triton_utils
      creating build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      copying auto_gptq/eval_tasks/_utils/__init__.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      copying auto_gptq/eval_tasks/_utils/generation_utils.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      copying auto_gptq/eval_tasks/_utils/classification_utils.py -> build/lib.linux-x86_64-3.10/auto_gptq/eval_tasks/_utils
      running build_ext
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/kaggle/working/AutoGPTQ/setup.py", line 49, in <module>
          setup(
        File "/opt/conda/lib/python3.10/site-packages/setuptools/__init__.py", line 153, in setup
          return distutils.core.setup(**attrs)
        File "/opt/conda/lib/python3.10/distutils/core.py", line 148, in setup
          dist.run_commands()
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 966, in run_commands
          self.run_command(cmd)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/site-packages/setuptools/command/install.py", line 68, in run
          return orig.install.run(self)
        File "/opt/conda/lib/python3.10/distutils/command/install.py", line 568, in run
          self.run_command('build')
        File "/opt/conda/lib/python3.10/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/distutils/command/build.py", line 135, in run
          self.run_command(cmd_name)
        File "/opt/conda/lib/python3.10/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 79, in run
          _build_ext.run(self)
        File "/opt/conda/lib/python3.10/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
          _build_ext.build_ext.run(self)
        File "/opt/conda/lib/python3.10/distutils/command/build_ext.py", line 340, in run
          self.build_extensions()
        File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 499, in build_extensions
          _check_cuda_version(compiler_name, compiler_version)
        File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 387, in _check_cuda_version
          raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))
      RuntimeError:
      The detected CUDA version (12.1) mismatches the version that was used to compile
      PyTorch (11.3). Please make sure to use the same CUDA versions.
      
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> auto-gptq

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failur

I tried that still ran into the same issue.

TheBloke · 2023-05-04T07:48:13Z

Are you sure you ran it exactly like I showed?

! BUILD_CUDA_EXT=0 pip install auto-gptq

This is wrong syntax:

BUILD_CUDA_EXT=0
!pip install auto-gptq

Please show the output of running:

! BUILD_CUDA_EXT=0 pip install -v auto-gptq

vrunm · 2023-05-04T07:52:04Z

Sure I tried that and it worked.

While trying the demo:

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig


pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"


tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
examples = [
    tokenizer(
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
    )
]

quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
)

# load un-quantized model, the model will always be force loaded into cpu
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)

# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask" 
# with value under torch.LongTensor type.
model.quantize(examples, use_triton=False)

# save quantized model
model.save_quantized(quantized_model_dir)

# save quantized model using safetensors
model.save_quantized(quantized_model_dir, use_safetensors=True)

# load quantized model, currently only support cpu or single gpu
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", use_triton=False)

# inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to("cuda:0"))[0]))

# or you can also use pipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
print(pipeline("auto-gptq is")[0]["generated_text"])

────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /tmp/ipykernel_32/3071398276.py:38 in <module>                                                   │
│                                                                                                  │
│ [Errno 2] No such file or directory: '/tmp/ipykernel_32/3071398276.py'                           │
│                                                                                                  │
│ /kaggle/working/AutoGPTQ/auto_gptq/modeling/_base.py:371 in generate                             │
│                                                                                                  │
│   368 │   def generate(self, **kwargs):                                                          │
│   369 │   │   """shortcut for model.generate"""                                                  │
│   370 │   │   with torch.inference_mode(), torch.amp.autocast(device_type=self.device.type):     │
│ ❱ 371 │   │   │   return self.model.generate(**kwargs)                                           │
│   372 │                                                                                          │
│   373 │   def prepare_inputs_for_generation(self, *args, **kwargs):                              │
│   374 │   │   """shortcut for model.prepare_inputs_for_generation"""                             │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py:115 in decorate_context       │
│                                                                                                  │
│   112 │   @functools.wraps(func)                                                                 │
│   113 │   def decorate_context(*args, **kwargs):                                                 │
│   114 │   │   with ctx_factory():                                                                │
│ ❱ 115 │   │   │   return func(*args, **kwargs)                                                   │
│   116 │                                                                                          │
│   117 │   return decorate_context                                                                │
│   118                                                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py:1437 in generate        │
│                                                                                                  │
│   1434 │   │   │   │   )                                                                         │
│   1435 │   │   │                                                                                 │
│   1436 │   │   │   # 11. run greedy search                                                       │
│ ❱ 1437 │   │   │   return self.greedy_search(                                                    │
│   1438 │   │   │   │   input_ids,                                                                │
│   1439 │   │   │   │   logits_processor=logits_processor,                                        │
│   1440 │   │   │   │   stopping_criteria=stopping_criteria,                                      │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py:2248 in greedy_search   │
│                                                                                                  │
│   2245 │   │   │   model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)  │
│   2246 │   │   │                                                                                 │
│   2247 │   │   │   # forward pass to get next token                                              │
│ ❱ 2248 │   │   │   outputs = self(                                                               │
│   2249 │   │   │   │   **model_inputs,                                                           │
│   2250 │   │   │   │   return_dict=True,                                                         │
│   2251 │   │   │   │   output_attentions=output_attentions,                                      │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in _call_impl            │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/accelerate/hooks.py:165 in new_forward                   │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py:938 in forward   │
│                                                                                                  │
│    935 │   │   return_dict = return_dict if return_dict is not None else self.config.use_return  │
│    936 │   │                                                                                     │
│    937 │   │   # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)   │
│ ❱  938 │   │   outputs = self.model.decoder(                                                     │
│    939 │   │   │   input_ids=input_ids,                                                          │
│    940 │   │   │   attention_mask=attention_mask,                                                │
│    941 │   │   │   head_mask=head_mask,                                                          │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in _call_impl            │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/accelerate/hooks.py:165 in new_forward                   │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py:704 in forward   │
│                                                                                                  │
│    701 │   │   │   │   │   None,                                                                 │
│    702 │   │   │   │   )                                                                         │
│    703 │   │   │   else:                                                                         │
│ ❱  704 │   │   │   │   layer_outputs = decoder_layer(                                            │
│    705 │   │   │   │   │   hidden_states,                                                        │
│    706 │   │   │   │   │   attention_mask=causal_attention_mask,                                 │
│    707 │   │   │   │   │   layer_head_mask=(head_mask[idx] if head_mask is not None else None),  │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in _call_impl            │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/accelerate/hooks.py:165 in new_forward                   │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py:329 in forward   │
│                                                                                                  │
│    326 │   │   │   hidden_states = self.self_attn_layer_norm(hidden_states)                      │
│    327 │   │                                                                                     │
│    328 │   │   # Self Attention                                                                  │
│ ❱  329 │   │   hidden_states, self_attn_weights, present_key_value = self.self_attn(             │
│    330 │   │   │   hidden_states=hidden_states,                                                  │
│    331 │   │   │   past_key_value=past_key_value,                                                │
│    332 │   │   │   attention_mask=attention_mask,                                                │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in _call_impl            │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/accelerate/hooks.py:165 in new_forward                   │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py:174 in forward   │
│                                                                                                  │
│    171 │   │   bsz, tgt_len, _ = hidden_states.size()                                            │
│    172 │   │                                                                                     │
│    173 │   │   # get query proj                                                                  │
│ ❱  174 │   │   query_states = self.q_proj(hidden_states) * self.scaling                          │
│    175 │   │   # get key, value proj                                                             │
│    176 │   │   if is_cross_attention and past_key_value is not None:                             │
│    177 │   │   │   # reuse k,v, cross_attentions                                                 │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in _call_impl            │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/accelerate/hooks.py:165 in new_forward                   │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /kaggle/working/AutoGPTQ/auto_gptq/nn_modules/qlinear.py:189 in forward                          │
│                                                                                                  │
│   186 │   │   │   elif self.bits == 3:                                                           │
│   187 │   │   │   │   quant_cuda.vecquant3matmul(x.float(), self.qweight, out, self.scales.flo   │
│   188 │   │   │   elif self.bits == 4:                                                           │
│ ❱ 189 │   │   │   │   quant_cuda.vecquant4matmul(x.float(), self.qweight, out, self.scales.flo   │
│   190 │   │   │   elif self.bits == 8:                                                           │
│   191 │   │   │   │   quant_cuda.vecquant8matmul(x.float(), self.qweight, out, self.scales.flo   │
│   192 │   │   │   else:                                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: module 'quant_cuda' has no attribute 'vecquant4matmul'

The bug seems to be referenced here

TheBloke · 2023-05-04T07:58:36Z

I think the reason for the error is that you have another version of quant-cuda already installed, likely from GPTQ-for-LLaMa. You should first do: pip uninstall quant-cuda.

However you can't use CUDA AutoGPTQ anyway. You just told auto-gptq to install without CUDA (BUILD_CUDA_EXT=0), and you need to do that because your CUDA Toolkit version doesn't match the version used to compiled pytorch.

If you want to fix that you could uninstall CUDA Toolkit 12.1 and install CUDA Toolkit 11.8 instead. Or you could build pytorch from source, so it can use CUDA Toolkit 12.1. There's no pre-built pytorch binaries for CUDA 12.x yet. Instructions for doing that are on the pytorch Github - it takes a while though, at least an hour to build.

Or forget about CUDA and use Triton instead:

Here is simple example code to use Triton to load this GPTQ model: https://huggingface.co/TheBloke/stable-vicuna-13B-GPTQ

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantized_model_dir = "/path/to/stable-vicuna-13B-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=True)

def get_config(has_desc_act):
    return BaseQuantizeConfig(
        bits=4,  # quantize model to 4-bit
        group_size=128,  # it is recommended to set the value to 128
        desc_act=has_desc_act
    )

def get_model(model_base, triton, model_has_desc_act):
    return AutoGPTQForCausalLM.from_quantized(quantized_model_dir, use_safetensors=True, model_basename=model_base, device="cuda:0", use_triton=triton, quantize_config=get_config(model_has_desc_act))

# Prevent printing spurious transformers error
logging.set_verbosity(logging.CRITICAL)

prompt='''### Human: Write a story about llamas
### Assistant:'''

model = get_model("stable-vicuna-13B-GPTQ-4bit.compat.no-act-order", triton=True, model_has_desc_act=False)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print("### Inference:")
print(pipe(prompt)[0]['generated_text'])

vrunm · 2023-05-04T08:16:08Z

It seems the repo path is wrong:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /tmp/ipykernel_32/3845330370.py:6 in <module>                                                    │
│                                                                                                  │
│ [Errno 2] No such file or directory: '/tmp/ipykernel_32/3845330370.py'                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:642 in     │
│ from_pretrained                                                                                  │
│                                                                                                  │
│   639 │   │   │   return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *input   │
│   640 │   │                                                                                      │
│   641 │   │   # Next, let's try to use the tokenizer_config file to get the tokenizer class.     │
│ ❱ 642 │   │   tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)   │
│   643 │   │   if "_commit_hash" in tokenizer_config:                                             │
│   644 │   │   │   kwargs["_commit_hash"] = tokenizer_config["_commit_hash"]                      │
│   645 │   │   config_tokenizer_class = tokenizer_config.get("tokenizer_class")                   │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:486 in     │
│ get_tokenizer_config                                                                             │
│                                                                                                  │
│   483 │   tokenizer_config = get_tokenizer_config("tokenizer-test")                              │
│   484 │   ```"""                                                                                 │
│   485 │   commit_hash = kwargs.get("_commit_hash", None)                                         │
│ ❱ 486 │   resolved_config_file = cached_file(                                                    │
│   487 │   │   pretrained_model_name_or_path,                                                     │
│   488 │   │   TOKENIZER_CONFIG_FILE,                                                             │
│   489 │   │   cache_dir=cache_dir,                                                               │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py:409 in cached_file             │
│                                                                                                  │
│    406 │   user_agent = http_user_agent(user_agent)                                              │
│    407 │   try:                                                                                  │
│    408 │   │   # Load from URL or cache if already cached                                        │
│ ❱  409 │   │   resolved_file = hf_hub_download(                                                  │
│    410 │   │   │   path_or_repo_id,                                                              │
│    411 │   │   │   filename,                                                                     │
│    412 │   │   │   subfolder=None if len(subfolder) == 0 else subfolder,                         │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py:112 in _inner_fn    │
│                                                                                                  │
│   109 │   │   │   kwargs.items(),  # Kwargs values                                               │
│   110 │   │   ):                                                                                 │
│   111 │   │   │   if arg_name in ["repo_id", "from_id", "to_id"]:                                │
│ ❱ 112 │   │   │   │   validate_repo_id(arg_value)                                                │
│   113 │   │   │                                                                                  │
│   114 │   │   │   elif arg_name == "token" and arg_value is not None:                            │
│   115 │   │   │   │   has_token = True                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py:160 in              │
│ validate_repo_id                                                                                 │
│                                                                                                  │
│   157 │   │   raise HFValidationError(f"Repo id must be a string, not {type(repo_id)}: '{repo_   │
│   158 │                                                                                          │
│   159 │   if repo_id.count("/") > 1:                                                             │
│ ❱ 160 │   │   raise HFValidationError(                                                           │
│   161 │   │   │   "Repo id must be in the form 'repo_name' or 'namespace/repo_name':"            │
│   162 │   │   │   f" '{repo_id}'. Use `repo_type` argument if needed."                           │
│   163 │   │   )                                                                                  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 
'/path/to/stable-vicuna-13B-GPTQ'. Use `repo_type` argument if needed.

TheBloke · 2023-05-04T08:20:58Z

My example code assumes the model is downloaded locally. Download the model then update quantized_model_dir to point to the folder where you downloaded it.

AutoGPTQ does not yet support loading a model directly from Hugging Face.

vrunm · 2023-05-04T08:22:32Z

@TheBloke That's been my concern from the start I was trying versions of Alpaca, GPT-J, Bloom, OPT, Pegasus but was not able to load them from huggingface.

vrunm · 2023-05-04T08:24:32Z

@TheBloke Is there an open PR to do this?

TheBloke · 2023-05-04T08:32:06Z

No, no-one is looking at it yet to my knowledge. Remember that AutoGPTQ is still new and under active development. Such improvements will come over time.

It's easy to download the model first. If you want a quick way to do that, clone text-generation-webui and call its download-model.py:

root@bce51ed9603e:/workspace# python /root/text-generation-webui/download-model.py -h
usage: download-model.py [-h] [--branch BRANCH] [--threads THREADS] [--text-only] [--output OUTPUT] [--clean] [--check] [MODEL]

positional arguments:
  MODEL

options:
  -h, --help         show this help message and exit
  --branch BRANCH    Name of the Git branch to download from.
  --threads THREADS  Number of files to download simultaneously.
  --text-only        Only download text files (txt/json).
  --output OUTPUT    The folder where the model should be saved.
  --clean            Does not resume the previous download.
  --check            Validates the checksums of model files.

root@bce51ed9603e:/workspace# python /root/text-generation-webui/download-model.py TheBloke/stable-vicuna-13B-GPTQ
Downloading the model to models/TheBloke_stable-vicuna-13B-GPTQ
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13.2k /13.2k  11.9MiB/s
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21.0  /21.0   22.7kiB/s
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 587   /587    649kiB/s
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 137   /137    157kiB/s
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 81.0  /81.0   93.1kiB/s
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 96.0  /96.0   107kiB/s
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.26G /7.26G  36.8MiB/s
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M /1.84M  3.27MiB/s
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k  /500k   1.97MiB/s
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 715   /715    808kiB/s

root@bce51ed9603e:/workspace# ll models/TheBloke_stable-vicuna-13B-GPTQ/
total 7093305
drwxrwxrwx 2 root root    3000675 May  4 08:30 ./
drwxrwxrwx 7 root root    3002107 May  4 08:27 ../
-rw-rw-rw- 1 root root      13247 May  4 08:27 README.md
-rw-rw-rw- 1 root root         21 May  4 08:27 added_tokens.json
-rw-rw-rw- 1 root root        587 May  4 08:27 config.json
-rw-rw-rw- 1 root root        137 May  4 08:27 generation_config.json
-rw-rw-rw- 1 root root        333 May  4 08:27 huggingface-metadata.txt
-rw-rw-rw- 1 root root         81 May  4 08:27 quantize_config.json
-rw-rw-rw- 1 root root         96 May  4 08:27 special_tokens_map.json
-rw-rw-rw- 1 root root 7255179696 May  4 08:30 stable-vicuna-13B-GPTQ-4bit.compat.no-act-order.safetensors
-rw-rw-rw- 1 root root    1842847 May  4 08:30 tokenizer.json
-rw-rw-rw- 1 root root     499723 May  4 08:30 tokenizer.model
-rw-rw-rw- 1 root root        715 May  4 08:30 tokenizer_config.json
root@bce51ed9603e:/workspace#

In this example, using my example code you would then set

quantized_model_dir =  "/workspace/models/TheBloke_stable-vicuna-13B-GPTQ"

vrunm · 2023-05-04T08:35:14Z

@TheBloke Sure will try that. While loading TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g from huggingface. It takes very long to load on a kaggle cpu. Is there any way around that?

TheBloke · 2023-05-04T08:36:02Z

I don't think so, no. Are you trying to use the CPU only, or do you have a GPU for inference?

vrunm · 2023-05-04T08:37:22Z

@TheBloke I am trying using CPU only but even while GPU only it does not work. I think it requires 28GB of GPU RAM to load?

TheBloke · 2023-05-04T08:41:15Z

CPU only will definitely be terribly slow.

For a 13B 4bit model like stable-vicuna-13B you need around 9GB VRAM. Nothing like 28GB - that would be for an unquantised fp16 model.

TheBloke · 2023-05-04T08:41:49Z

nvidia-smi stats while running the example code I showed before on stable-vicuna-13B:

timestamp, name, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/05/04 08:41:04.156, NVIDIA RTX A4500, 20 %, 7 %, 20470 MiB, 11292 MiB, 8894 MiB

Last column is used VRAM

vrunm · 2023-05-04T08:45:15Z

I have a VRAM of 14GB. But it still is not able to run?

TheBloke · 2023-05-04T08:48:23Z

Please show the full output of running the script

PanQiWei · 2023-05-04T08:59:20Z

download quantized model from HF hub is a feature that I plan to add in v0.2.0, I'm now writing the features plan of v0.2.0 and v0.3.0, anyone interested can see in Projects later.

vrunm · 2023-05-04T09:04:13Z

After running the model I got this:


/opt/conda/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[1], line 5
      1 from transformers import AutoTokenizer, AutoModelForCausalLM
      3 tokenizer = AutoTokenizer.from_pretrained("TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g")
----> 5 model = AutoModelForCausalLM.from_pretrained("TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g")

File /opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:471, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
    469 elif type(config) in cls._model_mapping.keys():
    470     model_class = _get_model_class(config, cls._model_mapping)
--> 471     return model_class.from_pretrained(
    472         pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
    473     )
    474 raise ValueError(
    475     f"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\n"
    476     f"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping.keys())}."
    477 )

File /opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py:2511, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
   2505             raise EnvironmentError(
   2506                 f"{pretrained_model_name_or_path} does not appear to have a file named"
   2507                 f" {_add_variant(WEIGHTS_NAME, variant)} but there is a file without the variant"
   2508                 f" {variant}. Use `variant=None` to load this model from those weights."
   2509             )
   2510         else:
-> 2511             raise EnvironmentError(
   2512                 f"{pretrained_model_name_or_path} does not appear to have a file named"
   2513                 f" {_add_variant(WEIGHTS_NAME, variant)}, {TF2_WEIGHTS_NAME}, {TF_WEIGHTS_NAME} or"
   2514                 f" {FLAX_WEIGHTS_NAME}."
   2515             )
   2516 except EnvironmentError:
   2517     # Raise any environment error raise by `cached_file`. It will have a helpful error message adapted
   2518     # to the original exception.
   2519     raise

OSError: TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.

TheBloke · 2023-05-04T09:08:03Z

This is wrong:

model = AutoModelForCausalLM.from_pretrained("TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g")

It should be AutoGPTQForCausalLM, and needs the other arguments I showed in my example code.

Please use the example code I provided as a base.

vrunm · 2023-05-04T09:12:34Z

@TheBloke I had taken it from the huggingface model card. Do all your models require AutoGPTQ or you have support for huggingface transformers ?

TheBloke · 2023-05-04T09:20:38Z

@TheBloke I had taken it from the huggingface model card. Do all your models require AutoGPTQ or you have support for huggingface transformers ?

I have many models that support unquantised transformers inference. But they will need a lot more VRAM. I don't think you can load any of them in fp16 in 16GB, but you could load them in 8bit.

Here are my HF models: https://huggingface.co/models?search=thebloke%20hf

A 7B fp16 model may load in 16GB VRAM. A 13B fp16 definitely will not.

But either way I recommend you add load_in_8bit=Trueto the model = AutoModelForCausalLM() call, and then it will require half as much VRAM. This requires the bitsandbytes library: pip install bitsandbytes (you may have it installed already.)

If you are loading unquantised HF models then that is not relevant to AutoGPTQ. For further support with that, please open a Discussion on the HF model page for whichever of my models you try to use.

vrunm · 2023-05-04T09:22:33Z

Sure I have used bitsandbytes and accelerate and have experimented with Alpaca, OPT, GPT-J using their 8bit versions

vrunm changed the title ~~Unable to buld the package from pip~~ Module 'quant_cuda' has no attribute 'vecquant4matmul' May 4, 2023

vrunm closed this as completed May 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Module 'quant_cuda' has no attribute 'vecquant4matmul' #53

Module 'quant_cuda' has no attribute 'vecquant4matmul' #53

vrunm commented May 4, 2023 •

edited

TheBloke commented May 4, 2023

vrunm commented May 4, 2023 •

edited

TheBloke commented May 4, 2023

vrunm commented May 4, 2023 •

edited

TheBloke commented May 4, 2023 •

edited

vrunm commented May 4, 2023 •

edited

TheBloke commented May 4, 2023

vrunm commented May 4, 2023

vrunm commented May 4, 2023

TheBloke commented May 4, 2023

vrunm commented May 4, 2023 •

edited

TheBloke commented May 4, 2023

vrunm commented May 4, 2023

TheBloke commented May 4, 2023

TheBloke commented May 4, 2023 •

edited

vrunm commented May 4, 2023

TheBloke commented May 4, 2023

PanQiWei commented May 4, 2023 •

edited

vrunm commented May 4, 2023

TheBloke commented May 4, 2023

vrunm commented May 4, 2023

TheBloke commented May 4, 2023 •

edited

vrunm commented May 4, 2023

Module 'quant_cuda' has no attribute 'vecquant4matmul' #53

Module 'quant_cuda' has no attribute 'vecquant4matmul' #53

Comments

vrunm commented May 4, 2023 • edited

TheBloke commented May 4, 2023

vrunm commented May 4, 2023 • edited

TheBloke commented May 4, 2023

vrunm commented May 4, 2023 • edited

TheBloke commented May 4, 2023 • edited

vrunm commented May 4, 2023 • edited

TheBloke commented May 4, 2023

vrunm commented May 4, 2023

vrunm commented May 4, 2023

TheBloke commented May 4, 2023

vrunm commented May 4, 2023 • edited

TheBloke commented May 4, 2023

vrunm commented May 4, 2023

TheBloke commented May 4, 2023

TheBloke commented May 4, 2023 • edited

vrunm commented May 4, 2023

TheBloke commented May 4, 2023

PanQiWei commented May 4, 2023 • edited

vrunm commented May 4, 2023

TheBloke commented May 4, 2023

vrunm commented May 4, 2023

TheBloke commented May 4, 2023 • edited

vrunm commented May 4, 2023

vrunm commented May 4, 2023 •

edited

vrunm commented May 4, 2023 •

edited

vrunm commented May 4, 2023 •

edited

TheBloke commented May 4, 2023 •

edited

vrunm commented May 4, 2023 •

edited

vrunm commented May 4, 2023 •

edited

TheBloke commented May 4, 2023 •

edited

PanQiWei commented May 4, 2023 •

edited

TheBloke commented May 4, 2023 •

edited