forked from ROCm/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ROCm5.2/AMDGPU support for PyTorch 1.10 #1
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
torch.vmap is a prototype feature and should not be in the stable binary. This PR: - Removes the torch.vmap API - Removes the documentation entry for torch.vmap - Changes the vmap tests to use an internal API instead of torch.vmap. Test Plan: - Tested locally (test_torch, test_autograd, test_type_hints, test_vmap), but also wait for CI.
…rch#65835) Summary: Pull Request resolved: pytorch#65721 #Closes: pytorch#65696 The bug is introduced in pytorch#55861, and it causes 100X slowdown since 1.9. ghstack-source-id: 139128267 Test Plan: Performance test: ``` import time from torch.distributed.distributed_c10d import _object_to_tensor start = time.time() _object_to_tensor("x" * 50_000_000) print("Time:", time.time() - start) ``` Reviewed By: rohan-varma Differential Revision: D31219794 fbshipit-source-id: 1abec38f9d51361c1eab6ad5efd87b589322e208 Co-authored-by: Yi Wang <wayi@fb.com>
…on for IterableWrapper (pytorch#65220) (pytorch#65924) Summary: Pull Request resolved: pytorch#65220 Fixes pytorch#65221 - Remove deepcopy from Mapper to support file handles - Convert `IterableWrapper` to deepcopy iterable instance within each iterator to prevent in-place modification (different data per epoch) - Convert `IDP` to `IterableWrapper` in test_datapipe.py - Refine the variable names (prevent using `dp` that is module reference) Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D31021886 Pulled By: ejguan fbshipit-source-id: 72a9eee66c758e2717d591cd0942892bddedc223
…rch#65979) Summary: Pull Request resolved: pytorch#65934 see: pytorch#65931, this was a suggested remediation on the linked issue Test Plan: Imported from OSS Reviewed By: malfet, zhouzhuojie Differential Revision: D31313040 Pulled By: suo fbshipit-source-id: a9e2b82a1e879962af768ed3049c73ab77394738 Co-authored-by: Michael Suo <suo@fb.com>
Summary: Fixes pytorch#66030 Pull Request resolved: pytorch#66031 Reviewed By: VitalyFedyunin Differential Revision: D31356243 Pulled By: malfet fbshipit-source-id: d1537bc65bbba5d6497ecb8db7160a397eca81fd
…ytorch#66155) Summary: Reported by cloudhan in pytorch#64733 (comment) Fixes regression introduced by pytorch@047e682 cc malfet seemethere Pull Request resolved: pytorch#65444 Reviewed By: dagitses, seemethere Differential Revision: D31103260 Pulled By: malfet fbshipit-source-id: 9d5454a64cb8a0b96264119cf16582cc5afed284
Compare operator list against RC1 build rather than against nightly
Summary: Fixes pytorch#65988 Pull Request resolved: pytorch#66004 Reviewed By: xta0 Differential Revision: D31340893 Pulled By: malfet fbshipit-source-id: 3bf0be266e9686a73d62e86c5cf0bebeb0416260 Co-authored-by: Tao Xu <taox@fb.com>
…torch#65932) * Unify the output pathname of archive reader and extractor (pytorch#65424) Summary: Pull Request resolved: pytorch#65424 This PR is re-implementation for https://github.com/facebookexternal/torchdata/pull/93 Same PR has landed into torchdata https://github.com/facebookexternal/torchdata/pull/157 Test Plan: Imported from OSS Reviewed By: soulitzer Differential Revision: D31090447 Pulled By: ejguan fbshipit-source-id: 45af1ad9b24310bebfd6e010f41cff398946ba65 * [DatePipe] add deprecation warnings for DataPipes that will solely exist in TorchData (pytorch#65827) Summary: Pull Request resolved: pytorch#65827 Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D31272794 Pulled By: NivekT fbshipit-source-id: 8da8266184b4df050422904cbc5fca6d7c3d2e02 * [DataPipe] Fixes an issue where TarArchiveReader closes stream when read into a buffer (pytorch#65877) Summary: Pull Request resolved: pytorch#65877 Fixes pytorch#65808 Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D31296041 Pulled By: NivekT fbshipit-source-id: cdcad3a333ae9781d6063678a122a128955b0ff4 Co-authored-by: Erjia Guan <erjia@fb.com>
…ytorch#65495) (pytorch#65755) * Added option to update parameters using state_dict in AveragedModel (pytorch#65495) Summary: While implementing [EMA](pytorch/vision#4381 extends AveragedModel) in torchvision, update_parameters() from AveragedModel could not be used as it did not handle state_dict(), so a custom update_parameters() needed to be defined in [EMA class](pytorch/vision#4406). This PR aims to handle this scenario removing the need for this custom update_parameters() implementation. Discussion: pytorch/vision#4406 (review) Pull Request resolved: pytorch#65495 Reviewed By: datumbox Differential Revision: D31176742 Pulled By: prabhat00155 fbshipit-source-id: 326d14876018f21cf602bab5eaba344678dbabe2 (cherry picked from commit 2ea724b) * Added validation of mode parameter in AveragedModel (pytorch#65921) Summary: Discussion: pytorch#65495 (comment) Pull Request resolved: pytorch#65921 Reviewed By: albanD Differential Revision: D31310105 Pulled By: prabhat00155 fbshipit-source-id: 417691832a7c793744830c11e0ce53e3972d21a3 (cherry picked from commit c7748fc)
…dModel (pytorch#65495) (pytorch#65755)" (pytorch#66308) This reverts commit 5f1a434.
…65926) Summary: Pull Request resolved: pytorch#63646 Fixes pytorch#63609 Test Plan: Imported from OSS Reviewed By: NivekT Differential Revision: D30451774 Pulled By: ejguan fbshipit-source-id: 550d77494326446d1a42b5da0559e0d384c47413
* [ONNX] Remove argument _retain_param_name from torch.onnx.export() function. (pytorch#61702) (pytorch#64370) Summary: Pull Request resolved: pytorch#64370 As of now, the "_retain_param_name" parameter has no description in PyTorch docs website. According to code, this argument determines if we keep the original parameter names of PyTorch model in the final ONNX graph. If this is False, those original parameter names will be replaced with a series of integers starting from 1. Since setting numbers as parameter names make no sense to users, we remove this argument from the torch.onnx.export() function to increase user experience of calling this function. This PR will still keep it in torch.onnx.export() function for backward support while all backend logic has been changed to work as _retain_param_name is set to True. Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D30905270 Pulled By: malfet fbshipit-source-id: ca60757ca17daaff937e9f08da42596086795f4a Co-authored-by: fatcat-z <zhang-ji@outlook.com> * [ONNX] Remove strip_doc_string param from torch.onnx.export() function. (pytorch#61712) (pytorch#64371) Summary: Pull Request resolved: pytorch#64371 As of now, the "strip_doc_string" parameter was described as below: strip_doc_string (bool, default True): do not include the field doc_string``` from the exported model. Otherwise the field will mention the source code locations for model``. This is usually useless to users who want to transform a PyTorch model to ONNX one. Only when the user wants to debug the export process, these source code locations could provide benefits. To make the export() function more friendly by providing less parameters, we combined "strip_doc_string" into "verbose" parameter. If a user set verbose to True, it means the users need some log information for debugging the export process and this is similar with the purpose of strip_doc_string parameter. But the usage of these 2 arguments are opposite: setting verbose to True means we want to print log information to help debug, which means strip_doc_string should be False. And this is how we replace strip_doc_string with verbose argument in this PR. This PR will still keep it in torch.onnx.export() function for backward support while the usage of it has been combined with verbose argument. Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D30905268 Pulled By: malfet fbshipit-source-id: 2f06eb805c01fe15ff7a1b4f6595c937ba716d60 Co-authored-by: fatcat-z <zhang-ji@outlook.com> * [ONNX] minor doc improvements and cleanup (pytorch#62514) (pytorch#64373) Summary: Pull Request resolved: pytorch#64373 * Fix some bad formatting and clarify things in onnx.rst. * In `export_to_pretty_string`: * Add documentation for previously undocumented args. * Document that `f` arg is ignored and mark it deprecated. * Update tests to stop setting `f`. * Warn if `_retain_param_name` is set. * Use double quotes for string literals in test_operators.py. Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D30905271 Pulled By: malfet fbshipit-source-id: 3627eeabf40b9516c4a83cfab424ce537b36e4b3 * [ONNX] Deprecated the example_outputs param from torch.onnx.export() function. (pytorch#62815) (pytorch#64380) Summary: Pull Request resolved: pytorch#64380 * `example_outputs` used to determine the type and shape of the outputs without tracing the execution of the model. And it must be provided when exporting a ScriptModule or ScriptFunction when using export() function. * Since we can work out `example_outputs` in internal function instead of being provided by user, so we deprecated this argument in the export() function to increase user experience of calling this function. Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D30905266 Pulled By: malfet fbshipit-source-id: d00b00d7d02b365d165028288ad915678caa51f2 Co-authored-by: hwangdeyu <dejack953@outlook.com> * [ONNX] Deprecate use_external_data_format param from torch.onnx.export() function. (pytorch#62257) (pytorch#64382) Summary: Pull Request resolved: pytorch#64382 * This `use_external_data_format` parameter is used for large models cannot be exported because of the 2GB protobuf limit. * When `use_external_data_format` set to True, the model is exported in ONNX external data format, in which case some of the model parameters are stored in external binary files and not in the ONNX model file itself. * This PR will set this paramter to DEPRECATED and check the model proto sizes by code instead of by user, if the sizes lager than 2GB, then `use_external_data_format = True` automatically. Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D30905265 Pulled By: malfet fbshipit-source-id: 82b4e17bfa6a8de2bfd700a5282c12f6835603cb Co-authored-by: hwangdeyu <dejack953@outlook.com> * fix clang-tidy error introduced by pytorch#64382 (pytorch#65977) Summary: Pull Request resolved: pytorch#65977 Reviewed By: ngimel Differential Revision: D31423174 Pulled By: malfet fbshipit-source-id: 0ea560b9a6ddd6431f70bd3ac10ace68e26ab352 Co-authored-by: BowenBao <bowbao@microsoft.com> Co-authored-by: fatcat-z <zhang-ji@outlook.com> Co-authored-by: hwangdeyu <dejack953@outlook.com>
* fix cosine similarity dimensionality check * fix shapes in the doc
…rch#66629) Summary: Fixes pytorch#66353 Fixes #{issue number} Pull Request resolved: pytorch#66433 Reviewed By: seemethere, janeyx99 Differential Revision: D31548290 Pulled By: malfet fbshipit-source-id: 3b094bc8195d0392338e0bdc6df2f39587b85bb3
…ix .tolist() for conjugated and negated tensors (pytorch#66082) (pytorch#66576) Summary: Pull Request resolved: pytorch#66082 Fixes pytorch#66024 pytorch#65779 cc ezyang anjali411 dylanbespalko mruberry Lezcano nikitaved albanD Test Plan: Imported from OSS Reviewed By: Gamrix, albanD Differential Revision: D31615588 Pulled By: anjali411 fbshipit-source-id: c3e65ef0fe301630eb76732ccd7819683c09aa19
pytorch#66642) * Disable .numpy() and .tolist() for tensor subclasses subclasses and fix .tolist() for conjugated and negated tensors (pytorch#66082) Summary: Pull Request resolved: pytorch#66082 Fixes pytorch#66024 pytorch#65779 cc ezyang anjali411 dylanbespalko mruberry Lezcano nikitaved albanD Test Plan: Imported from OSS Reviewed By: Gamrix, albanD Differential Revision: D31615588 Pulled By: anjali411 fbshipit-source-id: c3e65ef0fe301630eb76732ccd7819683c09aa19 * Apply suggestions from code review Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com> Co-authored-by: Nikita Shulga <nshulga@fb.com>
* Handle shared memory cases in MathBithFallback (pytorch#63602) Summary: Pull Request resolved: pytorch#63602 This PR fixes the case when a read and write is performed on a memory shared between mutable and (or) non-mutable arguments. Example: ``` a=torch.tensor([1+1j]) b=a.conj() b.add_(a) # should return tensor([2]) but returns tensor ([2-2j]) ``` The issue here is that in the conjugate fallback, we resolve the conjugation in-place for mutable arguments which can be a problem as shown above in the case when other input arguments share memory with the mutable argument(s). This PR fixes this issue by: 1. first scanning through the operator input arguments and creating a vector of mutable arguments that have the conj bit set to `True` (and accordingly setting the flag `check_for_alias_with_mut_arg ` to `True` or `False`). 2. Iterating through all the arguments. At this time we only look at the non-mutable arguments. If `check_for_alias_with_mut_arg` is set to `True`, then we iterate through `mutable_inputs` to check if the current arg tensor in question doesn't alias any of the entries in `mutable_inputs`. If yes, then we clone the non-mutable tensor arg, else we resolve the conjugation as before. 3. Now we look through the mutable_inputs vector (which contains only mutable input tensors with conj bit set to `True`). We in-place conjugate each of the entries in the vector. 4. Do the computation. 5. Re-conjugate the mutable argument tensors. NOTE: `TensorLists` are not fully handled in ConjugateFallback. Please see the in-line comment for more details. Fixes pytorch#59943 Test Plan: Imported from OSS Reviewed By: gmagogsfm Differential Revision: D30466905 Pulled By: anjali411 fbshipit-source-id: 58058e5e6481da04a12d03f743c1491942a6cc9b * fix lint (pytorch#66572) Summary: Pull Request resolved: pytorch#66572 Test Plan: Imported from OSS Reviewed By: seemethere Differential Revision: D31624043 Pulled By: suo fbshipit-source-id: 9db9cee3140d78c2a2f0c937be84755206fee1dd Co-authored-by: anjali411 <chourdiaanjali123@gmail.com> Co-authored-by: Michael Suo <suo@fb.com>
…ytorch#66662) Summary: Pull Request resolved: pytorch#66182 closes pytorch#63174 Does a few things: 1. adds hostname to the error report 2. moves the "root cause" section to the end (presumably since the logs are being "tailed" we want the root cause to appear at the end) 3. moves redundant error info logging to debug 4. makes the border max 60 char in length and justifies left for the header NOTE: YOU HAVE TO annotate your main function with torch.distributed.elastic.multiprocessing.errors.record, otherwise no traceback is printed (this is because python exception propagation does NOT work out of the both for IPC - hence the extra record annotation). Test Plan: Sample ``` ============================================================ run_script_path FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2021-10-05_17:37:22 host : devvm4955.prn0.facebook.com rank : 0 (local_rank: 0) exitcode : 1 (pid: 3296201) error_file: /home/kiuk/tmp/elastic/none_3_lsytqe/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/tmp/jetter.xr3_x6qq/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 372, in wrapper return f(*args, **kwargs) File "main.py", line 28, in main raise RuntimeError(args.throws) RuntimeError: foobar ============================================================ ``` Reviewed By: cbalioglu, aivanou Differential Revision: D31416492 fbshipit-source-id: 0aeaf6e634e23ce0ea7f6a03b12c8a9ac57246e9
…ytorch#53177)" - This reverts commit a0d1e70. - Reverting this commit, since it is causing a regression for detectron2 - Please check SWDEV-304968 for more info.
* Add amdgpu repos for rocm install * Correct ROCm version
Signed-off-by: Wang, Yanyao <yanyao.wang@amd.com> Co-authored-by: Wang, Yanyao <yanyao.wang@amd.com>
* Various hipify-related fixes: 1. Fix JIT path of building PyTorch extensions 2. Use absolute paths for all files to allow for absolute paths in includes/ignores 3. Limit hipification to build_dir for non-JIT path 4. Ignore ROCm/PyTorch headers during hipification of header_include_dirs for JIT path 5. Update hipify output with clearer status 6. Don't include files ignored by hipify in output 7. Define HIP flags in cflags for JIT path as well 8. Ensure includes and ignores are passed in as absolute paths for pytorch build; explicitly require relative paths for certain helper functions
… in custom extensions (ROCm#909)
…support is not in PT1.9
Summary: This reverts commit 9e8016d. cc jeffdaily sunway513 jithunnair-amd ROCmSupport KyleCZH Pull Request resolved: pytorch#68008 Reviewed By: H-Huang Differential Revision: D32254779 Pulled By: ngimel fbshipit-source-id: 38ec415199f62a1e58000abe3e34ac91898a94ae
Co-authored-by: Wang, Yanyao <yanyao.wang@amd.com>
Co-authored-by: Wang, Yanyao <yanyao.wang@amd.com>
revert d5ca53c (pytorch#46097). The changes only affect ROCm. Reverts a work-around for a compiler performance issue that is no longer needed. `python -m pt.cat_test --tag_filter all --device cuda` ``` OLD Forward Execution Time (us) : 48.833 NEW Forward Execution Time (us) : 8.318 OLD Forward Execution Time (us) : 54.508 NEW Forward Execution Time (us) : 23.824 OLD Forward Execution Time (us) : 52.117 NEW Forward Execution Time (us) : 14.942 OLD Forward Execution Time (us) : 98.790 NEW Forward Execution Time (us) : 74.334 OLD Forward Execution Time (us) : 102.063 NEW Forward Execution Time (us) : 76.008 OLD Forward Execution Time (us) : 167.786 NEW Forward Execution Time (us) : 123.679 OLD Forward Execution Time (us) : 98.320 NEW Forward Execution Time (us) : 67.436 OLD Forward Execution Time (us) : 91.484 NEW Forward Execution Time (us) : 59.230 OLD Forward Execution Time (us) : 109.569 NEW Forward Execution Time (us) : 76.557 OLD Forward Execution Time (us) : 106.603 NEW Forward Execution Time (us) : 87.635 OLD Forward Execution Time (us) : 106.693 NEW Forward Execution Time (us) : 88.902 OLD Forward Execution Time (us) : 110.881 NEW Forward Execution Time (us) : 94.361 OLD Forward Execution Time (us) : 122.925 NEW Forward Execution Time (us) : 123.046 OLD Forward Execution Time (us) : 272.442 NEW Forward Execution Time (us) : 271.932 OLD Forward Execution Time (us) : 457.329 NEW Forward Execution Time (us) : 456.767 OLD Forward Execution Time (us) : 117.688 NEW Forward Execution Time (us) : 87.133 OLD Forward Execution Time (us) : 873.764 NEW Forward Execution Time (us) : 865.075 OLD Forward Execution Time (us) : 1746.831 NEW Forward Execution Time (us) : 1730.252 OLD Forward Execution Time (us) : 2619.303 NEW Forward Execution Time (us) : 2598.717 OLD Forward Execution Time (us) : 52.063 NEW Forward Execution Time (us) : 7.904 OLD Forward Execution Time (us) : 52.275 NEW Forward Execution Time (us) : 8.118 OLD Forward Execution Time (us) : 51.896 NEW Forward Execution Time (us) : 7.938 OLD Forward Execution Time (us) : 51.745 NEW Forward Execution Time (us) : 7.922 OLD Forward Execution Time (us) : 52.575 NEW Forward Execution Time (us) : 13.299 OLD Forward Execution Time (us) : 52.090 NEW Forward Execution Time (us) : 8.015 ``` Pull Request resolved: pytorch#74129 Approved by: https://github.com/ngimel
Properly import LooseVersion (pytorch#69904) Summary: This fixes regression introduced by pytorch#57040 Somehow importing `distutils` from `setuptool` caused import of `distutils.versions`, which is not a documented dependency and got change with the release of [setuptools-59.6.0](https://github.com/pypa/setuptools/tree/v59.6.0) We should not rely on that, as `import distutils` never re-imports `distutils.version`, which one can see by observing https://github.com/python/cpython/blob/3.9/Lib/distutils/__init__.py or by running: ``` % python3 -c "import distutils;print(distutils.__version__, dir(distutils))" 3.7.5 ['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', 'sys'] % python3 -c "from setuptools import distutils;print(distutils.__version__, dir(distutils))" 3.7.5 ['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', 'archive_util', 'ccompiler', 'cmd', 'config', 'core', 'debug', 'dep_util', 'dir_util', 'dist', 'errors', 'extension', 'fancy_getopt', 'file_util', 'filelist', 'log', 'spawn', 'sys', 'sysconfig', 'util', 'version'] ``` Pull Request resolved: pytorch#69904 Reviewed By: albanD, atalman, janeyx99 Differential Revision: D33094453 Pulled By: malfet fbshipit-source-id: aaf1adb7c6f293c4e376ccff21c64cd6ba625e97
use ncclAllToAll for rocm version > 5.0; ROCm/rccl#503 detail on ncclAllToAll: ROCm/rccl#503 @jithunnair-amd @amathews-amd Pull Request resolved: pytorch#75128 Approved by: https://github.com/wenkaidu, https://github.com/yzygitzh, https://github.com/seemethere
…evert_ncclAllToAll Deactive ncclAllToAll
Add ROCm5.1.3/AMDGPU support
Fixes nightly libtorch builds. As of ROCm 5.1.x, all *.cmake files are under /opt/rocm/lib/cmake/package instead of /opt/rocm/package/lib/cmake. Pull Request resolved: pytorch#77087 Approved by: https://github.com/seemethere
test |
WBobby
pushed a commit
that referenced
this pull request
Aug 18, 2022
…78136) (pytorch#78204) This prevents `import torch` accidentally crash on machines with no metal devices Should prevent crashes reported in pytorch#77662 (comment) and https://github.com/pytorch/functorch/runs/6560056366?check_suite_focus=true Backtrace to the crash: ``` (lldb) bt * thread #1, stop reason = signal SIGSTOP * frame #0: 0x00007fff7202be57 libobjc.A.dylib`objc_msgSend + 23 frame #1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436 frame #2: 0x000000010fda011d libtorch_cpu.dylib`_GLOBAL__sub_I_MPSAllocator.mm + 125 frame ROCm#3: 0x000000010ada81e3 dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 535 frame ROCm#4: 0x000000010ada85ee dyld`ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) + 40(lldb) up frame #1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl: -> 0x10fd9f524 <+436>: movq %rax, 0x1b0(%rbx) 0x10fd9f52b <+443>: movw $0x0, 0x1b8(%rbx) 0x10fd9f534 <+452>: addq $0x8, %rsp 0x10fd9f538 <+456>: popq %rbx (lldb) disassemble ... 0x10fd9f514 <+420>: movq 0xf19ad15(%rip), %rsi ; "maxBufferLength" 0x10fd9f51b <+427>: movq %r14, %rdi 0x10fd9f51e <+430>: callq *0xeaa326c(%rip) ; (void *)0x00007fff7202be40: objc_msgSend ``` which corresponds to `[m_device maxBufferLength]` call, where `m_device` is not initialized in https://github.com/pytorch/pytorch/blob/2ae3c59e4bcb8e6e75b4a942cacc2d338c88e609/aten/src/ATen/mps/MPSAllocator.h#L171 Pull Request resolved: pytorch#78136 Approved by: https://github.com/seemethere Co-authored-by: Nikita Shulga <nshulga@fb.com>
WBobby
pushed a commit
that referenced
this pull request
Jan 3, 2023
…78136) This prevents `import torch` accidentally crash on machines with no metal devices Should prevent crashes reported in pytorch#77662 (comment) and https://github.com/pytorch/functorch/runs/6560056366?check_suite_focus=true Backtrace to the crash: ``` (lldb) bt * thread #1, stop reason = signal SIGSTOP * frame #0: 0x00007fff7202be57 libobjc.A.dylib`objc_msgSend + 23 frame #1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436 frame #2: 0x000000010fda011d libtorch_cpu.dylib`_GLOBAL__sub_I_MPSAllocator.mm + 125 frame ROCm#3: 0x000000010ada81e3 dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 535 frame ROCm#4: 0x000000010ada85ee dyld`ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) + 40(lldb) up frame #1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl: -> 0x10fd9f524 <+436>: movq %rax, 0x1b0(%rbx) 0x10fd9f52b <+443>: movw $0x0, 0x1b8(%rbx) 0x10fd9f534 <+452>: addq $0x8, %rsp 0x10fd9f538 <+456>: popq %rbx (lldb) disassemble ... 0x10fd9f514 <+420>: movq 0xf19ad15(%rip), %rsi ; "maxBufferLength" 0x10fd9f51b <+427>: movq %r14, %rdi 0x10fd9f51e <+430>: callq *0xeaa326c(%rip) ; (void *)0x00007fff7202be40: objc_msgSend ``` which corresponds to `[m_device maxBufferLength]` call, where `m_device` is not initialized in https://github.com/pytorch/pytorch/blob/2ae3c59e4bcb8e6e75b4a942cacc2d338c88e609/aten/src/ATen/mps/MPSAllocator.h#L171 Pull Request resolved: pytorch#78136 Approved by: https://github.com/seemethere
WBobby
pushed a commit
that referenced
this pull request
Jan 3, 2023
… of libtorch_python (pytorch#78028) Summary: This moves torch::class_<WorkerInfo> into `rpc_agent.cpp` so it gets registered in libtorch instead of libtorch_python. This is intermediate work to getting torch::deploy to load an unmodified copy of libtorch. Current RPC is incompatible due to duplicate registrations. ``` unknown file: Failure C++ exception with description "Exception Caught inside torch::deploy embedded library: Custom class with name __torch__.torch.classes.dist_rpc.WorkerInfo is already registered. Ensure that registration with torch::class_ is only called once. Exception raised from registerCustomClass at ../aten/src/ATen/core/custom_class.cpp:61 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7f3bd9adb92e in /home/tristanr/venvs/multipy/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5c (0x7f3bd9ab7068 in /home/tristanr/venvs/multipy/lib/python3.8/site-packages/torch/lib/libc10.so) frame #2: torch::registerCustomClass(std::shared_ptr<c10::ClassType>) + 0x110 (0x7f3bc2258980 in /home/tristanr/venvs/multipy/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame ROCm#3: torch::detail::class_base::class_base(std::string const&, std::string const&, std::string, std::type_info const&, std::type_info const&) + 0x3b9 (0x7f3bc225a419 in /home/tristanr/venvs/multipy/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame ROCm#4: [0x7f3ba45cfea1] frame ROCm#5: <unknown function> + 0x1b5334 (0x5652bdab9334 in ./test_deploy) frame ROCm#6: <unknown function> + 0x1b4f3e (0x5652bdab8f3e in ./test_deploy) frame ROCm#7: <unknown function> + 0x1b519b (0x5652bdab919b in ./test_deploy) frame ROCm#8: loadSearchFile(char const*) + 0x23e (0x7f3ba62f37f8 in /tmp/torch_deploy9ATEFg) frame ROCm#9: deploy_set_self + 0x51 (0x7f3ba62f38f9 in /tmp/torch_deploy9ATEFg) frame ROCm#10: torch::deploy::Interpreter::Interpreter(torch::deploy::InterpreterManager*, std::shared_ptr<torch::deploy::Environment>) + 0x274 (0x5652bdaaa790 in ./test_deploy) frame ROCm#11: void __gnu_cxx::new_allocator<torch::deploy::Interpreter>::construct<torch::deploy::Interpreter, torch::deploy::InterpreterManager*, std::shared_ptr<torch::deploy::Environment>&>(torch::deploy::Interpreter*, torch::deploy::InterpreterManager*&&, std::shared_ptr<torch::deploy::Environment>&) + 0x81 (0x5652bdaaf58b in ./test_deploy) frame ROCm#12: void std::allocator_traits<std::allocator<torch::deploy::Interpreter> >::construct<torch::deploy::Interpreter, torch::deploy::InterpreterManager*, std::shared_ptr<torch::deploy::Environment>&>(std::allocator<torch::deploy::Interpreter>&, torch::deploy::Interpreter*, torch::deploy::InterpreterManager*&&, std::shared_ptr<torch::deploy::Environment>&) + 0x4a (0x5652bdaae320 in ./test_deploy) frame ROCm#13: void std::vector<torch::deploy::Interpreter, std::allocator<torch::deploy::Interpreter> >::_M_realloc_insert<torch::deploy::InterpreterManager*, std::shared_ptr<torch::deploy::Environment>&>(__gnu_cxx::__normal_iterator<torch::deploy::Interpreter*, std::vector<torch::deploy::Interpreter, std::allocator<torch::deploy::Interpreter> > >, torch::deploy::InterpreterManager*&&, std::shared_ptr<torch::deploy::Environment>&) + 0xee (0x5652bdaae4a0 in ./test_deploy) frame ROCm#14: void std::vector<torch::deploy::Interpreter, std::allocator<torch::deploy::Interpreter> >::emplace_back<torch::deploy::InterpreterManager*, std::shared_ptr<torch::deploy::Environment>&>(torch::deploy::InterpreterManager*&&, std::shared_ptr<torch::deploy::Environment>&) + 0xb6 (0x5652bdaad258 in ./test_deploy) frame ROCm#15: torch::deploy::InterpreterManager::InterpreterManager(unsigned long, std::shared_ptr<torch::deploy::Environment>) + 0x123 (0x5652bdaa83b1 in ./test_deploy) frame ROCm#16: TorchpyTest_InitTwice_Test::TestBody() + 0x65 (0x5652bda075a9 in ./test_deploy) frame ROCm#17: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 0x65 (0x5652bda944b7 in ./test_deploy) frame ROCm#18: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 0x5a (0x5652bda8cfe7 in ./test_deploy) frame ROCm#19: testing::Test::Run() + 0x100 (0x5652bda68622 in ./test_deploy) frame ROCm#20: testing::TestInfo::Run() + 0x10f (0x5652bda68fb3 in ./test_deploy) frame ROCm#21: testing::TestSuite::Run() + 0x121 (0x5652bda6980d in ./test_deploy) frame ROCm#22: testing::internal::UnitTestImpl::RunAllTests() + 0x38e (0x5652bda756e6 in ./test_deploy) frame ROCm#23: bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 0x65 (0x5652bda9586b in ./test_deploy) frame ROCm#24: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 0x5a (0x5652bda8e0f7 in ./test_deploy) frame ROCm#25: testing::UnitTest::Run() + 0xc9 (0x5652bda73fd1 in ./test_deploy) frame ROCm#26: RUN_ALL_TESTS() + 0x11 (0x5652bda169fa in ./test_deploy) frame ROCm#27: main + 0x27 (0x5652bda10ce2 in ./test_deploy) frame ROCm#28: <unknown function> + 0x2d310 (0x7f3bc0431310 in /usr/lib/libc.so.6) frame ROCm#29: __libc_start_main + 0x81 (0x7f3bc04313c1 in /usr/lib/libc.so.6) frame ROCm#30: _start + 0x25 (0x5652bda063b5 in ./test_deploy) ``` Test Plan: CI Differential Revision: D36564258 Pull Request resolved: pytorch#78028 Approved by: https://github.com/rohan-varma
WBobby
pushed a commit
that referenced
this pull request
Jan 3, 2023
…ytorch#78276) Fixes ROCm#325 **Summary**: Currently, the pytorchbot only allows for rebasing to the master branch. These modifications add functionality for rebasing to the 'viable/strict' branch of pytorch/pytorch by adding a flag to the comment. **Test Plan:** tested manually on personal fork ([#1](swang392#1)), and included a test case in test_tryrebase.py that checks if rebasing to viable/strict branch was successful. Pull Request resolved: pytorch#78276 Approved by: https://github.com/clee2000, https://github.com/janeyx99
WBobby
pushed a commit
that referenced
this pull request
Jan 3, 2023
… to conform with non-quantized countertpart filenames Summary: Names of analogous files in quantized directory (previously snake case) were inconsistent with their non-quantized filename counterparts (pascal case). This is the first of a series of PRs that changes all files in quantized (and sub-directories) dir to have pascal case. `aten/src/ATen/native/quantized/qconv_unpack.cpp` has not been renamed yet because (for reasons currently unknown) after making the name change, `import torch` produces the below error (`qlinear_unpack.cpp` renaming also seems to fail some phabricator CI tests for similar reasons). We suspect that these may be undefined errors and will revisit naming these files in a future PR. ``` terminate called after throwing an instance of 'c10::Error' what(): Type c10::intrusive_ptr<ConvPackedParamsBase<2> > could not be converted to any of the known types. Exception raised from operator() at ../aten/src/ATen/core/jit_type.h:1735 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x55 (0x7f26745c0c65 in /data/users/dzdang/pytorch/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xb1 (0x7f26745bdcd1 in /data/users/dzdang/pytorch/torch/lib/libc10.so) frame #2: <unknown function> + 0x1494e24 (0x7f2663b14e24 in /data/users/dzdang/pytorch/torch/lib/libtorch_cpu.so) frame ROCm#3: <unknown function> + 0xfed0bc (0x7f266366d0bc in /data/users/dzdang/pytorch/torch/lib/libtorch_cpu.so) frame ROCm#4: c10::detail::infer_schema::make_function_schema(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, c10::ArrayRef<c10::detail::infer_schema::ArgumentDef>, c10::ArrayRef<c10::detail::infer_schema::ArgumentDef>) + 0x5a (0x7f266366d71a in /data/users/dzdang/pytorch/torch/lib/libtorch_cpu.so) frame ROCm#5: c10::detail::infer_schema::make_function_schema(c10::ArrayRef<c10::detail::infer_schema::ArgumentDef>, c10::ArrayRef<c10::detail::infer_schema::ArgumentDef>) + 0x7b (0x7f266366e06b in /data/users/dzdang/pytorch/torch/lib/libtorch_cpu.so) frame ROCm#6: <unknown function> + 0x1493f32 (0x7f2663b13f32 in /data/users/dzdang/pytorch/torch/lib/libtorch_cpu.so) frame ROCm#7: <unknown function> + 0xe227dd (0x7f26634a27dd in /data/users/dzdang/pytorch/torch/lib/libtorch_cpu.so) frame ROCm#8: <unknown function> + 0x14e0a (0x7f268c934e0a in /lib64/ld-linux-x86-64.so.2) ..........................truncated............. ``` Test Plan: ``` python test/test_quantization.py ``` Pull Request resolved: pytorch#77037 Approved by: https://github.com/jerryzh168
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
file_diff_from_base
for release/1.10 branch (Tweakfile_diff_from_base
for release/1.10 branch pytorch/pytorch#66202)PyArray_Check
only if NumPy is available (CallPyArray_Check
only if NumPy is available pytorch/pytorch#66433) (CallPyArray_Check
only if NumPy is available (#66433) pytorch/pytorch#66629)internal::GRAIN_SIZE
bygrain_size
(parameter). (Replaceinternal::GRAIN_SIZE
bygrain_size
(parameter). pytorch/pytorch#53177)"Fixes #ISSUE_NUMBER