merge upstream 20221023 (microsoft#108)

* Fix the layer-past for GPT based models (microsoft#2196) * Add gradient_average flag support for sparse grads (microsoft#2188) * Add gradient_average flag support for sparse grads * formatting fixes * Add tests Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Adding additional instructiosn in the compression tutorial on pre-training distillation and quantization for GPT (microsoft#2197) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Log user config exactly (microsoft#2201) * Fix the tensor-slicing copy for qkv parameters (microsoft#2198) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Refactor Distributed Tests (microsoft#2180) Refactor Distributed unit tests * fix table syntax (microsoft#2204) Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Correctly detect offload configuration (microsoft#2208) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * add cuda 11.7 (microsoft#2211) * add cuda 11.7 * formatting * use torch 1.9 (microsoft#2215) * [zero-3] print warning once and support torch parameter (microsoft#2127) * print warning only once. * add support for torch param and only warn on gpu 0 * remove type checking. will be done on a new PR with more tests. Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Add support of OPT models (microsoft#2205) * add opt replace policy * simplify inf. api * fix opt replace policy * fix use-cash & add relu * Add support of custom MLP act. function * Revert "simplify inf. api" This reverts commit 9e910fc. * fix the inference API (temp. solution) * fix code formatting * add unit tests for OPT models. * refactor pre-attention layer norm configuration * add support of opt-350m model * refactor the HF model config initialization * fix hf model config issue Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> * fix typos in readme. (microsoft#2218) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * [device abstraction] add device abstraction to allow other device than CUDA be used * Fix regression w. dist_init_required (microsoft#2225) * add doc for new bert example (microsoft#2224) * Remove the random-generator from context during inference (microsoft#2228) * Fix the tensor-slicing copy for qkv parameters * remove the random-generator from context during inference * formatting Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * allow saving ckpt w/o ckpt json + bloom copy fix (microsoft#2237) * Correctly detect zero_offload (microsoft#2213) * Correctly detect offload configuration * Correctly detect offload configuration * Handle deprecated cpu offload setting * Correcly detect zero_offload setting * Minor tweak Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * update videos (microsoft#2249) * Refactor dist tests: Checkpointing (microsoft#2202) Refactor distributed tests: checkpointing Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Make OPT policy backward compatible with pre-OPT transformers versions (microsoft#2254) * fix ds-inference without policy (microsoft#2247) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * bump to 0.7.2 * Enable contiguous gradients with Z1+MoE (microsoft#2250) MoE training with zero stage 1 only works with `contiguous gradients=True`. * [rebase-202208] additional changes needed when rebase to 202208 * [rebase] cleanup direct cuda usage after merge * Correctly detect CPU optimizer usage (microsoft#2257) * Correctly detect CPU optimizer usage * Update nv-transformers-v100.yml (microsoft#2259) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [precommit] fix pre-commit issues * Update half precision header guards (microsoft#2261) * fix microsoft#2240: wrong time unit in flops_profiler (microsoft#2241) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * bump to 0.7.3 * Add blob storage to CI runners (microsoft#2260) Add blob storage to CI runners and enable for transformers cache on inference tests * Update replace_module.py, test-gptj.py related fix (microsoft#2269) Fix RuntimeError: Boolean value of Tensor with more than one value is ambiguous when running test-gptj.py * Fix OrderedDict import for python3.6 (microsoft#2267) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Ds inference/fix mp2 (microsoft#2270) * Trajepl: nebula load fix (microsoft#2182) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: chenguo <chenguo@microsoft.com> * prevent torch ext folder mkdir at tmp (microsoft#2274) * Ds-inference Int8 support through ZeroQuant technology (microsoft#2217) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * add a new unit test for cuda ops (microsoft#2278) Co-authored-by: cmikeh2 <connorholmes@microsoft.com> * Add to codeowners file (microsoft#2279) * [pin_memory] make pin_memory select device type * Memory Access Utility (microsoft#2276) Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * Fp32 accuracy bug fix (microsoft#2285) Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org> Co-authored-by: Arash Bakhtiari <arashb@users.noreply.github.com> * Refactor universal checkpointing and tensor fragments (microsoft#2253) * Refactor universal checkpointing and tensor fragments * Formatting * [ds-inference] fix progress bar (microsoft#2286) when loading the non-sharded checkpoint update the progress bar (fix by @RezaYazdaniAminabadi) - I've just tested it to work. Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Offload all gradients to nvme (microsoft#2282) * fused bias relu unittest (microsoft#2297) * fix for pytest picking up local deepspeed dir instead of installed deepspeed (microsoft#2299) * Fix for Zero3 when MP>1 and at least one batch param undefined (microsoft#2289) Co-authored-by: anthony.301 <anthony.301@mri.cluster> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [downstream] merge from xpu support downstream * Unit test for bias add kernel (microsoft#2298) * added unit test * Update pt_binding.cpp * formatting * Update test_bias_add.py * Update relu.cu with mem_access_utils (microsoft#2306) * Add tensor parallel inference unit tests (microsoft#2232) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Sam Ade Jacobs <samjacobs@microsoft.com> * Fix the residual add mp scaling for GPTNeoX (microsoft#2310) * Add unit tests for residual_add kernels (microsoft#2307) * add inference eval scripts (microsoft#2303) * Upgrade P40 tests to torch 1.8 (microsoft#2316) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * ZeRO-Inference blog (microsoft#2271) * ZeRO-Inference blog * ZeRO-Inference blog * Format fixes * Apply feedback * Feedback * Update docs/_posts/2022-08-27-zero-inference.md Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update docs/_posts/2022-08-27-zero-inference.md Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Address feedback * Format fixes * More tweaks * long sequence, nvme offload * Add image Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * ZeRO-Inference blog - wrap up (microsoft#2321) * ZeRO-Inference blog - Update README (microsoft#2322) * refactor to use mem_access (microsoft#2317) * add quant unit test (microsoft#2315) * add quant unit test * add codeowner * format fix * fix undefined symbol: curandSetPseudoRandomGeneratorSeed * modify ref fn name and add comment * add comments * add 4bit quant 16groups * fix * modify groups in ref code * parameterize tensor shape * single param * detach tensor * remove -lcurand flag * add back -lcurand flag Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * only override forward if using cuda-graph (microsoft#2291) * Add more options to inference benchmark (microsoft#2325) * bump to 0.7.4 * MOE residual matmult unit test (microsoft#2323) MOE residual matmul unit tests Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * [device] port cuda device to literal_device() in new tests * MOE matmult with memaccess (microsoft#2336) * Fix formatting * Remove redundant variable * Refactor residual add kernels (microsoft#2333) Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * [accel_runtime] add pin_memory to accelerator runtime interface. * mem access for quantize kernel (microsoft#2331) * mem access for quantize kernel * format * format fp32 * modify quant kernel * modify quant kernel2 * modify format * format * fix comments in pytest * fix comments in pytest * format * rerun Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Connor Holmes <connorholmes@microsoft.com> * increase min pre-commit versions (microsoft#2346) * Extend scratch buffer for long prompts (microsoft#2212) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * fix zero docs (microsoft#2350) * Inference profiling updates/fixes (microsoft#2348) (microsoft#2349) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Kernel Data Conversion Utility (microsoft#2327) * Unify macro definitions and constants in a single file * Conversion utility implementation. * Fix reversion from formatting * Bugfixes after testing with correct DeepSpeed * Inline markers are available on both HIP + CUDA * Add Onebit Optimzers in __init__ (microsoft#2340) Co-authored-by: Saeyeol Lee <sylee@si-anlaytics.ai> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * [accelerator abstraction] merge from microsoft#2320 * docs(mixture-of-experts-inference): fix typo in tuto (microsoft#2345) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * download cifar to blob storage (microsoft#2342) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Refactor gptj_residual_add kernels for better readability (microsoft#2358) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> * Updated issue templates (microsoft#2363) * Update issue templates * fix cuda invalid config error in dequant kernel (microsoft#2362) * format * remove round fn * Add missing pytest fixture scope (microsoft#2353) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Extend residual_add kernel tests to conver pre_attn_norm (microsoft#2354) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Refactor fused_bias_residual kernels for better readability (microsoft#2356) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Capture error message during sweep tests (microsoft#2351) * Collect error messages in results.csv Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * fix an exception when recursively casting dicts to fp16 (microsoft#2370) * Refactor remaining distributed tests (microsoft#2216) * batch of refactored tests * more test refactoring * fp16 test refactor * more refactors * added DistributedFixture class * applied DistributedFixture to first batch of tests as a trial * added DistributedFixture test and documentation * last tests * fixes for refactored tests * remove subdirs in workflow files * fix pytest syntax error * fix another syntax error * update imports * use DistFixture with elastic checkpoint test * missing import * update to shared class tmpdir for elastic test * moved test files * avoid duplicate test file name * last refactor and moving test files * formatting * fix broken import * testing forked AMD tests * update abstract method * use blob storage for accelerate and transformers tests * upgrade torch for acclerate CI Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Fix the MLP output tensor's shape (microsoft#2380) * allow building with latest CUDA (11.8), it is backwards compatible (microsoft#2390) * pin transformers version for unit tests (microsoft#2402) * Change type to tuple in replace_wo_policy isinstance check (microsoft#2387) Update the isinstance check inside the `replace_wo_policy` function to `tuple` and `str` instead of `dict`, since the layers are provided as a `tuple` type. Co-authored-by: Lev Kurilenko <lekurile@microsoft.com> Co-authored-by: Molly Smith <mosm@microsoft.com> Co-authored-by: Lok Chand Koppaka <lokoppak@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * Checkpoint backwards-compatbility workaround (microsoft#2384) * Add predicated global load (microsoft#2373) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> * change call site of literal_device, on_accel_device and accel_runtime to get_accelerator() call * add new interface definition from olruwase/accelerator_abstraction * MII blog post (microsoft#2418) Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * Fix figure reference (microsoft#2419) * [docs] update news items * [docs] add mii repo link * Add SLURM Multinode Runner (microsoft#2404) Signed-off-by: Dashiell Stander <dstander@protonmail.com> Co-authored-by: Dashiell Stander <dashiell@ip-172-31-45-20.ec2.internal> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Fix issue with corrupted output on long generation for GPT (microsoft#2359) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * MII blog title update on Readme * DeepSpeed-MII title change in website * Fix GPT Neo-X multi-gpu inference (microsoft#2401) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * MII-Public and MII-Azure subheading in mii post * CI fixes related to triton (microsoft#2422) * [docs] update mii blog title (microsoft#2423) * add SD injection policy (microsoft#2381) Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> * [accelerator abstraction] remove name() from interface, device_name() should be used. * merge with master (ec13da6) * fix checkpoint loading when it is a dictionary (microsoft#2425) * Make error regex more generic in collect_results.py (microsoft#2415) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * fixes microsoft#2389 (microsoft#2411) truncating expert param storage for checkpointing Co-authored-by: Alexander Jipa <azzhipa@amazon.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Fix for inference gpt-j test (microsoft#2430) * fix for gpt-j failing due to tokenizer error * limit number of gpt-j tokens generated due to low memory * Fixing bug 2361 (microsoft#2410) * fixing bug 2361 * adding pytest for config initialization * chaning expected output to FusedAdam * remove print statement * running yapf on modified files * running pre-commit formatting Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Universal checkpoint for zero stage 1 (microsoft#2284) * Refactor universal checkpointing and tensor fragments * Formatting * Support zero stage1; Expand TP dim * Remove debug prints * Detect sharded optimizer state * Format fixes * Encode reshaping guide * More symbolic constants Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * only add deps if extra is explictly called (microsoft#2432) * Add TestInjectionPolicy inference unittest class for testing custom injection policies (microsoft#2426) This PR adds a TestInjectionPolicy inference unittest class for testing custom injection policies. This test differs from the existing tests in that the injection_policy dictionary is explicitly specified when calling the DeepSpeed init_inference API. The google/t5-v1_1-small text2text-generation model and the roberta-large fill-mask model are added as tests with the injection policy explicitly specified. This is done to expand our unittest coverage to test the path where the replace_wo_policy function is invoked (see microsoftGH-2387). Co-authored-by: Lev Kurilenko <lekurile@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * [memory estimators] new config args sync (microsoft#2431) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * parallelize writing of layer checkpoint files across data parallel instances (microsoft#1419) * parallelize layer checkpoints across data parallel groups * use partition_uniform to determine start/end index values * formatting fix * config: add option for parallel write of layer checkpoints in pipeline stage * yapf fixes * enable parallel layer write according to config param * avoid extraneous makedir when rank 0 writes all layers Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Fix broken link to DeepSpeed Megatron fork (microsoft#2440) Co-authored-by: Lev Kurilenko <lekurile@microsoft.com> * bump to 0.7.5 * [OpBuilder] Add op builder abstraction * convert op builder usage in merged code * merge diff files from upstream * [OpBuilder] add create_op_builder interface in abstract_accelerator.py * remove files that is deleted from upstream * [OpBuilder] add left over op builder usage in tests * [OpBuilder] fix op builder usage in tests * [OpBuilder] fix <op builder>.NAME usage in tests to follow op builder abstraction design * import get_accelerator from deepspeed.accelerator directly * [OpBuilder] remove unused function and sync with main * add missing import * revert changes in device.py to avoid conflict with main * fix alexnet_model to use /tmp instead of /blob * Mingzhi/solve pr108 b (microsoft#115) * move ALL_OPs from __init__.py to all_Op.py to solve circular import * delete deepspeedexamples * fix import * fix regression (microsoft#117) * fix pin_memory * fix regression * fix error Signed-off-by: Dashiell Stander <dstander@protonmail.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Mikhail Druzhinin <dipetm@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Minjia Zhang <33713995+minjiaz@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Kamal Raj <kamalraj97@gmail.com> Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Arash Bakhtiari <arashb@users.noreply.github.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Zhihong Chen <gdst_czh@163.com> Co-authored-by: Siddharth Singh <siddharth9820@gmail.com> Co-authored-by: Connor Holmes <connorholmes@microsoft.com> Co-authored-by: 叶志晟 <yzs981130@126.com> Co-authored-by: Molly Smith <112220543+molly-smith@users.noreply.github.com> Co-authored-by: trajep <trajepl@gmail.com> Co-authored-by: chenguo <chenguo@microsoft.com> Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Quentin Anthony <qganthony@yahoo.com> Co-authored-by: anthony.301 <anthony.301@mri.cluster> Co-authored-by: Sam Ade Jacobs <samjacobs@microsoft.com> Co-authored-by: Guanhua Wang <alexwgh333@gmail.com> Co-authored-by: Saeyeol Lee <78332687+l4d2boomer@users.noreply.github.com> Co-authored-by: Saeyeol Lee <sylee@si-anlaytics.ai> Co-authored-by: Jean-Louis Queguiner <jean-louis.queguiner@gadz.org> Co-authored-by: Matt Smith <matt@mjksmith.com> Co-authored-by: Thomas-MMJ <112830596+Thomas-MMJ@users.noreply.github.com> Co-authored-by: lekurile <113481193+lekurile@users.noreply.github.com> Co-authored-by: Lev Kurilenko <lekurile@microsoft.com> Co-authored-by: Molly Smith <mosm@microsoft.com> Co-authored-by: Lok Chand Koppaka <lokoppak@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Dashiell Stander <dstander@protonmail.com> Co-authored-by: Dashiell Stander <dashiell@ip-172-31-45-20.ec2.internal> Co-authored-by: Andrey Chernykh <andrew.chernyh@gmail.com> Co-authored-by: Alexander Jipa <alexander.jipa@gmail.com> Co-authored-by: Alexander Jipa <azzhipa@amazon.com> Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com> Co-authored-by: Adam Moody <moody20@llnl.gov> Co-authored-by: AGUL <mingzhi.liu@intel.com>
delock · Oct 28, 2022 · 77020fe · 77020fe
1 parent 7f34488
commit 77020fe
Show file tree

Hide file tree

Showing 220 changed files with 6,778 additions and 6,542 deletions.
diff --git a/.github/ISSUE_TEMPLATE/compression_bug_report.md b/.github/ISSUE_TEMPLATE/compression_bug_report.md
@@ -0,0 +1,43 @@
+---
+name: Bug report (compression)
+about: Create a DeepSpeed compression related issue to help us improve
+title: "[BUG]"
+labels: bug,compression
+assignees: ''
+
+---
+
+**Describe the bug**
+A clear and concise description of what the bug is.
+
+**To Reproduce**
+Steps to reproduce the behavior:
+1. Go to '...'
+2. Click on '....'
+3. Scroll down to '....'
+4. See error
+
+**Expected behavior**
+A clear and concise description of what you expected to happen.
+
+**ds_report output**
+Please run `ds_report` to give us details about your setup.
+
+**Screenshots**
+If applicable, add screenshots to help explain your problem.
+
+**System info (please complete the following information):**
+ - OS: [e.g. Ubuntu 18.04]
+ - GPU count and types [e.g. two machines with x8 A100s each]
+ - Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
+ - Python version
+ - Any other relevant info about your setup
+
+**Launcher context**
+Are you launching your experiment with the `deepspeed` launcher, MPI, or something else?
+
+**Docker context**
+Are you using a specific docker image that you can share?
+
+**Additional context**
+Add any other context about the problem here.
diff --git a/.github/ISSUE_TEMPLATE/inference_bug_report.md b/.github/ISSUE_TEMPLATE/inference_bug_report.md
@@ -0,0 +1,41 @@
+---
+name: Bug report (inference)
+about: Create a DeepSpeed inference related issue to help us improve
+title: "[BUG]"
+labels: bug,inference
+assignees: ''
+
+---
+
+**Describe the bug**
+A clear and concise description of what the bug is.
+
+**To Reproduce**
+Steps to reproduce the behavior:
+1. Simple inference script to reproduce
+2. What packages are required and their versions
+3. How to run the script
+4. ...
+
+**Expected behavior**
+A clear and concise description of what you expected to happen.
+
+**ds_report output**
+Please run `ds_report` to give us details about your setup.
+
+**Screenshots**
+If applicable, add screenshots to help explain your problem.
+
+**System info (please complete the following information):**
+ - OS: [e.g. Ubuntu 18.04]
+ - GPU count and types [e.g. two machines with x8 A100s each]
+ - (if applicable) what [DeepSpeed-MII](https://github.com/microsoft/deepspeed-mii) version are you using
+ - (if applicable) Hugging Face Transformers/Accelerate/etc. versions
+ - Python version
+ - Any other relevant info about your setup
+
+**Docker context**
+Are you using a specific docker image that you can share?
+
+**Additional context**
+Add any other context about the problem here.
diff --git a/.github/ISSUE_TEMPLATE/bug_report.md → ...hub/ISSUE_TEMPLATE/training_bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md → ...hub/ISSUE_TEMPLATE/training_bug_report.md
@@ -1,8 +1,8 @@
 ---
-name: Bug report
-about: Create a report to help us improve
+name: Bug report (training)
+about: Create a DeepSpeed training related issue to help us improve
 title: "[BUG]"
-labels: bug
+labels: bug,training
 assignees: ''
 
 ---

diff --git a/.github/workflows/amd.yml b/.github/workflows/amd.yml
@@ -35,7 +35,7 @@ jobs:
           which hipcc
           hipcc --version
           pip install --upgrade pip
-          pip uninstall --yes torch torchvision
+          pip uninstall --yes torch torchvision triton
           pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.1.1
           python -c "import torch; print('torch:', torch.__version__, torch)"
           python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
@@ -67,5 +67,5 @@ jobs:
         run: |
           if [[ -d ./torch-extensions ]]; then rm -rf ./torch-extensions; fi
           cd tests
-          TORCH_EXTENSIONS_DIR=./torch-extensions pytest --color=yes --durations=0 --verbose unit/{autotuning,checkpoint,comm,compression,elasticity,inference,launcher,monitor,ops,profiling,runtime,utils}
-          #TORCH_EXTENSIONS_DIR=./torch-extensions pytest --color=yes --durations=0 --verbose -m 'sequential' unit/{autotuning,checkpoint,comm,compression,elasticity,inference,launcher,monitor,ops,profiling,runtime,utils}
+          TORCH_EXTENSIONS_DIR=./torch-extensions pytest --color=yes --durations=0 --forked -n 4 --verbose unit/
+          TORCH_EXTENSIONS_DIR=./torch-extensions pytest --color=yes --durations=0 --forked --verbose -m 'sequential' unit/
diff --git a/.github/workflows/nv-accelerate-v100.yml b/.github/workflows/nv-accelerate-v100.yml
@@ -31,8 +31,8 @@ jobs:
           which nvcc
           nvcc --version
           pip install --upgrade pip
-          pip uninstall --yes torch torchvision
-          pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
+          pip uninstall --yes torch torchvision triton
+          pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu111
           python -c "import torch; print('torch:', torch.__version__, torch)"
           python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
 
@@ -58,4 +58,4 @@ jobs:
           # tmp fix: force newer datasets version
           pip install "datasets>=2.0.0"
           pip list
-          TORCH_EXTENSIONS_DIR=./torch-extensions pytest --color=yes --durations=0 --verbose tests/deepspeed
+          HF_DATASETS_CACHE=/blob/datasets_cache/ TRANSFORMERS_CACHE=/blob/transformers_cache/ TORCH_EXTENSIONS_DIR=./torch-extensions pytest --color=yes --durations=0 --verbose tests/deepspeed
diff --git a/.github/workflows/nv-inference.yml b/.github/workflows/nv-inference.yml
@@ -31,7 +31,7 @@ jobs:
           which nvcc
           nvcc --version
           pip install --upgrade pip
-          pip uninstall --yes torch torchvision
+          pip uninstall --yes torch torchvision triton
           pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu113
           python -c "import torch; print('torch:', torch.__version__, torch)"
           python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
@@ -51,7 +51,7 @@ jobs:
       - name: Install deepspeed
         run: |
           pip uninstall --yes deepspeed
-          pip install .[dev,1bit,autotuning,sparse_attn,inf]
+          pip install .[dev,1bit,autotuning,inf]
           ds_report
 
       - name: Unit tests

diff --git a/.github/workflows/nv-nightly.yml b/.github/workflows/nv-nightly.yml
@@ -24,7 +24,7 @@ jobs:
           which nvcc
           nvcc --version
           pip install --upgrade pip
-          pip uninstall --yes torch torchvision
+          pip uninstall --yes torch torchvision triton
           pip install torch==1.8.2+cu111 torchvision==0.9.2+cu111 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
           python -c "import torch; print('torch:', torch.__version__, torch)"
           python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
@@ -42,7 +42,7 @@ jobs:
       - name: Install deepspeed
         run: |
           pip uninstall --yes deepspeed
-          pip install .[dev,1bit,autotuning,sparse_attn,inf]
+          pip install .[dev,1bit,autotuning,inf]
           ds_report
 
       - name: Unit tests

diff --git a/.github/workflows/nv-torch-latest-v100.yml b/.github/workflows/nv-torch-latest-v100.yml
@@ -31,7 +31,7 @@ jobs:
           which nvcc
           nvcc --version
           pip install --upgrade pip
-          pip uninstall --yes torch torchvision
+          pip uninstall --yes torch torchvision triton
           pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu113
           python -c "import torch; print('torch:', torch.__version__, torch)"
           python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
@@ -53,13 +53,13 @@ jobs:
       - name: Install deepspeed
         run: |
           pip uninstall --yes deepspeed
-          pip install .[dev,1bit,autotuning,sparse_attn]
+          pip install .[dev,1bit,autotuning]
           ds_report
 
       - name: Unit tests
         run: |
           unset TORCH_CUDA_ARCH_LIST # only jit compile for current arch
           if [[ -d ./torch-extensions ]]; then rm -rf ./torch-extensions; fi
           cd tests
-          TORCH_EXTENSIONS_DIR=./torch-extensions pytest --color=yes --durations=0 --forked --verbose -n 4 unit/{autotuning,checkpoint,comm,compression,elasticity,inference,launcher,monitor,ops,profiling,runtime,utils} --torch_ver="1.12" --cuda_ver="11.3"
-          TORCH_EXTENSIONS_DIR=./torch-extensions pytest --color=yes --durations=0 --forked --verbose -m 'sequential' unit/{autotuning,checkpoint,comm,compression,elasticity,inference,launcher,monitor,ops,profiling,runtime,utils} --torch_ver="1.12" --cuda_ver="11.3"
+          TORCH_EXTENSIONS_DIR=./torch-extensions pytest --color=yes --durations=0 --forked --verbose -n 4 unit/ --torch_ver="1.12" --cuda_ver="11.3"
+          TORCH_EXTENSIONS_DIR=./torch-extensions pytest --color=yes --durations=0 --forked --verbose -m 'sequential' unit/ --torch_ver="1.12" --cuda_ver="11.3"
diff --git a/.github/workflows/nv-torch-nightly-v100.yml b/.github/workflows/nv-torch-nightly-v100.yml
@@ -24,7 +24,7 @@ jobs:
           which nvcc
           nvcc --version
           pip install --upgrade pip
-          pip uninstall --yes torch torchvision
+          pip uninstall --yes torch torchvision triton
           pip install --pre torch torchvision --extra-index-url https://download.pytorch.org/whl/nightly/cu113
           python -c "import torch; print('torch:', torch.__version__, torch)"
           python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
@@ -46,7 +46,7 @@ jobs:
       - name: Install deepspeed
         run: |
           pip uninstall --yes deepspeed
-          pip install .[dev,1bit,autotuning,sparse_attn]
+          pip install .[dev,1bit,autotuning]
           ds_report
 
       - name: Unit tests

diff --git a/.github/workflows/nv-torch18-p40.yml b/.github/workflows/nv-torch18-p40.yml
@@ -31,7 +31,7 @@ jobs:
           which nvcc
           nvcc --version
           pip install --upgrade pip
-          pip uninstall --yes torch torchvision
+          pip uninstall --yes torch torchvision triton
           pip install torch==1.8.2 torchvision==0.9.2 --extra-index-url https://download.pytorch.org/whl/lts/1.8/cu101
           python -c "import torch; print('torch:', torch.__version__, torch)"
           python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
@@ -53,7 +53,7 @@ jobs:
       - name: Install deepspeed
         run: |
           pip uninstall --yes deepspeed
-          pip install .[dev,1bit,autotuning,sparse_attn]
+          pip install .[dev,1bit,autotuning]
           ds_report
 
       - name: Unit tests

diff --git a/.github/workflows/nv-torch18-v100.yml b/.github/workflows/nv-torch18-v100.yml
@@ -31,7 +31,7 @@ jobs:
           which nvcc
           nvcc --version
           pip install --upgrade pip
-          pip uninstall --yes torch torchvision
+          pip uninstall --yes torch torchvision triton
           pip install torch==1.8.2+cu111 torchvision==0.9.2+cu111 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
           python -c "import torch; print('torch:', torch.__version__, torch)"
           python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
@@ -53,7 +53,7 @@ jobs:
       - name: Install deepspeed
         run: |
           pip uninstall --yes deepspeed
-          pip install .[dev,1bit,autotuning,sparse_attn]
+          pip install .[dev,1bit,autotuning]
           ds_report
 
       - name: Unit tests

diff --git a/.github/workflows/nv-transformers-v100.yml b/.github/workflows/nv-transformers-v100.yml
@@ -31,7 +31,7 @@ jobs:
           which nvcc
           nvcc --version
           pip install --upgrade pip
-          pip uninstall --yes torch torchvision
+          pip uninstall --yes torch torchvision triton
           pip install torch==1.8.2+cu111 torchvision==0.9.2+cu111 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
           python -c "import torch; print('torch:', torch.__version__, torch)"
           python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
@@ -54,7 +54,7 @@ jobs:
           git clone https://github.com/huggingface/transformers
           cd transformers
           # if needed switch to the last known good SHA until transformers@master is fixed
-          # git checkout 1cc453d33
+          git checkout 6268694e2
           git rev-parse --short HEAD
           # scipy/sklearn required for tests, using the 'dev' extra forces torch re-install
           pip install .[testing]
@@ -65,4 +65,4 @@ jobs:
           # force protobuf version due to issues
           pip install "protobuf<4.21.0"
           pip list
-          WANDB_DISABLED=true TORCH_EXTENSIONS_DIR=./torch-extensions RUN_SLOW=1 pytest --color=yes --durations=0 --verbose tests/deepspeed
+          HF_DATASETS_CACHE=/blob/datasets_cache/ TRANSFORMERS_CACHE=/blob/transformers_cache/ WANDB_DISABLED=true TORCH_EXTENSIONS_DIR=./torch-extensions RUN_SLOW=1 pytest --color=yes --durations=0 --verbose tests/deepspeed
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -54,3 +54,9 @@ repos:
             --check-filenames,
             --check-hidden
         ]
+
+-   repo: https://github.com/pycqa/flake8
+    rev: 4.0.1
+    hooks:
+    -   id: flake8
+        args: ['--ignore=E,F403,F405,F541,F841,W', '--select=E9,F,W6', '--per-file-ignores=__init__.py:F401']
diff --git a/README.md b/README.md
@@ -12,11 +12,11 @@
 ## Latest News
 <b> DeepSpeed trained the world's most powerful language models ([MT-530B](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/), [BLOOM](https://huggingface.co/blog/bloom-megatron-deepspeed)); [learn how](https://www.deepspeed.ai/tutorials/large-models-w-deepspeed/).</b>
 
+* [2022/10] [DeepSpeed-MII: instant speedup on 24,000+ open-source DL models with up to 40x cheaper inference](https://www.deepspeed.ai/2022/10/10/mii.html)
 * [2022/09] [ZeRO-Inference: Democratizing massive model inference](https://www.deepspeed.ai/2022/09/09/zero-inference.html)
 * [2022/07] [Azure and DeepSpeed empower easy-to-use and high-performance model training](https://azure.microsoft.com/en-us/blog/azure-empowers-easytouse-highperformance-and-hyperscale-model-training-using-deepspeed/)
 * [2022/07] [DeepSpeed Compression: A composable library for extreme compression](https://www.microsoft.com/en-us/research/blog/deepspeed-compression-a-composable-library-for-extreme-compression-and-zero-cost-quantization/)
 * [2022/03] [Supporting efficient large model training on AMD Instinct GPUs with DeepSpeed](https://cloudblogs.microsoft.com/opensource/2022/03/21/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed/)
-* [2022/03] [Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam](https://www.deepspeed.ai/tutorials/zero-one-adam/)
 
 ---
 

diff --git a/benchmarks/communication/all_gather.py b/benchmarks/communication/all_gather.py
@@ -1,6 +1,6 @@
 from benchmarks.communication.utils import *
 from benchmarks.communication.constants import *
-from deepspeed.accelerator.real_accelerator import get_accelerator
+from deepspeed.accelerator import get_accelerator
 
 import time
 

diff --git a/benchmarks/communication/all_reduce.py b/benchmarks/communication/all_reduce.py
@@ -1,6 +1,6 @@
 from benchmarks.communication.utils import *
 from benchmarks.communication.constants import *
-from deepspeed.accelerator.real_accelerator import get_accelerator
+from deepspeed.accelerator import get_accelerator
 
 import time
 

diff --git a/benchmarks/communication/all_to_all.py b/benchmarks/communication/all_to_all.py
@@ -1,6 +1,6 @@
 from benchmarks.communication.utils import *
 from benchmarks.communication.constants import *
-from deepspeed.accelerator.real_accelerator import get_accelerator
+from deepspeed.accelerator import get_accelerator
 
 import time
 

diff --git a/benchmarks/communication/broadcast.py b/benchmarks/communication/broadcast.py
@@ -1,7 +1,7 @@
 import torch
 from benchmarks.communication.utils import *
 from benchmarks.communication.constants import *
-from deepspeed.accelerator.real_accelerator import get_accelerator
+from deepspeed.accelerator import get_accelerator
 
 import time
 

diff --git a/benchmarks/communication/constants.py b/benchmarks/communication/constants.py
@@ -1,4 +1,4 @@
-from deepspeed.accelerator.real_accelerator import get_accelerator
+from deepspeed.accelerator import get_accelerator
 
 DEFAULT_WARMUPS = 5
 DEFAULT_TRIALS = 50

diff --git a/benchmarks/communication/pt2pt.py b/benchmarks/communication/pt2pt.py
@@ -1,6 +1,6 @@
 from benchmarks.communication.utils import *
 from benchmarks.communication.constants import *
-from deepspeed.accelerator.real_accelerator import get_accelerator
+from deepspeed.accelerator import get_accelerator
 
 import time
 

diff --git a/benchmarks/communication/utils.py b/benchmarks/communication/utils.py
@@ -3,7 +3,7 @@
 import math
 import argparse
 from benchmarks.communication.constants import *
-from deepspeed.accelerator.real_accelerator import get_accelerator
+from deepspeed.accelerator import get_accelerator
 
 global dist
 

diff --git a/benchmarks/inference/bert-bench.py b/benchmarks/inference/bert-bench.py
@@ -3,7 +3,7 @@
 import deepspeed
 import argparse
 from transformers import pipeline
-from deepspeed.accelerator.real_accelerator import get_accelerator
+from deepspeed.accelerator import get_accelerator
 
 parser = argparse.ArgumentParser()
 parser.add_argument("--model", "-m", type=str, help="hf model name")

diff --git a/benchmarks/inference/collect_results.py b/benchmarks/inference/collect_results.py
@@ -75,6 +75,14 @@ def get_generated_text(file_content, gen_text_n):
         return {f"generated-text-{key}": val for key, val in matches}
 
 
+def get_error(file_content):
+    matches = re.findall(r"Error:\s+(.+?)\n", file_content)
+    if matches is []:
+        return False
+    else:
+        return {f"error": val for val in matches}
+
+
 if __name__ == "__main__":
     # List to collect data from all benchmarks
     benchmarks_data = []
@@ -112,15 +120,17 @@ def get_generated_text(file_content, gen_text_n):
             perf_data = get_perf_data(file_content)
             if not perf_data:
                 print(
-                    f"WARNING: Could not detect benchmark performance data for file {file_path}, skipping"
+                    f"WARNING: Could not detect benchmark performance data for file {file_path}"
                 )
-                continue
 
             generated_text = get_generated_text(file_content, args.gen_text_n)
             if not generated_text:
-                print(
-                    f"WARNING: Could not detect generated text for file {file_path}, skipping"
-                )
+                print(f"WARNING: Could not detect generated text for file {file_path}")
+
+            error = get_error(file_content)
+            if error:
+                print(f"Error found in {file_path}, collecting error info...")
+                benchmarks_data.append({"branch": branch, **params, **error})
                 continue
 
             benchmarks_data.append({