Add local tensor-parallel fwd/bwd #143

justheuristic · 2022-12-09T10:27:47Z

This pull request adds an option to run Petals server on multiple local GPUs. It uses https://github.com/BlackSamorez/tensor_parallel

tensor_parallel works with forward/backward
tensor_parallel works with inference (use_cache=True)
exact match output, grad w.r.t. input, inference caches
works with odd numbers of GPUs
del block frees memory from all GPUs (for rebalancing)
stability tests
add to CI tests
determine num_blocks
measure throughput correctly

Benchmark

https://gist.github.com/justheuristic/149ccfbf903a847cbaa09dbe59965bd9

Overnight sanity checks:

8bit approximation error same as in main (mean~=2% q0.9~=5%)
- TP=1, 2, 3 (see screenshots above)
forward, grad w.r.t. input and inference exact match with main with TP=1
>=80% GPU utilization with 3x 1080ti, batch = 8 tokens
throughput measured with and without TP
TP on 1080Tis has near-linear speedup comparable to the benchmarks (see first message)

Co-authored-by: Iaroslav Lisniak <48571134+IaroslavLisniak@users.noreply.github.com> Co-authored-by: Iaroslav Lisniak <yalisnyak@nes.ru> Co-authored-by: Andrei Panferov <andrei@blacksamorez.ru> Co-authored-by: Andrei Panferov <andrei.panferov@eqvilent.com>

justheuristic · 2023-01-03T10:47:03Z

src/petals/server/backend.py

@@ -48,44 +64,60 @@ def __init__(self, *args, memory_cache: MemoryCache, backend_dtype: torch.dtype,
            self.kwargs_schema,
        )

+    def get_inference_cache_descriptors(self, batch_size: int, max_length: int) -> Tuple[TensorDescriptor, ...]:


This can be more than 2 tuples if TP > 1

src/petals/cli/run_server.py

src/petals/server/server.py

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>

src/petals/server/throughput.py

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>

justheuristic added 2 commits December 9, 2022 11:59

empty test

af84502

undo

a6f00e2

borzunov changed the title ~~[donotmerege] empty test~~ [do not merge] empty test Dec 9, 2022

justheuristic and others added 15 commits December 10, 2022 21:55

check for empty inputs

2a30d51

check for empty inputs

3715493

update slicer

c1ff01b

Co-authored-by: Iaroslav Lisniak <48571134+IaroslavLisniak@users.noreply.github.com> Co-authored-by: Iaroslav Lisniak <yalisnyak@nes.ru> Co-authored-by: Andrei Panferov <andrei@blacksamorez.ru> Co-authored-by: Andrei Panferov <andrei.panferov@eqvilent.com>

black-isort

a629510

Co-authored-by: Iaroslav Lisniak <48571134+IaroslavLisniak@users.noreply.github.com> Co-authored-by: Iaroslav Lisniak <yalisnyak@nes.ru> Co-authored-by: Andrei Panferov <andrei@blacksamorez.ru> Co-authored-by: Andrei Panferov <andrei.panferov@eqvilent.com>

swap assert

f3c6ab2

black-isort

6f65aae

merge rules

797000b

add a basic test

b6b5f32

remove barrier, add teardown event

a95c3da

safety first

7087380

typo

ed2276a

justheuristic changed the title ~~[do not merge] empty test~~ Added tensor-parallel fwd/bwd Dec 12, 2022

justheuristic changed the title ~~Added tensor-parallel fwd/bwd~~ Added local tensor-parallel fwd/bwd Dec 12, 2022

justheuristic added 10 commits December 12, 2022 19:47

test convolutions

5cfd726

typo

4546adf

black-isort

2e67090

Merge branch 'main' into empty

ec9429d

minimize diff

c1cc0b2

minimize diff

0e3e5f1

minimize diff

7912500

minimize diff

3c560cf

Merge branch 'main' into empty

174b777

extra deflap

ba541ee

justheuristic and others added 12 commits January 3, 2023 05:02

trigger full sync

f159a1f

trigger full sync

6069de1

trigger full sync

97007a9

trigger full sync

00830df

trigger full sync

9711ac7

undo

b98f50b

explicit blocking copy

b75c3d2

better performance (do not broadcast cache between ranks)

ddc3d8c

check cache key

2540ebc

remove debugprint

f4f3455

swap to pypi

a528476

review

a5d64a0

justheuristic commented Jan 3, 2023

View reviewed changes

review

926cdc3

borzunov reviewed Jan 3, 2023

View reviewed changes

src/petals/cli/run_server.py Outdated Show resolved Hide resolved

borzunov reviewed Jan 3, 2023

View reviewed changes

src/petals/server/server.py Outdated Show resolved Hide resolved

borzunov and others added 7 commits January 3, 2023 16:40

review

1110dd5

review

82c463a

Update src/petals/cli/run_server.py

b38a956

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>

Update src/petals/server/server.py

7eb41e5

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>

review

3c88a28

Merge branch 'empty' of github.com:bigscience-workshop/petals into empty

b02edc3

review

1f5cf79

borzunov reviewed Jan 3, 2023

View reviewed changes

src/petals/server/throughput.py Outdated Show resolved Hide resolved

justheuristic and others added 2 commits January 3, 2023 17:21

Update src/petals/server/throughput.py

8677c57

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>

black

89c54f2

borzunov approved these changes Jan 3, 2023

View reviewed changes

fix throughput eval

9416950

justheuristic merged commit ae9e71f into main Jan 3, 2023

justheuristic deleted the empty branch January 3, 2023 15:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add local tensor-parallel fwd/bwd #143

Add local tensor-parallel fwd/bwd #143

justheuristic commented Dec 9, 2022 •

edited

Loading

justheuristic Jan 3, 2023 •

edited

Loading

Add local tensor-parallel fwd/bwd #143

Add local tensor-parallel fwd/bwd #143

Conversation

justheuristic commented Dec 9, 2022 • edited Loading

Benchmark

Overnight sanity checks:

justheuristic Jan 3, 2023 • edited Loading

Choose a reason for hiding this comment

justheuristic commented Dec 9, 2022 •

edited

Loading

justheuristic Jan 3, 2023 •

edited

Loading