Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
2025ac2
init new test
SaulLu Feb 28, 2022
c33343d
test pad vocab size to
SaulLu Feb 28, 2022
390b4dc
add logs
SaulLu Feb 28, 2022
784b751
log to warning
SaulLu Feb 28, 2022
6f3a472
change TP
SaulLu Feb 28, 2022
1d9649a
fix loop
SaulLu Feb 28, 2022
7fa5c10
revert
SaulLu Feb 28, 2022
bcc6d8d
remove hack size
SaulLu Feb 28, 2022
9e17a4f
this new test should pass
SaulLu Feb 28, 2022
92614bf
test not divisible by num tp
SaulLu Feb 28, 2022
8322f89
Revert "remove hack size"
SaulLu Feb 28, 2022
6d72073
Revert "Revert "remove hack size""
SaulLu Feb 28, 2022
84333d3
Revert "test not divisible by num tp"
SaulLu Feb 28, 2022
b2382d8
Revert "this new test should pass"
SaulLu Feb 28, 2022
d4a15a3
change info to warning
SaulLu Feb 28, 2022
cd5e8b4
change to print
SaulLu Feb 28, 2022
a6ee894
add print
SaulLu Feb 28, 2022
f534c43
test 2
SaulLu Feb 28, 2022
0a1167b
new print
SaulLu Feb 28, 2022
34bfd60
woups
SaulLu Feb 28, 2022
50cb3ca
more
SaulLu Feb 28, 2022
786e02d
woups
SaulLu Feb 28, 2022
20d08a8
comment
SaulLu Feb 28, 2022
915bd6c
raise errors
SaulLu Feb 28, 2022
119a0d2
woups
SaulLu Feb 28, 2022
5c6dec0
pad to save vocab size
SaulLu Feb 28, 2022
de3353f
simplify test
SaulLu Feb 28, 2022
8485770
assert test raised
SaulLu Feb 28, 2022
df24492
print error msg
SaulLu Feb 28, 2022
46fc9da
check msg error
SaulLu Feb 28, 2022
9ffafb1
check error
SaulLu Feb 28, 2022
1eb5baa
woups
SaulLu Feb 28, 2022
56af695
clean
SaulLu Feb 28, 2022
3ea0c6b
simplify
SaulLu Feb 28, 2022
be2e371
remove unused print
SaulLu Feb 28, 2022
8986962
add comment
SaulLu Feb 28, 2022
a72fa03
add test multiple of tp size
SaulLu Feb 28, 2022
1e5b2af
add print
SaulLu Feb 28, 2022
8d8be7e
add check
SaulLu Feb 28, 2022
b2867a7
clean
SaulLu Feb 28, 2022
ef61e89
Update megatron/mpu/layers.py
SaulLu Feb 28, 2022
c10a359
Update megatron/tokenizer/tokenizer.py
SaulLu Feb 28, 2022
fc975b4
chnage micro-batch-size
SaulLu Feb 28, 2022
a2b86b7
use tiny vocab
SaulLu Feb 28, 2022
ae9f83c
fix data dir
SaulLu Feb 28, 2022
ecdda50
fix arg
SaulLu Feb 28, 2022
c170fd9
change micro-batch-size
SaulLu Feb 28, 2022
c82d615
adept input ids
SaulLu Feb 28, 2022
3587b52
assertIn
SaulLu Feb 28, 2022
a90a8f9
change micro batch size
SaulLu Feb 28, 2022
982d88c
Fix test TP
SaulLu Mar 1, 2022
78b7686
unused var
SaulLu Mar 1, 2022
c922204
add test make_vocab_size_divisible_by
SaulLu Mar 1, 2022
806cbb5
fix test_tokenizer_vocab_size_multiple_of_tp_size test
SaulLu Mar 1, 2022
f515b67
Fix padded vocab size on preprocessing scripts (#257)
thomasw21 Mar 1, 2022
02f86f5
documentation
SaulLu Mar 1, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions megatron/arguments.py
Original file line number Diff line number Diff line change
Expand Up @@ -369,6 +369,10 @@ def _add_network_size_args(parser):
group.add_argument('--make-vocab-size-divisible-by', type=int, default=128,
help='Pad the vocab size to be divisible by this value.'
'This is added for computational efficieny reasons.')
group.add_argument('--pad-vocab-size-to', type=int, default=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
group.add_argument('--pad-vocab-size-to', type=int, default=None,
group.add_argument('--pad-embedding-size-to', type=int, default=None,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, I liked the naming vocab because it emphasize the fact that we're really choosing our tokenizer's vocabulary size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually works for me, I just didn't want to indicate we were modifying the tokenizer, but really just the embedding layer. But in Meg-DS they seem to use padded_vocab_size so your solution makes more sense.

help='Pad the vocab size to this value.'
'This value must be greater than the initial size of the tokenizer'
', needs to be divisible by TP size and `make-vocab-size-divisible-by`.')
group.add_argument('--layernorm-epsilon', type=float, default=1e-5,
help='Layer norm epsilon.')
group.add_argument('--apply-residual-connection-post-layernorm',
Expand Down
5 changes: 5 additions & 0 deletions megatron/mpu/layers.py
Original file line number Diff line number Diff line change
Expand Up @@ -217,6 +217,9 @@ def __init__(self, num_embeddings, embedding_dim,


def forward(self, input_):
if torch.any(input_ >= self.num_embeddings):
raise ValueError(f"There is an input id in the input that is greater than the highest possible input id.\nInput: {input_}\nnum_embeddings: {self.num_embeddings}")

Copy link
Contributor

@stas00 stas00 Feb 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

killing the training at run time because the input is broken? This we can't afford to support.

Additionally this assert can't be acted upon since the operator would have no idea how to fix this as you're not including the input id and the sample id - impossible to act upon.

If there is a need to validate data before the training it should happen separately from the training.

worst case scenario that can be supported is probably skip the bad input, i.e. do the checking at the dataloader retrieve stage. but still this feels like a bad design in the software.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the forward should be as lean as possible and not need to check anything so that it could run as fast as possible.

Copy link
Member

@thomasw21 thomasw21 Feb 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree, we need this sanity check so that we're not doing something wrong. Right now, model would silently bypass this issue if you use TP>2. IMO it should kill the training as we are doing something VERY bad (Essentially if you use TP=1, this does get killed when you call F.embedding on that)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't do data sanity checks in forward. All sanity checks should ideally be done before at the dataloader level.

Which case are you guarding against - all of the data is completely borked or only some inputs are invalid?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and practically, let's do a an imaginary scenario - you started the training and this assert happens 5 days in - what do you do? I fail to see how this assert is actionable.

Copy link
Contributor

@stas00 stas00 Feb 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, unfortunately I don't have the resources to dive right now into this as I have to finish lots of things before the launch, so I trust you will do the best thing you can and if things break during the training I will ping you and you will know what to do.

It's not great to have this sort of last minute change that hasn't been thoroughly tested but what to do. It's not the only last minute change. e.g. the whole bf16 optimizer was re-written last week.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thomas and I discussed this, he helped me to clear out an important misunderstanding and he will post an update.

Thank you for working on this, @SaulLu and @DanielHesslow - my apologies that I can't be involved at a deeper level at the moment.

Copy link
Member

@thomasw21 thomasw21 Feb 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay talked to @stas00 :

  • This PR makes sense in the sense that the model should check for this. @DanielHesslow said it better, but if we get a out of bound index, we need to raise an exception
  • There's the issue of "if the training set is bad for X reason" what do we do? After this PR it will throw badly. So we need a strategy. Given that the number of root causes, @stas00 suggest to have a skip mechanism in the data loader, why? Because we're able to keep training, in case the issue "isn't that bad". In the worst case scenario, we kill the job and relaunch from scratch. But this is orthogonal to this PR IMO. Another approach I'm hoping to go through is actually go through the data loader with a check before the training so we don't get bad surprises.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, so the idea is do a check on the dataloader level, and warn when skipping a bad sample. If it's a lot of samples but not the majority - we continue training while someone is fixing the data.

of course if the data is very broken then it'll be skipping them all and we can't train.

But the point is that - don't crash the training unless you have to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and additionally we can pre-check all our data outside of the training so that it doesn't have any bad samples.

This PR is mainly for future users, but in our training we should make sure it never hits this assert

if self.tensor_model_parallel_size > 1:
# Build the mask.
input_mask = (input_ < self.vocab_start_index) | \
Expand All @@ -225,7 +228,9 @@ def forward(self, input_):
masked_input = input_.clone() - self.vocab_start_index
masked_input[input_mask] = 0
else:
# input_ is garanted to be in the range [0:self.vocab_end_index - self.vocab_start_index] thanks to the first check
masked_input = input_

# Get the embeddings.
output_parallel = F.embedding(masked_input, self.weight,
self.padding_idx, self.max_norm,
Expand Down
27 changes: 19 additions & 8 deletions megatron/tokenizer/tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,14 +68,25 @@ def build_tokenizer(args):


def _vocab_size_with_padding(orig_vocab_size, args):
"""Pad vocab size so it is divisible by model parallel size and
still having GPU friendly size."""

after = orig_vocab_size
multiple = args.make_vocab_size_divisible_by * \
args.tensor_model_parallel_size
while (after % multiple) != 0:
after += 1
"""Apply the requested rules to change the size of the vocabulary"""
if args.pad_vocab_size_to is not None:
if args.pad_vocab_size_to < orig_vocab_size:
raise ValueError(
f"You asked to pad the vocabulary to {args.pad_vocab_size_to} when the initial vocabulary size is "
f"{orig_vocab_size}. You can only pad to a higher value."
)

if args.make_vocab_size_divisible_by is not None and (args.pad_vocab_size_to % args.make_vocab_size_divisible_by) != 0:
raise ValueError(f"{args.pad_vocab_size_to} is not divisible by {args.make_vocab_size_divisible_by}")

after = args.pad_vocab_size_to
else:
# Pad vocab size so it is divisible by model parallel size and still having GPU friendly size.
after = orig_vocab_size
multiple = args.make_vocab_size_divisible_by * \
args.tensor_model_parallel_size
while (after % multiple) != 0:
after += 1
if args.rank == 0:
print(' > padded vocab (size: {}) with {} dummy tokens '
'(new size: {})'.format(
Expand Down
114 changes: 108 additions & 6 deletions tests/test_tensor_parallel.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
class MegDSTestTP(TestCasePlus):
def get_default_args(self):
"""return a dictionary with key as argument name and value as additional arguments"""
data_dir = f"{self.data_dir}/gpt2"
return {
# GPT_ARGS
"--num-layers": "2",
Expand All @@ -39,8 +40,9 @@ def get_default_args(self):
"--lr": "0.00015",
"--min-lr": "1.0e-5",
"--train-iters": "5000",
"--tokenizer-type": "PretrainedFromHF",
"--tokenizer-name-or-path": "gpt2",
"--tokenizer-type": "GPT2BPETokenizer",
"--merge-file": f"{data_dir}/gpt2-tiny-merges.txt",
"--vocab-file": f"{data_dir}/gpt2-tiny-vocab.json",
"--data-impl": "mmap",
"--split": "949,50,1",
"--distributed-backend": "nccl",
Expand Down Expand Up @@ -111,8 +113,6 @@ def create_model_inputs(tokens):
initialize_megatron()
args = get_args()

args.vocab_size = args.padded_vocab_size = 1024

tokenizer = get_tokenizer()

model, _, _ = setup_model_and_optimizer(gpt_model_provider)
Expand Down Expand Up @@ -141,7 +141,6 @@ def create_model_inputs(tokens):
else:
token_ids = torch.tensor(token_ids)


model.micro_batches = 1
model.set_batch_fn(create_model_inputs)
# process batch
Expand All @@ -156,7 +155,7 @@ def create_model_inputs(tokens):

output = model.eval_batch(iter([token_ids]), compute_loss = False, reduce_output = None)[0]

output = gather_from_tensor_model_parallel_region(output)[..., :tokenizer.vocab_size]
output = gather_from_tensor_model_parallel_region(output)

if save != None:
args.save = save
Expand All @@ -169,6 +168,7 @@ def test_alibi_tp(self):
cp_dir = self.get_auto_remove_tmp_dir()

command_args = self.get_default_args()
command_args["--pad-vocab-size-to"] = "5120" # This is equal to 128 * 40 which is above the len of gp2-tiny vocabulary
command_args["--position-embedding-type"] = "alibi"
command_args["--tensor-model-parallel-size"] = "1"

Expand All @@ -192,5 +192,107 @@ def test_alibi_tp(self):
logging.getLogger().critical(output-output2)
self.assertTrue(np.allclose(output,output2, atol=5e-3, rtol=0), "Different results when running with TP=1 and TP=2")



def test_embedding_matrix_tp(self):
mp.set_start_method('spawn', force=True)
cp_dir = self.get_auto_remove_tmp_dir()

command_args = self.get_default_args()
command_args["--pad-vocab-size-to"] = "5120" # This is equal to 128 * 40 which is above the len of gp2-tiny vocabulary
command_args["--seq-length"] = "4"
command_args["--micro-batch-size"] = "2"
tokens = [[5119, 0, 1, 5100],[0, 1, 5111, 5101]]

command_args["--tensor-model-parallel-size"] = "1"

pool = Pool(1)
# tp_index, tp_size, command_args, token_ids, save, load
result = pool.map(MegDSTestTP.infer_model, [((0, 1, command_args, tokens, cp_dir, None))])
pool.close()
pool.join()

output, _ = result[0]
logging.getLogger().info("First done!")

command_args["--tensor-model-parallel-size"] = "2"

pool = Pool(2)
result = pool.map(MegDSTestTP.infer_model, [((0, 2, command_args, tokens, None, cp_dir)), ((1, 2, command_args, tokens, None, cp_dir))])
pool.close()
pool.join()

output2, _ = result[0]

logging.getLogger().critical(output-output2)
self.assertTrue(np.allclose(output,output2, atol=5e-3, rtol=0), "Different results when running with TP=1 and TP=2")


def test_embedding_matrix_tp_with_invalid_tokens_ids(self):
mp.set_start_method('spawn', force=True)

command_args = self.get_default_args()
command_args["--pad-vocab-size-to"] = "5120" # This is equal to 128 * 40 which is above the len of gp2-tiny vocabulary
command_args["--seq-length"] = "4"
command_args["--micro-batch-size"] = "2"
tokens = [[5120, 0, 1, 2],[0, 1, 3, 4]]

command_args["--tensor-model-parallel-size"] = "1"

pool = Pool(1)
with pytest.raises(Exception) as exc_info:
_ = pool.map(MegDSTestTP.infer_model, [((0, 1, command_args, tokens, None, None))])
pool.close()
pool.join()

self.assertIn("There is an input id in the input that is greater than the highest possible input id" , str(exc_info.value))

logging.getLogger().info("First done!")

command_args["--tensor-model-parallel-size"] = "2"

pool = Pool(2)
with pytest.raises(Exception) as exc_info:
_ = pool.map(MegDSTestTP.infer_model, [((0, 2, command_args, tokens, None, None)), ((1, 2, command_args, tokens, None, None))])
pool.close()
pool.join()

self.assertIn("There is an input id in the input that is greater than the highest possible input id", str(exc_info.value))


def test_tokenizer_vocab_size_multiple_of_tp_size(self):
mp.set_start_method('spawn', force=True)

command_args = self.get_default_args()
command_args["--pad-vocab-size-to"] = "5121" # This is equal to 128 * 40 + 1 which is above the len of gp2-tiny vocabulary
command_args["--micro-batch-size"] = "4"
command_args["--tensor-model-parallel-size"] = "2"
command_args["--make-vocab-size-divisible-by"] = "1"

pool = Pool(2)
with pytest.raises(Exception) as exc_info:
_ = pool.map(MegDSTestTP.infer_model, [((0, 2, command_args, None, None, None)), ((1, 2, command_args, None, None, None))])
pool.close()
pool.join()

self.assertEqual(str(exc_info.value), "5121 is not divisible by 2")

def test_tokenizer_raise_error_make_vocab_size_divisible_by(self):
mp.set_start_method('spawn', force=True)

command_args = self.get_default_args()
command_args["--pad-vocab-size-to"] = "5121" # This is equal to 128 * 40 + 1 which is above the len of gp2-tiny vocabulary
command_args["--micro-batch-size"] = "4"


pool = Pool(2)
with pytest.raises(Exception) as exc_info:
_ = pool.map(MegDSTestTP.infer_model, [((0, 2, command_args, None, None, None)), ((1, 2, command_args, None, None, None))])
pool.close()
pool.join()

self.assertEqual(str(exc_info.value), "5121 is not divisible by 128")


if __name__ == '__main__':
unittest.main()
9 changes: 8 additions & 1 deletion tools/preprocess_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,14 @@ def get_args():
help='Append an <eod> token to the end of a document.')
group.add_argument("--tokenizer-name-or-path", type=str, default=None,
help="Name or path of the huggingface tokenizer.")
group.add_argument('--make-vocab-size-divisible-by', type=int, default=128,
help='Pad the vocab size to be divisible by this value.'
'This is added for computational efficieny reasons.')
group.add_argument('--pad-vocab-size-to', type=int, default=None,
help='Pad the vocab size to be divisible by this value.'
'Value of the size of the vocabulary of the tokenizer to reach. This value must be greater than'
' the initial size of the tokenizer. If this argument is used the value of '
'`make-vocab-size-divisible-by` will be ignored.')

group = parser.add_argument_group(title='output data')
group.add_argument('--output-prefix', type=str, required=True,
Expand All @@ -140,7 +148,6 @@ def get_args():

# some default/dummy values for the tokenizer
args.rank = 0
args.make_vocab_size_divisible_by = 128
args.tensor_model_parallel_size = 1
args.vocab_extra_ids = 0

Expand Down
10 changes: 9 additions & 1 deletion tools/preprocess_data_dist.py
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,15 @@ def get_args():
help='Path to binary output file without suffix')
group.add_argument('--dataset-impl', type=str, default='mmap',
choices=['lazy', 'cached', 'mmap'])
group.add_argument('--make-vocab-size-divisible-by', type=int, default=128,
help='Pad the vocab size to be divisible by this value.'
'This is added for computational efficieny reasons.')
group.add_argument('--pad-vocab-size-to', type=int, default=None,
help='Pad the vocab size to be divisible by this value.'
'Value of the size of the vocabulary of the tokenizer to reach. This value must be greater than'
' the initial size of the tokenizer. If this argument is used the value of '
'`make-vocab-size-divisible-by` will be ignored.')


group = parser.add_argument_group(title='runtime')
group.add_argument('--torch-backend', type=str, default='gloo', choices=['gloo', 'mpi'],
Expand Down Expand Up @@ -198,7 +207,6 @@ def get_args():
args.numranks = args.distctx.numranks

# some default/dummy values for the tokenizer
args.make_vocab_size_divisible_by = 128
args.tensor_model_parallel_size = 1
args.vocab_extra_ids = 0

Expand Down
9 changes: 8 additions & 1 deletion tools/preprocess_data_many_cores.py
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,14 @@ def get_args():
help='Append an <eod> token to the end of a document.')
group.add_argument("--tokenizer-name-or-path", type=str, default=None,
help="Name or path of the huggingface tokenizer.")
group.add_argument('--make-vocab-size-divisible-by', type=int, default=128,
help='Pad the vocab size to be divisible by this value.'
'This is added for computational efficieny reasons.')
group.add_argument('--pad-vocab-size-to', type=int, default=None,
help='Pad the vocab size to be divisible by this value.'
'Value of the size of the vocabulary of the tokenizer to reach. This value must be greater than'
' the initial size of the tokenizer. If this argument is used the value of '
'`make-vocab-size-divisible-by` will be ignored.')

group = parser.add_argument_group(title='output data')
group.add_argument('--output-prefix', type=str, required=True,
Expand All @@ -206,7 +214,6 @@ def get_args():

# some default/dummy values for the tokenizer
args.rank = 0
args.make_vocab_size_divisible_by = 128
args.tensor_model_parallel_size = 1
args.vocab_extra_ids = 0

Expand Down