Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Size of embedding tables in MLPerf checkpoint #369

Open
AlCatt91 opened this issue Dec 14, 2023 · 0 comments
Open

Size of embedding tables in MLPerf checkpoint #369

AlCatt91 opened this issue Dec 14, 2023 · 0 comments

Comments

@AlCatt91
Copy link

Hello, I am looking at the pre-trained weights for the MLPerf benchmark configuration on Criteo Terabyte that are provided in the README (link). If I understand correctly, this should be the best checkpoint of the configuration that is run with the script ./bench/run_and_time.sh.
Based on the code snippet

if args.max_ind_range > 0:
            ln_emb = np.array(
                list(
                    map(
                        lambda x: x if x < args.max_ind_range else args.max_ind_range,
                        ln_emb,
                    )
                )
            )

since that config uses --max-ind-range=40000000, I was expecting the largest embedding tables (namely, tables 0, 9, 19, 20, 21) to be reduced to have exactly 40M rows, however the length of these tensors in the state_dict in the downloaded checkpoint is more variable than that:

**emb_l.0.weight: torch.Size([39884406, 128])**
emb_l.1.weight: torch.Size([39043, 128])
emb_l.2.weight: torch.Size([17289, 128])
emb_l.3.weight: torch.Size([7420, 128])
emb_l.4.weight: torch.Size([20263, 128])
emb_l.5.weight: torch.Size([3, 128])
emb_l.6.weight: torch.Size([7120, 128])
emb_l.7.weight: torch.Size([1543, 128])
emb_l.8.weight: torch.Size([63, 128])
**emb_l.9.weight: torch.Size([38532951, 128])**
emb_l.10.weight: torch.Size([2953546, 128])
emb_l.11.weight: torch.Size([403346, 128])
emb_l.12.weight: torch.Size([10, 128])
emb_l.13.weight: torch.Size([2208, 128])
emb_l.14.weight: torch.Size([11938, 128])
emb_l.15.weight: torch.Size([155, 128])
emb_l.16.weight: torch.Size([4, 128])
emb_l.17.weight: torch.Size([976, 128])
emb_l.18.weight: torch.Size([14, 128])
**emb_l.19.weight: torch.Size([39979771, 128])**
**emb_l.20.weight: torch.Size([25641295, 128])**
**emb_l.21.weight: torch.Size([39664984, 128])**
emb_l.22.weight: torch.Size([585935, 128])
emb_l.23.weight: torch.Size([12972, 128])
emb_l.24.weight: torch.Size([108, 128])
emb_l.25.weight: torch.Size([36, 128])

How does the hashing work for this model? It cannot be just taking the categorical value ID modulo 40M as in the released pytorch code. Moreover, it seems to me that also some of the smaller embedding tables have been reduced in size (suggesting additional custom filtering/merging of the categorical values?)

Also, I am not seeing the test_auc key in the checkpointed dictionary, despite --mlperf-logging being set in ./bench/run_and_time.sh: what's the test AUC of this pre-trained model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant