Understanding adaptive-span loss #13

prajjwal1 · 2020-01-21T07:52:18Z

Hi,

Sorry to bother you. I have gone through the paper several times. I've also looked at the code many times
I just had one query with adaptive span loss. Here's what I interpreted:
This parameter self.current_val = nn.Parameter(torch.zeros(*shape) + init_val) is responsible for calculating loss, mask and span.
In this case, this parameter will be initialized with zero values since as per your config since init_val is kept as 0 (since the mean of all the values of the parameter will be 0).

My question is how is this parameter getting updated ?

When I call adaptive_span.get_loss(), it in turn calls:
self._loss_coeff * self._max_span * self._mask.current_val.mean() which will also return 0.
When I do :
adaptive_span.clamp_param(), nothing will happen since all the values inside the parameter were initialized with 0. These are the only two function calls happening inside train method.
Can you please point out what am I missing ?

The text was updated successfully, but these errors were encountered:

tesatory · 2020-01-21T09:43:34Z

It's a model parameter, so it will be updated by optimizer.step() like any other parameter.

prajjwal1 · 2020-01-22T10:00:50Z

Thanks for your reply. I wanted to ask:

Do you think adaptive span takes a longer time to converge as compared to standard attention ? In my case, I'm seeing improvements but the extent is very less. Could this be due to trim_memory ? Did you try this on other tasks except char LM ?
In your experiments, did adaptive span loss become non zero at any moment ? Although current_val is a parameter and it's being constantly updated, the loss is a constant 0.
Thanks for your support.

tesatory · 2020-01-23T10:01:12Z

Not sure what "converge" means here. If you're saying it's not growing large enough, you might want to reduce the loss coefficient associated with it. trim_memory shouldn't affect learning. Yes, we used it on word level LM without a problem.
The loss can be zero if it has too large weight compared to the LM loss. Try setting --adapt-span-loss to 0.

prajjwal1 · 2020-01-24T09:34:40Z

Hi,
Thanks for replying. What did you use to calculate FLOPS?

tesatory · 2020-01-24T14:20:44Z

We just counted all the flops in the model. For example, a linear layer has d_in x d_out flops.

prajjwal1 · 2020-01-26T07:46:55Z

Thanks for your reply.

In case where trim_len<0, the trim_memory will perform padding on the input tensor as specified here. So in my case, trim_len<0 since 1024 is big, here's what happens:

# query.shape -> [128,36,768]
# key.shape -> [128,20,768]
# value.shape -> [128,20,768]
k,v,k_pe = adaptive.trim_memory(q,k,v,k_pe)
# k.shape -> [128,1060,768]
# v.shape -> [128,1060,768]
# k_pe.shape -> [1,64,768]

So in this case, I don't think memory consumption is being reduced, since now the dimensions have risen many fold, and more FLOPS are required. Am I right or am I missing something? So for now, I've removed this operation.

Using masking function as specified in the paper, my FLOPS have stayed the same

macs: 12.074G 
params: 237.558M

These results are noted during inference. Did you measure FLOPS (as per in the paper) during training (since spans only change during this process only) ? My spans are changing after some changes, but the FLOPS are same. Is it because trimming operations are solely responsible for reducing FLOPS ?

tesatory · 2021-12-15T18:20:47Z

As noted in the paper, FLOPS is the number of FLOPS necessary for computing one step prediction. So it's not the training time flops where a batch of samples being processed together.

prajjwal1 changed the title ~~Understanding loss term from adaptive-span~~ Understanding adaptive-span loss Jan 21, 2020

tesatory closed this as completed Dec 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding adaptive-span loss #13

Understanding adaptive-span loss #13

prajjwal1 commented Jan 21, 2020 •

edited

tesatory commented Jan 21, 2020

prajjwal1 commented Jan 22, 2020 •

edited

tesatory commented Jan 23, 2020

prajjwal1 commented Jan 24, 2020

tesatory commented Jan 24, 2020

prajjwal1 commented Jan 26, 2020 •

edited

tesatory commented Dec 15, 2021

Understanding adaptive-span loss #13

Understanding adaptive-span loss #13

Comments

prajjwal1 commented Jan 21, 2020 • edited

My question is how is this parameter getting updated ?

tesatory commented Jan 21, 2020

prajjwal1 commented Jan 22, 2020 • edited

tesatory commented Jan 23, 2020

prajjwal1 commented Jan 24, 2020

tesatory commented Jan 24, 2020

prajjwal1 commented Jan 26, 2020 • edited

tesatory commented Dec 15, 2021

prajjwal1 commented Jan 21, 2020 •

edited

prajjwal1 commented Jan 22, 2020 •

edited

prajjwal1 commented Jan 26, 2020 •

edited