You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi @kernelmachine / @kyleclo , I'm wondering how News corpus is filtered from RealNews dataset? I have tried to extract docs from RealNews dataset, but got 32.80M docs instead of 11.90M docs as mentioned in the paper. Is there any additional filtering applied? Thanks!
The text was updated successfully, but these errors were encountered:
Hey @stevezheng23, we reported the number of documents used as constrained by the amount of DAPT we wanted to perform for the experiments. That is, we only used enough documents such that we could perform our 12.5K gradient updates with batch size of 2K. This will not use all of the RealNews dataset.
We didn't perform any intentional filtering, just a random subsampling.
Hi @kernelmachine / @kyleclo , I'm wondering how News corpus is filtered from RealNews dataset? I have tried to extract docs from RealNews dataset, but got 32.80M docs instead of 11.90M docs as mentioned in the paper. Is there any additional filtering applied? Thanks!
The text was updated successfully, but these errors were encountered: