How is News corpus filtered from RealNews dataset #5

stevezheng23 · 2020-05-07T18:19:24Z

Hi @kernelmachine / @kyleclo , I'm wondering how News corpus is filtered from RealNews dataset? I have tried to extract docs from RealNews dataset, but got 32.80M docs instead of 11.90M docs as mentioned in the paper. Is there any additional filtering applied? Thanks!

kyleclo · 2020-05-07T20:12:35Z

Hey @stevezheng23, we reported the number of documents used as constrained by the amount of DAPT we wanted to perform for the experiments. That is, we only used enough documents such that we could perform our 12.5K gradient updates with batch size of 2K. This will not use all of the RealNews dataset.

We didn't perform any intentional filtering, just a random subsampling.

stevezheng23 · 2020-05-07T20:39:42Z

Got it, great to know that random subsampling is applied to get 11.90M docs, thanks!

stevezheng23 closed this as completed May 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How is News corpus filtered from RealNews dataset #5

How is News corpus filtered from RealNews dataset #5

stevezheng23 commented May 7, 2020

kyleclo commented May 7, 2020

stevezheng23 commented May 7, 2020

How is News corpus filtered from RealNews dataset #5

How is News corpus filtered from RealNews dataset #5

Comments

stevezheng23 commented May 7, 2020

kyleclo commented May 7, 2020

stevezheng23 commented May 7, 2020