Size of weakly labeled data in paper/Annotate.ipynb and the number of sentences of 2021 PubMed Baseline don't match #1

soochem · 2021-08-11T08:33:56Z

Hello,

I have tried to annotate all pubmed sentences and struggled with large number of pubmed sentences and memory issue.
I changed download_pubmed.sh code to retrieve 2021 pubmed baseline since we could not find 2020 baseline.

URL=ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline
for i in $(seq -f "%04g" 1 1015); do
  GZFILE=pubmed21n${i}.xml.gz
  echo $URL/$GZFILE
  wget $URL/$GZFILE
  gzip -d $GZFILE
  XMLFILE=pubmed21n${i}.xml
...

And then we encounter this issue: We have retrieved 2021 pubmed baseline only including 1-1015 files, and we assume data was accumulated from 2020. So we guess our retrieved data may have the same number of lines with yours.
But we have quite large number of lines for all_text and (un)labeled_lines, compared to your outputs of Annotate notebook.

Could you please give me some advice for the different number of pubmed sentences and expected effect of those large number of sentences?

Thank you in advance.

The text was updated successfully, but these errors were encountered:

HMJiangGatech · 2021-08-19T02:25:32Z

Hi the entire pubmed data is very large. The data used in our paper contains 11M/15M samples. It was actually subsampled, so that it can be loaded in the memory of our machine. I believe in general, you can use larger data to achieve better performance. I wish to share the exact data we used. However, due to the policy, we can not directly redistribute the data.

soochem · 2021-08-19T06:11:52Z

Thank you for your helpful comments :)
I was just wondering how you sampled the desired number of weakly labeled data.

Does "11M/15M" mean the number of tokens in chem_weak.txt and disease_weak.txt (results of Annotate.ipynb) respectively? I used <2% of 2021 pubmed sentences, and then I got 12M chem tokens and 10.9M disease tokens.
For figure 2 in your paper, "data size" is the number of randomly sampled tokens from 15M disease tokens? It is possible to ruin the rule: "leave the sentence with at least one weak entity label".

Thanks again!

HMJiangGatech · 2021-08-24T05:47:34Z

The way I generate the data is first using the notebook to generate all data and then randomly sub-sampled it by 50%.

"11M/15M" means the number of samples (i.e. sentences).
They are the number of randomly sub-sampled samples. So it will not ruin the rule.

soochem · 2021-08-30T08:45:33Z

I think I misunderstood the meaning of the term in your paper. Following your advice, I extracted the correct number of sentences (all_samples) . Thank you for the kind reply :)

zhiyuanpeng · 2021-11-18T00:04:41Z

I think I misunderstood the meaning of the term in your paper. Following your advice, I extracted the correct number of sentences (all_samples) . Thank you for the kind reply :)

Hi soochem @soochem , do you get the same numer of sentences as needle? I also use pubmed2021, but labeled_lines=10779835 and unlabeled_lines=35106422 which are much bigger than the values generated by pubmed2020. How do you get the correct number of sentences by pubmed2021? Thanks.

soochem · 2021-11-25T10:34:11Z

@zhiyuanpeng hello, I just use 1015 text files of PubMed baseline 2021 by using for i in $(seq -f "%04g" 1 1015); do.
The total number of unlabeled and labeled_lines is bigger than the number presented in the authors' code.
Then, I randomly sampled labeled_lines and got the all_samples of the same size as the number of samples presented in the paper.
It may not be possible to obtain a data distribution that is exactly the same as the authors' weak label. But the model performance is the same as described in the table.

soochem changed the title ~~Issue when using 2021 PubMed Baseline in Annotate.ipynb~~ Size of weakly labeled data in paper/Annotate.ipynb and the number of sentences of 2021 PubMed Baseline don't match Aug 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Size of weakly labeled data in paper/Annotate.ipynb and the number of sentences of 2021 PubMed Baseline don't match #1

Size of weakly labeled data in paper/Annotate.ipynb and the number of sentences of 2021 PubMed Baseline don't match #1

soochem commented Aug 11, 2021 •

edited

HMJiangGatech commented Aug 19, 2021

soochem commented Aug 19, 2021 •

edited

HMJiangGatech commented Aug 24, 2021

soochem commented Aug 30, 2021

zhiyuanpeng commented Nov 18, 2021

soochem commented Nov 25, 2021

Size of weakly labeled data in paper/Annotate.ipynb and the number of sentences of 2021 PubMed Baseline don't match #1

Size of weakly labeled data in paper/Annotate.ipynb and the number of sentences of 2021 PubMed Baseline don't match #1

Comments

soochem commented Aug 11, 2021 • edited

HMJiangGatech commented Aug 19, 2021

soochem commented Aug 19, 2021 • edited

HMJiangGatech commented Aug 24, 2021

soochem commented Aug 30, 2021

zhiyuanpeng commented Nov 18, 2021

soochem commented Nov 25, 2021

soochem commented Aug 11, 2021 •

edited

soochem commented Aug 19, 2021 •

edited