Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Size of weakly labeled data in paper/Annotate.ipynb and the number of sentences of 2021 PubMed Baseline don't match #1

Open
soochem opened this issue Aug 11, 2021 · 6 comments

Comments

@soochem
Copy link

soochem commented Aug 11, 2021

Hello,

I have tried to annotate all pubmed sentences and struggled with large number of pubmed sentences and memory issue.
I changed download_pubmed.sh code to retrieve 2021 pubmed baseline since we could not find 2020 baseline.

URL=ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline
for i in $(seq -f "%04g" 1 1015); do
  GZFILE=pubmed21n${i}.xml.gz
  echo $URL/$GZFILE
  wget $URL/$GZFILE
  gzip -d $GZFILE
  XMLFILE=pubmed21n${i}.xml
...

And then we encounter this issue: We have retrieved 2021 pubmed baseline only including 1-1015 files, and we assume data was accumulated from 2020. So we guess our retrieved data may have the same number of lines with yours.
But we have quite large number of lines for all_text and (un)labeled_lines, compared to your outputs of Annotate notebook.

Could you please give me some advice for the different number of pubmed sentences and expected effect of those large number of sentences?

Thank you in advance.

@soochem soochem changed the title Issue when using 2021 PubMed Baseline in Annotate.ipynb Size of weakly labeled data in paper/Annotate.ipynb and the number of sentences of 2021 PubMed Baseline don't match Aug 12, 2021
@HMJiangGatech
Copy link
Contributor

Hi the entire pubmed data is very large. The data used in our paper contains 11M/15M samples. It was actually subsampled, so that it can be loaded in the memory of our machine. I believe in general, you can use larger data to achieve better performance. I wish to share the exact data we used. However, due to the policy, we can not directly redistribute the data.

@soochem
Copy link
Author

soochem commented Aug 19, 2021

Thank you for your helpful comments :)
I was just wondering how you sampled the desired number of weakly labeled data.

  1. Does "11M/15M" mean the number of tokens in chem_weak.txt and disease_weak.txt (results of Annotate.ipynb) respectively? I used <2% of 2021 pubmed sentences, and then I got 12M chem tokens and 10.9M disease tokens.
  2. For figure 2 in your paper, "data size" is the number of randomly sampled tokens from 15M disease tokens? It is possible to ruin the rule: "leave the sentence with at least one weak entity label".

Thanks again!

@HMJiangGatech
Copy link
Contributor

The way I generate the data is first using the notebook to generate all data and then randomly sub-sampled it by 50%.

  1. "11M/15M" means the number of samples (i.e. sentences).
  2. They are the number of randomly sub-sampled samples. So it will not ruin the rule.

@soochem
Copy link
Author

soochem commented Aug 30, 2021

I think I misunderstood the meaning of the term in your paper. Following your advice, I extracted the correct number of sentences (all_samples) . Thank you for the kind reply :)

@zhiyuanpeng
Copy link

I think I misunderstood the meaning of the term in your paper. Following your advice, I extracted the correct number of sentences (all_samples) . Thank you for the kind reply :)

Hi soochem @soochem , do you get the same numer of sentences as needle? I also use pubmed2021, but labeled_lines=10779835 and unlabeled_lines=35106422 which are much bigger than the values generated by pubmed2020. How do you get the correct number of sentences by pubmed2021? Thanks.

@soochem
Copy link
Author

soochem commented Nov 25, 2021

@zhiyuanpeng hello, I just use 1015 text files of PubMed baseline 2021 by using for i in $(seq -f "%04g" 1 1015); do.
The total number of unlabeled and labeled_lines is bigger than the number presented in the authors' code.
Then, I randomly sampled labeled_lines and got the all_samples of the same size as the number of samples presented in the paper.
It may not be possible to obtain a data distribution that is exactly the same as the authors' weak label. But the model performance is the same as described in the table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants