-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WARNING:root:Found repetitions in sample 0 #11
Comments
I meet the same problem. There is nothing but
in the mmd file. |
Hi @jacobmarks
I will look into the failure detection heuristic again. Thank you for brining this to my attention.
of lanthanum is 7/2, hence the nuclear magnetic moment as determined by this analysis is 2.5 nuclear magnetons. This is in fair agreement with the value 2.8 nuclear magnetons determined from La III hyperfine structures by the writer and N. S. Grace.9
|
Thanks for looking into this @lukas-blecher ! Really appreciate your prompt response |
I am also getting a lot of 'MISSING_PAGE_FAIL:{n}]' on output file and 'WARNING:root:Skipping page {n} due to repetitions.' in terminal. |
I'm experiencing a similar issue of missing pages during PDF to Markdown conversion, but with some nuances. System Information
Issue Details
Additional Notes
Let me know if you need any more info. |
I actually noticed this as well, that the quality of the output generated by Nougat was better in the Huggingface demo than on my own computer. Maybe this is a false positive though? Remember that Nougat was trained on Arxiv scientific papers in the STEM field, so if you feed it a magazine article in PDF format that only includes text and some graphics, don't be surprised if it fails here and there. Out-of-domain content seems to not be super robust yet. If your doc includes content that is not similar (enough) to the training data it will not work correctly, just like a dog you've taught to sit will not necessarily stay. |
Did you see my doc? It's pretty academic and Arxiv-like. I wonder if it's the architecture of my old hardware that's the issue. |
I did see your doc. Looks like it should be processed perfectly fine? Did the shape of the LaTeX rendered integral sign change drastically at one point? Unlikely to be a hardware issue IMO, the processes being run are the same, so the only difference with better hardware (video card) would be more VRAM and therefore lower inferencing time, not necessarily changing quality of output. Have you tried with the Huggingface demo? How did the output of the Huggingface demo compare with the results you achieved? |
Hi, thanks for the detailed report. However I was unable to reproduce the it. My best guess is that it's the GPU. Can you try to convert some of the failed pages with CPU only (set batch size to 0 Edit: Was able to reproduce on CPU. Will investigate now, thanks! |
I was able to confirm that this is again a case of a false positive failure detection. I've added the |
@lukas-blecher what specifically does the |
In short, it won't apply the failure detection heuristic described in the paper. I still haven't fully grasped the problem at hand but for some reason pytorch gives different values depending on the device you're using to compute. But what this also means is that true positives won't be caught as well. So you might get a lot of repetitions for out of domain PDFs (plus the computation time will be longer because we aren't stopping in the middle of the generation anymore) |
Is this also why it seems that the huggingface demo seems to give me the best results? Where does that run? I'm pretty convinced that this does not run locally. So hardware seems to have an effect on the quality of the output, correct? Also, repetitions for out of domain documents, this is already a known issue so with this new flag, will this get worse, stay the same or improve? |
The hardware doesn't really change the output but it does change the repetition / failure detection. So you should get the same results as the HF demo. What will change for out of domain documents is that the repetition will not be detected during the generation. There are still some rule based postprocessing functions that will detect some of them though. |
As I still continue to experience missing pages (which I'm sure will improve over time), I find myself turning to Claude (v2) and asking it to convert pdfs into the format I desire. It's very good at it. |
FYI, I believe Claude 2 preprocesses PDF files using Mathpix (at least on the official website). |
Well, for whatever reason it was doing an excellent job. IDK if it was my GPU/CUDA compatibility or what. |
same issue using nougat_api, can I use --no-skipping somehow? how to change tmp o setting with the api? |
Tried applying to individual pages of the PDF for EPR paper https://cds.cern.ch/record/405662/files/PhysRev.47.777.pdf.
While the first page works and I get a print out of the text, pages 2-4 don't work. I get errors like this:
The text was updated successfully, but these errors were encountered: