-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Troubles with multi-document entries #26
Comments
Hi matteosecli, Yes that's a legit complaint, something I overlooked when making this tool. No need for screenshots or pdfs, the issue appears quite clear, but it might take some time for me to fix it. In the meanwhile, you could create a separate entry for the supplementary in Mendeley, or use some other tools if they do it properly. Sorry for the trouble. |
No worries, thank you! 😄 |
Xunius
added a commit
that referenced
this issue
Jul 30, 2018
A big commit with quite some changes, intended to address the issue of a single doc having multiple attachment PDFs. See issue #26. Add a `DocAnno` class in menotexport.py to replace the original role of `FileAnno`, a DocAnno may have more than 1 FileAnno objs, each representing an attachment PDF. Correspondingly, exportpdf.exportAnnoPdf(), exportpdf.copyPdf(), menotexport.extractAnnos() have to do a nested loop through DocAnno objs and then FileAnno objs. In extracthl2.Anno class, add `path` attribute. In extracthl2.extractHighlights2() and extractnt.extractNotes(), add `path` info to the extracted highlight/note. In menotexport.py: * getMetaData() now fetches `user_name`, `path`, `folder` and `tags`, and adds `folder` to the list of tags. * getHighlights(), getNotes() and getDocNotes() now have a cleaner call signature, with the only filtering as doc id. The output dict has 1 more layer of structure: use `path` to distinguish highlightes/notes made on different pdfs associated with a single doc. * Deprecate `getOtherDocs()`, `getOtherCanonicalDocs()` and `processCanonicals(). Use `processDocs()` to process both docs in folders and canonical docs. Some relevant changes in `main()` to achieve this merging. * Use sqlite filtering in various places to replace the filtering/selecting done previously on pandas.DataFrame, hopefully the pandas dependency can be removed in the future. In lib.tools, add makedirs() that creates a dir on disk removing the ':' symbol, which is illegal in Mac and windows. In extracthl2.findStrFromBox2(), read output of `pdftotext` directly via stdout, quite some speed up. Possible issue: getDocNotes() related things is not thoroughly tested, particularly when a side-bar note is made on a doc that has no attachments.
Xunius
added a commit
that referenced
this issue
Aug 7, 2018
A big commit with quite some changes, intended to address the issue of a single doc having multiple attachment PDFs. See issue #26. Change notes (ordered in time): Add a `DocAnno` class in menotexport.py to replace the original role of `FileAnno`, a DocAnno may have more than 1 FileAnno objs, each representing an attachment PDF. Correspondingly, exportpdf.exportAnnoPdf(), exportpdf.copyPdf(), menotexport.extractAnnos() have to do a nested loop through DocAnno objs and then FileAnno objs. In extracthl2.Anno class, add `path` attribute. In extracthl2.extractHighlights2() and extractnt.extractNotes(), add `path` info to the extracted highlight/note. In menotexport.py: * getMetaData() now fetches `user_name`, `path`, `folder` and `tags`, and adds `folder` to the list of tags. * getHighlights(), getNotes() and getDocNotes() now have a cleaner call signature, with the only filtering as doc id. The output dict has 1 more layer of structure: use `path` to distinguish highlightes/notes made on different pdfs associated with a single doc. * Deprecate `getOtherDocs()`, `getOtherCanonicalDocs()` and `processCanonicals(). Use `processDocs()` to process both docs in folders and canonical docs. Some relevant changes in `main()` to achieve this merging. * Use sqlite filtering in various places to replace the filtering/selecting done previously on pandas.DataFrame, hopefully the pandas dependency can be removed in the future. In lib.tools, add makedirs() that creates a dir on disk removing the ':' symbol, which is illegal in Mac and windows. In extracthl2.findStrFromBox2(), read output of `pdftotext` directly via stdout, quite some speed up. In lib/extrachthl2.py: have to call `tii=tools.deu(tii)` on the output of pdftotext. Previously I thought `pp.communicate()` speeds up, but it was just because some parts failed. After adding `tools.deu()` it is as slow as before In menotexport.py: In extractAnnos(), move extracthl2.checkPdftotext() out from loops Import extracthl2 module at top. Since some version before 1.17.9 Mendeley saves side-bar notes (aka General Notes) in the Documents.note column instead of DocumentNotes table, so the old fetching method won't get anything. See some discussions in issue#29 (#29 (comment)) Another related issue regarding side-bar notes is for those docs with 0 or more than 1 attachments. In commit a5b3839 I enabled multiple attachment support but haven't fully tested these scenarios because of the above issue. This commit aims at solving this. In menotexport.py: Notes fetched by getNotes() functions are assigned with "isgeneralnote=False", while those fetched by getDocNotes() are asssigned with "isgeneralnote=True", to distinguish sticky notes from side-bar notes. In extracthl2.py: Add to the Anno class "isgeneralnote" attribute. In extractnt.py: Pass in the "isgeneralnote" argument when creating Anno objs In exportannotation.py and extracttags.py: Before exporting note texts to txt files, remove duplicate side-bar notes. This is because when a doc has >1 attachments, the same side-bar notes are saved to each exported pdf. But extracted note texts should only keep 1 copy. To remove duplicates, call removeDupGeneralNotes() func defined in tools.py In tools.py: Def of removeDupGeneralNotes() In menotexport.py: getDocNotes() queries the DocumentNotes table for old versions and also Documents.note for newer versions, and combine the outputs because idk which Mendeley version the seperate occurred. The fetched side-bar notes are filtered by a DOI pattern matching. See discussions in issue #29. A minor fix of stdout prints None for docs without attachments, in such cases, use doc title instead of filename.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
It's quite common for papers to have "Supplementary Materials", most of the times additional PDF files with extra info. I usually save these extra materials as additional PDF files in the paper entry; however, it seems that Menotexport discards them during the export process. I have an entry with the following 2 PDF files:
article.pdf
supplementary.pdf
For some weird reason, Mendeley has decided that
supplementary.pdf
is the "primary" PDF file of the entry, i.e. the file that gets opened when you double-click on the paper entry on Mendeley. If I want to open the main text, I have to click on the second file on the file list of the entry, in Mendeley's right panel. Annoying, but not a big deal.However, if I export the document with Menotexport, only the "primary" file (which in this case Mendeley has decided to be
supplementary.pdf
) gets exported. What's even more strange is that both files have their own notes; however,supplementary.pdf
gets exported with the highlights belonging toarticle.pdf
baked in it, which obviously looks like a complete non-sense when I open it.By digging around, I've found that this Python program, instead, does the export in the correct way (i.e., it exports both PDF files and each of them with its own highlights).
I can provide screenshots of the entry in Mendeley and/or the relevant PDF files, if you need them.
The text was updated successfully, but these errors were encountered: