Troubles with multi-document entries #26

matteosecli · 2018-06-23T17:43:43Z

It's quite common for papers to have "Supplementary Materials", most of the times additional PDF files with extra info. I usually save these extra materials as additional PDF files in the paper entry; however, it seems that Menotexport discards them during the export process. I have an entry with the following 2 PDF files:

the article itself, say article.pdf
the supplementary materials, say supplementary.pdf

For some weird reason, Mendeley has decided that supplementary.pdf is the "primary" PDF file of the entry, i.e. the file that gets opened when you double-click on the paper entry on Mendeley. If I want to open the main text, I have to click on the second file on the file list of the entry, in Mendeley's right panel. Annoying, but not a big deal.

However, if I export the document with Menotexport, only the "primary" file (which in this case Mendeley has decided to be supplementary.pdf) gets exported. What's even more strange is that both files have their own notes; however, supplementary.pdf gets exported with the highlights belonging to article.pdf baked in it, which obviously looks like a complete non-sense when I open it.

By digging around, I've found that this Python program, instead, does the export in the correct way (i.e., it exports both PDF files and each of them with its own highlights).

I can provide screenshots of the entry in Mendeley and/or the relevant PDF files, if you need them.

The text was updated successfully, but these errors were encountered:

Xunius · 2018-06-24T09:30:23Z

Hi matteosecli,

Yes that's a legit complaint, something I overlooked when making this tool. No need for screenshots or pdfs, the issue appears quite clear, but it might take some time for me to fix it. In the meanwhile, you could create a separate entry for the supplementary in Mendeley, or use some other tools if they do it properly.

Sorry for the trouble.

matteosecli · 2018-06-24T09:52:38Z

No worries, thank you! 😄

A big commit with quite some changes, intended to address the issue of a single doc having multiple attachment PDFs. See issue #26. Add a `DocAnno` class in menotexport.py to replace the original role of `FileAnno`, a DocAnno may have more than 1 FileAnno objs, each representing an attachment PDF. Correspondingly, exportpdf.exportAnnoPdf(), exportpdf.copyPdf(), menotexport.extractAnnos() have to do a nested loop through DocAnno objs and then FileAnno objs. In extracthl2.Anno class, add `path` attribute. In extracthl2.extractHighlights2() and extractnt.extractNotes(), add `path` info to the extracted highlight/note. In menotexport.py: * getMetaData() now fetches `user_name`, `path`, `folder` and `tags`, and adds `folder` to the list of tags. * getHighlights(), getNotes() and getDocNotes() now have a cleaner call signature, with the only filtering as doc id. The output dict has 1 more layer of structure: use `path` to distinguish highlightes/notes made on different pdfs associated with a single doc. * Deprecate `getOtherDocs()`, `getOtherCanonicalDocs()` and `processCanonicals(). Use `processDocs()` to process both docs in folders and canonical docs. Some relevant changes in `main()` to achieve this merging. * Use sqlite filtering in various places to replace the filtering/selecting done previously on pandas.DataFrame, hopefully the pandas dependency can be removed in the future. In lib.tools, add makedirs() that creates a dir on disk removing the ':' symbol, which is illegal in Mac and windows. In extracthl2.findStrFromBox2(), read output of `pdftotext` directly via stdout, quite some speed up. Possible issue: getDocNotes() related things is not thoroughly tested, particularly when a side-bar note is made on a doc that has no attachments.

A big commit with quite some changes, intended to address the issue of a single doc having multiple attachment PDFs. See issue #26. Change notes (ordered in time): Add a `DocAnno` class in menotexport.py to replace the original role of `FileAnno`, a DocAnno may have more than 1 FileAnno objs, each representing an attachment PDF. Correspondingly, exportpdf.exportAnnoPdf(), exportpdf.copyPdf(), menotexport.extractAnnos() have to do a nested loop through DocAnno objs and then FileAnno objs. In extracthl2.Anno class, add `path` attribute. In extracthl2.extractHighlights2() and extractnt.extractNotes(), add `path` info to the extracted highlight/note. In menotexport.py: * getMetaData() now fetches `user_name`, `path`, `folder` and `tags`, and adds `folder` to the list of tags. * getHighlights(), getNotes() and getDocNotes() now have a cleaner call signature, with the only filtering as doc id. The output dict has 1 more layer of structure: use `path` to distinguish highlightes/notes made on different pdfs associated with a single doc. * Deprecate `getOtherDocs()`, `getOtherCanonicalDocs()` and `processCanonicals(). Use `processDocs()` to process both docs in folders and canonical docs. Some relevant changes in `main()` to achieve this merging. * Use sqlite filtering in various places to replace the filtering/selecting done previously on pandas.DataFrame, hopefully the pandas dependency can be removed in the future. In lib.tools, add makedirs() that creates a dir on disk removing the ':' symbol, which is illegal in Mac and windows. In extracthl2.findStrFromBox2(), read output of `pdftotext` directly via stdout, quite some speed up. In lib/extrachthl2.py: have to call `tii=tools.deu(tii)` on the output of pdftotext. Previously I thought `pp.communicate()` speeds up, but it was just because some parts failed. After adding `tools.deu()` it is as slow as before In menotexport.py: In extractAnnos(), move extracthl2.checkPdftotext() out from loops Import extracthl2 module at top. Since some version before 1.17.9 Mendeley saves side-bar notes (aka General Notes) in the Documents.note column instead of DocumentNotes table, so the old fetching method won't get anything. See some discussions in issue#29 (#29 (comment)) Another related issue regarding side-bar notes is for those docs with 0 or more than 1 attachments. In commit a5b3839 I enabled multiple attachment support but haven't fully tested these scenarios because of the above issue. This commit aims at solving this. In menotexport.py: Notes fetched by getNotes() functions are assigned with "isgeneralnote=False", while those fetched by getDocNotes() are asssigned with "isgeneralnote=True", to distinguish sticky notes from side-bar notes. In extracthl2.py: Add to the Anno class "isgeneralnote" attribute. In extractnt.py: Pass in the "isgeneralnote" argument when creating Anno objs In exportannotation.py and extracttags.py: Before exporting note texts to txt files, remove duplicate side-bar notes. This is because when a doc has >1 attachments, the same side-bar notes are saved to each exported pdf. But extracted note texts should only keep 1 copy. To remove duplicates, call removeDupGeneralNotes() func defined in tools.py In tools.py: Def of removeDupGeneralNotes() In menotexport.py: getDocNotes() queries the DocumentNotes table for old versions and also Documents.note for newer versions, and combine the outputs because idk which Mendeley version the seperate occurred. The fetched side-bar notes are filtered by a DOI pattern matching. See discussions in issue #29. A minor fix of stdout prints None for docs without attachments, in such cases, use doc title instead of filename.

matteosecli · 2018-08-09T11:09:16Z

Closing as fixed in a5b3839, f8560c4 and e9abd4f.

matteosecli mentioned this issue Jul 22, 2018

Export author in highlights and sticky notes #27

Closed

matteosecli mentioned this issue Aug 4, 2018

Wrong coordinate ordering in "Rectangle Highlight" #29

Closed

matteosecli closed this as completed Aug 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubles with multi-document entries #26

Troubles with multi-document entries #26

matteosecli commented Jun 23, 2018

Xunius commented Jun 24, 2018

matteosecli commented Jun 24, 2018

matteosecli commented Aug 9, 2018

Troubles with multi-document entries #26

Troubles with multi-document entries #26

Comments

matteosecli commented Jun 23, 2018

Xunius commented Jun 24, 2018

matteosecli commented Jun 24, 2018

matteosecli commented Aug 9, 2018