Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Troubles with multi-document entries #26

Closed
matteosecli opened this issue Jun 23, 2018 · 3 comments
Closed

Troubles with multi-document entries #26

matteosecli opened this issue Jun 23, 2018 · 3 comments

Comments

@matteosecli
Copy link
Contributor

It's quite common for papers to have "Supplementary Materials", most of the times additional PDF files with extra info. I usually save these extra materials as additional PDF files in the paper entry; however, it seems that Menotexport discards them during the export process. I have an entry with the following 2 PDF files:

  • the article itself, say article.pdf
  • the supplementary materials, say supplementary.pdf

For some weird reason, Mendeley has decided that supplementary.pdf is the "primary" PDF file of the entry, i.e. the file that gets opened when you double-click on the paper entry on Mendeley. If I want to open the main text, I have to click on the second file on the file list of the entry, in Mendeley's right panel. Annoying, but not a big deal.

However, if I export the document with Menotexport, only the "primary" file (which in this case Mendeley has decided to be supplementary.pdf) gets exported. What's even more strange is that both files have their own notes; however, supplementary.pdf gets exported with the highlights belonging to article.pdf baked in it, which obviously looks like a complete non-sense when I open it.

By digging around, I've found that this Python program, instead, does the export in the correct way (i.e., it exports both PDF files and each of them with its own highlights).

I can provide screenshots of the entry in Mendeley and/or the relevant PDF files, if you need them.

@Xunius
Copy link
Owner

Xunius commented Jun 24, 2018

Hi matteosecli,

Yes that's a legit complaint, something I overlooked when making this tool. No need for screenshots or pdfs, the issue appears quite clear, but it might take some time for me to fix it. In the meanwhile, you could create a separate entry for the supplementary in Mendeley, or use some other tools if they do it properly.

Sorry for the trouble.

@matteosecli
Copy link
Contributor Author

No worries, thank you! 😄

Xunius added a commit that referenced this issue Jul 30, 2018
A big commit with quite some changes, intended to address
the issue of a single doc having multiple attachment PDFs.
See issue #26.

Add a `DocAnno` class in menotexport.py to replace the
original role of `FileAnno`, a DocAnno may have more than
1 FileAnno objs, each representing an attachment PDF.

Correspondingly, exportpdf.exportAnnoPdf(), exportpdf.copyPdf(),
menotexport.extractAnnos() have to do a nested loop through DocAnno objs
and then FileAnno objs.

In extracthl2.Anno class, add `path` attribute.
In extracthl2.extractHighlights2() and extractnt.extractNotes(),
add `path` info to the extracted highlight/note.

In menotexport.py:

    * getMetaData() now fetches `user_name`, `path`, `folder` and `tags`,
    and adds `folder` to the list of tags.
    * getHighlights(), getNotes() and getDocNotes() now have a cleaner
    call signature, with the only filtering as doc id. The output dict
    has 1 more layer of structure: use `path` to distinguish
    highlightes/notes made on different pdfs associated with a single
    doc.
    * Deprecate `getOtherDocs()`, `getOtherCanonicalDocs()` and
    `processCanonicals(). Use `processDocs()` to process both docs in
    folders and canonical docs. Some relevant changes in `main()` to
    achieve this merging.
    * Use sqlite filtering in various places to replace the
    filtering/selecting done previously on pandas.DataFrame, hopefully
    the pandas dependency can be removed in the future.

In lib.tools, add makedirs() that creates a dir on disk removing the ':'
symbol, which is illegal in Mac and windows.

In extracthl2.findStrFromBox2(), read output of `pdftotext` directly via
stdout, quite some speed up.

Possible issue: getDocNotes() related things is not thoroughly tested,
particularly when a side-bar note is made on a doc that has no
attachments.
Xunius added a commit that referenced this issue Aug 7, 2018
A big commit with quite some changes, intended to address
the issue of a single doc having multiple attachment PDFs.
See issue #26.

Change notes (ordered in time):

Add a `DocAnno` class in menotexport.py to replace the
original role of `FileAnno`, a DocAnno may have more than
1 FileAnno objs, each representing an attachment PDF.

Correspondingly, exportpdf.exportAnnoPdf(), exportpdf.copyPdf(),
menotexport.extractAnnos() have to do a nested loop through DocAnno objs
and then FileAnno objs.

In extracthl2.Anno class, add `path` attribute.
In extracthl2.extractHighlights2() and extractnt.extractNotes(),
add `path` info to the extracted highlight/note.

In menotexport.py:

    * getMetaData() now fetches `user_name`, `path`, `folder` and `tags`,
    and adds `folder` to the list of tags.
    * getHighlights(), getNotes() and getDocNotes() now have a cleaner
    call signature, with the only filtering as doc id. The output dict
    has 1 more layer of structure: use `path` to distinguish
    highlightes/notes made on different pdfs associated with a single
    doc.
    * Deprecate `getOtherDocs()`, `getOtherCanonicalDocs()` and
    `processCanonicals(). Use `processDocs()` to process both docs in
    folders and canonical docs. Some relevant changes in `main()` to
    achieve this merging.
    * Use sqlite filtering in various places to replace the
    filtering/selecting done previously on pandas.DataFrame, hopefully
    the pandas dependency can be removed in the future.

In lib.tools, add makedirs() that creates a dir on disk removing the ':'
symbol, which is illegal in Mac and windows.

In extracthl2.findStrFromBox2(), read output of `pdftotext` directly via
stdout, quite some speed up.

In lib/extrachthl2.py:
    have to call `tii=tools.deu(tii)` on the output of pdftotext.
    Previously I thought `pp.communicate()` speeds up, but it was just
    because some parts failed. After adding `tools.deu()` it is
    as slow as before

In menotexport.py:
    In extractAnnos(), move extracthl2.checkPdftotext() out from
    loops

Import extracthl2 module at top.

Since some version before 1.17.9 Mendeley saves side-bar notes
(aka General Notes) in the Documents.note column instead of
DocumentNotes table, so the old fetching method won't get anything.
See some discussions in issue#29 (#29 (comment))

Another related issue regarding side-bar notes is for those docs
with 0 or more than 1 attachments. In commit a5b3839 I
enabled multiple attachment support but haven't fully tested these
scenarios because of the above issue. This commit aims at solving
this.

In menotexport.py:

    Notes fetched by getNotes() functions are assigned with
    "isgeneralnote=False", while those fetched by getDocNotes()
    are asssigned with "isgeneralnote=True", to distinguish
    sticky notes from side-bar notes.

In extracthl2.py:

    Add to the Anno class "isgeneralnote" attribute.

In extractnt.py:

    Pass in the "isgeneralnote" argument when creating Anno
    objs

In exportannotation.py and extracttags.py:

    Before exporting note texts to txt files, remove duplicate
    side-bar notes. This is because when a doc has >1 attachments,
    the same side-bar notes are saved to each exported pdf. But
    extracted note texts should only keep 1 copy.
    To remove duplicates, call removeDupGeneralNotes() func
    defined in tools.py

In tools.py:

    Def of removeDupGeneralNotes()

In menotexport.py:

    getDocNotes() queries the DocumentNotes table for old versions
    and also Documents.note for newer versions, and combine the
    outputs because idk which Mendeley version the seperate occurred.

    The fetched side-bar notes are filtered by a DOI pattern
    matching. See discussions in issue #29.

    A minor fix of stdout prints None for docs without attachments,
    in such cases, use doc title instead of filename.
@matteosecli
Copy link
Contributor Author

Closing as fixed in a5b3839, f8560c4 and e9abd4f.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants