Add apkg import/export on backend #1743

RumovZ · 2022-03-27T09:06:15Z

This is pretty much untested. My plan is to do the apkg import part next, then add extensive roundtrip tests and fix it up.
What do you think so far? Should I do anything differently?

dae · 2022-03-28T02:32:00Z

The approach I had in mind was a little different, but that's my fault for not communicating it properly - sorry :-(

Rather that copying rows directly and then mutating the resulting database like the current Python code does, it might make things easier to reason about and maintain if we instead deserialize items into our standard structs like Note, Noteype and so on, and then serialize them again to the output DB - eg the equivalent of for_each_note_in_search(|note| output.storage.add_or_update_note(note)). It won't be as efficient as copying rows directly, though we would win back some of the performance by not having to apply updates after rows have been copied (to modifying tags/fields, reset scheduling/flags, etc).

The gathering and transforms are also something we potentially might want to reuse. For example, at one point in the future it might be nice to have a separate import/export format that is more easily read by third-party apps. Logic like resetting cards, excluding revlog entries, renaming out-of-tree decks or blanking certain fields is something that could be applied to such other formats as well.

A simple approach would be to gather and transform all the data up-front, then have the exporter just responsible for writing out the gathered data. A more memory-friendly approach would be to have the exporter able to deal with the items in a streaming fashion, but that does add more complexity, and decks over 500MB are exceedingly rare, so a decent case could be made for the former approach too.

For either approach, if the apkg exporter was then implemented as an Exporter trait that accepts the gathered+modified data and writes it out, that would make adding future formats easier. We may even want to use it for csv exports as well - as while the scheduling info is currently irrelevant, removing the marked tag and redacting some fields are things we may still want to do in the future.

WDYT?

dae · 2022-03-28T02:42:14Z

(Getting a minimum implementation of apkg importing done will ease testing, and I think it's a good idea to consider both import and export sides of the equation at the same time, so please feel free to focus on that first before considering the suggestions above)

rslib/src/import_export/package/apkg/export.rs

rslib/src/import_export/package/colpkg/export.rs

RumovZ · 2022-03-28T10:14:53Z

I see, the colpkg is actually a special case, and other formats should have a common gather step. Does the same go for importing? I.e. should importers gather data into native structs, and then have a single routine to commit these structs to db?

You mentioned an export trait. I guess removing scheduling information, and more customisable stuff would be done in the gather step. Then what would be the shared behaviour of the implementors of that trait?
I guess defining a trait doesn't make sense anyway, before there is a second user, but I just want to make sure I understand.

Not sure if I will take care of imporing and tests first, because then I'll have to debug this code before reworking it as well. 😅

dae · 2022-03-28T11:42:46Z

With the Python code, there is something a bit like that. apkg is treated specially, but the other importers like CSV, Mnemosyne and SuperMemo all are based on NoteImporter. They gather the data into a common format, which NoteImporter then applies to the DB. Ideally our new code would unify the apkg import case with the others, if it is practical to do so.

For the export trait, I'd just been thinking of a serialization/visitor-like interface, eg:

fn export_notes(&self, notes: Vec<Note>)
fn export_cards(&self, cards: Vec<Card>)

Then each format exporter could decide how it would turn the gathered and mutated cards/notes/etc into the desired format (whether it places all the provides objects in an .anki2 file and zips it up, writes them out to a csv file, etc). Each exporter could ignore parts not relevant (eg no notetype info in a csv), and would probably needs to retain state throughout the export (eg a json exporter might need to build up a single giant dict with the info).

Skipping the trait while prototyping sounds wise; we can easily add it later.

RumovZ · 2022-03-29T15:21:17Z

Is this more like what you had in mind?
For importing, I imagine the apkg would just use the current gather code. Other formats would need to do more work, but share some routines among them, of course. Then there'd be a preparation step for handling ids, usns etc, which should be pretty much the same for all formats. After that, the insert module can handle the rest.

To be honest, I still don't understand the benefit of an export trait, but maybe it'll dawn on me as I proceed.

rslib/src/tags/matcher.rs

dae · 2022-03-30T00:05:24Z

rslib/src/import_export/gather.rs

+        with_scheduling: bool,
+    ) -> Result<()> {
+        self.decks = col.gather_decks(deck_id)?;
+        let search = optional_deck_search(deck_id);


The current Python code allows the caller to pass in either the whole collection, a specific deck, or a list of cards (which is actually whole notes, as the browser uses .selectedCardsAsNotes()). It would be nice to support the selected notes use case as well.

One option would be to require the caller to use search_notes_into_table() prior to calling gather_data(). That covers all three cases, and allows this code to not be concerned with how exactly the notes have been selected, or need the separate siblings search. It also means our queries can reference the search_nids table instead of needing to construct and parse large sqlite statements that use ids_to_string().

We may need some more helper functions here - currently there's a for_each_card_in_search(), but there isn't one for notes, or one for returning cards matched by search_nids.

To preserve the "rewrite decks outside selected deck" functionality for the deck export case, we may still need to accept an optional deck_id argument. And one option would be to call search_notes_into_table() automatically when a deck has been provided.

Makes sense. Maybe export_apkg() can just accept a search? The backend code could expose a oneof of reasonable options (a deck id, all, note ids...), and pass on the according search node.

Just remembered why I chose a deck-based approach: The user might want to export a top-level deck where all cards are in subdecks. Then we would also want to export the "empty" parent.
If we accepted a SearchNode, we could check if it's a single deck search, and export empty parent decks accordingly. Not the cleanest solution, but accepting both note ids and a deck id isn't pretty, either.

Accepting a SearchNode sounds good.

Good point about needing the parents, but that should apply in the specific notes case as well - we must implicitly include all parent decks of the decks that have been used, as the deck tree code in the importing Anki will expect the parents to exist.

97c9dd4 removes the feature to specify a deck, and move siblings outside that deck into it. Can we get away with that?
First of all, it won't affect the majority of users who don't have siblings in different decks. And for those who do, including cards outside the selected deck may not be the expected behaviour anyway.
We could show a warning if an export includes decks and cards outside the selected decks.

I was mainly thinking of the shared decks case, where it's the users who import the deck who tend to encounter the problems/confusion, rather than the person exporting them (similar to our template checks). Showing an error on export if the user has elected to export a particular deck and cards are outside it may even be a better option than moving cards, as it avoids surprising behaviour, and is simpler than attempting to move the cards to a different deck. And for local exports, users could do them via the browse screen if they wanted to avoid the error.

I assume then there'd be a separate routine to check for these things in advance? I think something like this would be required anyway if we want the export process to be more customisable. E.g. we could detect notetype conflicts in advance, and let the user decide on a resolution strategy (mapping or skipping).

Wouldn't notetype conflict checking need to be done at import time? Maybe you have something else in mind than I'm imagining?

I'm not too fussed about the timing of the check, though it would be nice to at least do prior to all media being exported.

Sorry, I got confused with import/export. For both cases, some routine for preparing/checking that is called before the import/export call by the frontend seems to make sense. Then we can be more naive in the core routines. In this case, we can keep just exporting all decks of matching cards.

Yep, sounds good 👍

rslib/src/import_export/gather.rs

rslib/src/import_export/package/apkg/export.rs

rslib/src/import_export/package/colpkg/export.rs

rslib/src/storage/revlog/mod.rs

rslib/src/import_export/gather.rs

dae · 2022-03-30T03:00:12Z

Is this more like what you had in mind?

Yep, looking good :-)

For importing, I imagine the apkg would just use the current gather code. Other formats would need to do more work, but share some routines among them, of course. Then there'd be a preparation step for handling ids, usns etc, which should be pretty much the same for all formats. After that, the insert module can handle the rest.

Sounds good as a high-level plan. The preparation step will be the tricky bit - the existing apkg import logic is somewhat complicated and badly designed. For example:

Notes are matched with existing ones based on a string guid instead of id. This was done to reduce the chances of a conflict that a millisecond timestamp would have, but the 64 bit guid should have been larger. It was stored as a string because interoperability with Javascript was a concern at the time.
When an existing note is found, importing will refuse to update it if the notetype's fields or template names differ, as we don't want to accidentally write a note out with 2 fields if the notetype says it should have 3.
When a new note is to be added, we try to copy the notetype over from the source collection, preserving its id. But if the notetype of source collection and current collection have different schemas, we need to copy the source notetype into the collection with a new id, and rewrite the ntid of the note. Instead of assigning a random new number, the code instead tries to increment the ntid by one - the logic being that if an updated deck is imported again later, we could walk the ntids starting at the original one until we locate the copied notetype, allowing it to be reused/updated, instead of a new notetype being created each time. But the current way this is done feels a bit complex, and I wonder whether there might be a better way to do it.
When importing shared decks, two different decks may have used "banana.jpg" with different file content in each. If that file already exists in the collection with different contents, we need to adjust any notes that reference it to point to the new filename instead.

To be honest, I still don't understand the benefit of an export trait, but maybe it'll dawn on me as I proceed.

We may not need one. I just thought if we ended up with each exporter looking like

fn some_exporter() {
... lots of shared setup code
specific_exporter_logic()
... lots of shared teardown code
}

then in that case, being able to pass in the custom logic with a fn or trait might be convenient. Let's YAGNI until we see a pattern. :-)

export_apkg() now accepts a search instead of a deck id. Decks are gathered according to the matched notes' cards.

RumovZ · 2022-03-31T08:53:05Z

rslib/src/import_export/gather.rs

+            if let Ok(normal_mut) = deck.normal_mut() {
+                normal_mut.config_id = 1;
+            } else {
+                // TODO: scheduling case


I couldn't figure out how filtered decks are currently converted to regular ones if scheduling is included. 🤔
Do I have to run the below code for all filtered decks in the scheduling case as well?

Unintuitively, the current code handles this at import time (anki2.py:287, :356). It means renaming an apkg file to colpkg and importing it results in errors. Ideally we'd do it on the export side instead, but since we'll need to keep doing it on the import side to support legacy decks, we could get away with just doing that for now.

In the scheduling=reset case, we reset the config id to 1 so that the user's default deck preset/config is used. In the scheduling=keep case for normal cards, we preserve+copy the deck presets over, so the user can use this mode for moving/copying their cards/decks/presets from one profile to another. In the filtered deck case I guess the current code is discarding the deck config id? In the scheduling=keep case, it might be preferable if we preserved the filtered deck and made sure the home deck was included at the same time, as that would allow users to shuffle filtered decks between profiles as they can do with normal decks. However, the first paragraph may come back to bite us there - if we went down that path, we'd need to make sure we're not breaking legacy exports. Do you have any thoughts/ideas/preferences on what we do here? We could potentially defer this part until the code to deal with the importing side is implemented.

Unintuitively, the current code handles this at import time (anki2.py:287, :356).

Thanks, somehow I missed that.

In the filtered deck case I guess the current code is discarding the deck config id?

Did you mean the deck config id? Then yes, its Filtered kind is overwritten with the default Normal kind. I haven't tested it yet, but the intention is to transform the deck into a regular one with default settings.

I'm not sure what the use case is for shuffling decks between profiles, and don't have any preferences here. The current code should work for the legacy import, and produce well-formed collections, so deferring any improvements here sounds good.

Consider the case where a user has created two separate profiles, and then wants to follow the advice on https://faqs.ankiweb.net/synchronizing-multiple-profiles.html. They need some way of being able to transfer content from one profile to the other in order to be able to merge the profiles, and currently as filtered decks can't be transported, it means the user will need to recreate them and their settings in the importing profile. Not a big deal for the majority of users who either don't use filtered decks, or use them ephemerally. But some users have a bunch of filtered decks with complex queries, and ideally in that case we'd provide a way for them to transport them. But to be fair, it is a niche use case, and probably doesn't come up much.

rslib/src/import_export/gather.rs

dae · 2022-03-31T10:52:23Z

rslib/src/import_export/gather.rs

+            if let Ok(normal_mut) = deck.normal_mut() {
+                normal_mut.config_id = 1;
+            } else {
+                // TODO: scheduling case


Unintuitively, the current code handles this at import time (anki2.py:287, :356). It means renaming an apkg file to colpkg and importing it results in errors. Ideally we'd do it on the export side instead, but since we'll need to keep doing it on the import side to support legacy decks, we could get away with just doing that for now.

In the scheduling=reset case, we reset the config id to 1 so that the user's default deck preset/config is used. In the scheduling=keep case for normal cards, we preserve+copy the deck presets over, so the user can use this mode for moving/copying their cards/decks/presets from one profile to another. In the filtered deck case I guess the current code is discarding the deck config id? In the scheduling=keep case, it might be preferable if we preserved the filtered deck and made sure the home deck was included at the same time, as that would allow users to shuffle filtered decks between profiles as they can do with normal decks. However, the first paragraph may come back to bite us there - if we went down that path, we'd need to make sure we're not breaking legacy exports. Do you have any thoughts/ideas/preferences on what we do here? We could potentially defer this part until the code to deal with the importing side is implemented.

rslib/src/import_export/gather.rs

search_cards_of_notes_into_table

RumovZ · 2022-04-05T16:35:08Z

As discussed above, I was trying to implement a three-step approach (gather, prepare, insert), compared to the current code, which does everything at once, to reduce complexity, and increase code reusability. However, it feels like an uphill battle. All ids and names need to be handled manually, while we'd (at least in part) get that for free with the existing add methods.
Also, the approach seems impossible to implement strictly. Two examples: Canonifying tags needs to both register new tags, and look up the case of existing ones, so the note can be adjusted accordingly. So we need to either make changes to the db during the preparation step, or modify the imported data during the insertion step.
Similarly with media: We need to read incoming media while preparing notes, so we can check for equality with existing media, and potentially use an alternative file name (which then needs to be reflected on the notes). So it makes sense to immediately copy the file while it's in memory. On the other hand, I think media should be copied as late as possible, because we can't roll it back if an error occurs. (Or should we copy to temp files, and persist these in bulk at the end of a successful import?)

I wonder if this is still the way to go. Should I stick more closely to the current Python code, or do you have another idea for dealing with this? If you think this might still be our best bet, I'll continue. I just want to avoid writing all this new code if it's doomed from the beginning.

Another consideration is undoing. I assume that's a goal at some point? Not sure if this will be easier to achieve with one approach or another.

I've pushed my work so far, hopefully it helps a little bit to understand what I'm rambling about. It's not even remotely finished, though, so there's no point in reviewing anything in detail.

Shift would have been nice, but the existing shortcuts complicate things. If the user triggers an import with ctrl+shift+i, shift is unlikely to have been released by the time our code runs, meaning the user accidentally triggers the new code. We could potentially wait a while before bringing up the dialog, but then we're forced to guess at how long it will take the user to release the key. One alternative would be to use alt instead of shift, but then we need to trigger our shortcut when that key is pressed as well, and it could potentially cause a conflict with an add-on that already uses that combination.

+ Deal with long deck names

Ensures each imported note appears on a separate line

This may come down to personal preference, but I feel the other counts are equally as important, and separating them feels like it makes it a bit easier to ignore them.

dae · 2022-05-02T02:53:47Z

Still some testing to do, but I've pushed a few tweaks I've made so far. Please let me know if there's anything you'd prefer done differently or if you spot any problems.

…d time

…export" This reverts commit 8f0b2c1. Will use a different approach

Previously the user could end seeing a message like: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 1: invalid start byte Unfortunately we can't retroactively fix this for older clients.

dae · 2022-05-02T05:36:51Z

Ready from my end, pending your feedback/approval

Then skip special casing in update_deck(). Also skip updating description if new one is empty.

RumovZ · 2022-05-02T07:22:23Z

Great!
I've replaced 40ba291, so it hopefully does the same thing, but is easier to follow.

dae · 2022-05-02T11:13:06Z

🚀

Thanks for all your hard work on this Rumo!

Add apkg export on backend

5669731

dae reviewed Mar 28, 2022

View reviewed changes

rslib/src/import_export/package/apkg/export.rs Outdated Show resolved Hide resolved

rslib/src/import_export/package/colpkg/export.rs Show resolved Hide resolved

Filter out missing media-paths at write time

6269a88

RumovZ added 3 commits March 29, 2022 16:44

Make TagMatcher::new() infallible

7d1686e

Gather export data instead of copying directly

84c5ac8

Merge remote-tracking branch 'upstream/main' into apkg

aab518d

dae mentioned this pull request Mar 29, 2022

Fix SearchBuilder #1754

Merged

dae reviewed Mar 30, 2022

View reviewed changes

Revert changes to rslib/src/tags/

3c17d37

RumovZ mentioned this pull request Mar 30, 2022

Also restore/keep position of new cards #1760

Merged

RumovZ added 6 commits March 30, 2022 19:33

Reuse filename_is_safe/check_filename_safe()

112ad11

Accept func to produce MediaIter in export_apkg()

a46026f

Only store file folder once in MediaIter

d1dd058

Use temporary tables for gathering

97c9dd4

export_apkg() now accepts a search instead of a deck id. Decks are gathered according to the matched notes' cards.

Merge remote-tracking branch 'upstream/HEAD' into apkg

bd14ccf

Use schedule_as_new() to reset cards

71670a0

RumovZ commented Mar 31, 2022

View reviewed changes

rslib/src/import_export/gather.rs Outdated Show resolved Hide resolved

ExportData → ExchangeData

fbc21da

dae reviewed Mar 31, 2022

View reviewed changes

RumovZ added 3 commits March 31, 2022 18:20

Ignore ascii case when filtering system tags

4aa5ee5

search_notes_cards_into_table →

fce797c

search_cards_of_notes_into_table

Start on apkg importing on backend

6836da0

RumovZ added 2 commits April 30, 2022 18:33

Outsource incrementing into Incrementor

3864857

Merge remote-tracking branch 'upstream' into apkg

9231911

RumovZ marked this pull request as ready for review April 30, 2022 17:35

RumovZ and others added 8 commits May 1, 2022 09:26

Mutate ProgressUpdate in progress_update callback

8c5311e

Show extension in export dialog

8e4b1c6

Continue to provide separate options for schema 11+18 colpkg export

8f0b2c1

Default to colpkg export when using File>Export

b600366

Improve appearance of combo boxes when switching between apkg/colpkg

9c7d7b9

+ Deal with long deck names

Convert newlines to spaces when showing fields from import

87b904a

Ensures each imported note appears on a separate line

Don't separate total note count from the other summary lines

911854a

This may come down to personal preference, but I feel the other counts are equally as important, and separating them feels like it makes it a bit easier to ignore them.

dae added 7 commits May 2, 2022 13:41

Fix 'deck not normal' error when importing a filtered deck for the 2n…

40ba291

…d time

Fix [Identical] being shown on first import

38cb6ef

Revert "Continue to provide separate options for schema 11+18 colpkg …

a296ea7

…export" This reverts commit 8f0b2c1. Will use a different approach

Move legacy support into a separate exporter option; add to apkg export

f3be2ed

Adjust 'too new' message to also apply to .apkg import case

d7cbcf1

Show a better message when attempting to import new apkg into old code

4c3ab6b

Previously the user could end seeing a message like: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 1: invalid start byte Unfortunately we can't retroactively fix this for older clients.

Hide legacy support option in older exporting screen

c98f9ae

RumovZ added 2 commits May 2, 2022 08:57

Reflect change from paths to fnames in type & name

bc4ff14

Make imported decks normal at once

6b3e59e

Then skip special casing in update_deck(). Also skip updating description if new one is empty.

Merge branch 'main' into apkg

fd21273

dae merged commit 5f9451f into ankitects:main May 2, 2022

RumovZ deleted the apkg branch May 2, 2022 12:02

dae mentioned this pull request May 3, 2022

flags need to be reset on export, possibly ignored on import #1017

Closed

RumovZ mentioned this pull request Jun 2, 2022

Import/Export todos #1895

Open

27 tasks

dae mentioned this pull request Apr 26, 2023

New importer creates new copy of remapped notetypes on each import #2482

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add apkg import/export on backend #1743

Add apkg import/export on backend #1743

RumovZ commented Mar 27, 2022

dae commented Mar 28, 2022

dae commented Mar 28, 2022

RumovZ commented Mar 28, 2022 •

edited

dae commented Mar 28, 2022 •

edited

RumovZ commented Mar 29, 2022

dae Mar 30, 2022

RumovZ Mar 30, 2022

RumovZ Mar 30, 2022

dae Mar 30, 2022

RumovZ Mar 31, 2022

dae Mar 31, 2022

RumovZ Apr 1, 2022

dae Apr 1, 2022

RumovZ Apr 1, 2022

dae Apr 1, 2022

dae commented Mar 30, 2022 •

edited

RumovZ Mar 31, 2022

dae Mar 31, 2022

RumovZ Mar 31, 2022

dae Apr 1, 2022

dae Mar 31, 2022

RumovZ commented Apr 5, 2022

dae commented May 2, 2022

dae commented May 2, 2022

RumovZ commented May 2, 2022

dae commented May 2, 2022

Add apkg import/export on backend #1743

Add apkg import/export on backend #1743

Conversation

RumovZ commented Mar 27, 2022

dae commented Mar 28, 2022

dae commented Mar 28, 2022

RumovZ commented Mar 28, 2022 • edited

dae commented Mar 28, 2022 • edited

RumovZ commented Mar 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dae commented Mar 30, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RumovZ commented Apr 5, 2022

dae commented May 2, 2022

dae commented May 2, 2022

RumovZ commented May 2, 2022

dae commented May 2, 2022

RumovZ commented Mar 28, 2022 •

edited

dae commented Mar 28, 2022 •

edited

dae commented Mar 30, 2022 •

edited