Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add apkg import/export on backend #1743

Merged
merged 140 commits into from May 2, 2022
Merged

Add apkg import/export on backend #1743

merged 140 commits into from May 2, 2022

Conversation

RumovZ
Copy link
Collaborator

@RumovZ RumovZ commented Mar 27, 2022

This is pretty much untested. My plan is to do the apkg import part next, then add extensive roundtrip tests and fix it up.
What do you think so far? Should I do anything differently?

@dae
Copy link
Member

dae commented Mar 28, 2022

The approach I had in mind was a little different, but that's my fault for not communicating it properly - sorry :-(

Rather that copying rows directly and then mutating the resulting database like the current Python code does, it might make things easier to reason about and maintain if we instead deserialize items into our standard structs like Note, Noteype and so on, and then serialize them again to the output DB - eg the equivalent of for_each_note_in_search(|note| output.storage.add_or_update_note(note)). It won't be as efficient as copying rows directly, though we would win back some of the performance by not having to apply updates after rows have been copied (to modifying tags/fields, reset scheduling/flags, etc).

The gathering and transforms are also something we potentially might want to reuse. For example, at one point in the future it might be nice to have a separate import/export format that is more easily read by third-party apps. Logic like resetting cards, excluding revlog entries, renaming out-of-tree decks or blanking certain fields is something that could be applied to such other formats as well.

A simple approach would be to gather and transform all the data up-front, then have the exporter just responsible for writing out the gathered data. A more memory-friendly approach would be to have the exporter able to deal with the items in a streaming fashion, but that does add more complexity, and decks over 500MB are exceedingly rare, so a decent case could be made for the former approach too.

For either approach, if the apkg exporter was then implemented as an Exporter trait that accepts the gathered+modified data and writes it out, that would make adding future formats easier. We may even want to use it for csv exports as well - as while the scheduling info is currently irrelevant, removing the marked tag and redacting some fields are things we may still want to do in the future.

WDYT?

@dae
Copy link
Member

dae commented Mar 28, 2022

(Getting a minimum implementation of apkg importing done will ease testing, and I think it's a good idea to consider both import and export sides of the equation at the same time, so please feel free to focus on that first before considering the suggestions above)

@RumovZ
Copy link
Collaborator Author

RumovZ commented Mar 28, 2022

I see, the colpkg is actually a special case, and other formats should have a common gather step. Does the same go for importing? I.e. should importers gather data into native structs, and then have a single routine to commit these structs to db?

You mentioned an export trait. I guess removing scheduling information, and more customisable stuff would be done in the gather step. Then what would be the shared behaviour of the implementors of that trait?
I guess defining a trait doesn't make sense anyway, before there is a second user, but I just want to make sure I understand.

Not sure if I will take care of imporing and tests first, because then I'll have to debug this code before reworking it as well. 😅

@dae
Copy link
Member

dae commented Mar 28, 2022

With the Python code, there is something a bit like that. apkg is treated specially, but the other importers like CSV, Mnemosyne and SuperMemo all are based on NoteImporter. They gather the data into a common format, which NoteImporter then applies to the DB. Ideally our new code would unify the apkg import case with the others, if it is practical to do so.

For the export trait, I'd just been thinking of a serialization/visitor-like interface, eg:

fn export_notes(&self, notes: Vec<Note>)
fn export_cards(&self, cards: Vec<Card>)

Then each format exporter could decide how it would turn the gathered and mutated cards/notes/etc into the desired format (whether it places all the provides objects in an .anki2 file and zips it up, writes them out to a csv file, etc). Each exporter could ignore parts not relevant (eg no notetype info in a csv), and would probably needs to retain state throughout the export (eg a json exporter might need to build up a single giant dict with the info).

Skipping the trait while prototyping sounds wise; we can easily add it later.

@RumovZ
Copy link
Collaborator Author

RumovZ commented Mar 29, 2022

Is this more like what you had in mind?
For importing, I imagine the apkg would just use the current gather code. Other formats would need to do more work, but share some routines among them, of course. Then there'd be a preparation step for handling ids, usns etc, which should be pretty much the same for all formats. After that, the insert module can handle the rest.

To be honest, I still don't understand the benefit of an export trait, but maybe it'll dawn on me as I proceed.

@dae dae mentioned this pull request Mar 29, 2022
rslib/src/tags/matcher.rs Outdated Show resolved Hide resolved
with_scheduling: bool,
) -> Result<()> {
self.decks = col.gather_decks(deck_id)?;
let search = optional_deck_search(deck_id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current Python code allows the caller to pass in either the whole collection, a specific deck, or a list of cards (which is actually whole notes, as the browser uses .selectedCardsAsNotes()). It would be nice to support the selected notes use case as well.

One option would be to require the caller to use search_notes_into_table() prior to calling gather_data(). That covers all three cases, and allows this code to not be concerned with how exactly the notes have been selected, or need the separate siblings search. It also means our queries can reference the search_nids table instead of needing to construct and parse large sqlite statements that use ids_to_string().

We may need some more helper functions here - currently there's a for_each_card_in_search(), but there isn't one for notes, or one for returning cards matched by search_nids.

To preserve the "rewrite decks outside selected deck" functionality for the deck export case, we may still need to accept an optional deck_id argument. And one option would be to call search_notes_into_table() automatically when a deck has been provided.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Maybe export_apkg() can just accept a search? The backend code could expose a oneof of reasonable options (a deck id, all, note ids...), and pass on the according search node.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just remembered why I chose a deck-based approach: The user might want to export a top-level deck where all cards are in subdecks. Then we would also want to export the "empty" parent.
If we accepted a SearchNode, we could check if it's a single deck search, and export empty parent decks accordingly. Not the cleanest solution, but accepting both note ids and a deck id isn't pretty, either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepting a SearchNode sounds good.

Good point about needing the parents, but that should apply in the specific notes case as well - we must implicitly include all parent decks of the decks that have been used, as the deck tree code in the importing Anki will expect the parents to exist.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

97c9dd4 removes the feature to specify a deck, and move siblings outside that deck into it. Can we get away with that?
First of all, it won't affect the majority of users who don't have siblings in different decks. And for those who do, including cards outside the selected deck may not be the expected behaviour anyway.
We could show a warning if an export includes decks and cards outside the selected decks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mainly thinking of the shared decks case, where it's the users who import the deck who tend to encounter the problems/confusion, rather than the person exporting them (similar to our template checks). Showing an error on export if the user has elected to export a particular deck and cards are outside it may even be a better option than moving cards, as it avoids surprising behaviour, and is simpler than attempting to move the cards to a different deck. And for local exports, users could do them via the browse screen if they wanted to avoid the error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume then there'd be a separate routine to check for these things in advance? I think something like this would be required anyway if we want the export process to be more customisable. E.g. we could detect notetype conflicts in advance, and let the user decide on a resolution strategy (mapping or skipping).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't notetype conflict checking need to be done at import time? Maybe you have something else in mind than I'm imagining?

I'm not too fussed about the timing of the check, though it would be nice to at least do prior to all media being exported.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I got confused with import/export. For both cases, some routine for preparing/checking that is called before the import/export call by the frontend seems to make sense. Then we can be more naive in the core routines. In this case, we can keep just exporting all decks of matching cards.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, sounds good 👍

rslib/src/import_export/gather.rs Show resolved Hide resolved
rslib/src/import_export/gather.rs Outdated Show resolved Hide resolved
rslib/src/import_export/gather.rs Outdated Show resolved Hide resolved
rslib/src/import_export/package/apkg/export.rs Outdated Show resolved Hide resolved
rslib/src/import_export/package/colpkg/export.rs Outdated Show resolved Hide resolved
rslib/src/storage/revlog/mod.rs Show resolved Hide resolved
rslib/src/import_export/gather.rs Outdated Show resolved Hide resolved
rslib/src/import_export/gather.rs Outdated Show resolved Hide resolved
@dae
Copy link
Member

dae commented Mar 30, 2022

Is this more like what you had in mind?

Yep, looking good :-)

For importing, I imagine the apkg would just use the current gather code. Other formats would need to do more work, but share some routines among them, of course. Then there'd be a preparation step for handling ids, usns etc, which should be pretty much the same for all formats. After that, the insert module can handle the rest.

Sounds good as a high-level plan. The preparation step will be the tricky bit - the existing apkg import logic is somewhat complicated and badly designed. For example:

  • Notes are matched with existing ones based on a string guid instead of id. This was done to reduce the chances of a conflict that a millisecond timestamp would have, but the 64 bit guid should have been larger. It was stored as a string because interoperability with Javascript was a concern at the time.
  • When an existing note is found, importing will refuse to update it if the notetype's fields or template names differ, as we don't want to accidentally write a note out with 2 fields if the notetype says it should have 3.
  • When a new note is to be added, we try to copy the notetype over from the source collection, preserving its id. But if the notetype of source collection and current collection have different schemas, we need to copy the source notetype into the collection with a new id, and rewrite the ntid of the note. Instead of assigning a random new number, the code instead tries to increment the ntid by one - the logic being that if an updated deck is imported again later, we could walk the ntids starting at the original one until we locate the copied notetype, allowing it to be reused/updated, instead of a new notetype being created each time. But the current way this is done feels a bit complex, and I wonder whether there might be a better way to do it.
  • When importing shared decks, two different decks may have used "banana.jpg" with different file content in each. If that file already exists in the collection with different contents, we need to adjust any notes that reference it to point to the new filename instead.

To be honest, I still don't understand the benefit of an export trait, but maybe it'll dawn on me as I proceed.

We may not need one. I just thought if we ended up with each exporter looking like

fn some_exporter() {
... lots of shared setup code
specific_exporter_logic()
... lots of shared teardown code
}

then in that case, being able to pass in the custom logic with a fn or trait might be convenient. Let's YAGNI until we see a pattern. :-)

if let Ok(normal_mut) = deck.normal_mut() {
normal_mut.config_id = 1;
} else {
// TODO: scheduling case
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't figure out how filtered decks are currently converted to regular ones if scheduling is included. 🤔
Do I have to run the below code for all filtered decks in the scheduling case as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unintuitively, the current code handles this at import time (anki2.py:287, :356). It means renaming an apkg file to colpkg and importing it results in errors. Ideally we'd do it on the export side instead, but since we'll need to keep doing it on the import side to support legacy decks, we could get away with just doing that for now.

In the scheduling=reset case, we reset the config id to 1 so that the user's default deck preset/config is used. In the scheduling=keep case for normal cards, we preserve+copy the deck presets over, so the user can use this mode for moving/copying their cards/decks/presets from one profile to another. In the filtered deck case I guess the current code is discarding the deck config id? In the scheduling=keep case, it might be preferable if we preserved the filtered deck and made sure the home deck was included at the same time, as that would allow users to shuffle filtered decks between profiles as they can do with normal decks. However, the first paragraph may come back to bite us there - if we went down that path, we'd need to make sure we're not breaking legacy exports. Do you have any thoughts/ideas/preferences on what we do here? We could potentially defer this part until the code to deal with the importing side is implemented.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unintuitively, the current code handles this at import time (anki2.py:287, :356).

Thanks, somehow I missed that.

In the filtered deck case I guess the current code is discarding the deck config id?

Did you mean the deck config id? Then yes, its Filtered kind is overwritten with the default Normal kind. I haven't tested it yet, but the intention is to transform the deck into a regular one with default settings.

I'm not sure what the use case is for shuffling decks between profiles, and don't have any preferences here. The current code should work for the legacy import, and produce well-formed collections, so deferring any improvements here sounds good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the case where a user has created two separate profiles, and then wants to follow the advice on https://faqs.ankiweb.net/synchronizing-multiple-profiles.html. They need some way of being able to transfer content from one profile to the other in order to be able to merge the profiles, and currently as filtered decks can't be transported, it means the user will need to recreate them and their settings in the importing profile. Not a big deal for the majority of users who either don't use filtered decks, or use them ephemerally. But some users have a bunch of filtered decks with complex queries, and ideally in that case we'd provide a way for them to transport them. But to be fair, it is a niche use case, and probably doesn't come up much.

rslib/src/import_export/gather.rs Outdated Show resolved Hide resolved
if let Ok(normal_mut) = deck.normal_mut() {
normal_mut.config_id = 1;
} else {
// TODO: scheduling case
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unintuitively, the current code handles this at import time (anki2.py:287, :356). It means renaming an apkg file to colpkg and importing it results in errors. Ideally we'd do it on the export side instead, but since we'll need to keep doing it on the import side to support legacy decks, we could get away with just doing that for now.

In the scheduling=reset case, we reset the config id to 1 so that the user's default deck preset/config is used. In the scheduling=keep case for normal cards, we preserve+copy the deck presets over, so the user can use this mode for moving/copying their cards/decks/presets from one profile to another. In the filtered deck case I guess the current code is discarding the deck config id? In the scheduling=keep case, it might be preferable if we preserved the filtered deck and made sure the home deck was included at the same time, as that would allow users to shuffle filtered decks between profiles as they can do with normal decks. However, the first paragraph may come back to bite us there - if we went down that path, we'd need to make sure we're not breaking legacy exports. Do you have any thoughts/ideas/preferences on what we do here? We could potentially defer this part until the code to deal with the importing side is implemented.

rslib/src/import_export/gather.rs Outdated Show resolved Hide resolved
@RumovZ
Copy link
Collaborator Author

RumovZ commented Apr 5, 2022

As discussed above, I was trying to implement a three-step approach (gather, prepare, insert), compared to the current code, which does everything at once, to reduce complexity, and increase code reusability. However, it feels like an uphill battle. All ids and names need to be handled manually, while we'd (at least in part) get that for free with the existing add methods.
Also, the approach seems impossible to implement strictly. Two examples: Canonifying tags needs to both register new tags, and look up the case of existing ones, so the note can be adjusted accordingly. So we need to either make changes to the db during the preparation step, or modify the imported data during the insertion step.
Similarly with media: We need to read incoming media while preparing notes, so we can check for equality with existing media, and potentially use an alternative file name (which then needs to be reflected on the notes). So it makes sense to immediately copy the file while it's in memory. On the other hand, I think media should be copied as late as possible, because we can't roll it back if an error occurs. (Or should we copy to temp files, and persist these in bulk at the end of a successful import?)

I wonder if this is still the way to go. Should I stick more closely to the current Python code, or do you have another idea for dealing with this? If you think this might still be our best bet, I'll continue. I just want to avoid writing all this new code if it's doomed from the beginning.

Another consideration is undoing. I assume that's a goal at some point? Not sure if this will be easier to achieve with one approach or another.

I've pushed my work so far, hopefully it helps a little bit to understand what I'm rambling about. It's not even remotely finished, though, so there's no point in reviewing anything in detail.

@RumovZ RumovZ marked this pull request as ready for review April 30, 2022 17:35
RumovZ and others added 8 commits May 1, 2022 09:26
Shift would have been nice, but the existing shortcuts complicate things.
If the user triggers an import with ctrl+shift+i, shift is unlikely to
have been released by the time our code runs, meaning the user accidentally
triggers the new code. We could potentially wait a while before bringing
up the dialog, but then we're forced to guess at how long it will take the
user to release the key.

One alternative would be to use alt instead of shift, but then we need to
trigger our shortcut when that key is pressed as well, and it could
potentially cause a conflict with an add-on that already uses that
combination.
Ensures each imported note appears on a separate line
This may come down to personal preference, but I feel the other counts
are equally as important, and separating them feels like it makes it
a bit easier to ignore them.
@dae
Copy link
Member

dae commented May 2, 2022

Still some testing to do, but I've pushed a few tweaks I've made so far. Please let me know if there's anything you'd prefer done differently or if you spot any problems.

@dae
Copy link
Member

dae commented May 2, 2022

Ready from my end, pending your feedback/approval

RumovZ added 2 commits May 2, 2022 08:57
Then skip special casing in update_deck(). Also skip updating
description if new one is empty.
@RumovZ
Copy link
Collaborator Author

RumovZ commented May 2, 2022

Great!
I've replaced 40ba291, so it hopefully does the same thing, but is easier to follow.

@dae dae merged commit 5f9451f into ankitects:main May 2, 2022
@dae
Copy link
Member

dae commented May 2, 2022

🚀

Thanks for all your hard work on this Rumo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants