New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add apkg import/export on backend #1743
Conversation
The approach I had in mind was a little different, but that's my fault for not communicating it properly - sorry :-( Rather that copying rows directly and then mutating the resulting database like the current Python code does, it might make things easier to reason about and maintain if we instead deserialize items into our standard structs like Note, Noteype and so on, and then serialize them again to the output DB - eg the equivalent of for_each_note_in_search(|note| output.storage.add_or_update_note(note)). It won't be as efficient as copying rows directly, though we would win back some of the performance by not having to apply updates after rows have been copied (to modifying tags/fields, reset scheduling/flags, etc). The gathering and transforms are also something we potentially might want to reuse. For example, at one point in the future it might be nice to have a separate import/export format that is more easily read by third-party apps. Logic like resetting cards, excluding revlog entries, renaming out-of-tree decks or blanking certain fields is something that could be applied to such other formats as well. A simple approach would be to gather and transform all the data up-front, then have the exporter just responsible for writing out the gathered data. A more memory-friendly approach would be to have the exporter able to deal with the items in a streaming fashion, but that does add more complexity, and decks over 500MB are exceedingly rare, so a decent case could be made for the former approach too. For either approach, if the apkg exporter was then implemented as an Exporter trait that accepts the gathered+modified data and writes it out, that would make adding future formats easier. We may even want to use it for csv exports as well - as while the scheduling info is currently irrelevant, removing the marked tag and redacting some fields are things we may still want to do in the future. WDYT? |
(Getting a minimum implementation of apkg importing done will ease testing, and I think it's a good idea to consider both import and export sides of the equation at the same time, so please feel free to focus on that first before considering the suggestions above) |
I see, the colpkg is actually a special case, and other formats should have a common gather step. Does the same go for importing? I.e. should importers gather data into native structs, and then have a single routine to commit these structs to db? You mentioned an export trait. I guess removing scheduling information, and more customisable stuff would be done in the gather step. Then what would be the shared behaviour of the implementors of that trait? Not sure if I will take care of imporing and tests first, because then I'll have to debug this code before reworking it as well. 😅 |
With the Python code, there is something a bit like that. apkg is treated specially, but the other importers like CSV, Mnemosyne and SuperMemo all are based on NoteImporter. They gather the data into a common format, which NoteImporter then applies to the DB. Ideally our new code would unify the apkg import case with the others, if it is practical to do so. For the export trait, I'd just been thinking of a serialization/visitor-like interface, eg: fn export_notes(&self, notes: Vec<Note>)
fn export_cards(&self, cards: Vec<Card>) Then each format exporter could decide how it would turn the gathered and mutated cards/notes/etc into the desired format (whether it places all the provides objects in an .anki2 file and zips it up, writes them out to a csv file, etc). Each exporter could ignore parts not relevant (eg no notetype info in a csv), and would probably needs to retain state throughout the export (eg a json exporter might need to build up a single giant dict with the info). Skipping the trait while prototyping sounds wise; we can easily add it later. |
Is this more like what you had in mind? To be honest, I still don't understand the benefit of an export trait, but maybe it'll dawn on me as I proceed. |
rslib/src/import_export/gather.rs
Outdated
with_scheduling: bool, | ||
) -> Result<()> { | ||
self.decks = col.gather_decks(deck_id)?; | ||
let search = optional_deck_search(deck_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current Python code allows the caller to pass in either the whole collection, a specific deck, or a list of cards (which is actually whole notes, as the browser uses .selectedCardsAsNotes()). It would be nice to support the selected notes use case as well.
One option would be to require the caller to use search_notes_into_table() prior to calling gather_data(). That covers all three cases, and allows this code to not be concerned with how exactly the notes have been selected, or need the separate siblings search. It also means our queries can reference the search_nids table instead of needing to construct and parse large sqlite statements that use ids_to_string().
We may need some more helper functions here - currently there's a for_each_card_in_search(), but there isn't one for notes, or one for returning cards matched by search_nids.
To preserve the "rewrite decks outside selected deck" functionality for the deck export case, we may still need to accept an optional deck_id argument. And one option would be to call search_notes_into_table() automatically when a deck has been provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Maybe export_apkg()
can just accept a search? The backend code could expose a oneof of reasonable options (a deck id, all, note ids...), and pass on the according search node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just remembered why I chose a deck-based approach: The user might want to export a top-level deck where all cards are in subdecks. Then we would also want to export the "empty" parent.
If we accepted a SearchNode, we could check if it's a single deck search, and export empty parent decks accordingly. Not the cleanest solution, but accepting both note ids and a deck id isn't pretty, either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accepting a SearchNode sounds good.
Good point about needing the parents, but that should apply in the specific notes case as well - we must implicitly include all parent decks of the decks that have been used, as the deck tree code in the importing Anki will expect the parents to exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
97c9dd4 removes the feature to specify a deck, and move siblings outside that deck into it. Can we get away with that?
First of all, it won't affect the majority of users who don't have siblings in different decks. And for those who do, including cards outside the selected deck may not be the expected behaviour anyway.
We could show a warning if an export includes decks and cards outside the selected decks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was mainly thinking of the shared decks case, where it's the users who import the deck who tend to encounter the problems/confusion, rather than the person exporting them (similar to our template checks). Showing an error on export if the user has elected to export a particular deck and cards are outside it may even be a better option than moving cards, as it avoids surprising behaviour, and is simpler than attempting to move the cards to a different deck. And for local exports, users could do them via the browse screen if they wanted to avoid the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume then there'd be a separate routine to check for these things in advance? I think something like this would be required anyway if we want the export process to be more customisable. E.g. we could detect notetype conflicts in advance, and let the user decide on a resolution strategy (mapping or skipping).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't notetype conflict checking need to be done at import time? Maybe you have something else in mind than I'm imagining?
I'm not too fussed about the timing of the check, though it would be nice to at least do prior to all media being exported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I got confused with import/export. For both cases, some routine for preparing/checking that is called before the import/export call by the frontend seems to make sense. Then we can be more naive in the core routines. In this case, we can keep just exporting all decks of matching cards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, sounds good 👍
Yep, looking good :-)
Sounds good as a high-level plan. The preparation step will be the tricky bit - the existing apkg import logic is somewhat complicated and badly designed. For example:
We may not need one. I just thought if we ended up with each exporter looking like fn some_exporter() { then in that case, being able to pass in the custom logic with a fn or trait might be convenient. Let's YAGNI until we see a pattern. :-) |
export_apkg() now accepts a search instead of a deck id. Decks are gathered according to the matched notes' cards.
rslib/src/import_export/gather.rs
Outdated
if let Ok(normal_mut) = deck.normal_mut() { | ||
normal_mut.config_id = 1; | ||
} else { | ||
// TODO: scheduling case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't figure out how filtered decks are currently converted to regular ones if scheduling is included. 🤔
Do I have to run the below code for all filtered decks in the scheduling case as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unintuitively, the current code handles this at import time (anki2.py:287, :356). It means renaming an apkg file to colpkg and importing it results in errors. Ideally we'd do it on the export side instead, but since we'll need to keep doing it on the import side to support legacy decks, we could get away with just doing that for now.
In the scheduling=reset case, we reset the config id to 1 so that the user's default deck preset/config is used. In the scheduling=keep case for normal cards, we preserve+copy the deck presets over, so the user can use this mode for moving/copying their cards/decks/presets from one profile to another. In the filtered deck case I guess the current code is discarding the deck config id? In the scheduling=keep case, it might be preferable if we preserved the filtered deck and made sure the home deck was included at the same time, as that would allow users to shuffle filtered decks between profiles as they can do with normal decks. However, the first paragraph may come back to bite us there - if we went down that path, we'd need to make sure we're not breaking legacy exports. Do you have any thoughts/ideas/preferences on what we do here? We could potentially defer this part until the code to deal with the importing side is implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unintuitively, the current code handles this at import time (anki2.py:287, :356).
Thanks, somehow I missed that.
In the filtered deck case I guess the current code is discarding the deck config id?
Did you mean the deck config id? Then yes, its Filtered
kind is overwritten with the default Normal
kind. I haven't tested it yet, but the intention is to transform the deck into a regular one with default settings.
I'm not sure what the use case is for shuffling decks between profiles, and don't have any preferences here. The current code should work for the legacy import, and produce well-formed collections, so deferring any improvements here sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider the case where a user has created two separate profiles, and then wants to follow the advice on https://faqs.ankiweb.net/synchronizing-multiple-profiles.html. They need some way of being able to transfer content from one profile to the other in order to be able to merge the profiles, and currently as filtered decks can't be transported, it means the user will need to recreate them and their settings in the importing profile. Not a big deal for the majority of users who either don't use filtered decks, or use them ephemerally. But some users have a bunch of filtered decks with complex queries, and ideally in that case we'd provide a way for them to transport them. But to be fair, it is a niche use case, and probably doesn't come up much.
rslib/src/import_export/gather.rs
Outdated
if let Ok(normal_mut) = deck.normal_mut() { | ||
normal_mut.config_id = 1; | ||
} else { | ||
// TODO: scheduling case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unintuitively, the current code handles this at import time (anki2.py:287, :356). It means renaming an apkg file to colpkg and importing it results in errors. Ideally we'd do it on the export side instead, but since we'll need to keep doing it on the import side to support legacy decks, we could get away with just doing that for now.
In the scheduling=reset case, we reset the config id to 1 so that the user's default deck preset/config is used. In the scheduling=keep case for normal cards, we preserve+copy the deck presets over, so the user can use this mode for moving/copying their cards/decks/presets from one profile to another. In the filtered deck case I guess the current code is discarding the deck config id? In the scheduling=keep case, it might be preferable if we preserved the filtered deck and made sure the home deck was included at the same time, as that would allow users to shuffle filtered decks between profiles as they can do with normal decks. However, the first paragraph may come back to bite us there - if we went down that path, we'd need to make sure we're not breaking legacy exports. Do you have any thoughts/ideas/preferences on what we do here? We could potentially defer this part until the code to deal with the importing side is implemented.
search_cards_of_notes_into_table
As discussed above, I was trying to implement a three-step approach (gather, prepare, insert), compared to the current code, which does everything at once, to reduce complexity, and increase code reusability. However, it feels like an uphill battle. All ids and names need to be handled manually, while we'd (at least in part) get that for free with the existing add methods. I wonder if this is still the way to go. Should I stick more closely to the current Python code, or do you have another idea for dealing with this? If you think this might still be our best bet, I'll continue. I just want to avoid writing all this new code if it's doomed from the beginning. Another consideration is undoing. I assume that's a goal at some point? Not sure if this will be easier to achieve with one approach or another. I've pushed my work so far, hopefully it helps a little bit to understand what I'm rambling about. It's not even remotely finished, though, so there's no point in reviewing anything in detail. |
Shift would have been nice, but the existing shortcuts complicate things. If the user triggers an import with ctrl+shift+i, shift is unlikely to have been released by the time our code runs, meaning the user accidentally triggers the new code. We could potentially wait a while before bringing up the dialog, but then we're forced to guess at how long it will take the user to release the key. One alternative would be to use alt instead of shift, but then we need to trigger our shortcut when that key is pressed as well, and it could potentially cause a conflict with an add-on that already uses that combination.
+ Deal with long deck names
Ensures each imported note appears on a separate line
This may come down to personal preference, but I feel the other counts are equally as important, and separating them feels like it makes it a bit easier to ignore them.
Still some testing to do, but I've pushed a few tweaks I've made so far. Please let me know if there's anything you'd prefer done differently or if you spot any problems. |
…export" This reverts commit 8f0b2c1. Will use a different approach
Previously the user could end seeing a message like: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 1: invalid start byte Unfortunately we can't retroactively fix this for older clients.
Ready from my end, pending your feedback/approval |
Then skip special casing in update_deck(). Also skip updating description if new one is empty.
Great! |
🚀 Thanks for all your hard work on this Rumo! |
This is pretty much untested. My plan is to do the apkg import part next, then add extensive roundtrip tests and fix it up.
What do you think so far? Should I do anything differently?