-
Notifications
You must be signed in to change notification settings - Fork 338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MTNI-217 ⁃ History + bookmarks imports feature #25
Conversation
Great, soon we are getting there :) I just played around with the extension and found some things that do not yet work.
tl;drThere are a few bugs and usability issues that we could solve by putting the stubs-creation after a user confirms a download. There we could also directly use the duplication check with Gerben's de-duplication framework for each single element, before any writing to the DB happens. This may conflict with @Treora 's idea to have to batch download at least the titles etc, so a user can search for the basic information without having to download the content. Raises the question if the feature is really necessary for now, or if we can later just add a button in the download options "only import basic history information (fast/incomplete)" |
@oliversauter Thanks for the feedback. Wasn't expecting any so soon, but you brought up some good points of discussion.
Yeah, this was expected in these cases, as the resource usage in this stage is directly proportionate to the input size (number of history items), as noted in slack chat + upstream PR discussion. There's ways to put a constant upper-limit on the resource usage by processing the history items in constant-size "pages" or "batches", one-at-a-time (similar to how pagination works in most websites that have big lists of posts, if you've ever looked into that). This effectively trades resource usage for a bit more processing-time. Might be good to look into, as regardless of where this stage occurs (on user "import start" press or component mount), it will still have that input-dependent resource overhead.
Yes, I agree that design-wise it's weird having this stage on the component mount, and would be nicer to happen between when the user presses "start import" and when the actual import batcher starts processing. I explored this option before, but kept getting stuck on trying to come up with ways of getting those estimate counts to the UI in a nice way without having to do that same processing. Things have changed a little bit now, and I have a more fresh perspective on all of this, so I'll have a brainstorm today on this again. Maybe if that same processing takes place but without the DB writes? (I need to investigate this more)
The solution we came up for this one I think involved the use of the saved extension installation time as well (not yet done). If I can get that ext. install timestamp (there's a Regarding the deduping, that is currently done for every imported page, but only after all the fetch&analysis for that page is done, and the outcome is omitted from the UI for the sake of not overcomplicating things from user-POV. I decided to do it after fetch&analysis following Gerben's lead with how page-visit deduping is done. The reason stated (in inline comment) is because the deduping logic will use the extra data that now exists (text, metadata, etc.) to make a better judgement of whether or not the page already exists. I may have to investigate the deduping stuff more and work out what goes on. Although, it is important to note that there are two different stages where a given page is attempted to be matched up with existing pages in the DB (both in page visit and imports scenarios):
However, at the moment the re-identification logic is unimplemented (returns |
From what I understand this would only be a problem if we really store the stubs before a user starts the import, because the item counts need to be displayed as fast as possible? Correct? It seems that there are multiple reasons why we have to put the page-stub creation after the user confirmation of download, when a user starts the download. Performance and the fact, that page-stubs are just created without asking, that cannot be deleted again.
I have an idea here, since we really just want to check if the page already exists, why not just checking the URL, not text etc. If it is the same, we discard it. This way we could use it for page stubs on import as well as full-page on visit?
Which same processing?
|
This isn't a problem. It's a solution to the problem we currently have which is that of the amount of memory needed is growing linearly with the input size, and our input sizes can be large enough that the needed memory is unacceptable. So basically my proposal is to change the linear growth in memory overhead to a constant overhead, regardless of input size, by reducing the amount of items being processed at any given time (may sound familiar as this is also the main reasoning behind the batching: limiting the growth of resource usage relative to input size). The time needed currently linearly grows with input size, but employing the solution I proposed shouldn't change the overall growth pattern of time relative to input size, just changes some constants (that's what I meant by "a bit more processing-time"; it's effectively not an important-enough difference). tl;dr: space/memory complexity reduced to some constant factor, time complexity remains linear (both relative to input size) But will leave this part until after everything else as it currently works and there's other more high priority things to do; it's an optimisation task. But will still be moving this whole page stubs/
So basically skip the smart dedup logic and just do a much simpler A page pointed at via a URL could have changed since it was visited last, but if it was visited before and the extension is installed, it should already have been fetched&analysed at that time (which is what we want). If the extension wasn't installed, then that's the point of the user invoking the import process (which, at the moment, cannot get whatever was at the URL at the visit time and instead gets the current contents with an XHR, but I've discussed the possibility of using web archiving stuff before with @Treora; but that's a further "maybe" extension to the fetch&analyse logic). So yes, maybe in the context of imports we don't need the deduping stuff and can test on URL existence. Kinda related note: the batcher currently
Yeah I was over-complicating this part. Can just use counts gotten from DB queries, as we discussed in the slack chat earlier today. Previously I was thinking in terms of a single import process; so getting counts from the
Yep, just an additional state for the results table data rows. Can do so. |
Do I understand correctly here, that the RAM usage comes from generally loading all history items for preparation of processing? That would happen independently of the stub creation happening before/after a user starts the download.
Just for my understanding: the deduping currently happens ONLY after the page is fetched? It can't be used before that in the process?
Yes thats what I thought as well.
|
Yes, that should be right.
The loading of history items into memory isn't independent of the page stub logic; that logic happens for each item (which are all held in memory).
I will have a look over how the old extension does it tomorrow, if needed, but I presume the underlying algorithm would still have linear complexity in relation to the number of history items (the input), unless you did some form of batching and processing the items in constant-size batches/pages/groups (say 100 items at any given time). Basically what is happening is the To get a better idea of what's going on, and to explain a bit how I came to the conclusion of the linear space complexity, here's a rough complexity breakdown of the underlying algorithm relative to the input which is stored in memory (the number of history items):
Leaving us with linear growth overall (assuming the index update is linear; I'm unsure about what's actually going on in this part, as it's hidden behind a Also let me know if any of the parts of that algorithm don't make sense why they are there, or if you see something that could be simplified.
I need to get acquainted with the deduping logic before I can accurately answer this. Been treating it as a black-box as it's a separate complex module in itself, but pretty tired after spending all day on this now, so will get back to you on this tomorrow once I have a look. From what I understand of the reasoning put in the comments, the deduping logic will give more accurate results given more available page data (text, metadata, etc.), hence why it's done afterwards in the page-visit scenario. Assuming that's correct, then yes it can be done before but deduping accuracy will be impacted. |
I just checked the RAM usage on the old extension, it spikes up to 180MB and it processes all 20k elements with all the data in it, but does not store them or make them searchable. It may be better, if the page stubs are created 1 by 1 therefore the indexing also happens linearly and not in one big batch that kills the RAM, or we have the ability to wait for reindexing until the import process is finished.
It could be that the dedup framework of Gerben is to sophisticated for our requirements at the moment. We may can more easily use it when an actual visit happens where all data is available. In the import we should be more vary about the resource use, hence more custom rules need to be in place. Dependent on how Gerben designed the dedup framework, these rules can be incorporated. (rules below in step 5) I have an idea for a work-flow where everything is linear.
|
I just tested it with a large input on my personal web-browsing browser (~15k of history), skipping the indexing, but the issue still remains (which is expected, as the overall algorithm space complexity remains unchanged).
I think this is one point of confusion (at least for me). At the moment everything is being processed 1 by 1. That's what gives the overall linear (
Yes, I think for now we should just go with the simple duplicate URL checking scheme with imports, instead of deduping for the reasons mentioned before (unless we think of something that contradicts our reasoning later). With the workflow you detailed above, that's basically the same thing. Step 4 is still the linear time/space algorithm I detailed above (ignoring the "maybe also x at a time" part, as that really changes everything), although reindexing has been moved to step 6 (which may be a good thing for complexity-sake, but means user won't be able to search using simple page stubs until after the entire import process is complete; effectively making the page stubs pointless). Step 5 should really be done before/during step 4 to reduce the constants from processing already-saved page stubs (although that may be what you meant, if step 5 and 6 is meant to happen for every input of step 4). I really think it's better to have a quick call later today, if you have time, and I can better show you what I'm talking about regarding the problem with the linear space (memory) growth as it relates to input size, and why it will always cause a problem here for sufficiently-sized inputs, regardless of how much constant-time processing we remove from the algorithm. This article also is a good one to read if you have time, but I think it's overly long and applies to all algorithms in general. I could summarise it for this particular case in much less time in a video chat. What do you think? |
Further things that complicate the imports process (mostly notes to myself, but feel free to comment):
solved using a flag in local storage that says whether or not a previous import invocation was left unfinished
EDIT: This is just a stupid idea; it defeats the purpose of page stubs... Find a better way to get estimate counts (still need to differentiate between stubs and full docs) UPDATE: Done
I remember in one of the other threads in upstream repo, there are reasons outlined why updating is not wanted. May need to revisit that.
They currently act as the input to the batcher. They have a |
Here's a bit of an outline of the steps that happen with the current implementation of the import process state between both background script, BG, and UI. The biggest pain here was the state management and keeping BG state and UI state (redux) in-sync over the 2-way runtime connection port messaging interface, which was slightly awkward. More information regarding the runtime port connection: Init stage when the user navigates to the imports UI in options:
That runtime port connection stays open after init process (will remain open until either UI page or background script is closed; through nav event, for example). BG and UI will wait on estimates view until user presses "Start import", which triggers the following:
After that, the user can manually trigger Anytime while the import process is running (and not paused), BG can send In the case of either manual cancel ( Things change a bit when it is determined that a running import process and its BG<->UI connection was lost somewhere earlier. Changes mainly lie in rehydration of both UI and BG state (UI handled via redux enhancer, BG via existing import docs in DB) and the skipping of stages that don't need to run again, but overall flow remains pretty much the same EDIT: One thing that I probably should have mentioned here is the import doc, as it's new. Looks something like this at the moment, and (as described above) acts as a sort of input and state for the underlying batcher instance and as import-related metadata for a page/bookmark doc: {
_id: 'import/:timestamp/:nonce',
status: 'pending'|'success'|'fail',
type: 'history'|'bookmark',
url: string,
dataDocId: string, // ID to a page/bookmark doc
} |
Main problem with the current flow lies in the Making users with sufficiently large histories wait that long on the loading screen (step 1 of "Start import" flow) and having their computer slow down, and possibly crash the browser, isn't really ideal. I've been discussing a bit with @oliversauter about this, and came to some ideas. Page stub generation is still important here before the actual fetch&analyse batching stage (step 9), for reasons discussed earlier, but we could possibly batch it to constrain resource usage to a constant factor. Visits however can't be batched like that in the current algorithm, as step iv needs all visit docs in memory to do the ID remapping (theoretically you could put them all in DB and then do n queries + updates, but that many DB interactions doesn't sound like an improvement; may be wrong). We discussed the possibility of leaving the browser visitItem ID in the doc and adding a further index on that, so that the cross-references still exist without that remapping process needed. Then we could batch the visit doc logic as well, but it makes the visit doc structure messy (and inconsistent with visit docs generated through page visits; but I think that's less of a concern). Relating it to UI, if we can batch these stages to constrain resource usage we should introduce some additional UI states instead of just leaving the user on loading screen for that long (can use progress view and switch on the help message text + add additional rows to the progress table, for example). @Treora: What do you think about the idea of leaving in the browser visitItem IDs in visit docs generated during imports (so that the remapping stage doesn't need to happen and the tl;dr resource usage is very bad for |
@oliversauter really good progress made this morning based on the ideas we discussed yesterday. To sum it up:
This affords the ease of adding new types of imports processing logic as we discussed (think history vs bookmarks, and whatever other types we need in the future), by adding a new case into the entrypoint of that module that calls any async function that need to take an import doc and returns a status message (all errors recommended to be thrown in this module). Example exists and working for This module now set as the tl;dr: Should be super easy to add any new types of processing to the import pipeline by adding cases to the
As per your step 5 thing I linked to, the visit docs are now done here when the page stub is processed (either check for missing visits in the case of an existing page doc with URL, or just get all the visits for a single URL). Greatly simplifies the Of course, this means the re-mapping of visit IDs can't be done, so the browser VisitItem ID is also stored for now. Although I found another use for that: when page doc is deemed to exist already and we need to check for any missing visit docs in DB against the browser API, we can check against the stored browser ID, which IMO is more reliable than the timestamp (even though that seems do-able). We'll see if we want to do the remapping at a later stage or make an index on the browser visit ID (it all depends what we want to do with these references later, I suppose). Will update the PR OP list with next obvious things needed to do. |
Wohoo! Great we are getting there!
Would it help if we also store the historyItem ID in the import doc? This way we can also check if a item from the history API is already stored, in case a user starts the import later/re-does. Or are the import docs deleted after the import happened? |
This case is handled by the way the page stub and import ID is generated: it's based on the history item's ID + last visit timestamp. So say we start an import, the page stubs and import docs are created and stored, then we choose to cancel early. Next time we start, it will attempt to generate the page stubs + import docs, but it will reject them as docs already exist with those IDs (from the previous processes' step). But good thing you asked that! It got me thinking as visit docs have a similar ID generation scheme (based on VisitItems instead), so maybe I'm checking on the VisitItem ID unnecessarily 😄 Will add this to my checklist and see if we can get rid of that explicit check. Import docs specifically have a |
I get where you are coming from though, the DB would automatically reject adding new items with the same ID. But what happens in the case that a person visits a page again between canceling and restart? From my understanding a re-import would then create a new page stub because lastVisitTime has changed? Wouldn't it be better to just check for the URL as an indicator if a page already exists? |
Yes, your understanding is correct. But at the moment, as the import docs are persisted between imports, there's a check during the However, thinking about it, in that time between cancel and next import, the page pointed at by that URL could have changed. But given the way we currently do the page fetch in imports (XHR direct to that URL at the time of import), the same result will be fetched regardless of if a new page/import doc is made (with new timestamp) or we reuse the existing one (which happens now). Brought up the idea of possibly using a web archiving service to attempt fetch page content at the history item's timestamp with @Treora before, but decided against it. May be a future extension to the fetch&analyse logic, depending on what we want. |
Played around with integrating bookmarks into the imports flow today and got a draft version working alright. In the pre-imports stage (which we've referred to as
That's the only changes in the pre-imports stage. Main thing here is at the end of this stage, we are left with
This is pretty much the same as Bookmark doc structure isn't finalised yet, but I've just created something for now shaped similar to the existing visit and page docs, while retaining all the needed bookmark-specific data that was talked about in #29 upstream (there's not really much special bookmark-specific data). Also updated the backgrounds estimates logic for initing the UI with so that it works with both bookmarks and history. Messy and yucky calculations, but seems to work fine for big inputs. |
@poltak
If I recall correctly I did have visit ids be generated deterministically (by simply prefixing the visitItem id). The reason for this was idempotency (preventing duplication if importing twice), not batching, but I guess that could be done too this way. Using a deterministic mapping should be just as good as leaving the ids the same. I am unpleasantly surprised about the time importing appears to take for you, I expected much lower times. Batching may be sensible then, at the cost of higher complexity. One small worry I had about importing visits one by one was broken referrer links. Some visitItems might point to inexistent others. Not sure if this happens (but best to assume they might), but in any case it need not really be a problem to have these dead referrer links in our database either. If you think we can somehow (but efficiently) prevent these that would be nice though. Combining visits from multiple browsers may be a problem if they map to the same visit id; should we perhaps add a token to all ids for the browser and/or importing occasion? (a bit like the same timestamp that is added to each under the importedFromBrowserHistory field) In case you would like to discuss design choices further we could have a call later this weeek. I have not followed your progress in detail, but it looks like you are going in the right direction. |
I ran a few tests and discovered some bugs: I just tried to import the bookmarks for now, this is what happened:
|
If I visit pages, they are counted towards the "yet to import" urls in the overview. |
@oliversauter thanks for the feedback!
The browser history API shouldn't be touched in the case of bookmarks, it just generates the page doc from the URL stored in the bookmark item (BookmarkTreeNode) from the browser bookmarks API. So in theory it should result in the same thing in Overview, but will look into what's actually going on here.
Although, now I'm thinking maybe that existing data checking step should come earlier (during page stub + import doc gen) to avoid the need to skip during imports at all... will see. Fixing the estimates counts has been on my todo list, so I'll get around to that. You wouldn't believe how much pain has come out of trying to get those seemingly simple counts accurate in all the different cases of stopped imports process 😛 For now, take the data given in that UI with a grain of salt. |
Yeah my guess is that it is because only visit items are shown in the overview, and the content of the related page object is rendered in there. At least how I understood the process to be from Gerben. So in case a bookmark item has no visit item, it won't be shown.
This is something I am unsure about, because for example I have for noisli.com, which I use almost daily, all visits in the result list after importing. Goes back to 27 of February. So exactly 3 months, the limit of the History API. Didn't you say that there is one step in your process, where you get all visit Items from the API, just with the URL? Maybe this step is still active in the bookmarks import flow? Also it shows this at the beginning:
it seems, as if we could put fixing these counts to the back of the issue list. It's ok, if they are a bit off, at least for now. If there is more important stuff to fix, then this can wait. What do you think?
Yeah @bwbroersma is working on a new search library during GSoC. More info on bugsWhen importing, these errors are shown for me in the console for many pages: |
Yes, visit items are retrieved in the case of both history and bookmarks imports. At the moment, the only difference between history and bookmarks is that there is an extra step at the beginning for bookmarks to create a bookmark doc. Visit + page doc generation happens the same in both cases (the page doc is made without touching the history API in the case of bookmarks).
Yes, most of the UI outstanding stuff isn't a high priority for me. Just plan to get to it when I can. |
The remaining non-trivial issues with this feature are all to do with the state handling and juggling things between UI and background states. There's so many different states and cases to handle that mainly stem from the idea of having it "resumable". I'll write some of them out here, mostly as documentation for myself (since it's so hard to keep organised in my head alone), and for transparency-sake.
What happens?Bi-directional connection is lost with background script and needs to be re-established. This starts up connection initialisation again in background after UI automatically re-connects. Background connection initialisation looks something like:
ProblemsBackground does not have access to allowed import types state (UI checkboxes). This means if say the import was started for bookmarks only, but there are some existing import docs of type Possible solutions1: 2: Maybe 1 can be tried, but it further complicates and messes up this bi-directional UI<->Background connection state syncing thing.
What happens?Bi-directional connection will be lost and any currently-in-progress import items will continue until they succeed or fail (for whatever reason). In this process the full page doc may be filled out. ProblemsThe observer logic (logics that trigger on the different import item outcomes) cannot be run as the connection on longer exists and all connection logic gets cleaned out of memory. The update of import doc status is currently done in the observer, hence it won't happen and both the UI and the background will never know the outcome of these items, even though the page doc is filled-out in the DB. The effect is that in the next import process, these will be started to import again (as their Possible solutions1: 2: 3: 1 and 2 essentially just put all the DB mutations in one place instead of two, which does make more sense looking at the problem from a distance. 1 could save future re-imports, but it means the UI will never know the outcome of those items (counts and progress table will get out-of-sync, may confuse user). 2 will result in n unnecessary extra imports (at most 5, or whatever we end up choosing the concurrency level to be) but should allow UI to stay in sync. 3 will keep UI in sync but have the first n results in the progress table always as "Skipped". There's more state related issues, but I'll add them as I encounter them again. Can't remember everything to do with this off the top of my head as it's so messy. |
@poltak Thanks for the detailed analysis.
Mhh weird. How can it be that for 300 bookmarks items (which I almost never touch), my history count is reduced from 20k to 30ish or so (when download is finished and I go back to the import start overview)?
I think we should go for this option for the time being. As long as the actual import works, its an issue we can tackle later then. |
How can it be that for 300 bookmarks items (which I almost never touch), my history count is reduced from 20k to 30ish or so (when download is finished and I go back to the import start overview)?
This is the same estimates count view we just agreed before to leave until later. Getting those counts apart from before the first import is run is going to be quite difficult, so until that's done don't take any of the counts in that view as fact. The only way to view what is already stored right now is through the DB
Ngày 29 thg 5, 2017, vào lúc 13:22, Oliver Sauter <notifications@github.com> viết:
… @poltak Thanks for the detailed analysis.
(the page doc is made without touching the history API in the case of bookmarks).
Mhh weird. How can it be that for 300 bookmarks items (which I almost never touch), my history count is reduced from 20k to 30ish or so (when download is finished and I go back to the import start overview)?
Just live with the skipping of docs that were in progress at the connection-lost time. It just means that n docs will always show up as being skipped in the UI whenever a user interrupts the import process and resumes it later, and the import doc status' will still be eventually synced up. It's still kinda messy.
I think we should go for this option for the time being. As long as the actual import works, its an issue we can tackle later then.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Ah ok, didn't understand that they are so closely correlated. When I talked about leaving it out, I meant it may not be a problem if the count is off by a few (like when at the end the counter is 294/300 even though it is finished. |
- now based on URL + import timestamp - URL is directly related to the import doc; should only ever be one import doc for a given URL in an import process - the timestamp I'm still not decided on... now will generate new import docs per import process (cancel/finish and start again later) - means that have to handle old pending import docs in the case of cancelled import process (prev _id generation was kinda pointless though)
Create yet another main import UI state (preparation) - before was reusing init/loading state, but this stage is complicated and long for large inputs - hence need a slightly different state to show user (simple loading bouncy thing for minutes is not really good) - now we're up to 6 main states for imports UI (plus all the different combinations) - really needs a cleanup Write view logic for new imports UI prep state - just a simple message for now (just that seems like a big improvement from no message) Merge imports UI init + prep states into single state - called isLoading in the view - they were pretty much the same apart from a message that is displayed - hence put that message in store and display it on view if it's present - means we can put whatever messages on loading state via actions (without introducing new states) Put loading state + reindexing on import cancel event - reindex needs to happen so that the progress up until cancel can be usable - loading state needed as reindexing takes forever, and we can tell the user to sit-tight - changed CMDS.STOP to CMDS.CANCEL since that's what the cmd is doing - additional action + reducer Wipe download details progress on import finish - previousy kept them, but wiped things like the progress counts - inconsistency
- this one checks the URL to process against existing page docs in the pre-imports stage (used in both estimates counts and page/import docs creation - needed mainly to stop feeding in visited pages to import batch (they will still be skipped, but no reason to even create import docs for these guys)
- this file changed quite a bit upstream since last work on it, hence I decided to just overwrite my changes with upstream changes and rewrite the more simple case of imports page analysis - main conflicts with upstream now is that the setDocField/Attachment helpers are on the entire file scope instead of inside the background entrypoint
Add isStub flag to page doc to afford differentiation - used to differentiate stubs and filled-out docs - may be temporary; at the "fill-out" stage, probably can just get rid of it as it is only used in the context of imports - similar to hotdog or not hotdog Write logic for removing page stubs on import cancel - uses the new isStub flag Add page stub flag switching in fetchAndAnalyse - this is done at the same time as other fields are updated in fetchAndAnalyse - at the moment this means that those page stubs existing for import docs that get skipped or error-out will remain, hence the cleanup step in complete() (happens at end of imports process) Altered completed pages calc in estimates calc - now that leftover page stubs are guaranteed not to be in DB, we can simply use the count of page docs within the db as the completed count (same as bookmarks) - note that things still go weird for now as import docs refer to the deleted page docs; this is the next step (and might as well move them to local storage at the same time)
Remove explicit index updates in imports - need to wait until new search module is implemented and will update index using that logic Rewrite persistent import state in local storage Big commit; essentially removes the old import doc (stored in pouch) and replaces with simplified import item (stored in local storage). Import state data model simplified quite a bit: now only contains URL and type fields. This assumes uniqueness of URL in the imports process for page stubs and import items. One import item should always refer to just one page stub. Import doc status is now implicit: if it exists, it's pending, if it doesn't, it's either done or error'd (error should only happen for error in page data fetch, else it's a bug!) Still need to properly handle removing of state on a cancel, which should get remade on restart, but not including already processed items. Remove pending import items check in preimports - now as import state items are cleaned up between import process, and preimports only happens at the start of an import process, this check is not needed Remove import items checks in estimate count calc - no longer needed as these est counts happen when there are no import state items in storage (due to simplified import doc/item state) Remove all time-related imports logic - now everything is done more naiively based on URL rather than time, as the imports state and page stubs are cleaned up between imports - no need to store last import time in local storage; page and bookmark docs will be filtered on URL
- was in actual page import stage after earlier discussion, but after more recent discussion we need to put it back or else there's no point of preimports (as they are needed to show the results in the overview search) - may alter this algorithm slightly soon to store them as they are mapped from page docs; currently have issues testing with large input sizes as pouch is going crazy
- this logic was flawed as those page docs are needed if we want to be able to search on the visits generated in pre-imports - now assume the page stubs are in there and switch on them in the relevent queries (calc ext page doc estimates + filtering import URLs against existing page docs) - due to the idempotent nature of page doc _ids, they will not be duplicated for history/bookmark items on next pre-imports stage
- essentially replaces the bulk doc insert of all visit docs with several bulk inserts per page doc, putting a linear space limit relative to the amount of visits for given URL - no way to get the trails this way, but the browser ID is still stored
Move import DB upsert to observer - previously was done for each import item - this led to the problem of actually performing the upsert after the fact when user presses pause or cancel - hence on resume of imports, those previously upserted would be found out to already exist leading to skipping, messing up counts and leading to #CONCURRENCY skips - also set concurrency to 2 as it seems much more stable than 5 on larger inputs (still not completely happy with the performance of the promise-batcher with large inputs) Rename import-doc-processor to import-item-processor - minor refactor commit Unify bookmarks + history at page fetch import stage - there really isn't any difference between history and bookmarks at this stage; this stage only fetches the page data for the history/bookmark item - hence it can be unified here and the mapping of bookmark items to docs + storing can happen in preimports, along with visits and page stubs (should take relatively small amount of time)
- given last change, there is no real way in the UI estimates counts stage to differentiate bookmarks associated with page stubs and filled out page docs (since it counts bookmark docs) - this commit adds logic for each stored bookmark doc to see if the associated page doc is a stub or full doc - could also be done with a map reduce but given the way Promises work, they would all be done at the same time without batching logic, hence may have big perf impacts with large amounts of bookmarks (done in for-of loop instead) Move import-estimates out of imports-preparation module - refactoring commit - cleans up a lot of the preimports module - moves the browser history + bookmarks API fetchers + filtering on existing URLs logic back to the index, as they're shared between preimports and estimates - interfaces slightly changed to afford more customisation later, but defaults set to avoid any changes in current set-up function invocations
- this commit cleans up the changes that no longer need to exist to files originally from master - should clean up the diff of this branch, to better focus on the actual additions (there shouldn't really be any big changes from master; most changes in this branch are completely new and separate) Fix up the two favicon extraction methods - in page visit, favicon URL extracted from the tab - in imports it isn't - however due to some changes rebased in from upstream, that slightly changed the `extract-page-content` logic (and thus favicon), I needed to clean up the favicon logic so it isn't attempted to be remotely fetched on page visits - also change the identifier for `favIcon` to favIconURI` in `fetch-page-data` Remove unused import UI code - originally committed in the first imports UI commit (c482468c5bfb85eaf050c9d88e01ef8d5d4a8e81) that was submitted months back upstream, which I included and built my UI off - sadly never got around to cleaning up the unused stuff that still hung around
Make promise-batcher stateless - was originally designed to manage state of it's inputs - revisiting it, it could be greatly simplified by making it completely stateless and pushing the state management responsibility to the caller - instead of passing in static input, which it then manages the state (to afford pausing etc), another async fn can be passed in to fetch input - this also allows us to simplify the interface to just "start" and "stop"; "pause" is implicit if the user manages their state - reducing more unneeded states; great stuff! - also means I have to alter the `imports-connection-handler` (which currently uses it) slightly due to the changed interface Give promise-batcher a class makeover - essentially a refactoring commit; no real interface change - just syntactic sugar but nice in this case as it was essentially creating the start-stop interface, holding a bit of state for the RxJS sub and abstracting away RxJS stuff - for me, at least, it's a lot nicer to express this as a class rather than a closure - also several minor bits of cleanup and putting things on "private" methods where it makes sense Simplify import item processing by minimising DB ops - improves the speed of imports for large inputs by A LOT - removed checking for existing docs to skip current input (essentially done in preimports as well when the import items are made; some may get through, but it's really not so important compared to the perf impact) - remove fetching of associated page stub; instead we can store the ID of the stub on the import item; a bit extra space but -1 db op - put the update op back inside the import-item processor: means that the counts may get slightly out of sync and the currently running import items will continue to run if user pauses/cancels. But now the DB will not get overloaded as it was before; the batcher will not move on until the DB has been updated - also increased the concurrency back to 5 as it works great (even up to 10 worked alright, but I think safer as 5 for now)
- no longer need Location to be formed for readability - all page data put inside `content` field - still returning that favIconURI Update imports logic to use new page model - stub sets `content.title` as `title` for now - import processor replaces old fields with the `content` field, and no longer cares what's inside Set imports concurrency at 2 - this was the best tradeoff on my machine; let's see how it fares for others - my machines results with different concurrency levels on the PR discussion
- due to the enhancer saving the current import state, it was skipping this loading screen on init - it was also skipping loading screen on cancel/finish due to the finishImportsReducer always setting to IDLE state; altered this reducer to allow conditional state setting
- default timeout arg added to afford customisability - switch from using xhr.ontimeout and xhr.onload to xhr.onreadystatechange to better handle non-200 HTTP responses
- previously if error was encountered, it was handled fine in UI but the page stub wasn't marked off, hence it would be retried next time - now it makes sure to at least mark off the page stub to avoid future error retries (these errors relate to bad HTTP requests; like URL no longer points to page)
- both stylelint + eslint
- say user finishes downloading all bookmarks only. They now have 0 remaining - before user could press start btn and get into empty download (finishes immediately after pause + resume) - now it can't be started
This reverts commit 7672f6e. @oliversauter brought up good point about losing internet connection during imports process. This commit would have then marked off all the failed downloads as full pages, meaning they wouldn't be re-attempted on next import process. Cannot simply check the internet connection for each import item, hence I think it's safest to revert this behaviour for now. Good suggestion to keep track of failed downloads for future feature to re-attempt failed ones in future import process.
- brought into both options and overview - at the moment just the `LoadingIndicator` has been moved to there - fixes the issue with the `LoadingIndicator`s CSS not being pulled into overview when it was in the `src/common-ui` module
- bit of testing with my history and bookmarks give a lot lower constants than originally planned
Congratulations! Wow, what a massive feature. "That was a birth" - how we would say in german :) |
This will forever be the legendary PR at WorldBrain 🏆 |
Contains everything needed for new Imports feature which will appear in options view. Can be split into the following features (most of which work independently):
promise-batcher
module (batches and runs promises concurrently)import-doc-processor
module (used as batching function topromise-batcher
; actual processing logic for each import - calls fetch&analyse) more info in this postimports-preparation
module (previouslyimport-history
): pre-import stage which generates page stubs and import docs from browser history - more info in this postHere's two main UI view states: https://imgur.com/a/EXS1b
Remaining tasks:
UI
Bookmarks
import-doc-processor
Background
Put a better loading/prep state on for now with brief info for user; prev was just a spinner (which isn't very nice to sit an watch for 2 mins without any further info)
_attachments
to match page visit structureimport/
doc structuresindex.js
)Misc