Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MTNI-217 ⁃ History + bookmarks imports feature #25

Merged
merged 77 commits into from
Jul 5, 2017
Merged

Conversation

poltak
Copy link
Member

@poltak poltak commented May 18, 2017

Contains everything needed for new Imports feature which will appear in options view. Can be split into the following features (most of which work independently):

  • Options page Imports UI (allows user to control imports process)
  • background and UI bidirectional message passing interface (keeps background and UI state in sync with events - both user and batcher triggered) more info in this post
  • fetch&analyse functionality (fetches + analyses further meta/data on a URL)
  • promise-batcher module (batches and runs promises concurrently)
  • import-doc-processor module (used as batching function to promise-batcher; actual processing logic for each import - calls fetch&analyse) more info in this post
  • imports-preparation module (previously import-history): pre-import stage which generates page stubs and import docs from browser history - more info in this post

Here's two main UI view states: https://imgur.com/a/EXS1b

Remaining tasks:

UI

  • allow progress UI to have multiple 'Downloaded' status states (now is boolean; at least need one more state for skipped/deduped)
  • move import type checkbox state to redux
  • have import type checkbox state synced with background state
  • try to come up with more accurate time/data estimates in derived UI state
  • disable start button if nothing to process

Bookmarks

  • get draft bookmarks logic working with import-doc-processor

Background

  • need another existing data check at the same time as pending imports and blacklist check (pre-imports) to stop unnecessary processing of visited pages (which get skipped anyway)
  • optimise the page-stub generation stage (pre-imports); still takes too long for large inputs (mainly because of the reindexing step)
    Put a better loading/prep state on for now with brief info for user; prev was just a spinner (which isn't very nice to sit an watch for 2 mins without any further info)
  • properly handle import type allowance flags (checkboxes) in all cases (more difficult than first thought)
  • need a DB reindex happen on cancel event to make import progress usable (problem is the reindex takes a looong time)
  • attach FavIcon att to page doc via new _attachments to match page visit structure
  • come up with method of getting estimate counts without page stubs logic needing to run
  • store extension installation timestamp
  • finalise these import/ doc structures
  • clean up background connection port logic (put it in relevant modules, not all in index.js)

Misc

  • figure out what to do with a lot of string literals currently used as commands for UI and background to communicate (import from a common place that doesn't pull in UI stuff to background script and vice versa)

@blackforestboi
Copy link
Member

blackforestboi commented May 18, 2017

@poltak

Great, soon we are getting there :)

I just played around with the extension and found some things that do not yet work.
You maybe know them already, so consider the posting just for documentation purposes

  1. When opening the import view with a lot of elements to import (in my case 21547), the RAM shots to almost 1GB and becomes unresponsive
    I guess it is because all the elements are written into the DB before the counting for the UI happens.
    It feels to me as if the actual writing into the DB should happen only if a user presses import.
    This also because it is not really in the control of the user when things are copied from his (trusted) browser history to a new app, even if it is just the stubs.
    Solution idea:
    Maybe finding a way to just count the elements without writing them to DB, so at least the display can happen. When user presses the download button there could be a message like "Preparing Download", while in the background all the stubs are created (maybe with pouchDB's batch function).
    After that, the message disappears and the download process normally continues with counting up with the urls that are successfully downloaded.
    I don't have a clear picture of the workflows for creating the stubs and batch download yet though, so I might be wrong :)

  2. Elements that are visited after installing the extension are still counted to download
    We already talked about that case in the chat. The outcome was: Maybe optimising for removing duplicates later, when showing how many pages are imported.
    Solution idea
    Ideally they wouldn't increase the counter anymore, but that would require that we'd have to make the batch duplicate check before counting the URLs to import. Maybe too early in optimisation.
    Combined with the first problem: We could move the stubs creating directly after the user confirms. Either as a batch or single create. Single create would allow us to use the deduplication framework of Gerben to check each stub, if it even needs to be created.
    Without having a batch-duplication check, also this has its downside of still counting duplicates, although the download would not store them. May be a good step in between.

tl;dr

There are a few bugs and usability issues that we could solve by putting the stubs-creation after a user confirms a download. There we could also directly use the duplication check with Gerben's de-duplication framework for each single element, before any writing to the DB happens.

This may conflict with @Treora 's idea to have to batch download at least the titles etc, so a user can search for the basic information without having to download the content. Raises the question if the feature is really necessary for now, or if we can later just add a button in the download options "only import basic history information (fast/incomplete)"

@poltak
Copy link
Member Author

poltak commented May 19, 2017

@oliversauter Thanks for the feedback. Wasn't expecting any so soon, but you brought up some good points of discussion.

  1. When opening the import view with a lot of elements to import (in my case 21547), the RAM shots to almost 1GB and becomes unresponsive

Yeah, this was expected in these cases, as the resource usage in this stage is directly proportionate to the input size (number of history items), as noted in slack chat + upstream PR discussion.

There's ways to put a constant upper-limit on the resource usage by processing the history items in constant-size "pages" or "batches", one-at-a-time (similar to how pagination works in most websites that have big lists of posts, if you've ever looked into that).

This effectively trades resource usage for a bit more processing-time. Might be good to look into, as regardless of where this stage occurs (on user "import start" press or component mount), it will still have that input-dependent resource overhead.

When user presses the download button there could be a message like "Preparing Download", while in the background all the stubs are created

Yes, I agree that design-wise it's weird having this stage on the component mount, and would be nicer to happen between when the user presses "start import" and when the actual import batcher starts processing. I explored this option before, but kept getting stuck on trying to come up with ways of getting those estimate counts to the UI in a nice way without having to do that same processing. Things have changed a little bit now, and I have a more fresh perspective on all of this, so I'll have a brainstorm today on this again. Maybe if that same processing takes place but without the DB writes? (I need to investigate this more)

  1. Elements that are visited after installing the extension are still counted to download

The solution we came up for this one I think involved the use of the saved extension installation time as well (not yet done). If I can get that ext. install timestamp (there's a browser.runtime.onInstalled event :D) it should be easy to put constraints on the history search time so that if the last import timestamp doesn't exist yet, just import and get counts for history before the install time.

Regarding the deduping, that is currently done for every imported page, but only after all the fetch&analysis for that page is done, and the outcome is omitted from the UI for the sake of not overcomplicating things from user-POV. I decided to do it after fetch&analysis following Gerben's lead with how page-visit deduping is done. The reason stated (in inline comment) is because the deduping logic will use the extra data that now exists (text, metadata, etc.) to make a better judgement of whether or not the page already exists. I may have to investigate the deduping stuff more and work out what goes on.

Although, it is important to note that there are two different stages where a given page is attempted to be matched up with existing pages in the DB (both in page visit and imports scenarios):

  • re-identification (performed on the stub page doc, before fetch&analysis)
  • deduplication (performed on the full page doc, after fetch&analysis)

However, at the moment the re-identification logic is unimplemented (returns undefined for all inputs). Anyway, I think what actually will happen here is maybe out-of-scope of imports feature and maybe we should just focus on what the user should see, like if we think it's acceptable to have a bit of inaccuracy in those estimate counts + the results table (as we may not be able to tell before-hand whether something will be deemed already in the DB, but whether we want/need to tell the user if a given page/url was deemed in the results table).

@blackforestboi
Copy link
Member

This effectively trades resource usage for a bit more processing-time.

From what I understand this would only be a problem if we really store the stubs before a user starts the import, because the item counts need to be displayed as fast as possible? Correct?
A possible solution: I could imagine that we can either store all 20k elements in RAM and process them 1 by 1, or we just temporarily store the whole array to local storage and then processing them from there, also 1 by 1? This way we also could run the dedup right in the process before fetch&analysis. From what I understand, with that we can also save a lot of CPU and RAM?

It seems that there are multiple reasons why we have to put the page-stub creation after the user confirmation of download, when a user starts the download. Performance and the fact, that page-stubs are just created without asking, that cannot be deleted again.

The reason stated (in inline comment) is because the deduping logic will use the extra data that now exists (text, metadata, etc.) to make a better judgement of whether or not the page already exists. I may have to investigate the deduping stuff more and work out what goes on.

I have an idea here, since we really just want to check if the page already exists, why not just checking the URL, not text etc. If it is the same, we discard it. This way we could use it for page stubs on import as well as full-page on visit?
We could just use pouchDB-Find to look for any url and then deciding if the page stub/full-page is stored? Does the page stub need to be stored, in order to make that check?

I explored this option before, but kept getting stuck on trying to come up with ways of getting those estimate counts to the UI in a nice way without having to do that same processing.

Which same processing?
Couldn't we just count the elements in the array fetched from the API, in order to count the numbers? This way there is no need to write them to the DB? It's what I have done in the old extension: https://github.com/WorldBrain/Research-Engine/blob/master/src/js/import_bookmarks.js#L44

whether something will be deemed already in the DB,
Why not just returning a message in the processing list (UI) that shows the URL to be a "duplicate" instead of "success"

@poltak
Copy link
Member Author

poltak commented May 19, 2017

From what I understand this would only be a problem if we really store the stubs before a user starts the import

This isn't a problem. It's a solution to the problem we currently have which is that of the amount of memory needed is growing linearly with the input size, and our input sizes can be large enough that the needed memory is unacceptable. So basically my proposal is to change the linear growth in memory overhead to a constant overhead, regardless of input size, by reducing the amount of items being processed at any given time (may sound familiar as this is also the main reasoning behind the batching: limiting the growth of resource usage relative to input size). The time needed currently linearly grows with input size, but employing the solution I proposed shouldn't change the overall growth pattern of time relative to input size, just changes some constants (that's what I meant by "a bit more processing-time"; it's effectively not an important-enough difference).

tl;dr: space/memory complexity reduced to some constant factor, time complexity remains linear (both relative to input size)

But will leave this part until after everything else as it currently works and there's other more high priority things to do; it's an optimisation task. But will still be moving this whole page stubs/import-history stage to after when the user presses "start import", instead of on component mount.

why not just checking the URL, not text etc. If it is the same, we discard it.

So basically skip the smart dedup logic and just do a much simpler $exists check on url? I thought this was a bit silly at first, but after thinking about it a bit in the context of imports, I think it makes good sense.

A page pointed at via a URL could have changed since it was visited last, but if it was visited before and the extension is installed, it should already have been fetched&analysed at that time (which is what we want). If the extension wasn't installed, then that's the point of the user invoking the import process (which, at the moment, cannot get whatever was at the URL at the visit time and instead gets the current contents with an XHR, but I've discussed the possibility of using web archiving stuff before with @Treora; but that's a further "maybe" extension to the fetch&analyse logic).

So yes, maybe in the context of imports we don't need the deduping stuff and can test on URL existence. Kinda related note: the batcher currently uniq-ifys the input URLs, so if the user has 100 entries with the URL https://www.github.com in their history, only 1 is fetched&analysed (as they're gonna be the same as they're all fetched roughly at the same time).

...[estimates view data stuff]...

Yeah I was over-complicating this part. Can just use counts gotten from DB queries, as we discussed in the slack chat earlier today. Previously I was thinking in terms of a single import process; so getting counts from the import/ docs that are made for each page stub of a given import process invocation.

Why not just returning a message in the processing list (UI) that shows the URL to be a "duplicate" instead of "success"

Yep, just an additional state for the results table data rows. Can do so.

@blackforestboi
Copy link
Member

and our input sizes can be large enough that the needed memory is unacceptable

Do I understand correctly here, that the RAM usage comes from generally loading all history items for preparation of processing? That would happen independently of the stub creation happening before/after a user starts the download.
If not the writing process makes the ram and CPU go mad, but generally holding/processing the data from 20k elements, why isn't it a the problem in the old extension? There we also get all 20k elements from the chrome API but have not the problem.

So yes, maybe in the context of imports we don't need the deduping stuff and can test on URL existence.

Just for my understanding: the deduping currently happens ONLY after the page is fetched? It can't be used before that in the process?

A page pointed at via a URL could have changed since it was visited last, but if it was visited before and the extension is installed, it should already have been fetched&analysed at that time (which is what we want)

Yes thats what I thought as well.

Yep, just an additional state for the results table data rows. Can do so. 👍

@poltak
Copy link
Member Author

poltak commented May 19, 2017

Do I understand correctly here, that the RAM usage comes from generally loading all history items for preparation of processing?

Yes, that should be right.

That would happen independently of the stub creation happening before/after a user starts the download.

The loading of history items into memory isn't independent of the page stub logic; that logic happens for each item (which are all held in memory).

...why isn't it a the problem in the old extension?

I will have a look over how the old extension does it tomorrow, if needed, but I presume the underlying algorithm would still have linear complexity in relation to the number of history items (the input), unless you did some form of batching and processing the items in constant-size batches/pages/groups (say 100 items at any given time).

Basically what is happening is the import-history thing here started from this function: https://github.com/WebMemex/webmemex-extension/blob/edf2252475ee3358021314500051ca3d4b7c2f00/src/browser-history/background/import-history.js#L105

To get a better idea of what's going on, and to explain a bit how I came to the conclusion of the linear space complexity, here's a rough complexity breakdown of the underlying algorithm relative to the input which is stored in memory (the number of history items):

  1. for each history item: (linear)
    1.1. put visit items for that history item's URL in memory (constant)
  2. for each history item: (linear)
    2.1. transform to page doc stub (constant)
    2.2. for each visit item of current history item's URL: (still constant to original input)
    2.2.1. transform to visit doc (constant)
  3. for each visit item, of every history item: (linear)
    3.1. replace referring visit item ID with ID of corresponding transformed visit doc (constant)
  4. add import timestamp to all docs (linear)
  5. insert all docs into db (linear)
  6. update page search index (linear???)

Leaving us with linear growth overall (assuming the index update is linear; I'm unsure about what's actually going on in this part, as it's hidden behind a db.search() call, which I'm not familiar with, but can look more into tomorrow).

Also let me know if any of the parts of that algorithm don't make sense why they are there, or if you see something that could be simplified.

Just for my understanding: the deduping currently happens ONLY after the page is fetched? It can't be used before that in the process?

I need to get acquainted with the deduping logic before I can accurately answer this. Been treating it as a black-box as it's a separate complex module in itself, but pretty tired after spending all day on this now, so will get back to you on this tomorrow once I have a look.

From what I understand of the reasoning put in the comments, the deduping logic will give more accurate results given more available page data (text, metadata, etc.), hence why it's done afterwards in the page-visit scenario. Assuming that's correct, then yes it can be done before but deduping accuracy will be impacted.

@blackforestboi
Copy link
Member

Yes, that should be right.

I just checked the RAM usage on the old extension, it spikes up to 180MB and it processes all 20k elements with all the data in it, but does not store them or make them searchable.
As you already hinted at in your list, maybe the RAM usage from the WebMemex comes from the indexing? Because right now, all elements are written to the DB and, assumably, then also immediately indexed, and that for all 20k elements.

It may be better, if the page stubs are created 1 by 1 therefore the indexing also happens linearly and not in one big batch that kills the RAM, or we have the ability to wait for reindexing until the import process is finished.

Assuming that's correct, then yes it can be done before but deduping accuracy will be impacted.
Maybe this could be mitigated with a simplified dedup check, that is just valid for importing urls (but not only from here, but generally, for importing large batches of URLs) . For the reasons you mentioned already (we cant get the state of the original visit anyhow). So if it is already in the DB then we should just discard that URL.

It could be that the dedup framework of Gerben is to sophisticated for our requirements at the moment. We may can more easily use it when an actual visit happens where all data is available. In the import we should be more vary about the resource use, hence more custom rules need to be in place. Dependent on how Gerben designed the dedup framework, these rules can be incorporated. (rules below in step 5)

I have an idea for a work-flow where everything is linear.

  1. user visits import overview
  2. api calls are made and elements saved to RAM, so that counting can happen.
  3. User presses start
  4. 1 by 1 is processed (maybe also x at a time)
  5. Simple dedup check for url
    • if from bookmarks API
      • if page object exists
        • if bookmark object exists
          • discard: duplicate
        • if not
          • add bookmark object and reference with page object
      • if not
        • add page object stub and bookmark object and reference to page object
        • run XHR and add missing data to page object
    • if from history API.
      • if page object exists
        • if all visit object exists (check with time stamp?)
          • discard: duplicate
        • if not:
          • add missing visit objects and reference with page object
      • if not:
        • add page object and all visit objects and reference with page object
        • run XHR and add missing data to page object
  6. New index happens with each successful URL-processing

@poltak
Copy link
Member Author

poltak commented May 20, 2017

As you already hinted at in your list, maybe the RAM usage from the WebMemex comes from the indexing?

I just tested it with a large input on my personal web-browsing browser (~15k of history), skipping the indexing, but the issue still remains (which is expected, as the overall algorithm space complexity remains unchanged).

...if the page stubs are created 1 by 1...

I think this is one point of confusion (at least for me). At the moment everything is being processed 1 by 1. That's what gives the overall linear (O(n)) space and time growth, and where the problem lies. If you mean instead of processing everything and doing a bulk DB insert + indexing at the end, we should process, DB insert, then reindex for every doc, it can't be done with the visits as they need be be in memory all at once to re-map their reference IDs (mapping the relationships between visit docs from the browser model to our model). It can be done with page docs, but then it will introduce unnecessary DB ops (inserts for every page doc instead of a bulk insert), and also if the reindex takes place for every history item, and it is a linear op, this increases the overall algorithm's complexity to quadratic time (a whole different level of complexity; generally you want to avoid this in algorithms if possible). Either way, as the visits can't be changed, the overall space complexity will remain the same (as visits are directly proportionate to the number of history items) meaning the problem will still exist for users with sufficiently large histories.

It could be that the dedup framework of Gerben is to sophisticated for our requirements at the moment

Yes, I think for now we should just go with the simple duplicate URL checking scheme with imports, instead of deduping for the reasons mentioned before (unless we think of something that contradicts our reasoning later).

With the workflow you detailed above, that's basically the same thing. Step 4 is still the linear time/space algorithm I detailed above (ignoring the "maybe also x at a time" part, as that really changes everything), although reindexing has been moved to step 6 (which may be a good thing for complexity-sake, but means user won't be able to search using simple page stubs until after the entire import process is complete; effectively making the page stubs pointless). Step 5 should really be done before/during step 4 to reduce the constants from processing already-saved page stubs (although that may be what you meant, if step 5 and 6 is meant to happen for every input of step 4).

I really think it's better to have a quick call later today, if you have time, and I can better show you what I'm talking about regarding the problem with the linear space (memory) growth as it relates to input size, and why it will always cause a problem here for sufficiently-sized inputs, regardless of how much constant-time processing we remove from the algorithm. This article also is a good one to read if you have time, but I think it's overly long and applies to all algorithms in general. I could summarise it for this particular case in much less time in a video chat. What do you think?

@poltak
Copy link
Member Author

poltak commented May 20, 2017

Further things that complicate the imports process (mostly notes to myself, but feel free to comment):

1. Need some way to tell from the background script if there was an import running before that was interrupted before completion DONE

Need to be able to skip import-history process if so, and just go straight into the progress view. Easily done in UI with redux state, but the background script cannot access that state directly; everything goes through the awkward runtime.connection port interface. This can maybe solved with a flag in local storage, or by checking for the existence of import/ docs (assuming these are all removed on the completion of an import process). If it is deemed that import process from earlier is incomplete, background script needs to skip estimate count and changing the UI state.

solved using a flag in local storage that says whether or not a previous import invocation was left unfinished

2. Need to rehydrate view with counts state DONE

3. Need to be able to remove page stubs if an import process is cancelled

If they stay in, the estimate counts sent from background script on init will be wrong (there is currently no way to differentiate between stubs and full page docs)

EDIT: This is just a stupid idea; it defeats the purpose of page stubs... Find a better way to get estimate counts (still need to differentiate between stubs and full docs)

UPDATE: Done

  1. fetch&analyse needs to be able to either delete the original page stub or update it with the fetched data

I remember in one of the other threads in upstream repo, there are reasons outlined why updating is not wanted. May need to revisit that.

  1. possibly remove import/ docs on batch completion

They currently act as the input to the batcher. They have a status flag which says if they are done/failed/pending, so it will still work again if they still exist, but they have no purpose (I think; need to think about this more)

@poltak
Copy link
Member Author

poltak commented May 22, 2017

Here's a bit of an outline of the steps that happen with the current implementation of the import process state between both background script, BG, and UI. The biggest pain here was the state management and keeping BG state and UI state (redux) in-sync over the 2-way runtime connection port messaging interface, which was slightly awkward.

More information regarding the runtime port connection:

Init stage when the user navigates to the imports UI in options:

  1. UI: opens runtime port connection with background script on component construction
  2. BG: instantiate batcher instance (no input given yet; can't be run until later)
  3. BG: gets estimation counts for display and sends them to UI in INIT message
  4. UI: receive INIT message and pass estimation payload to redux action for display

That runtime port connection stays open after init process (will remain open until either UI page or background script is closed; through nav event, for example). BG and UI will wait on estimates view until user presses "Start import", which triggers the following:

  1. UI: move into loading state and send START message to BG
  2. BG: grab the last import timestamp from local storage (may or may not exist; doesn't matter)
  3. BG: start and wait on import-history stage, passing in the last import timestamp:
    1. grab all history items between last import timestamp and now (filtered on blacklist and any history items with URLs matching any pending import docs with type history)
    2. for each history item, grab all visits
    3. convert history + visit items to page stubs + visit docs
    4. map the referring visit ID from browser model to our model
    5. for each page stub, create import doc
    6. insert all created docs
    7. reindex DB
  4. BG: send START message to UI
  5. UI: receive START message and move from loading to in-progress state (moves to progress view)
  6. BG: update last import timestamp in local storage with now
  7. BG: set import in-progress flag in local storage (used for when the port connection gets interrupted and needs to manually resume at a future init stage)
  8. BG: grab all pending import docs and pass them into the idle batcher instance for processing
  9. BG: start batcher instance

After that, the user can manually trigger PAUSE, RESUME, STOP events via UI buttons. Those are straightforward: just redux actions that signal UI state changes and send messages along the port to keep BG state in sync (BG responds by calling relevant methods on batcher instance).

Anytime while the import process is running (and not paused), BG can send NEXT messages to the UI which signify the finish of a given import item (either success or fail; differentiated by payload). UI reacts by updating the UI counts and progress table states. If the import process completes successfully, BG sends COMPLETE message to UI and UI reacts by going into stopped state.

In the case of either manual cancel (STOP event) or the natural completion of the import process (COMPLETE event), these lead to the stopped UI state where user can review what happened on the progress view and have a "Return" button which takes them back to the estimates view and marks the completion of that given import process. DB reindexing and clearing of the import in-progress flag in storage also happens in transition to stopped state. On "Return" press, estimation counts are recalc'd in BG and sent to UI in new INIT message.

Things change a bit when it is determined that a running import process and its BG<->UI connection was lost somewhere earlier. Changes mainly lie in rehydration of both UI and BG state (UI handled via redux enhancer, BG via existing import docs in DB) and the skipping of stages that don't need to run again, but overall flow remains pretty much the same

EDIT: One thing that I probably should have mentioned here is the import doc, as it's new. Looks something like this at the moment, and (as described above) acts as a sort of input and state for the underlying batcher instance and as import-related metadata for a page/bookmark doc:

{
  _id: 'import/:timestamp/:nonce',
  status: 'pending'|'success'|'fail',
  type: 'history'|'bookmark',
  url: string,
  dataDocId: string, // ID to a page/bookmark doc
}

@poltak
Copy link
Member Author

poltak commented May 22, 2017

Main problem with the current flow lies in the import-history stage (part 3 of the "Start import" flow doc'd above). Processing time and space is proportionate to input size (browser history), so when that is sufficiently large, things end up a bit crazy. Giving a concrete example: testing on my daily web-browsing browser with 15,755 history items ends up needing to generate 40,914 visit docs along with 15,755 page stubs + import docs. Steps i-vi takes ~55s and final step vii reindexing takes ~100s. Step ii + iii seems to be the most painful part, with all the calls to browser.history.getVisits() then processing all page + visit items. Understandably resource usage goes through the roof and my laptop starts to sound like a plane taking-off. The reindexing isn't as resource-intensive but takes the longest amount of time, however it is really important as without that reindex there's not much reason for the page stubs (from what I understand).

Making users with sufficiently large histories wait that long on the loading screen (step 1 of "Start import" flow) and having their computer slow down, and possibly crash the browser, isn't really ideal. I've been discussing a bit with @oliversauter about this, and came to some ideas.

Page stub generation is still important here before the actual fetch&analyse batching stage (step 9), for reasons discussed earlier, but we could possibly batch it to constrain resource usage to a constant factor.

Visits however can't be batched like that in the current algorithm, as step iv needs all visit docs in memory to do the ID remapping (theoretically you could put them all in DB and then do n queries + updates, but that many DB interactions doesn't sound like an improvement; may be wrong).

We discussed the possibility of leaving the browser visitItem ID in the doc and adding a further index on that, so that the cross-references still exist without that remapping process needed. Then we could batch the visit doc logic as well, but it makes the visit doc structure messy (and inconsistent with visit docs generated through page visits; but I think that's less of a concern).

Relating it to UI, if we can batch these stages to constrain resource usage we should introduce some additional UI states instead of just leaving the user on loading screen for that long (can use progress view and switch on the help message text + add additional rows to the progress table, for example).

@Treora: What do you think about the idea of leaving in the browser visitItem IDs in visit docs generated during imports (so that the remapping stage doesn't need to happen and the import-history stage can be batched)? Or do you have any different ideas that we haven't thought of to handle these cases?

tl;dr resource usage is very bad for import-history stage if your input size is sufficiently large, so we are proposing to batch page-stubs + visits creation. However visits can't simply be batched at the moment due to the ID remapping stage, so we are proposing skipping it by leaving in the browser IDs and creating another index on that :\

@poltak
Copy link
Member Author

poltak commented May 23, 2017

@oliversauter really good progress made this morning based on the ideas we discussed yesterday. To sum it up:

This affords the ease of adding new types of imports processing logic as we discussed (think history vs bookmarks, and whatever other types we need in the future), by adding a new case into the entrypoint of that module that calls any async function that need to take an import doc and returns a status message (all errors recommended to be thrown in this module).

Example exists and working for history types in L147, which calls the processHistoryImport() fn (which is essentially that step 5 in the flow you laid out above a few days back).

This module now set as the promise-batcher function and run for all inputs (which are the import docs), superseding the old history-centric function. No changes needed to promise-batcher.

tl;dr: Should be super easy to add any new types of processing to the import pipeline by adding cases to the import-doc-processor module that adhere to the (ImportDoc) => statusMessage interface.

  • went ahead and removed the visits step from the page-stubs (import-history) stage

As per your step 5 thing I linked to, the visit docs are now done here when the page stub is processed (either check for missing visits in the case of an existing page doc with URL, or just get all the visits for a single URL). Greatly simplifies the import-history logic and removes the craziness of having to process n calls to browser.history.getVisits() at once.

Of course, this means the re-mapping of visit IDs can't be done, so the browser VisitItem ID is also stored for now. Although I found another use for that: when page doc is deemed to exist already and we need to check for any missing visit docs in DB against the browser API, we can check against the stored browser ID, which IMO is more reliable than the timestamp (even though that seems do-able). We'll see if we want to do the remapping at a later stage or make an index on the browser visit ID (it all depends what we want to do with these references later, I suppose).

Will update the PR OP list with next obvious things needed to do.

@blackforestboi
Copy link
Member

Wohoo! Great we are getting there!

we can check against the stored browser ID

Would it help if we also store the historyItem ID in the import doc? This way we can also check if a item from the history API is already stored, in case a user starts the import later/re-does. Or are the import docs deleted after the import happened?

@poltak
Copy link
Member Author

poltak commented May 24, 2017

This way we can also check if a item from the history API is already stored, in case a user starts the import later/re-does

This case is handled by the way the page stub and import ID is generated: it's based on the history item's ID + last visit timestamp. So say we start an import, the page stubs and import docs are created and stored, then we choose to cancel early. Next time we start, it will attempt to generate the page stubs + import docs, but it will reject them as docs already exist with those IDs (from the previous processes' step).

But good thing you asked that! It got me thinking as visit docs have a similar ID generation scheme (based on VisitItems instead), so maybe I'm checking on the VisitItem ID unnecessarily 😄 Will add this to my checklist and see if we can get rid of that explicit check.

Import docs specifically have a status key either set to fail success or pending. Ones with pending are used as input to the batcher. At the moment they all stay in the DB even if you cancel or complete an import process fully. pending ones are reused on next run (in the case of a cancel), however this may be wrong in some cases (say someone runs an import, cancels, deletes some of their history, starts new import; at the moment that cancelled history is still gonna be imported unless we act on some history delete event). Also not sure yet if there is any purpose for success/fail docs for future runs yet. Deleting data is usually naughty, but in this case it may be right if the import is cancelled/finished completely and regen'd next run... I need to ponder over the outcomes some more

@blackforestboi
Copy link
Member

blackforestboi commented May 24, 2017

by the way the page stub and import ID is generated: it's based on the history item's ID + last visit timestamp.

I get where you are coming from though, the DB would automatically reject adding new items with the same ID.

But what happens in the case that a person visits a page again between canceling and restart? From my understanding a re-import would then create a new page stub because lastVisitTime has changed? Wouldn't it be better to just check for the URL as an indicator if a page already exists?

@poltak
Copy link
Member Author

poltak commented May 25, 2017

But what happens in the case that a person visits a page again between canceling and restart?

Yes, your understanding is correct. But at the moment, as the import docs are persisted between imports, there's a check during the import-history (same time as blacklist check) that checks history item URLs against the URLs of those import docs with pending flag in the DB. If there's a match, that history item is skipped for page stub + import doc creation (and the existing page stub + import doc is used for the subsequent import process as it's still pending).

However, thinking about it, in that time between cancel and next import, the page pointed at by that URL could have changed. But given the way we currently do the page fetch in imports (XHR direct to that URL at the time of import), the same result will be fetched regardless of if a new page/import doc is made (with new timestamp) or we reuse the existing one (which happens now). Brought up the idea of possibly using a web archiving service to attempt fetch page content at the history item's timestamp with @Treora before, but decided against it. May be a future extension to the fetch&analyse logic, depending on what we want.

@poltak
Copy link
Member Author

poltak commented May 25, 2017

Played around with integrating bookmarks into the imports flow today and got a draft version working alright.

In the pre-imports stage (which we've referred to as import-history so far), this is the basic flow in relation to bookmarks:

  1. page stubs are created for each bookmark gathered from the browser bookmarks API (indistinguishable from browser history page stubs, apart from the data used to generate the doc _id)
  2. then we generate import docs of type bookmark from those page stubs (apart from the type, they're the same as history typed import docs; still contain reference to the associated page stub)
  3. all page stubs (history + bookmarks) are uniqd together on url to avoid later unnecessary imports/storage

That's the only changes in the pre-imports stage. Main thing here is at the end of this stage, we are left with pending import docs of type bookmark and history and all the page stubs.

bookmark typed import docs are then handled in the import-doc-processor differently (but mostly the same as history type imports), whenever the batcher attempts to process one:

  1. checks that a bookmark doc with the same URL doesn't already exist in the DB
  2. create visit docs for all visits to the bookmarked URL (if any visits get created, they will be associated with the page stub, not the yet-to-be-created bookmark doc)
  3. extract the browser bookmark API's ID from the associated page stub _id's nonce
  4. use that to grab the bookmark item (BookmarkTreeNode) from browser.bookmark.get() and transform to bookmark doc and store
  5. fill out the page stub using same fetch&analyse logic as in history case

This is pretty much the same as history import doc case, apart from the fact that 1 is checking against existing bookmark docs instead of page docs (however it maybe should check against pages as well for step 5), and step 3 and 4 which are bookmark-specific. I might extract most of this logic out and just have it run the bookmark-specific stuff first then call the same logic as in the history case.

Bookmark doc structure isn't finalised yet, but I've just created something for now shaped similar to the existing visit and page docs, while retaining all the needed bookmark-specific data that was talked about in #29 upstream (there's not really much special bookmark-specific data).

Also updated the backgrounds estimates logic for initing the UI with so that it works with both bookmarks and history. Messy and yucky calculations, but seems to work fine for big inputs.

@Treora
Copy link
Contributor

Treora commented May 27, 2017

@poltak
(I sent the following reply to your message five days ago by email, but somehow it has not appeared in this conversation)

@Treora: What do you think about the idea of leaving in the browser visitItem IDs in visit docs generated during imports (so that the remapping stage doesn't need to happen and the import-history stage can be batched)? Or do you have any different ideas that we haven't thought of to handle these cases?

If I recall correctly I did have visit ids be generated deterministically (by simply prefixing the visitItem id). The reason for this was idempotency (preventing duplication if importing twice), not batching, but I guess that could be done too this way. Using a deterministic mapping should be just as good as leaving the ids the same.

I am unpleasantly surprised about the time importing appears to take for you, I expected much lower times. Batching may be sensible then, at the cost of higher complexity.

One small worry I had about importing visits one by one was broken referrer links. Some visitItems might point to inexistent others. Not sure if this happens (but best to assume they might), but in any case it need not really be a problem to have these dead referrer links in our database either. If you think we can somehow (but efficiently) prevent these that would be nice though.

Combining visits from multiple browsers may be a problem if they map to the same visit id; should we perhaps add a token to all ids for the browser and/or importing occasion? (a bit like the same timestamp that is added to each under the importedFromBrowserHistory field)

In case you would like to discuss design choices further we could have a call later this weeek. I have not followed your progress in detail, but it looks like you are going in the right direction.

@blackforestboi
Copy link
Member

blackforestboi commented May 27, 2017

@poltak

I ran a few tests and discovered some bugs:

I just tried to import the bookmarks for now, this is what happened:

  1. Some documents are not searchable
    my guess is that it is because I was importing bookmarks, that have no history item (chrome only stores the last 3 months). List elements in the overview are generated by a visit item rendering the content from the page. We might have to also render a bookmark item.

  2. After importing bookmarks also almost all history items are "imported". At least they don't show up anymore in the import dialogue. (before 20k now just 38)
    My guess is that when choosing only to import bookmarks, it also at least downloads the elements from the history API, but does not download them

@blackforestboi
Copy link
Member

  1. Duplicate test not implemented yet?

If I visit pages, they are counted towards the "yet to import" urls in the overview.
This would be ok to neglect for now, as long as we make sure that they are not again imported, means we have to throw a duplicate error before that. Is it already checked if they are duplicates?

@poltak
Copy link
Member Author

poltak commented May 28, 2017

@oliversauter thanks for the feedback!

  1. I need to look more into this first as it's a bit more complicated and maybe related to the overview rather than imports. Each successful bookmark import should result in 1 page doc, 1 bookmark doc, and n visit docs (bookmark doc and whatever visit docs will have a reference to that page doc).

The browser history API shouldn't be touched in the case of bookmarks, it just generates the page doc from the URL stored in the bookmark item (BookmarkTreeNode) from the browser bookmarks API. So in theory it should result in the same thing in Overview, but will look into what's actually going on here.

  1. This is a known bug in estimate counts logic (populates that pre-import estimates dialogue). History is not actually imported in the case of choosing bookmarks, it's just that the way it counts is wrong (in short it counts docs in the DB, but there is overlap between docs stored for bookmarks and history which the counting logic doesn't take into account).

  2. Basically the same thing; estimate counts logic is broken. It will schedule these pages for import, although will end up skipping during the actual batch imports process once it detects data exists for those URLs (the step 5 from that little flow laid out last week), which should get reflected in the UI.

Although, now I'm thinking maybe that existing data checking step should come earlier (during page stub + import doc gen) to avoid the need to skip during imports at all... will see.

Fixing the estimates counts has been on my todo list, so I'll get around to that. You wouldn't believe how much pain has come out of trying to get those seemingly simple counts accurate in all the different cases of stopped imports process 😛 For now, take the data given in that UI with a grain of salt.

@blackforestboi
Copy link
Member

I need to look more into this first as it's a bit more complicated and maybe related to the overview rather than imports.

Yeah my guess is that it is because only visit items are shown in the overview, and the content of the related page object is rendered in there. At least how I understood the process to be from Gerben. So in case a bookmark item has no visit item, it won't be shown.

The browser history API shouldn't be touched in the case of bookmarks

This is something I am unsure about, because for example I have for noisli.com, which I use almost daily, all visits in the result list after importing. Goes back to 27 of February. So exactly 3 months, the limit of the History API. Didn't you say that there is one step in your process, where you get all visit Items from the API, just with the URL? Maybe this step is still active in the bookmarks import flow?

Also it shows this at the beginning:
screen shot 2017-05-28 at 09 36 57

out of trying to get those seemingly simple counts accurate

it seems, as if we could put fixing these counts to the back of the issue list. It's ok, if they are a bit off, at least for now. If there is more important stuff to fix, then this can wait. What do you think?

(from Slack) "but I recall you saying something like you're planning to replace this library for some reason (or some talk about it"

Yeah @bwbroersma is working on a new search library during GSoC.

More info on bugs

When importing, these errors are shown for me in the console for many pages:

screen shot 2017-05-28 at 09 35 16

@poltak
Copy link
Member Author

poltak commented May 29, 2017

Didn't you say that there is one step in your process, where you get all visit Items from the API, just with the URL? Maybe this step is still active in the bookmarks import flow?

Yes, visit items are retrieved in the case of both history and bookmarks imports. At the moment, the only difference between history and bookmarks is that there is an extra step at the beginning for bookmarks to create a bookmark doc. Visit + page doc generation happens the same in both cases (the page doc is made without touching the history API in the case of bookmarks).

If there is more important stuff to fix, then [UI estimate counts] can wait. What do you think?

Yes, most of the UI outstanding stuff isn't a high priority for me. Just plan to get to it when I can.

@poltak
Copy link
Member Author

poltak commented May 29, 2017

The remaining non-trivial issues with this feature are all to do with the state handling and juggling things between UI and background states. There's so many different states and cases to handle that mainly stem from the idea of having it "resumable". I'll write some of them out here, mostly as documentation for myself (since it's so hard to keep organised in my head alone), and for transparency-sake.

  1. User presses refresh button during imports

What happens?

Bi-directional connection is lost with background script and needs to be re-established. This starts up connection initialisation again in background after UI automatically re-connects. Background connection initialisation looks something like:

  1. instantiate batcher
  2. import is deemed to be in progress
  3. batcher is (re-)init'd with pending import docs
  4. message sent to UI to go into "Paused" state (then user can resume whenever, which triggers the batcher in background to start up)

Problems

Background does not have access to allowed import types state (UI checkboxes). This means if say the import was started for bookmarks only, but there are some existing import docs of type history with pending status, the background has no way of knowing the checkbox state hence step 3 will re-init the batcher with both bookmarks and history.

Possible solutions

1:
Straight after connection started from UI side, send a message passing allowed import types state. This means the both background and UI will need to wait for this to arrive and have the batcher init'd with import docs depending on that UI state. UI needs to stop user from doing anything until this is done.

2:
Remove all import docs whenever a given import process finishes (cancelled or fully completed). This would avoid the problem completely. The problem with this is that import docs are associated with page stubs (generated during the same pre-imports stage), and those page stubs will still exist between distinct import processes. Hence need some way of regenerating import docs for existing page stubs, although currently there is no way to differentiate full page docs and stubs, and no plans to change this as page stubs only matter in this way directly related to the imports feature (the rest of the extension shouldn't need to care about these differences).

Maybe 1 can be tried, but it further complicates and messes up this bi-directional UI<->Background connection state syncing thing.

  1. User interrupts running imports process in any way (either refreshes page, kills tab, navigates away, closes browser, etc.)

What happens?

Bi-directional connection will be lost and any currently-in-progress import items will continue until they succeed or fail (for whatever reason). In this process the full page doc may be filled out.

Problems

The observer logic (logics that trigger on the different import item outcomes) cannot be run as the connection on longer exists and all connection logic gets cleaned out of memory. The update of import doc status is currently done in the observer, hence it won't happen and both the UI and the background will never know the outcome of these items, even though the page doc is filled-out in the DB. The effect is that in the next import process, these will be started to import again (as their pending import docs still exist), before it realises that page docs already exist for these and then skips them.

Possible solutions

1:
Move import doc status updating logic into the actual batching function logic. This means that whatever import items are running when the connection is lost will continue and eventually update the corresponding import doc, meaning they won't be needed to be attempted again on the next import process.

2:
Move page doc update logic to observers, with the import doc status update logic. Means that whatever import items are running when the connection will effectively be lost and redone completely next import process, without skipping.

3:
Just live with the skipping of docs that were in progress at the connection-lost time. It just means that n docs will always show up as being skipped in the UI whenever a user interrupts the import process and resumes it later, and the import doc status' will still be eventually synced up. It's still kinda messy.

1 and 2 essentially just put all the DB mutations in one place instead of two, which does make more sense looking at the problem from a distance. 1 could save future re-imports, but it means the UI will never know the outcome of those items (counts and progress table will get out-of-sync, may confuse user). 2 will result in n unnecessary extra imports (at most 5, or whatever we end up choosing the concurrency level to be) but should allow UI to stay in sync. 3 will keep UI in sync but have the first n results in the progress table always as "Skipped".

There's more state related issues, but I'll add them as I encounter them again. Can't remember everything to do with this off the top of my head as it's so messy.

@blackforestboi
Copy link
Member

@poltak Thanks for the detailed analysis.

(the page doc is made without touching the history API in the case of bookmarks).

Mhh weird. How can it be that for 300 bookmarks items (which I almost never touch), my history count is reduced from 20k to 30ish or so (when download is finished and I go back to the import start overview)?

Just live with the skipping of docs that were in progress at the connection-lost time. It just means that n docs will always show up as being skipped in the UI whenever a user interrupts the import process and resumes it later, and the import doc status' will still be eventually synced up. It's still kinda messy.

I think we should go for this option for the time being. As long as the actual import works, its an issue we can tackle later then.

@poltak
Copy link
Member Author

poltak commented May 29, 2017 via email

@blackforestboi
Copy link
Member

This is the same estimates count view we just agreed before to leave until later.

Ah ok, didn't understand that they are so closely correlated. When I talked about leaving it out, I meant it may not be a problem if the count is off by a few (like when at the end the counter is 294/300 even though it is finished.
But that 20k elements are missing, especially from a count that should not be touched at all, is very confusing, as it implies that the history has been imported as well, even though the user only picked the bookmarks.

poltak added 24 commits July 5, 2017 11:36
- now based on URL + import timestamp
- URL is directly related to the import doc; should only ever be one import doc for a given URL in an import process
- the timestamp I'm still not decided on... now will generate new import docs per import process (cancel/finish and start again later)
- means that have to handle old pending import docs in the case of cancelled import process (prev _id generation was kinda pointless though)
Create yet another main import UI state (preparation)

- before was reusing init/loading state, but this stage is complicated and long for large inputs
- hence need a slightly different state to show user (simple loading bouncy thing for minutes is not really good)
- now we're up to 6 main states for imports UI (plus all the different combinations)
- really needs a cleanup

Write view logic for new imports UI prep state

- just a simple message for now (just that seems like a big improvement from no message)

Merge imports UI init + prep states into single state

- called isLoading in the view
- they were pretty much the same apart from a message that is displayed
- hence put that message in store and display it on view if it's present
- means we can put whatever messages on loading state via actions (without introducing new states)

Put loading state + reindexing on import cancel event

- reindex needs to happen so that the progress up until cancel can be usable
- loading state needed as reindexing takes forever, and we can tell the user to sit-tight
- changed CMDS.STOP to CMDS.CANCEL since that's what the cmd is doing
- additional action + reducer

Wipe download details progress on import finish

- previousy kept them, but wiped things like the progress counts
- inconsistency
- this one checks the URL to process against existing page docs in the pre-imports stage (used in both estimates counts and page/import docs creation
- needed mainly to stop feeding in visited pages to import batch (they will still be skipped, but no reason to even create import docs for these guys)
- this file changed quite a bit upstream since last work on it, hence I decided to just overwrite my changes with upstream changes and rewrite the more simple case of imports page analysis
- main conflicts with upstream now is that the setDocField/Attachment helpers are on the entire file scope instead of inside the background entrypoint
Add isStub flag to page doc to afford differentiation

- used to differentiate stubs and filled-out docs
- may be temporary; at the "fill-out" stage, probably can just get rid of it as it is only used in the context of imports
- similar to hotdog or not hotdog

Write logic for removing page stubs on import cancel

- uses the new isStub flag

Add page stub flag switching in fetchAndAnalyse

- this is done at the same time as other fields are updated in fetchAndAnalyse
- at the moment this means that those page stubs existing for import docs that get skipped or error-out will remain, hence the cleanup step in complete() (happens at end of imports process)

Altered completed pages calc in estimates calc

- now that leftover page stubs are guaranteed not to be in DB, we can simply use the count of page docs within the db as the completed count (same as bookmarks)
- note that things still go weird for now as import docs refer to the deleted page docs; this is the next step (and might as well move them to local storage at the same time)
Remove explicit index updates in imports

- need to wait until new search module is implemented and will update index using that logic

Rewrite persistent import state in local storage

Big commit; essentially removes the old import doc (stored in pouch) and replaces with simplified import item (stored in local storage).
Import state data model simplified quite a bit: now only contains URL and type fields.
This assumes uniqueness of URL in the imports process for page stubs and import items. One import item should always refer to just one page stub.
Import doc status is now implicit: if it exists, it's pending, if it doesn't, it's either done or error'd (error should only happen for error in page data fetch, else it's a bug!)

Still need to properly handle removing of state on a cancel, which should get remade on restart, but not including already processed items.

Remove pending import items check in preimports

- now as import state items are cleaned up between import process, and preimports only happens at the start of an import process, this check is not needed

Remove import items checks in estimate count calc

- no longer needed as these est counts happen when there are no import state items in storage (due to simplified import doc/item state)

Remove all time-related imports logic

- now everything is done more naiively based on URL rather than time, as the imports state and page stubs are cleaned up between imports
- no need to store last import time in local storage; page and bookmark docs will be filtered on URL
- was in actual page import stage after earlier discussion, but after more recent discussion we need to put it back or else there's no point of preimports (as they are needed to show the results in the overview search)
- may alter this algorithm slightly soon to store them as they are mapped from page docs; currently have issues testing with large input sizes as pouch is going crazy
- this logic was flawed as those page docs are needed if we want to be able to search on the visits generated in pre-imports
- now assume the page stubs are in there and switch on them in the relevent queries (calc ext page doc estimates + filtering import URLs against existing page docs)
- due to the idempotent nature of page doc _ids, they will not be duplicated for history/bookmark items on next pre-imports stage
- essentially replaces the bulk doc insert of all visit docs with several bulk inserts per page doc, putting a linear space limit relative to the amount of visits for given URL
- no way to get the trails this way, but the browser ID is still stored
Move import DB upsert to observer

- previously was done for each import item
- this led to the problem of actually performing the upsert after the fact when user presses pause or cancel
- hence on resume of imports, those previously upserted would be found out to already exist leading to skipping, messing up counts and leading to #CONCURRENCY skips
- also set concurrency to 2 as it seems much more stable than 5 on larger inputs (still not completely happy with the performance of the promise-batcher with large inputs)

Rename import-doc-processor to import-item-processor

- minor refactor commit

Unify bookmarks + history at page fetch import stage

- there really isn't any difference between history and bookmarks at this stage; this stage only fetches the page data for the history/bookmark item
- hence it can be unified here and the mapping of bookmark items to docs + storing can happen in preimports, along with visits and page stubs (should take relatively small amount of time)
- given last change, there is no real way in the UI estimates counts stage to differentiate bookmarks associated with page stubs and filled out page docs (since it counts bookmark docs)
- this commit adds logic for each stored bookmark doc to see if the associated page doc is a stub or full doc
- could also be done with a map reduce but given the way Promises work, they would all be done at the same time without batching logic, hence may have big perf impacts with large amounts of bookmarks (done in for-of loop instead)

Move import-estimates out of imports-preparation module

- refactoring commit
- cleans up a lot of the preimports module
- moves the browser history + bookmarks API fetchers + filtering on existing URLs logic back to the index, as they're shared between preimports and estimates
- interfaces slightly changed to afford more customisation later, but defaults set to avoid any changes in current set-up function invocations
- this commit cleans up the changes that no longer need to exist to files originally from master
- should clean up the diff of this branch, to better focus on the actual additions (there shouldn't really be any big changes from master; most changes in this branch are completely new and separate)

Fix up the two favicon extraction methods

- in page visit, favicon URL extracted from the tab
- in imports it isn't
- however due to some changes rebased in from upstream, that slightly changed the `extract-page-content` logic (and thus favicon), I needed to clean up the favicon logic so it isn't attempted to be remotely fetched on page visits
- also change the identifier for `favIcon` to favIconURI` in `fetch-page-data`

Remove unused import UI code

- originally committed in the first imports UI commit (c482468c5bfb85eaf050c9d88e01ef8d5d4a8e81) that was submitted months back upstream, which I included and built my UI off
- sadly never got around to cleaning up the unused stuff that still hung around
Make promise-batcher stateless

- was originally designed to manage state of it's inputs
- revisiting it, it could be greatly simplified by making it completely stateless and pushing the state management responsibility to the caller
- instead of passing in static input, which it then manages the state (to afford pausing etc), another async fn can be passed in to fetch input
- this also allows us to simplify the interface to just "start" and "stop"; "pause" is implicit if the user manages their state
- reducing more unneeded states; great stuff!
- also means I have to alter the `imports-connection-handler` (which currently uses it) slightly due to the changed interface

Give promise-batcher a class makeover

- essentially a refactoring commit; no real interface change
- just syntactic sugar but nice in this case as it was essentially creating the start-stop interface, holding a bit of state for the RxJS sub and abstracting away RxJS stuff
- for me, at least, it's a lot nicer to express this as a class rather than a closure
- also several minor bits of cleanup and putting things on "private" methods where it makes sense

Simplify import item processing by minimising DB ops

- improves the speed of imports for large inputs by A LOT
- removed checking for existing docs to skip current input (essentially done in preimports as well when the import items are made; some may get through, but it's really not so important compared to the perf impact)
- remove fetching of associated page stub; instead we can store the ID of the stub on the import item; a bit extra space but -1 db op
- put the update op back inside the import-item processor: means that the counts may get slightly out of sync and the currently running import items will continue to run if user pauses/cancels. But now the DB will not get overloaded as it was before; the batcher will not move on until the DB has been updated
- also increased the concurrency back to 5 as it works great (even up to 10 worked alright, but I think safer as 5 for now)
- no longer need Location to be formed for readability
- all page data put inside `content` field
- still returning that favIconURI

Update imports logic to use new page model

- stub sets `content.title` as `title` for now
- import processor replaces old fields with the `content` field, and no longer cares what's inside

Set imports concurrency at 2

- this was the best tradeoff on my machine; let's see how it fares for others
- my machines results with different concurrency levels on the PR discussion
- due to the enhancer saving the current import state, it was skipping this loading screen on init
- it was also skipping loading screen on cancel/finish due to the finishImportsReducer always setting to IDLE state; altered this reducer to allow conditional state setting
- default timeout arg added to afford customisability
- switch from using xhr.ontimeout and xhr.onload to xhr.onreadystatechange to better handle non-200 HTTP responses
- previously if error was encountered, it was handled fine in UI but the page stub wasn't marked off, hence it would be retried next time
- now it makes sure to at least mark off the page stub to avoid future error retries (these errors relate to bad HTTP requests; like URL no longer points to page)
- say user finishes downloading all bookmarks only. They now have 0 remaining
- before user could press start btn and get into empty download (finishes immediately after pause + resume)
- now it can't be started
This reverts commit 7672f6e.

@oliversauter brought up good point about losing internet connection during imports process. This commit would have then marked off all the failed downloads as full pages, meaning they wouldn't be re-attempted on next import process.

Cannot simply check the internet connection for each import item, hence I think it's safest to revert this behaviour for now. Good suggestion to keep track of failed downloads for future feature to re-attempt failed ones in future import process.
- brought into both options and overview
- at the moment just the `LoadingIndicator` has been moved to there
- fixes the issue with the `LoadingIndicator`s CSS not being pulled into overview when it was in the `src/common-ui` module
- bit of testing with my history and bookmarks give a lot lower constants than originally planned
@poltak poltak merged commit 24ca8dd into master Jul 5, 2017
@blackforestboi
Copy link
Member

Congratulations! Wow, what a massive feature. "That was a birth" - how we would say in german :)

@bohrium272
Copy link
Member

This will forever be the legendary PR at WorldBrain 🏆
Congratulations @poltak!

@poltak poltak deleted the feature/imports branch July 7, 2017 05:01
@blackforestboi blackforestboi changed the title History + bookmarks imports feature MTNI-217 ⁃ History + bookmarks imports feature Apr 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants