Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve robustness by using queues #139

Merged
merged 22 commits into from
Sep 29, 2020
Merged

Improve robustness by using queues #139

merged 22 commits into from
Sep 29, 2020

Conversation

Ndpnt
Copy link
Member

@Ndpnt Ndpnt commented Sep 21, 2020

Tested with 1040 documents from 566 services, took ~11 minutes for the whole process.

* master: (24 commits)
  Remove unused variable in filter
  Start tracking LibraryThing Community Guidelines
  Update LibraryThing
  Add links
  Improve documentation about configuration
  Document how to obtain logs
  Use long options in ops documentation
  Fix typo
  Improve function name
  Remove dead code
  Get file path and commit id in one git command
  Change the recipient of error emails
  Remove unnecessary check
  Ignore unknown services
  Factorize code
  Directly expose winston instance
  Making the servicesIds parameter mandatory
  Remove useless intermediate variable
  Simplify forEachDocumentOfServices
  Simplify init code
  ...
If there is no changes the commit method will do nothing
@Ndpnt Ndpnt requested a review from MattiSG September 21, 2020 15:52
* master:
  Throw error if the selected content to remove seems too large
  Avoid filter to throw error when country version content is not available
  Revert "Update to use remove filter instead of a dedicated one"
  Update to use remove filter instead of a dedicated one
  Filter out country version blinking in Google Terms of Services
  Bump node-fetch from 2.6.0 to 2.6.1
  Use better mail subject
  Change the recipient of error emails
  Send mail also for warnings
  Ensure errors in filters function won't stop the whole process
  Avoid unnecessary check
  Update Facebook.filters.js
Copy link
Member

@MattiSG MattiSG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this error when running locally (consistent in 2 runs, does not happen on master):

2020-09-24 18:09:32 error ASKfm — Terms of Service                                Error: Could not commit /Users/diplomatie/Documents/cgus/data/snapshots/ASKfm/Terms of Service.html with message "Start tracking ASKfm Terms of Service" due to error: "Error: error: le spécificateur de chemin 'ASKfm/Terms of Service.html' ne correspond à aucun fichier connu de git
"
    at Recorder.commit (file:///Users/diplomatie/Documents/cgus/src/app/history/recorder.js:51:13)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async Recorder.record (file:///Users/diplomatie/Documents/cgus/src/app/history/recorder.js:24:17)
    at async Module.recordSnapshot (file:///Users/diplomatie/Documents/cgus/src/app/history/index.js:22:24)
    at async CGUs.recordSnapshot (file:///Users/diplomatie/Documents/cgus/src/app/index.js:151:47)
    at async CGUs.trackDocumentChanges (file:///Users/diplomatie/Documents/cgus/src/app/index.js:89:24)

Worst, this failure stops the entire process halfway. 17 documents are handled after the error, which seems consistent with the queue size.

src/app/history/recorder.js Show resolved Hide resolved
src/app/index.js Outdated
@@ -34,8 +38,25 @@ export default class CGUs extends events.EventEmitter {

async init() {
if (!this._serviceDeclarations) {
this.trackDocumentChangesQueue = async.queue(async ({ serviceDocument }) => this.trackDocumentChanges(serviceDocument), NUMBER_OF_TRACK_DOCUMENT_WORKERS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
this.trackDocumentChangesQueue = async.queue(async ({ serviceDocument }) => this.trackDocumentChanges(serviceDocument), NUMBER_OF_TRACK_DOCUMENT_WORKERS);
this.trackDocumentChangesQueue = async.queue(async ({ document }) => this.trackDocumentChanges(document), NUMBER_OF_TRACK_DOCUMENT_WORKERS);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really just the document. it's something like that :

{
  serviceId: …
  document: { … }
}

Which name could be better?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or I embed the serviceId in the document object and in this case I can name it document. What do you think?

src/app/index.js Outdated Show resolved Hide resolved
src/app/index.js Outdated Show resolved Hide resolved
src/app/index.js Outdated
Comment on lines 123 to 126
if (error instanceof InaccessibleContentError) {
return this.emit('inaccessibleContent', serviceId, type, error);
Object.keys(documents).forEach(type => {
callback({
serviceId,
document: {
type,
...documents[type]
}
throw error;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we remove this event emission from the _forEachDocumentOf interface?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because with queues errors can't be handle the same way.

src/app/index.js Outdated Show resolved Hide resolved
@Ndpnt Ndpnt requested a review from MattiSG September 29, 2020 11:56
Copy link
Member

@MattiSG MattiSG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally:

$ rm -rf data
$ npm run setup
   [new database]
$ npm start
   …
2020-09-29 14:47:38 info                                                          Tracked changes of 137 documents from 51 services.

then if I introduce an error in a filter:

…
2020-09-29 14:48:58 warn  Facebook Payments — Terms of Service                    Content inacessible: The document cannot be accessed or its content can not be selected: The filter function removeHelpButtons failed: Error: filter failure
…

Also works with one specific service:

› npm start Facebook\ Payments

> cgus@0.0.1 start /…/cgus
> node src/index.js "Facebook Payments"

2020-09-29 14:52:56 info                                                          Refiltering 2 documents from 1 services…
2020-09-29 14:52:56 warn  Facebook Payments — Terms of Service                    Content inacessible: The document cannot be accessed or its content can not be selected: The filter function removeHelpButtons failed: Error: filter failure
2020-09-29 14:52:56 info  Facebook Payments — Developer Terms                     No changes after filtering, did not record version.
2020-09-29 14:52:56 info                                                          Refiltered 2 documents from 1 services.

2020-09-29 14:52:56 info                                                          Start tracking changes of 2 documents from 1 services…
2020-09-29 14:52:57 info  Facebook Payments — Developer Terms                     Recorded snapshot with id fa01c05.
2020-09-29 14:52:57 info  Facebook Payments — Terms of Service                    Recorded snapshot with id c865bae.
2020-09-29 14:52:57 warn  Facebook Payments — Terms of Service                    Content inacessible: The document cannot be accessed or its content can not be selected: The filter function removeHelpButtons failed: Error: filter failure
2020-09-29 14:52:57 info  Facebook Payments — Developer Terms                     No changes after filtering, did not record version.
2020-09-29 14:52:57 info                                                          Tracked changes of 2 documents from 1 services.

If I introduce an unexpected error in the code, it fails immediately, logs the stack trace and exits with a >0 status code:

› npm start Apple

> cgus@0.0.1 start /…/cgus
> node src/index.js "Apple"

2020-09-29 14:58:52 info                                                          Refiltering 2 documents from 1 services…
2020-09-29 14:58:54 info  Apple — Privacy Policy                                  No changes after filtering, did not record version.
2020-09-29 14:58:54 info  Apple — Human Rights Policy                             No changes after filtering, did not record version.
2020-09-29 14:58:54 info                                                          Refiltered 2 documents from 1 services.

2020-09-29 14:58:54 info                                                          Start tracking changes of 2 documents from 1 services…
2020-09-29 14:58:54 error Apple — Privacy Policy                                  Error: fetcher failure
    at fetch (file:///…/cgus/src/app/fetcher/index.js:28:9)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async CGUs.trackDocumentChanges (file:///…/cgus/src/app/index.js:88:35)
/…/cgus/node_modules/async/dist/async.js:181
            setImmediate$1(e => { throw e }, err);
                                  ^

Error: fetcher failure
    at fetch (file:///…/cgus/src/app/fetcher/index.js:28:9)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async CGUs.trackDocumentChanges (file:///…/cgus/src/app/index.js:88:35)

Comment on lines +15 to +16
const MAX_PARALLEL_DOCUMENTS_TRACKS = 20;
const MAX_PARALLEL_REFILTERS = 20;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to distinguish the two?

Suggested change
const MAX_PARALLEL_DOCUMENTS_TRACKS = 20;
const MAX_PARALLEL_REFILTERS = 20;
const MAX_PARALLEL_OPERATIONS = 20;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I think it's better because even if it parallelization only improves by a little the refiltering time it improves significantly the tracking time. So the two numbers have different impacts

src/app/index.test.js Outdated Show resolved Hide resolved
This was referenced Sep 29, 2020
* master:
  Remove useless validation
  Improve validate script robustness
  Make arguments parsing more robust
  Avoid try/catch and make Mocha exit with non-zero code on unhandled rejections
  Allow diff between current branch and master
  Exit process with proper exit code when there is a failure
  Fix git command
  Add RSS english documentation
  Improve RSS french documentation
  Update README.fr.md
  Update README.fr.md
  Fix typos
  Document RSS feature
  Remove obsolete code and dependencies
  Specify ansible version in deploy github action
  Update deploy github action
  Do not track images from Google Privacy / ToS
  Limit attempts to restart to avoid looping on an unresolved error
@Ndpnt Ndpnt merged commit 59c61c6 into master Sep 29, 2020
@Ndpnt Ndpnt deleted the improve-robustness branch September 29, 2020 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants