-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve robustness by using queues #139
Conversation
Improve perf x3
* master: (24 commits) Remove unused variable in filter Start tracking LibraryThing Community Guidelines Update LibraryThing Add links Improve documentation about configuration Document how to obtain logs Use long options in ops documentation Fix typo Improve function name Remove dead code Get file path and commit id in one git command Change the recipient of error emails Remove unnecessary check Ignore unknown services Factorize code Directly expose winston instance Making the servicesIds parameter mandatory Remove useless intermediate variable Simplify forEachDocumentOfServices Simplify init code ...
If there is no changes the commit method will do nothing
* master: Throw error if the selected content to remove seems too large Avoid filter to throw error when country version content is not available Revert "Update to use remove filter instead of a dedicated one" Update to use remove filter instead of a dedicated one Filter out country version blinking in Google Terms of Services Bump node-fetch from 2.6.0 to 2.6.1 Use better mail subject Change the recipient of error emails Send mail also for warnings Ensure errors in filters function won't stop the whole process Avoid unnecessary check Update Facebook.filters.js
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had this error when running locally (consistent in 2 runs, does not happen on master
):
2020-09-24 18:09:32 error ASKfm — Terms of Service Error: Could not commit /Users/diplomatie/Documents/cgus/data/snapshots/ASKfm/Terms of Service.html with message "Start tracking ASKfm Terms of Service" due to error: "Error: error: le spécificateur de chemin 'ASKfm/Terms of Service.html' ne correspond à aucun fichier connu de git
"
at Recorder.commit (file:///Users/diplomatie/Documents/cgus/src/app/history/recorder.js:51:13)
at runMicrotasks (<anonymous>)
at processTicksAndRejections (internal/process/task_queues.js:93:5)
at async Recorder.record (file:///Users/diplomatie/Documents/cgus/src/app/history/recorder.js:24:17)
at async Module.recordSnapshot (file:///Users/diplomatie/Documents/cgus/src/app/history/index.js:22:24)
at async CGUs.recordSnapshot (file:///Users/diplomatie/Documents/cgus/src/app/index.js:151:47)
at async CGUs.trackDocumentChanges (file:///Users/diplomatie/Documents/cgus/src/app/index.js:89:24)
Worst, this failure stops the entire process halfway. 17 documents are handled after the error, which seems consistent with the queue size.
src/app/index.js
Outdated
@@ -34,8 +38,25 @@ export default class CGUs extends events.EventEmitter { | |||
|
|||
async init() { | |||
if (!this._serviceDeclarations) { | |||
this.trackDocumentChangesQueue = async.queue(async ({ serviceDocument }) => this.trackDocumentChanges(serviceDocument), NUMBER_OF_TRACK_DOCUMENT_WORKERS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this.trackDocumentChangesQueue = async.queue(async ({ serviceDocument }) => this.trackDocumentChanges(serviceDocument), NUMBER_OF_TRACK_DOCUMENT_WORKERS); | |
this.trackDocumentChangesQueue = async.queue(async ({ document }) => this.trackDocumentChanges(document), NUMBER_OF_TRACK_DOCUMENT_WORKERS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really just the document. it's something like that :
{
serviceId: …
document: { … }
}
Which name could be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or I embed the serviceId
in the document object and in this case I can name it document
. What do you think?
src/app/index.js
Outdated
if (error instanceof InaccessibleContentError) { | ||
return this.emit('inaccessibleContent', serviceId, type, error); | ||
Object.keys(documents).forEach(type => { | ||
callback({ | ||
serviceId, | ||
document: { | ||
type, | ||
...documents[type] | ||
} | ||
throw error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we remove this event emission from the _forEachDocumentOf
interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because with queues errors can't be handle the same way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested locally:
$ rm -rf data
$ npm run setup
[new database]
$ npm start
…
2020-09-29 14:47:38 info Tracked changes of 137 documents from 51 services.
then if I introduce an error in a filter:
…
2020-09-29 14:48:58 warn Facebook Payments — Terms of Service Content inacessible: The document cannot be accessed or its content can not be selected: The filter function removeHelpButtons failed: Error: filter failure
…
Also works with one specific service:
› npm start Facebook\ Payments
> cgus@0.0.1 start /…/cgus
> node src/index.js "Facebook Payments"
2020-09-29 14:52:56 info Refiltering 2 documents from 1 services…
2020-09-29 14:52:56 warn Facebook Payments — Terms of Service Content inacessible: The document cannot be accessed or its content can not be selected: The filter function removeHelpButtons failed: Error: filter failure
2020-09-29 14:52:56 info Facebook Payments — Developer Terms No changes after filtering, did not record version.
2020-09-29 14:52:56 info Refiltered 2 documents from 1 services.
2020-09-29 14:52:56 info Start tracking changes of 2 documents from 1 services…
2020-09-29 14:52:57 info Facebook Payments — Developer Terms Recorded snapshot with id fa01c05.
2020-09-29 14:52:57 info Facebook Payments — Terms of Service Recorded snapshot with id c865bae.
2020-09-29 14:52:57 warn Facebook Payments — Terms of Service Content inacessible: The document cannot be accessed or its content can not be selected: The filter function removeHelpButtons failed: Error: filter failure
2020-09-29 14:52:57 info Facebook Payments — Developer Terms No changes after filtering, did not record version.
2020-09-29 14:52:57 info Tracked changes of 2 documents from 1 services.
If I introduce an unexpected error in the code, it fails immediately, logs the stack trace and exits with a >0 status code:
› npm start Apple
> cgus@0.0.1 start /…/cgus
> node src/index.js "Apple"
2020-09-29 14:58:52 info Refiltering 2 documents from 1 services…
2020-09-29 14:58:54 info Apple — Privacy Policy No changes after filtering, did not record version.
2020-09-29 14:58:54 info Apple — Human Rights Policy No changes after filtering, did not record version.
2020-09-29 14:58:54 info Refiltered 2 documents from 1 services.
2020-09-29 14:58:54 info Start tracking changes of 2 documents from 1 services…
2020-09-29 14:58:54 error Apple — Privacy Policy Error: fetcher failure
at fetch (file:///…/cgus/src/app/fetcher/index.js:28:9)
at processTicksAndRejections (internal/process/task_queues.js:93:5)
at async CGUs.trackDocumentChanges (file:///…/cgus/src/app/index.js:88:35)
/…/cgus/node_modules/async/dist/async.js:181
setImmediate$1(e => { throw e }, err);
^
Error: fetcher failure
at fetch (file:///…/cgus/src/app/fetcher/index.js:28:9)
at processTicksAndRejections (internal/process/task_queues.js:93:5)
at async CGUs.trackDocumentChanges (file:///…/cgus/src/app/index.js:88:35)
const MAX_PARALLEL_DOCUMENTS_TRACKS = 20; | ||
const MAX_PARALLEL_REFILTERS = 20; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to distinguish the two?
const MAX_PARALLEL_DOCUMENTS_TRACKS = 20; | |
const MAX_PARALLEL_REFILTERS = 20; | |
const MAX_PARALLEL_OPERATIONS = 20; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep I think it's better because even if it parallelization only improves by a little the refiltering time it improves significantly the tracking time. So the two numbers have different impacts
* master: Remove useless validation Improve validate script robustness Make arguments parsing more robust Avoid try/catch and make Mocha exit with non-zero code on unhandled rejections Allow diff between current branch and master Exit process with proper exit code when there is a failure Fix git command Add RSS english documentation Improve RSS french documentation Update README.fr.md Update README.fr.md Fix typos Document RSS feature Remove obsolete code and dependencies Specify ansible version in deploy github action Update deploy github action Do not track images from Google Privacy / ToS Limit attempts to restart to avoid looping on an unresolved error
Tested with 1040 documents from 566 services, took ~11 minutes for the whole process.