-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
uses priority queueing when updating prod, adds database and checksum optimizations #145
Conversation
removes unused tweak code to modify solr records general code clean for pep8 compliance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very good! I have some suggestions, maybe the biggest one is transforming the status
field into three independent ones so that we do not overwrite that information (now that things will happen in parallel, chances are higher). Let me know what you think!
renamed config variable from update_timestamps to set_processed_timestamp because it controlled setting the processed timestamps set when we update a production data store, not the update timestamps that track when data is received from other pipelines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a second closer inspection, I found issues with the priorities (they are not setting any priority, actually), bugs (changes in behavior with respect to what we have in HEAD), and code that can be deduplicated and simplified. It was easier for me to reason by modifying code to see if what I had in mind made sense and could be done, it looks like it is possible. See #148, that PR illustrates what I have in mind. I did not test anything in that PR, it is just for us to talk and maybe incorporate the patterns and changes if they are indeed valid.
removes unused tweak code to modify solr records general code clean for pep8 compliance
renamed config variable from update_timestamps to set_processed_timestamp because it controlled setting the processed timestamps set when we update a production data store, not the update timestamps that track when data is received from other pipelines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested some more tweaks. Also, it is hard to make run.py
consistent with so many argument flags, but let's see if these suggestion make the --update-processed
more consistent.
we can not use the enum library because it conflicts with enum34
in dev-requirements.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one major behavior change (i.e., always update the processed
field) and one small request (sys.exit(1)
if reindex fails). The former is key, otherwise we will not index only the records that need to be indexed.
EDIT: And another logger.error to exception request too!
self.logger.exception('Failed posting individual bibcode %s to metrics', failed_bibcode) | ||
failed_bibcodes.append(failed_bibcode) | ||
if failed_bibcodes and update_processed: | ||
self.mark_processed(failed_bibcodes, checksums=None, type='metrics', status='metrics-failed') | ||
except Exception as e: | ||
trans.rollback() | ||
self.logger.error('DB failure: %s', e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use self.logger.exception
instead of self.logger.error
.
and sys.exit on reindex fail
and added sys.exit on reindex fail
fix mock call to datetime, previously it returned a mock not a datetime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are ready to go! There is only one minor forgotten change that was not addressed, where I was asking to change:
ADSMasterPipeline/adsmp/app.py
Line 412 in e425021
self.logger.error('DB failure: %s', e) |
with:
self.logger.exception('DB failure')
But apart from that, we should be good to merge. Thanks for all the work done here!
fix bug in code that detected when solr index updating was complete also wait for queue writing to solr to empty
pick up Roman's changes to remove aff_raw from master
handle case when no metrics info is available
…MasterPipeline into PR145conflicts
removes unused tweak code to modify solr records
use upsert to update metrics database
general code clean for pep8 compliance