Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reindex in a background thread on demand. #105

Merged
merged 1 commit into from Nov 8, 2016

Conversation

matthewwardrop
Copy link
Collaborator

@matthewwardrop matthewwardrop commented Nov 2, 2016

Currently, the index is only updated once immediately after being launched. Reindexing the repository requires restarting the app, which can take some time depending on the size of the repository. This patch changes this behaviour to reindex on demand. That is, every time a request is made, knowledge repo checks whether it should update the index. Knowledge repo will never update the index if another index is already underway, or if it is less than five minutes since the last index. It will otherwise re-index if the KnowledgeRepository(-ies) report a newer version than what is recorded in the index. When an index is being performed, it is done in a background thread. The status of the index is noted in the footer of the page, which indicated how long ago the index was performed (or if the index is currently being performed). An example is shown below:

screen shot 2016-11-02 at 2 19 48 pm

This PR would currently have some race conditions when using SQLite databases due to the frequent use of db_session.flush() in _update_index, and so background threading for the index update is disabled for sqlite databases.

Copy link
Collaborator

@NiharikaRay NiharikaRay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - a couple of style nits, mostly around how we handle Nones.

tag_exists = Tag(name=tag)
db_session.add(tag_exists)
db_session.commit()
Tag(name=tag)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question] can you just do this since duplicate tags won't be added?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I crafted the model for Tag objects means that tags are deduplicated as a matter of course. There can never be duplicate tags when done this way.


def seconds_since_index():
last_update = IndexMetadata.get_last_update('lock', 'index')
if last_update is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if last_update:
   return (datetime.datetime.utcnow() - last_update).total_seconds()
return None

return 'Currently indexing'
seconds = seconds_since_index()
if seconds is None:
return "Never"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same None comment from above (None is inherently falsey)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seconds is unlikely to be (but could be) zero; so not changing here.

@@ -104,7 +104,8 @@
</div>

<div class="footer">
Served with ♥ by <a href="https://github.com/airbnb/knowledge-repo">Knowledge Repo</a> <a title='{{version_revision}}' href="https://github.com/airbnb/knowledge-repo/releases/tag/v{{ version }}">{{ version }}</a>
Served with ♥ by <a href="https://github.com/airbnb/knowledge-repo">Knowledge Repo</a> <a title='{{version_revision}}' href="https://github.com/airbnb/knowledge-repo/releases/tag/v{{ version }}">{{ version }}</a><br />
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha in theory we should make this an actual glyphicon

@classmethod
def set(cls, type, name, value):
m = db_session.query(IndexMetadata).filter(IndexMetadata.type == type).filter(IndexMetadata.name == name).first()
if m is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same None comment

@classmethod
def get_last_update(cls, type, name):
m = db_session.query(IndexMetadata).filter(IndexMetadata.type == type).filter(IndexMetadata.name == name).first()
if m is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same None comment

__tablename__ = 'index_metadata'

id = db.Column(db.Integer, nullable=False, primary_key=True)
type = db.Column(db.String(255), nullable=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you imagining that type and name will be jointly unique?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if so, we should endcode this table that way

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also do we need type and name?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; tuples of type and name should be unique. I'll encode this into the table as you suggest. We need type and name for things like:
type='repository_revision', name='<repository uri>', value='<revision>'

It would be possible to cram these into one field, but I felt like that was a bit clumsy. Agreed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup sounds good

seconds_check = seconds_since_index_check()
if is_indexing() or (seconds is not None) and (seconds < 5 * 60) and (seconds_check < 5 * 60):
return False
try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this reads a little strangely without an except block.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is actually kosher Python code. It's the neatest way to guarantee that code block always runs even if there are exceptions (which are not caught).

if not current_app.config.get('REPOSITORY_INDEXING_ENABLED', True):
return False

seconds = seconds_since_index()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small efficiency gains if you bail early here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I don't think the early bail is worth it though. We can revisit this if we ever find the performance suffering.

@matthewwardrop matthewwardrop merged commit 7300433 into master Nov 8, 2016
@matthewwardrop matthewwardrop deleted the mw_reindex_ondemand branch November 8, 2016 09:07
@sryza
Copy link
Contributor

sryza commented Nov 8, 2016

@matthewwardrop when you say "so background threading for the index update is disabled for sqlite databases", does this mean that, when using the default database for the index, web server restarts are still required to pick up repository changes?

@matthewwardrop
Copy link
Collaborator Author

Hi @sryza! No... It just means that when the git repository is updated, there will be a synchronous updating of the repository index. Depending on the size of the repository, this could be slow... But I strongly advise against using sqlite databases in production. For testing or for small servers it probably is not terrible.

@sryza
Copy link
Contributor

sryza commented Nov 8, 2016

Great, thanks for the clarification!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants