-
Notifications
You must be signed in to change notification settings - Fork 551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[experimental] learn keyword preferences #379
Conversation
First of all this is great. I'm impressed at such a quick turn around. Here are some possible solutions to the open questions:
Let's keep the learning separate between concurrent pgcli sessions, which is what your implementation does. Reason: The same user might start two sessions, one as a root user performing maintenance tasks and the other session as a regular user to do exploration. One suggestion: Initially the KeywordCounter start empty and then learns during the session. But we have a corpus of query strings in our history file. Why not take the latest 100 entries and pre-populate the KeywordCounter?
Yes! Our history files are the way we save the keyword preferences. At the start of a pgcli session we'll read the last 100 or 1000 entries in the history file and use it to train our model (KeywordCounter). I have a few more thoughts, I'll leave them in a bit once I've had a chance to formulate them in a coherent manner. |
@@ -30,6 +31,7 @@ def __init__(self, smart_completion=True, pgspecial=None): | |||
super(PGCompleter, self).__init__() | |||
self.smart_completion = smart_completion | |||
self.pgspecial = pgspecial | |||
self.keyword_counter = KeywordCounter(self.keywords) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of calling it self.keyword_counter
let's call it self.recommendation_engine
or something more generic.
The idea is that in the future we might have different learning algorithms, one based on counting, one based on weighted frequency, or neural networks or something equally crazy etc. :)
We should also make KeywordCounter()
take in a initial corpus of input strings to train the model with the query history.
Right now we're only using this learning algorithm for Keywords. Why not use it for tables/columns/views etc? The fuzzy matcher sorts the list based on a tuple with two values, length of the matching group and the starting position of the match. Let's add a third value to the tuple which will be the frequency of a table/column/view. This means when the user types Only the keywords should be pre-populated from the history file. The table/column/view names should only learn from the start of a session and should not persist between sessions. The reason being, each session will be a specific task which requires a set of tables to be accessed but it'll vary from session to session. We should also reset these frequency counters for tables/cols/views whenever we change the database. The end. :) |
Good call. This would also be a good time to tweak how sorting is done. Right now |
Ok wow this got complicated quickly. Quick overview of the most recent set of changes:
|
@darikg Wow, this looks great. I have a question about this part: https://github.com/dbcli/pgcli/blob/darikg/learn-keyword-prefs/pgcli/packages/prioritization.py#L40 The comment says that we can't rely on sqlparse to recognize keywords, because sqlparse is database agnostic. But we're only parsing keywords out of our own history, right? so that would be database specific. Am I missing something? |
@j-bennet So in order to count identifier usages I'm iterating over the input like this:
For symmetry & simplicity it would be nice to be able to also do:
But that won't catch all keywords properly unless sqlparse knows to tokenize them as keyword. So prioritization.py has its own ad hoc keyword tokenization machinery. |
n_max_recent = 100 | ||
n_recent = history and min(n_max_recent, len(history)) | ||
if n_recent: | ||
for recent in history[-n_recent:]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be rewritten as:
n_recent = 100
if history:
for recent in history[-n_recent:]:
If you ask for last 100 items and history is only 10 items, python will give you only the 10 items. No need to be cautious about array out of bounds.
Proof:
>>> a = [1,2,3]
>>> a[-5:]
[1, 2, 3]
>>> a[-2:]
[2, 3]
>>> a[-1:]
[3]
This is quite the PR. :) Nice job. I've left a bunch of suggestions inline. I'm not sure I understand why the utils was changed to disable all the database tests. Thanks for taking the time to tackle this. The next release of pgcli is gonna kick butt. :) |
|
Ugh I'm really sorry about that.
I was thinking actually we might want to do a new release before merging this, so it has some time to live in master before being released to the wild |
That's a valid request. I can get started with the release process unless one of you is interested in doing. |
@darikg Oh I see. I thought sqlparse covers all possible keywords, but I guess it only covers the rather general subset and it would not catch ones specific to postgres. Makes sense. |
Sorry about the false alarm. The column names are available but they're buried deeper in the list. This makes me wonder if mixing all the completions together and then sorting based on priority is the right way to go. |
That's happening because the prioritizer is loading keyword prevalence from your history and not names, so they're showing up with higher priority. There's a couple ways to handle this.
I went ahead and tried option 2 because it's so easy. |
The new implementation works much better. I think we can use this solution for now while we implement the second order suggestions. I did notice one weird bug which is in master as well as this PR:
Until I type |
One weird thing with function suggestions as of #357 is that some functions are double listed -- once in the hardcoded functions list, and once in the database metadata. Not sure if it's related but I think |
|
I checked |
|
Just found this article: http://nicolewhite.github.io/2015/10/05/improving-cycli-autocomplete-markov-chains.html Markov chain based suggestion. |
That's awesome. It looks like she's looking only at
|
Now that 0.20.0 is out, lets get this merged into master and give it a thorough testing. @darikg Can you rebase this branch to bring it up to date? |
I'll finish rebasing this soon |
Ping! @darikg If you're busy I can take care of rebasing it. Just let me know. |
40970fd
to
471b058
Compare
squashed & rebased |
|
Thanks @darikg. 🚅 |
[experimental] learn keyword preferences
We've had this merged into master for about 2 weeks now. How does everyone feel? Is it working out? When I type Here's an example: I don't have a great suggestion to counter this yet. But I'd like to get feedback from the team as well as some hardcore pgcli users about the current behavior. @dbcli/pgcli-core I'll start collecting a list of users who have been active in reporting issues or sending occasional PRs. If you have user suggestions, please leave their user name here and we'll contact them seeking feedback. |
@amjith that definitely seems wrong. I'll work on getting a test set up for it. But yeah, in general, I'd love more feedback. |
Spinning this off from discussion in #377 since they're (mostly) orthogonal. I was originally a little skeptical of using any learning algorithm because
keyword
category when it could take greater advantage of syntactical restrictions. GivenSELECT * F
,FREEZE
is a pretty silly suggestion because it's just not valid sql.On the other hand, basing suggestions on syntax requires syntactically valid input, and a simple learning algorithm could be much more reliable in the middle of editing temporarily invalid queries. In the long run, the learning approach could move to estimating second- or third-order keyword transitions and be pretty powerful.
So this PR offers a basic experiment of the learning approach. Measure zeroth-order keyword probabilities, and rank keywords thereby.
Open questions: