Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[experimental] learn keyword preferences #379

Merged
merged 6 commits into from
Nov 10, 2015
Merged

Conversation

darikg
Copy link
Contributor

@darikg darikg commented Oct 7, 2015

Spinning this off from discussion in #377 since they're (mostly) orthogonal. I was originally a little skeptical of using any learning algorithm because

  1. It might be annoying to have different suggestions on different installations
  2. It might preclude improving the base suggestion engine. Right now sqlcomplete suggests only the very vague keyword category when it could take greater advantage of syntactical restrictions. Given SELECT * F, FREEZE is a pretty silly suggestion because it's just not valid sql.

On the other hand, basing suggestions on syntax requires syntactically valid input, and a simple learning algorithm could be much more reliable in the middle of editing temporarily invalid queries. In the long run, the learning approach could move to estimating second- or third-order keyword transitions and be pretty powerful.

So this PR offers a basic experiment of the learning approach. Measure zeroth-order keyword probabilities, and rank keywords thereby.

Open questions:

  1. How should learning be shared between concurrent pgcli sessions? One global state or one state per pgcli instance?
  2. Should we (and if so, how do) we save keyword preferences between sessions?

@landscape-bot
Copy link

Code Health
Code quality remained the same when pulling 18161db on dbcli:darikg/learn-keyword-prefs into 54acaed on dbcli:master.

@amjith
Copy link
Member

amjith commented Oct 8, 2015

First of all this is great. I'm impressed at such a quick turn around.

Here are some possible solutions to the open questions:

How should learning be shared between concurrent pgcli sessions? One global state or one state per pgcli instance?

Let's keep the learning separate between concurrent pgcli sessions, which is what your implementation does.

Reason: The same user might start two sessions, one as a root user performing maintenance tasks and the other session as a regular user to do exploration.

One suggestion: Initially the KeywordCounter start empty and then learns during the session. But we have a corpus of query strings in our history file. Why not take the latest 100 entries and pre-populate the KeywordCounter?

Should we (and if so, how do) we save keyword preferences between sessions?

Yes! Our history files are the way we save the keyword preferences. At the start of a pgcli session we'll read the last 100 or 1000 entries in the history file and use it to train our model (KeywordCounter).

I have a few more thoughts, I'll leave them in a bit once I've had a chance to formulate them in a coherent manner.

@@ -30,6 +31,7 @@ def __init__(self, smart_completion=True, pgspecial=None):
super(PGCompleter, self).__init__()
self.smart_completion = smart_completion
self.pgspecial = pgspecial
self.keyword_counter = KeywordCounter(self.keywords)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of calling it self.keyword_counter let's call it self.recommendation_engine or something more generic.

The idea is that in the future we might have different learning algorithms, one based on counting, one based on weighted frequency, or neural networks or something equally crazy etc. :)

We should also make KeywordCounter() take in a initial corpus of input strings to train the model with the query history.

@amjith
Copy link
Member

amjith commented Oct 8, 2015

Right now we're only using this learning algorithm for Keywords. Why not use it for tables/columns/views etc?

The fuzzy matcher sorts the list based on a tuple with two values, length of the matching group and the starting position of the match. Let's add a third value to the tuple which will be the frequency of a table/column/view. This means when the user types SELECT * FROM the list that pops up will have the table names listed in the most commonly used names. As soon as they start typing it'll reorder the list based on their typing and use the frequency item in the tuple to break ties.

Only the keywords should be pre-populated from the history file. The table/column/view names should only learn from the start of a session and should not persist between sessions. The reason being, each session will be a specific task which requires a set of tables to be accessed but it'll vary from session to session. We should also reset these frequency counters for tables/cols/views whenever we change the database.

The end. :)

@darikg
Copy link
Contributor Author

darikg commented Oct 8, 2015

Right now we're only using this learning algorithm for Keywords. Why not use it for tables/columns/views etc?

Good call. This would also be a good time to tweak how sorting is done. Right now find_matches sorts within suggestion categories, but not across them. So the completion menu lists all matching tables, then views, then schemas, etc. Instead, sorting should be done after all matches have been found, based on the three-element tuple you just described. So this is kinda dove-tailing with discussion in #377 that we should remove sorting from find_matches

@darikg
Copy link
Contributor Author

darikg commented Oct 10, 2015

Ok wow this got complicated quickly. Quick overview of the most recent set of changes:

  • We have a new package prioritization, with one class PrevalenceCounter
    • PrevalanceCounter counts keyword and identifier name prevalences separately
  • PGCompleter has a member prioritizer, which is initialized to an instance of PrevalenceCounter, but like @amjith said, this could become some other type of prioritizer like a neural network
  • pgcompleter has a new class Match, which is a simple namedtuple (completion, priority)
  • find_matches now returns a list of matches, instead of a list of completions.
    • The list is unsorted.
    • The priority is a nested tuple of (fuzzy_match_tuple, prevalance)
    • find_match arguments start_only and fuzzy are collapsed into a single parameter mode
      • mode = 'keyword' ==> start_only matching and keyword prevalence
      • mode = 'name' ==> fuzzy matching and name prevalence
      • TODO there's a small ambiguity here with hardcoded functions and datatypes, which use start_only matching but are names, not keywords.
    • matches are collapsed across all suggestion types, and then sorted by priority
    • Around this point I got sick of the linebreaks in pgcompleter.get_completions() and broke out everything into submethods get_table_matches, get_schema_matches etc. and use dictionary lookup to dispatch on suggestion type. This gives back two levels of indentation and was a big improvement in readability imho.
  • On pgcli startup, we read the most recent 100 lines from file history and learn keyword (but not name) prevalence. This is done in the completion refresher background thread so it shouldn't hurt startup time
  • Finally, learned prevalences are persisted across completion refreshes, but changing databases wipes out the learned name prevalences

@j-bennet
Copy link
Contributor

@darikg Wow, this looks great. I have a question about this part:

https://github.com/dbcli/pgcli/blob/darikg/learn-keyword-prefs/pgcli/packages/prioritization.py#L40

The comment says that we can't rely on sqlparse to recognize keywords, because sqlparse is database agnostic. But we're only parsing keywords out of our own history, right? so that would be database specific. Am I missing something?

@darikg
Copy link
Contributor Author

darikg commented Oct 10, 2015

@j-bennet So in order to count identifier usages I'm iterating over the input like this:

for parsed in sqlparse.parse(text):
    for token in parsed.flatten():
        if token.ttype in Name:
            self.name_counts[token.value] += 1

For symmetry & simplicity it would be nice to be able to also do:

        elif token.ttype in Keyword:
            self.keyword_counts[token.value] += 1

But that won't catch all keywords properly unless sqlparse knows to tokenize them as keyword. So prioritization.py has its own ad hoc keyword tokenization machinery.

n_max_recent = 100
n_recent = history and min(n_max_recent, len(history))
if n_recent:
for recent in history[-n_recent:]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be rewritten as:

n_recent = 100
if history:
    for recent in history[-n_recent:]:

If you ask for last 100 items and history is only 10 items, python will give you only the 10 items. No need to be cautious about array out of bounds.

Proof:

>>> a = [1,2,3]
>>> a[-5:]
[1, 2, 3]
>>> a[-2:]
[2, 3]
>>> a[-1:]
[3]

@amjith
Copy link
Member

amjith commented Oct 11, 2015

This is quite the PR. :)

Nice job. I've left a bunch of suggestions inline. I'm not sure I understand why the utils was changed to disable all the database tests.

Thanks for taking the time to tackle this. The next release of pgcli is gonna kick butt. :)

@landscape-bot
Copy link

Code Health
Repository health increased by 0.46% when pulling 113e774 on darikg/learn-keyword-prefs into 54acaed on master.

@darikg
Copy link
Contributor Author

darikg commented Oct 11, 2015

I'm not sure I understand why the utils was changed to disable all the database tests.

Ugh I'm really sorry about that.

Thanks for taking the time to tackle this. The next release of pgcli is gonna kick butt. :)

I was thinking actually we might want to do a new release before merging this, so it has some time to live in master before being released to the wild

@amjith
Copy link
Member

amjith commented Oct 11, 2015

I was thinking actually we might want to do a new release before merging this, so it has some time to live in master before being released to the wild.

That's a valid request. I can get started with the release process unless one of you is interested in doing.

@j-bennet
Copy link
Contributor

@darikg Oh I see. I thought sqlparse covers all possible keywords, but I guess it only covers the rather general subset and it would not catch ones specific to postgres. Makes sense.

@amjith
Copy link
Member

amjith commented Oct 12, 2015

Suggestions after the WHERE clause is not suggesting columns names anymore.

1__2_0_python_-__pgcli___users_amjith_dropbox_code_python_pgcli___tmux_

The even more surprising aspect of this find is that we don't have a test for this simple case SELECT * FROM table_name WHERE.

This seems isolated to this branch. Master seems fine.

@amjith
Copy link
Member

amjith commented Oct 12, 2015

Sorry about the false alarm. The column names are available but they're buried deeper in the list.

This makes me wonder if mixing all the completions together and then sorting based on priority is the right way to go.

@darikg
Copy link
Contributor Author

darikg commented Oct 12, 2015

That's happening because the prioritizer is loading keyword prevalence from your history and not names, so they're showing up with higher priority. There's a couple ways to handle this.

  1. Like I mentioned at the very top of this PR, it would be nice to improve keyword suggestion in general -- none of the suggestions in that screenshot are valid SQL. Keywords are suggested in where clauses because of some issues with LIKE Keyword 'LIKE' is not working well mycli#135 and INTERVAL Fix autocomplete after an identifier in a where-clause #340. So we could try to figure out a more sophisticated way of suggesting them. (Possibility: a new suggestion type 'operator' which would suggest and, or, not, like, ilike, interval, etc., separate from keyword).

  2. Force keywords to the bottom of the list, allowing all other completion types to intermingle. Fuzzy matching returns a two-tuple for sorting. Strict matching also returns a two-tuple, but the second element is always zero, because it's unused. So we could just specify that second element to be negative infinity instead.

I went ahead and tried option 2 because it's so easy.

@amjith
Copy link
Member

amjith commented Oct 12, 2015

The new implementation works much better. I think we can use this solution for now while we implement the second order suggestions.

I did notice one weird bug which is in master as well as this PR:

SELECT * FROM table_name WHERE abs

Until I type abs it showed a ton of suggestions that started with abs such as abs, abstime, abstimeeq etc. But as soon as typed s in abs the completion menu went away which is odd. I'm guessing it's a parsing bug. I haven't had time to dig in yet.

@darikg
Copy link
Contributor Author

darikg commented Oct 12, 2015

One weird thing with function suggestions as of #357 is that some functions are double listed -- once in the hardcoded functions list, and once in the database metadata. Not sure if it's related but I think abs is one of those functions so it'd be worth checking to see that PR introduced that bug.

@landscape-bot
Copy link

Code Health
Repository health increased by 0.66% when pulling 6d253f3 on darikg/learn-keyword-prefs into 54acaed on master.

@amjith
Copy link
Member

amjith commented Oct 13, 2015

I checked abs it's not listed in the functions in pgliterals. I think it is probably listed in sqlparse as a reserved word.

@landscape-bot
Copy link

Code Health
Repository health increased by 0.93% when pulling 40970fd on darikg/learn-keyword-prefs into 54acaed on master.

@amjith
Copy link
Member

amjith commented Oct 28, 2015

Just found this article: http://nicolewhite.github.io/2015/10/05/improving-cycli-autocomplete-markov-chains.html

Markov chain based suggestion.

@darikg
Copy link
Contributor Author

darikg commented Oct 28, 2015

That's awesome. It looks like she's looking only at keyword -> keyword transitions. It would be cool to also include

  • keyword -> identifier transitions
    • Suggest columns foo and bar in the SELECT clause, but columns baz and qux in the WHERE clause
  • keyword -> identifier type
    • E.g. if the user tends to alias qualify column names, suggest table aliases in the SELECT clause before column names
  • identifier -> identifier
    • E.g. if you tend to select columns foo and bar together, select foo, suggests bar

@amjith
Copy link
Member

amjith commented Nov 1, 2015

Now that 0.20.0 is out, lets get this merged into master and give it a thorough testing.

@darikg Can you rebase this branch to bring it up to date?

@darikg
Copy link
Contributor Author

darikg commented Nov 2, 2015

I'll finish rebasing this soon

@amjith
Copy link
Member

amjith commented Nov 8, 2015

Ping!

@darikg If you're busy I can take care of rebasing it. Just let me know.

@darikg
Copy link
Contributor Author

darikg commented Nov 8, 2015

squashed & rebased

@landscape-bot
Copy link

Code Health
Repository health increased by 0.52% when pulling 471b058 on darikg/learn-keyword-prefs into f7aef6e on master.

@amjith
Copy link
Member

amjith commented Nov 10, 2015

Thanks @darikg.

🚅

amjith added a commit that referenced this pull request Nov 10, 2015
[experimental] learn keyword preferences
@amjith amjith merged commit 7ee8a90 into master Nov 10, 2015
@amjith amjith deleted the darikg/learn-keyword-prefs branch November 10, 2015 03:55
@amjith
Copy link
Member

amjith commented Nov 26, 2015

We've had this merged into master for about 2 weeks now. How does everyone feel?

Is it working out?

When I type SELECT * FROM table WHERE I get the list of columns at the top, but as soon as I start typing the columns gets drowned out by the list of keywords, which is jarring.

Here's an example:

1__0_2_python2_7_-__pgcli___users_amjith_dropbox_code_python_pgcli___tmux_

I don't have a great suggestion to counter this yet. But I'd like to get feedback from the team as well as some hardcore pgcli users about the current behavior.

@dbcli/pgcli-core

I'll start collecting a list of users who have been active in reporting issues or sending occasional PRs. If you have user suggestions, please leave their user name here and we'll contact them seeking feedback.

@darikg
Copy link
Contributor Author

darikg commented Nov 26, 2015

@amjith that definitely seems wrong. I'll work on getting a test set up for it. But yeah, in general, I'd love more feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants