[experimental] learn keyword preferences #379

darikg · 2015-10-07T22:27:25Z

Spinning this off from discussion in #377 since they're (mostly) orthogonal. I was originally a little skeptical of using any learning algorithm because

It might be annoying to have different suggestions on different installations
It might preclude improving the base suggestion engine. Right now sqlcomplete suggests only the very vague keyword category when it could take greater advantage of syntactical restrictions. Given SELECT * F, FREEZE is a pretty silly suggestion because it's just not valid sql.

On the other hand, basing suggestions on syntax requires syntactically valid input, and a simple learning algorithm could be much more reliable in the middle of editing temporarily invalid queries. In the long run, the learning approach could move to estimating second- or third-order keyword transitions and be pretty powerful.

So this PR offers a basic experiment of the learning approach. Measure zeroth-order keyword probabilities, and rank keywords thereby.

Open questions:

How should learning be shared between concurrent pgcli sessions? One global state or one state per pgcli instance?
Should we (and if so, how do) we save keyword preferences between sessions?

landscape-bot · 2015-10-08T03:15:38Z

Code quality remained the same when pulling 18161db on dbcli:darikg/learn-keyword-prefs into 54acaed on dbcli:master.

amjith · 2015-10-08T04:48:13Z

First of all this is great. I'm impressed at such a quick turn around.

Here are some possible solutions to the open questions:

How should learning be shared between concurrent pgcli sessions? One global state or one state per pgcli instance?

Let's keep the learning separate between concurrent pgcli sessions, which is what your implementation does.

Reason: The same user might start two sessions, one as a root user performing maintenance tasks and the other session as a regular user to do exploration.

One suggestion: Initially the KeywordCounter start empty and then learns during the session. But we have a corpus of query strings in our history file. Why not take the latest 100 entries and pre-populate the KeywordCounter?

Should we (and if so, how do) we save keyword preferences between sessions?

Yes! Our history files are the way we save the keyword preferences. At the start of a pgcli session we'll read the last 100 or 1000 entries in the history file and use it to train our model (KeywordCounter).

I have a few more thoughts, I'll leave them in a bit once I've had a chance to formulate them in a coherent manner.

amjith · 2015-10-08T05:21:42Z

pgcli/pgcompleter.py

@@ -30,6 +31,7 @@ def __init__(self, smart_completion=True, pgspecial=None):
        super(PGCompleter, self).__init__()
        self.smart_completion = smart_completion
        self.pgspecial = pgspecial
+        self.keyword_counter = KeywordCounter(self.keywords)


Instead of calling it self.keyword_counter let's call it self.recommendation_engine or something more generic.

The idea is that in the future we might have different learning algorithms, one based on counting, one based on weighted frequency, or neural networks or something equally crazy etc. :)

We should also make KeywordCounter() take in a initial corpus of input strings to train the model with the query history.

amjith · 2015-10-08T05:38:05Z

Right now we're only using this learning algorithm for Keywords. Why not use it for tables/columns/views etc?

The fuzzy matcher sorts the list based on a tuple with two values, length of the matching group and the starting position of the match. Let's add a third value to the tuple which will be the frequency of a table/column/view. This means when the user types SELECT * FROM the list that pops up will have the table names listed in the most commonly used names. As soon as they start typing it'll reorder the list based on their typing and use the frequency item in the tuple to break ties.

Only the keywords should be pre-populated from the history file. The table/column/view names should only learn from the start of a session and should not persist between sessions. The reason being, each session will be a specific task which requires a set of tables to be accessed but it'll vary from session to session. We should also reset these frequency counters for tables/cols/views whenever we change the database.

The end. :)

darikg · 2015-10-08T19:45:49Z

Right now we're only using this learning algorithm for Keywords. Why not use it for tables/columns/views etc?

Good call. This would also be a good time to tweak how sorting is done. Right now find_matches sorts within suggestion categories, but not across them. So the completion menu lists all matching tables, then views, then schemas, etc. Instead, sorting should be done after all matches have been found, based on the three-element tuple you just described. So this is kinda dove-tailing with discussion in #377 that we should remove sorting from find_matches

darikg · 2015-10-10T16:26:47Z

Ok wow this got complicated quickly. Quick overview of the most recent set of changes:

We have a new package prioritization, with one class PrevalenceCounter
- PrevalanceCounter counts keyword and identifier name prevalences separately
PGCompleter has a member prioritizer, which is initialized to an instance of PrevalenceCounter, but like @amjith said, this could become some other type of prioritizer like a neural network
pgcompleter has a new class Match, which is a simple namedtuple (completion, priority)
find_matches now returns a list of matches, instead of a list of completions.
- The list is unsorted.
- The priority is a nested tuple of (fuzzy_match_tuple, prevalance)
- find_match arguments start_only and fuzzy are collapsed into a single parameter mode
  - mode = 'keyword' ==> start_only matching and keyword prevalence
  - mode = 'name' ==> fuzzy matching and name prevalence
  - TODO there's a small ambiguity here with hardcoded functions and datatypes, which use start_only matching but are names, not keywords.
- matches are collapsed across all suggestion types, and then sorted by priority
- Around this point I got sick of the linebreaks in pgcompleter.get_completions() and broke out everything into submethods get_table_matches, get_schema_matches etc. and use dictionary lookup to dispatch on suggestion type. This gives back two levels of indentation and was a big improvement in readability imho.
On pgcli startup, we read the most recent 100 lines from file history and learn keyword (but not name) prevalence. This is done in the completion refresher background thread so it shouldn't hurt startup time
Finally, learned prevalences are persisted across completion refreshes, but changing databases wipes out the learned name prevalences

j-bennet · 2015-10-10T19:02:52Z

@darikg Wow, this looks great. I have a question about this part:

https://github.com/dbcli/pgcli/blob/darikg/learn-keyword-prefs/pgcli/packages/prioritization.py#L40

The comment says that we can't rely on sqlparse to recognize keywords, because sqlparse is database agnostic. But we're only parsing keywords out of our own history, right? so that would be database specific. Am I missing something?

darikg · 2015-10-10T21:12:10Z

@j-bennet So in order to count identifier usages I'm iterating over the input like this:

for parsed in sqlparse.parse(text):
    for token in parsed.flatten():
        if token.ttype in Name:
            self.name_counts[token.value] += 1

For symmetry & simplicity it would be nice to be able to also do:

        elif token.ttype in Keyword:
            self.keyword_counts[token.value] += 1

But that won't catch all keywords properly unless sqlparse knows to tokenize them as keyword. So prioritization.py has its own ad hoc keyword tokenization machinery.

amjith · 2015-10-11T05:05:08Z

pgcli/completion_refresher.py

+        n_max_recent = 100
+        n_recent = history and min(n_max_recent, len(history))
+        if n_recent:
+            for recent in history[-n_recent:]:


This could be rewritten as:

n_recent = 100 if history: for recent in history[-n_recent:]:

If you ask for last 100 items and history is only 10 items, python will give you only the 10 items. No need to be cautious about array out of bounds.

Proof:

>>> a = [1,2,3] >>> a[-5:] [1, 2, 3] >>> a[-2:] [2, 3] >>> a[-1:] [3]

amjith · 2015-10-11T08:15:13Z

This is quite the PR. :)

Nice job. I've left a bunch of suggestions inline. I'm not sure I understand why the utils was changed to disable all the database tests.

Thanks for taking the time to tackle this. The next release of pgcli is gonna kick butt. :)

landscape-bot · 2015-10-11T09:11:35Z

Repository health increased by 0.46% when pulling 113e774 on darikg/learn-keyword-prefs into 54acaed on master.

5 new problems were found (including 0 errors and 4 code smells).
8 problems were fixed (including 2 errors and 3 code smells).

darikg · 2015-10-11T12:40:05Z

I'm not sure I understand why the utils was changed to disable all the database tests.

Ugh I'm really sorry about that.

Thanks for taking the time to tackle this. The next release of pgcli is gonna kick butt. :)

I was thinking actually we might want to do a new release before merging this, so it has some time to live in master before being released to the wild

amjith · 2015-10-11T13:36:33Z

I was thinking actually we might want to do a new release before merging this, so it has some time to live in master before being released to the wild.

That's a valid request. I can get started with the release process unless one of you is interested in doing.

j-bennet · 2015-10-11T22:39:07Z

@darikg Oh I see. I thought sqlparse covers all possible keywords, but I guess it only covers the rather general subset and it would not catch ones specific to postgres. Makes sense.

amjith · 2015-10-12T06:45:37Z

Suggestions after the WHERE clause is not suggesting columns names anymore.

The even more surprising aspect of this find is that we don't have a test for this simple case SELECT * FROM table_name WHERE.

This seems isolated to this branch. Master seems fine.

amjith · 2015-10-12T07:32:32Z

Sorry about the false alarm. The column names are available but they're buried deeper in the list.

This makes me wonder if mixing all the completions together and then sorting based on priority is the right way to go.

darikg · 2015-10-12T12:35:46Z

That's happening because the prioritizer is loading keyword prevalence from your history and not names, so they're showing up with higher priority. There's a couple ways to handle this.

Like I mentioned at the very top of this PR, it would be nice to improve keyword suggestion in general -- none of the suggestions in that screenshot are valid SQL. Keywords are suggested in where clauses because of some issues with LIKE Keyword 'LIKE' is not working well mycli#135 and INTERVAL Fix autocomplete after an identifier in a where-clause #340. So we could try to figure out a more sophisticated way of suggesting them. (Possibility: a new suggestion type 'operator' which would suggest and, or, not, like, ilike, interval, etc., separate from keyword).
Force keywords to the bottom of the list, allowing all other completion types to intermingle. Fuzzy matching returns a two-tuple for sorting. Strict matching also returns a two-tuple, but the second element is always zero, because it's unused. So we could just specify that second element to be negative infinity instead.

I went ahead and tried option 2 because it's so easy.

amjith · 2015-10-12T18:30:06Z

The new implementation works much better. I think we can use this solution for now while we implement the second order suggestions.

I did notice one weird bug which is in master as well as this PR:

SELECT * FROM table_name WHERE abs

Until I type abs it showed a ton of suggestions that started with abs such as abs, abstime, abstimeeq etc. But as soon as typed s in abs the completion menu went away which is odd. I'm guessing it's a parsing bug. I haven't had time to dig in yet.

darikg · 2015-10-12T18:42:22Z

One weird thing with function suggestions as of #357 is that some functions are double listed -- once in the hardcoded functions list, and once in the database metadata. Not sure if it's related but I think abs is one of those functions so it'd be worth checking to see that PR introduced that bug.

landscape-bot · 2015-10-13T08:26:48Z

Repository health increased by 0.66% when pulling 6d253f3 on darikg/learn-keyword-prefs into 54acaed on master.

4 new problems were found (including 0 errors and 3 code smells).
8 problems were fixed (including 2 errors and 3 code smells).

amjith · 2015-10-13T11:48:49Z

I checked abs it's not listed in the functions in pgliterals. I think it is probably listed in sqlparse as a reserved word.

landscape-bot · 2015-10-14T08:42:17Z

Repository health increased by 0.93% when pulling 40970fd on darikg/learn-keyword-prefs into 54acaed on master.

2 new problems were found (including 0 errors and 1 code smell).
8 problems were fixed (including 2 errors and 3 code smells).

amjith · 2015-10-28T12:15:17Z

Just found this article: http://nicolewhite.github.io/2015/10/05/improving-cycli-autocomplete-markov-chains.html

Markov chain based suggestion.

darikg · 2015-10-28T23:34:52Z

That's awesome. It looks like she's looking only at keyword -> keyword transitions. It would be cool to also include

keyword -> identifier transitions
- Suggest columns foo and bar in the SELECT clause, but columns baz and qux in the WHERE clause
keyword -> identifier type
- E.g. if the user tends to alias qualify column names, suggest table aliases in the SELECT clause before column names
identifier -> identifier
- E.g. if you tend to select columns foo and bar together, select foo, suggests bar

amjith · 2015-11-01T09:07:02Z

Now that 0.20.0 is out, lets get this merged into master and give it a thorough testing.

@darikg Can you rebase this branch to bring it up to date?

darikg · 2015-11-02T23:48:28Z

I'll finish rebasing this soon

amjith · 2015-11-08T12:52:22Z

Ping!

@darikg If you're busy I can take care of rebasing it. Just let me know.

…gestion type

darikg · 2015-11-08T21:05:48Z

squashed & rebased

landscape-bot · 2015-11-08T21:14:51Z

Repository health increased by 0.52% when pulling 471b058 on darikg/learn-keyword-prefs into f7aef6e on master.

1 new problem was found (including 0 errors and 0 code smells).
5 problems were fixed (including 0 errors and 2 code smells).

amjith · 2015-11-10T03:55:29Z

Thanks @darikg.

🚅

[experimental] learn keyword preferences

amjith · 2015-11-26T04:05:53Z

We've had this merged into master for about 2 weeks now. How does everyone feel?

Is it working out?

When I type SELECT * FROM table WHERE I get the list of columns at the top, but as soon as I start typing the columns gets drowned out by the list of keywords, which is jarring.

Here's an example:

I don't have a great suggestion to counter this yet. But I'd like to get feedback from the team as well as some hardcore pgcli users about the current behavior.

@dbcli/pgcli-core

I'll start collecting a list of users who have been active in reporting issues or sending occasional PRs. If you have user suggestions, please leave their user name here and we'll contact them seeking feedback.

darikg · 2015-11-26T13:06:45Z

@amjith that definitely seems wrong. I'll work on getting a test set up for it. But yeah, in general, I'd love more feedback.

amjith added the in progress label Oct 7, 2015

darikg mentioned this pull request Oct 7, 2015

Don't sort keyword suggestions alphabetically #377

Closed

amjith reviewed Oct 8, 2015
View reviewed changes

amjith reviewed Oct 11, 2015
View reviewed changes

darikg added 6 commits November 8, 2015 15:54

New package prioritization and class PrevalenceCounter

9c97d35

PGCompleter initializes a prevalence counter (not used yet)

886b048

pgcompleter.find_matches returns Match (Completion, Priorty) tuples

78649e5

Break up pgcompleter.suggest_type into subfunctions dispatched by sug…

13b6c93

…gestion type

Update pgcompleter tests

0cf0b7f

Update pgcli to new pgcompleter

471b058

darikg force-pushed the darikg/learn-keyword-prefs branch from 40970fd to 471b058 Compare November 8, 2015 21:03

amjith added a commit that referenced this pull request Nov 10, 2015

Merge pull request #379 from dbcli/darikg/learn-keyword-prefs

7ee8a90

[experimental] learn keyword preferences

amjith merged commit 7ee8a90 into master Nov 10, 2015

amjith removed the in progress label Nov 10, 2015

amjith deleted the darikg/learn-keyword-prefs branch November 10, 2015 03:55

darikg mentioned this pull request Nov 26, 2015

Really sort keywords after everything else #425

Merged

j-bennet mentioned this pull request Feb 24, 2017

Feature Request: Sort tab completion by frequency dbcli/mycli#351

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[experimental] learn keyword preferences #379

[experimental] learn keyword preferences #379

darikg commented Oct 7, 2015

landscape-bot commented Oct 8, 2015

amjith commented Oct 8, 2015

amjith Oct 8, 2015

amjith commented Oct 8, 2015

darikg commented Oct 8, 2015

darikg commented Oct 10, 2015

j-bennet commented Oct 10, 2015

darikg commented Oct 10, 2015

amjith Oct 11, 2015

amjith commented Oct 11, 2015

landscape-bot commented Oct 11, 2015

darikg commented Oct 11, 2015

amjith commented Oct 11, 2015

j-bennet commented Oct 11, 2015

amjith commented Oct 12, 2015

amjith commented Oct 12, 2015

darikg commented Oct 12, 2015

amjith commented Oct 12, 2015

darikg commented Oct 12, 2015

landscape-bot commented Oct 13, 2015

amjith commented Oct 13, 2015

landscape-bot commented Oct 14, 2015

amjith commented Oct 28, 2015

darikg commented Oct 28, 2015

amjith commented Nov 1, 2015

darikg commented Nov 2, 2015

amjith commented Nov 8, 2015

darikg commented Nov 8, 2015

landscape-bot commented Nov 8, 2015

amjith commented Nov 10, 2015

amjith commented Nov 26, 2015

darikg commented Nov 26, 2015

[experimental] learn keyword preferences #379

[experimental] learn keyword preferences #379

Conversation

darikg commented Oct 7, 2015

landscape-bot commented Oct 8, 2015

amjith commented Oct 8, 2015

amjith Oct 8, 2015

Choose a reason for hiding this comment

amjith commented Oct 8, 2015

darikg commented Oct 8, 2015

darikg commented Oct 10, 2015

j-bennet commented Oct 10, 2015

darikg commented Oct 10, 2015

amjith Oct 11, 2015

Choose a reason for hiding this comment

amjith commented Oct 11, 2015

landscape-bot commented Oct 11, 2015

darikg commented Oct 11, 2015

amjith commented Oct 11, 2015

j-bennet commented Oct 11, 2015

amjith commented Oct 12, 2015

amjith commented Oct 12, 2015

darikg commented Oct 12, 2015

amjith commented Oct 12, 2015

darikg commented Oct 12, 2015

landscape-bot commented Oct 13, 2015

amjith commented Oct 13, 2015

landscape-bot commented Oct 14, 2015

amjith commented Oct 28, 2015

darikg commented Oct 28, 2015

amjith commented Nov 1, 2015

darikg commented Nov 2, 2015

amjith commented Nov 8, 2015

darikg commented Nov 8, 2015

landscape-bot commented Nov 8, 2015

amjith commented Nov 10, 2015

amjith commented Nov 26, 2015

darikg commented Nov 26, 2015