/
README
246 lines (192 loc) · 10.5 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
Copying and distribution of this file, with or without modification,
are permitted in any medium without royalty provided the copyright
notice and this notice are preserved.
What is gpicker?
----------------
gpicker is a program that allows you to quickly and conveniently pick
file in a (possibly very large) project. You type significant letters
of file name (typically from the start of words) and gpicker provides
you with a list of files you most likely mean to pick. The program
filters and orders project's list of files in real-time as you type.
To place 'correct' matches on top gpicker uses sophisticated scoring
heuristics. This scoring heuristic tries to capture user's intention
and effectively 'un-abbreviate' pattern, rather than simply order
results by edit distance. Scoring is implemented by efficient dynamic
programming algorithm which makes filtration and ordering very
fast. The scoring details are described below.
It was inspired by class finding facility of IntelliJ IDEA and
'Command-T' feature of TextMate, but it's in many ways much better.
It is language-agnostic and supports matching directory names too.
Read on for details.
It is usable as a standalone program, but it was created to be used by
editors/IDEs. Emacs & VIM integration is provided. See comments on top
of gpicker.{el,vim} for installation notes. This README describes
general usage concepts.
There are also out-of-tree integrations for netbeans --
http://github.com/avsej/gpicker-netbeans and gedit --
http://github.com/yltsrc/gedit-gpicker.
http://github.com/alk/supermegadoc uses gpicker to navigate/pick-from
names in documentation system index.
High-level description of gpicker's filtration/sorting approach
---------------------------------------------------------------
Main observation behind gpicker is what we use (or would like to use)
when specifying target name in few key presses. The answer is
abbreviation of 'words' in that target name. For example, for
abbreviating 'active_record' it is 'natural' to use 'ar' (first chars
from each word), 'arec' or 'a-rec' (emphasizing 'r' as start of new
word). So sorting task is to order names in candidate list, so that
the names for which pattern is best abbreviation are placed on top.
Many 'smart' autocompleters focus on edit distance (Levenstein
distance, for example), but really smart autocompleter should try to
'un-abbreviate' pattern instead. Levenstein is good for spellchecking,
but autocompletion is about trying to capture user's intention. And
user abbreviates, not just omits random chars.
Another notable feature of gpicker is separate treatment of path
basename and dirname. Ordinary patterns filter only basenames. In many
cases this is sufficient. But in some projects there are files with
same names in different directories. Notable example is base.rb in
Ruby on Rails sources.
epsilon:~/src/altoros/rails# git ls-files | gpicker-simple base | head
actionpack/lib/action_view/base.rb
railties/lib/rails_generator/base.rb
activemodel/lib/active_model/base.rb
.. many other base.rb entries are skipped ...
For such cases matching dirname of complete path is useful. gpicker
does that in a smart way. Dirname pattern is used only to break ties
between basename matches.
epsilon:~/src/altoros/rails# git ls-files | gpicker-simple ac-reco/base
activerecord/lib/active_record/base.rb
activerecord/lib/active_record/connection_adapters/abstract/database_statements.rb
Empty basename patterns can be used to browse list of files in some
directory. For example 'are/con-ad/' can be used to see contents of
vendor/rails/activerecord/lib/active_record/connection_adapters/.
gpicker doesn't attempt to be too strict at filtering. Most effort in
design of gpicker's internals was spent on sorting candidate entries,
not eliminating them. The goal was to place 'correct' match on top, or
at least on first page. How many results in total we get is
irrelevant, if we can find what we need in first few results.
gpicker-simple
--------------
gpicker is a Gtk+ program. In some cases and on some OSes this is
inconvenient. So there's also gpicker-simple program that doesn't
depend on glib and Gtk+. In default './configure' build gpicker-simple
is just a hardlink to gpicker. But see Makefile.simple if you want to
build dependency-free gpicker-simple. I think it even should be
possible to build gpicker-simple on windows, with few tweaks (like
providing missing getopt(3) implementation).
More details
------------
Based on gpicker's main idea I designed the following:
* first, list of candidate names should be filtered so that only names
where all pattern characters match in correct order (increasing
position) remain
* matches should generally be case-insensitive, though
camel-case-transitions should be handled as beginnings of words
* each candidate name should be assigned some score based on how well
pattern abbreviates that name
* candidate name list is sorted with descending score (with some
additional rules to break ties)
The scoring system adds scores of all matches and is based on the
following observations:
* match on beginning of word is good and should be given some score. So
'a' and 'r' are good abbreviations for 'active_record' because they
match beginnings of words. While 'c' and 'e' (that match too) are not.
* match on beginning of first word is even better (because we usually abbreviate
from first word). So 'a' is better abbreviation for 'active_record'
than 'r' (both match, but 'r' probably queries for something that
begins with 'r', like 'red').
* patterns can have word separators too, to emphasize word start. So
'a-r', most probably, means: I want name with word that starts on
'a' and then has another word that starts on 'r'. Such emphasized
matches on beginnings of words should be rewarded more than 'wild'
matches on beginnings of words. So 'a-r' is better fit for
'active-record' than 'ar' (because 'ar' could mean 'arsenal').
* adjacent matches are more important than non-adjacent matches. 'ar'
fits 'arsenal' well. And it even fits 'paragraph' more or less well
(substring match). While it doesn't fit 'alert' at all (but
matches).
This set of rules encourages matches on beginnings of words and
positions that are adjacent to previous matches. But what scores to
assign in every case ? In particular, it's not clear what's relative
order of scores for adjacent match versus 'wild' match on start of
word. I decided that 'ar' should score more on 'arsenal', than on
'active-record'. I've chosen adjacency score to be a bit less then
doubled wild word beginning match score. Probably, to get this:
arsenal:100800
^^^
active-record-simple:100402
^ ^ ^
arbiters:100400
^^ ^
active-records:100201
^ ^ ^
So the scoring system is:
* 'proper' word beginning match is awarded 0x100000 points. This,
obviously, includes match on beginning of first word
* 'wild' word beginning match -- 0x201
* and 0x400 is given for adjacent match
Performance notes
-----------------
Don't judge gpicker's performance by first run. gpicker scans project
for list of files on every start. This can take a while on first
run. But subsequent runs will hit OS's directory entries cache and
will start almost instantly.
Real-time filtration should be snappy even on largest projects. At
least if your CPU is not very old.
Usage in examples
-----------------
When I want to pick 'vendor/rails/activerecord/lib/active_record/base.rb'
inside rails project I can type 'ba' or 'bas'. But it won't be
displayed on top, because rails has several files named base.rb in
different directories. I can type 'ar/ba' (or even 'ar/b'). This will
match 'ar' against directory name and 'ba' against basename and will
place 'correct' file on top.
To pick source of java.lang.ClassLoader class inside openjdk I can try
'cload'. There a many matches and correct file is not in top 5 (but
it's in top 10). I can try placing emphasis on start of words by
capitalizing them - 'cLoa' or 'CLoa'. But that removes only one extra
match. I can add directory name pattern to better convey my
intention. 'lan/cLo' is sufficient to find correct file on second
place. It can be selected via arrow keys now. And 'cl/la/cLo' (added
another part of directory - 'classes') or 'lan/cLo.j' (added
extension) is enough to place it first. 'clloderj' works too. Notice
skipped 'a' from 'loader'.
There are two ways to emphasize start of words. One is capitalizing
first letters. 'aRe' or 'AR' will give large score to 'active_record',
'active-record', 'active.record' and 'ActiveRecord'.
Second way is placing matching delimiters before first letter.
'b.r' will give large score for 'base.rb', because match of 'r' after
delimiter is considered start of the word. Delimiters '_' and '-' are
interchangeable. So association_proxy.rb can be matched with 'a-pro'
and with 'a_pro' (and with 'aPro' of course). That was done because
'-' is easier to type.
Filtering & scoring details
---------------------------
Only matching entries pass through filter. In simple words, there is
a match between given pattern and name if and only if it is possible
to transform name to pattern by removing characters from it without
changing order of remaining characters.
Matching is case insensitive, though capital letters both in pattern
and in name are additionally treated as word starts. Letters after
delimiters also count as word starts.
Smart ordering of matches is based on matches' scores. Scoring rules
are quite simple. Each character match is given some score and those
are summed for total match score. Match on word start is given
0x100000 points if matching pattern character is on word start too,
and 0x201 points otherwise. Match is given 0x400 points if it's
adjacent to previous match. And all other matches are given 0 points.
The idea is that 'proper' word start matches are most precious. All
other matches score significantly less. Among those substring matches
score about twice as much as non-proper word start matches. And
completely wild matches do not score at all.
If there are several matches for given name then the scoring algorithm
picks leftmost (or rightmost, for directory name matches) with maximal
score.
Directory names and basenames are scored separately and directory name
score is considered only when basename scores are equal.
For typically short patterns it's not unlikely to have a group of
names with maximal score. So great attention is given to ordering
names with equal scores. Current ordering heuristic takes
compactness of match into account (minimization of last matching
character index) and length of names. See code in filtration.c for
exact details.