Replace the tokenizer with a flex-based scanner #3846

kivikakk · 2017-10-05T05:49:24Z

Preliminary benchmarks put this in at a 12x speedup.

[1] ~/github/linguist  $ bundle exec bin/bench
Rehearsal -----------------------------------------
c:      0.460000   0.000000   0.460000 (  0.464052)
rb:     5.120000   0.020000   5.140000 (  5.142903)
-------------------------------- total: 5.600000sec
            user     system      total        real
c:      0.420000   0.000000   0.420000 (  0.426902)
rb:     5.170000   0.020000   5.190000 (  5.188762)
Tokens extracted by each method:
 * extract_tokens_c: 3041830
 * extract_tokens_rb: 2836080

It doesn't produce identical results, but very near enough to. (Enough that all the tests should pass.)

/cc @vmg because he luuuuurves C

vmg · 2017-10-05T07:21:53Z

Looking good. I assume we're now properly handling all the weird Flex crashes you were seeing?

I also think it'd be neat if the Rakefile had a way to rebuild the lexer (and also to check that we're using the right version of Flex).

lildude · 2017-10-05T10:35:45Z

I've not tested this yet, but I wonder how well this new tokenizer will fair with non-ASCII. From the look of things, it should be 👌 , but thought I'd ask to be sure.

For context, an attempt to improve Linguist's support of non-ASCII in the ruby implementation has been started in #3748.

kivikakk · 2017-10-06T04:33:26Z

I assume we're now properly handling all the weird Flex crashes you were seeing?

Yeah; I constrained our use of features which turn out to be dangerous (LOOKING AT YOU, TRAILING CONTEXT), and everything works as expected now. ✨

I also think it'd be neat if the Rakefile had a way to rebuild the lexer (and also to check that we're using the right version of Flex).

+1, will add.

I've not tested this yet, but I wonder how well this new tokenizer will fair with non-ASCII.

It'll do as well as it currently does, which is to say Not Hugely Well; non-ASCII stuff will get skipped. It wouldn't be too hard to make it grok things we're likely to see in UTF-8 text, though it'd be a lot harder to do this and only match word-characters (since we'd have to add actual Unicode understanding to our lexer at that stage).

* Don't read and split the entire file if we only ever use the first/last n lines * Only consider the first 50KiB when using heuristics/classifying. This can save a *lot* of time; running a large number of regexes over 1MiB of text takes a while. * Memoize File.size/read/stat; re-reading in a 500KiB file every time `data` is called adds up a lot.

Alhadis · 2017-10-09T08:36:44Z

lib/linguist/blob_helper.rb

@@ -289,6 +287,44 @@ def lines
        end
    end

+    def encoded_newlines_re
+      @encoded_newlines_re ||= Regexp.union(["\r\n", "\r", "\n"].


Does the \R extension not work here?

I also take it Ruby's regex engine doesn't have the equivalent of Perl's /a modifier?

I'm changing as little code as I can; this is just a refactor from:

https://github.com/github/linguist/blob/0b9c05f989a66e3e92ca9a1e0b236781ef54229f/lib/linguist/blob_helper.rb#L278-L279

\R also catches [\v\f] which we definitely don't want.

I also take it Ruby's regex engine doesn't have the equivalent of Perl's /a modifier?

~$ ruby -e '//a' -e:1: unknown regexp option - a

It doesn't, and more to the point, it wouldn't help for our use here, which isn't about Unicode-aware matching so much as avoiding terrible encoding exceptions rising from the deep. /a modifies the meaning of several sequences in the regular expressions itself, rather than changing how a regular expression is applied to a given byte-sequence-tagged-with-an-encoding (i.e. a String), whatever the meaning of its contents.

Ah, I see. ;) Just thought to ask, since it's used very little in Perl (for good reasons). Thanks!

kivikakk · 2017-10-26T00:36:46Z

I'd like to merge this! Anyone feel like doing a final review?

lildude

Caveat pre-emptor: I have a copy of Dennis Richie's book but I'm far from being a C expert.

From what I do know this looks good to me, and the perf improvement is fantastic!!

kivikakk · 2017-10-31T00:06:43Z

@lildude Thank you! The responsibility is mine if this somehow goes belly-up.

This reverts commit 99eaf5f.

kivikakk added 14 commits October 6, 2017 15:47

Lex everything except SGML, multiline, SHEBANG

f76a6c0

Prepend SHEBANG#! to tokens

4a3d214

Support SGML tag/attribute extraction

8ffb586

Multiline comments

db082c8

WIP cont'd; productionifying

a71cafa

Compile before test

41f0125

Add extension to gemspec

329850c

Add flex task to build lexer

2a2deaf

Reentrant extra data storage

a4cb887

regenerate lexer

9bb4b33

use prefix

87a5dd2

rebuild lexer on linux

646fd3f

Use single regex for C++

4d5a036

Alhadis reviewed Oct 9, 2017

View reviewed changes

kivikakk added 3 commits October 9, 2017 20:06

act like #lines

51a005c

[1][-2..-1] => nil, ffs

e6383ee

k may not be set

a234c60

lildude approved these changes Oct 27, 2017

View reviewed changes

kivikakk merged commit 99eaf5f into github-linguist:master Oct 31, 2017

kivikakk deleted the c-ext branch October 31, 2017 00:07

kivikakk mentioned this pull request Oct 31, 2017

Release v5.3.2 #3882

Merged

kivikakk mentioned this pull request Nov 9, 2017

Revert "Replace the tokenizer with a flex-based scanner" #3899

Closed

kivikakk pushed a commit that referenced this pull request Nov 9, 2017

Revert "Replace the tokenizer with a flex-based scanner (#3846)"

0698b0f

This reverts commit 99eaf5f.

smola mentioned this pull request Dec 2, 2017

tokenizer: capture non-ASCII identifiers #3748

Closed

smola mentioned this pull request Jan 28, 2019

Sync with github/linguist src-d/enry#155

Open

4 tasks

bzz mentioned this pull request Jan 28, 2019

Port new Tokeniser from Linguist src-d/enry#193

Open

github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace the tokenizer with a flex-based scanner #3846

Replace the tokenizer with a flex-based scanner #3846

kivikakk commented Oct 5, 2017

vmg commented Oct 5, 2017 •

edited

Loading

lildude commented Oct 5, 2017

kivikakk commented Oct 6, 2017

Alhadis Oct 9, 2017 •

edited

Loading

kivikakk Oct 9, 2017

Alhadis Oct 9, 2017

kivikakk commented Oct 26, 2017

lildude left a comment

kivikakk commented Oct 31, 2017

Replace the tokenizer with a flex-based scanner #3846

Replace the tokenizer with a flex-based scanner #3846

Conversation

kivikakk commented Oct 5, 2017

vmg commented Oct 5, 2017 • edited Loading

lildude commented Oct 5, 2017

kivikakk commented Oct 6, 2017

Alhadis Oct 9, 2017 • edited Loading

Choose a reason for hiding this comment

kivikakk Oct 9, 2017

Choose a reason for hiding this comment

Alhadis Oct 9, 2017

Choose a reason for hiding this comment

kivikakk commented Oct 26, 2017

lildude left a comment

Choose a reason for hiding this comment

kivikakk commented Oct 31, 2017

vmg commented Oct 5, 2017 •

edited

Loading

Alhadis Oct 9, 2017 •

edited

Loading