Skip to content

Loading…

Delete primary_extension from language data #985

Merged
merged 1 commit into from

7 participants

@nox

The language attribute is still maintained as the first extension found.

This allows Mercury to be properly detected by Linguist, as per #748.

@PaulBone

Thanks @nox

@nox

Btw, why aren't the pedantic tests parsing the languages index with an actual YAML parser instead of clumsily reading it line by line?

@ghost

"…why aren't the pedantic tests parsing the languages index with an actual YAML parser instead of clumsily reading it line by line."

Because actual software engineering takes time away from twirling lariats, barrel races and steer wrestling.

@aJanuary

You've kept the concept of "one of the file extensions is special", but moved it from being explicit in a separate property to being implicit at the top of the list.

I would suggest keeping the primary_extension property but removing the constraint that it must be unique, or removing the concept of a favoured extension entirely. I'm not familiar enough with the rest of the codebase to suggest which one would be better.

@nox

You've kept the concept of "one of the file extensions is special", but moved it from being explicit in a separate property to being implicit at the top of the list.

Of course I kept it! But the order is irrelevant from Linguist's point of view. This property is special because some closed source code at GitHub depends on it, how could I handle such code without looking at it?

@PaulBone

It sounds like keeping it is a good idea (it may help gist etc) but while
removing the uniqueness requirement.

@nox

Ping?

@nox

Can we have an answer?

Could GitHub get its act together and actually maintain their FOSS projects properly?

@nox

Ping?

@bkeepers
GitHub member

Ping?

64 bytes from github: time=16 days

Could GitHub get its act together and actually maintain their FOSS projects properly?

We're working on getting a team together to be more responsive to this project, but please cut us a little slack. We have a lot of FOSS and not a lot of resources to keep on top of it. We're all just trying to do our best.

This allows Mercury to be properly detected by Linguist, as per #748.

I don't see any tests to show that it fixes this, which makes me a little nervous. It also makes me nervous that it has other adverse effects.

Traditionally, changes to linguist have been really difficult for us to handle, because improvements for detection of one language usually come at the cost of another language. That makes some people really happy and others really upset. So we all need to figure out ways to do the classification even better.

@nox

I don't see any tests to show that it fixes this, which makes me a little nervous. It also makes me nervous that it has other adverse effects.

What? All the tests still run, as shown here by Travis. Among them, there is a test that uses @primary_extension. Please tell me which test to add for you to be satisfied and I will do it.

As for any adverse effects in your own closed code, I can't really do anything about them if I don't get any answer about what they could be.

@bkeepers
GitHub member

Please tell me which test to add for you to be satisfied and I will do it.

Maybe I misunderstood. Does this fix the detection of Mercury?

As for any adverse effects in your own closed code, I can't really do anything about them if I don't get any answer about what they could be.

There's not really any other closed source code to go along with this. We use linguist directly. The problem is just when we change the heuristics, it affects the way repositories are classified. As I said, this makes some people happy and others upset.

So we just want to be very dilligent changes.

@nox

Maybe I misunderstood. Does this fix the detection of Mercury?

Did you actually read the comments in the linked PR?

@nox

I guess you didn't.

@sebgod

Dear Brandon @bkeepers,
you wrote

Traditionally, changes to linguist have been really difficult for us to handle, because improvements for detection of one language usually come at the cost of another language.

Are you sure this is a problem? I mean all the tests are passing in Travis CI, AFAIK all languages are being recognised properly (if the tests are reasonably complete)
In my PR #1049 (based on #748 @PaulBone ) pull request I added enough samples to disambiguate all potential conflicts I could think off.
As I merged the PR with the current upstream/master, the Travis CI is passing, all Obj-C, Moocode, Mercury is properly recognised, I'd propose that you'd first accept my PR if you find it appropriate (the changes are non-intrusive to your detection logic and you mentioned that resources are sparse at the moment), and if all goes well after some testing you merge this PR #985 by @nox once you have finished testing.
I hope then we can finally settle this dispute :+1:

@bkeepers
GitHub member

I want to start with clarifying that this is not a dispute at all. I just want to make sure I understand. I like you all and want to make everything better.

Now, to clarify…

Did you actually read the comments in the linked PR?

Yes. What I'm confused about is, does this PR actually fix it for Mercury? Or is it purely to pave the way for a future fix? If it's fixing something, I'd love to see a test changed so we can be sure we don't break it in the future.

@nox

It fixes Linguist's broken design which relies on a primary extension unique per language. The very reason the other PR was rejected more or less as a WONTFIX. What are you confused about in this?

@aJanuary

I'm confused about why you haven't responded to their point about adding a test for the fix.

@nox

@aJanuary

I'm confused about why you haven't responded to their point about adding a test for the fix.

I'm confused as to what you don't understand in the following sentence:

Please tell me which test to add for you to be satisfied and I will do it.

You can also explain to me why you people are calling this a fix when this is a design improvement that happens to make primary extensions' conflicts disappear.

See @arfon's comment on the other PR:

We're working here within the constraints of the implementation of Linguist I'm afraid.

What I remove is a "constraint of the implementation", also known as a "design bug".

@nox

How do I test Mercury's detection when there is no Mercury thing in Linguist yet because there can't be without that branch?

@bkeepers bkeepers added the Triage label
@sebgod sebgod referenced this pull request
Merged

Mercury noconflict #1049

@nox
nox commented

Ping?

@arfon
GitHub member

Hey @nox, I just wrote a response to @DX-MON over here that I think is relevant for this thread.

While it's possible remove the primary_extension and keep the current test suite passing before we can merge this we would like to assess the impact of this change across a large number of repos. Setting up an environment to assess these changes is going to take us a while longer sorry.

Our focus over the past couple of weeks has been to triage outstanding pull requests that are a straightforward merge (new languages etc), this one is less straightforward I'm afraid.

@nox
nox commented

While it's possible remove the primary_extension and keep the current test suite passing before we can merge this we would like to assess the impact of this change across a large number of repos. Setting up an environment to assess these changes is going to take us a while longer sorry.

I see you are still not admitting that the classification is already broken and that my patch can't make it worse. Still not admitting that the primary extension lookup does not shortcut anything.

You are afraid to admit that this PR is just as straightforward, that's all.

@joneshf

Our focus over the past couple of weeks has been to triage outstanding pull requests that are a straightforward merge (new languages etc)

As an aside, I greatly appreciate this!

@arfon
GitHub member

Thanks @joneshf :smile:

@nox
nox commented

That patch can't misclassify anything because it doesn't change the list of looked-up languages for any file extension.

@nox
nox commented

So I've written a small script that outputs the list of candidate languages for all extensions known by Linguist:

require 'linguist'

Linguist::Language.all
    .map {| name| [name.primary_extension].concat(name.extensions) }
    .flatten.sort.uniq
        .each do |ext|
            print ext, ' -> ',
                Linguist::Language.find_by_filename("foo#{ext}")
                    .map { |lang| lang.name }.join(', '),
                "\n"
        end

Here are the differences between the two lists with and without primary_extension:

$ diff -u with*
--- with-primary-extension  2014-05-01 19:23:34.000000000 +0200
+++ without-primary-extension   2014-05-01 19:23:44.000000000 +0200
@@ -87,7 +87,7 @@
 .chs -> C2hs Haskell
 .cirru -> Cirru
 .ck -> ChucK
-.cl -> OpenCL, Common Lisp
+.cl -> Common Lisp, OpenCL
 .cl2 -> Clojure
 .clixml -> XML
 .clj -> Clojure
@@ -109,7 +109,7 @@
 .cppobjdump -> Cpp-ObjDump
 .cproject -> XML
 .cpy -> COBOL
-.cr -> Crystal, Cirru
+.cr -> Cirru, Crystal
 .creole -> Creole
 .cs -> C#
 .csh -> Tcsh
@@ -301,7 +301,7 @@
 .ltx -> TeX
 .lua -> Lua
 .ly -> LilyPond
-.m -> Objective-C, M, Matlab
+.m -> M, Matlab, Objective-C
 .mak -> Makefile
 .mako -> Mako
 .man -> Groff
@@ -398,7 +398,7 @@
 .pm6 -> Perl6
 .pmod -> Pike
 .po -> Gettext Catalog
-.pod -> Pod, Perl
+.pod -> Perl, Pod
 .podsl -> Common Lisp
 .podspec -> Ruby
 .pogo -> PogoScript
@@ -508,7 +508,7 @@
 .sublime_metrics -> JSON
 .sublime_session -> JSON
 .svg -> XML
-.t -> Turing, Perl
+.t -> Perl, Turing
 .tcc -> C++
 .tcl -> Tcl
 .tcsh -> Tcsh
@@ -536,7 +536,7 @@
 .ui -> XML
 .upc -> Unified Parallel C
 .urdf -> XML
-.v -> Verilog, Coq
+.v -> Coq, Verilog
 .vala -> Vala
 .vapi -> Vala
 .vark -> Gosu

I don't think checking these corner cases is such a difficult job you claim it to be. And if such order differences do indeed change the classification of some projects, it just points out the existence of another design fault.

@nox nox Delete primary_extension from language data
The language attribute is still maintained as the first extension found.

This allows Mercury to be properly detected by Linguist, as per #748.
2d82071
@nox
nox commented

@sebgod Rebased it as you requested. Weird that your comment disappeared.

All tests pass, all those which pass currently on master that is.

$ git ls-files samples/{Mercury,M,Objective-C,Mathematica} | xargs -n 1 ruby -I lib bin/linguist
@sebgod

@nox, my comment appeared twice, so I deleted one, seems like that deleted both. Thanks a lot :)

@arfon arfon merged commit 2d82071 into github:master

1 check failed

Details continuous-integration/travis-ci The Travis CI build failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on May 1, 2014
  1. @nox

    Delete primary_extension from language data

    nox committed
    The language attribute is still maintained as the first extension found.
    
    This allows Mercury to be properly detected by Linguist, as per #748.
This page is out of date. Refresh to see the latest.
Showing with 410 additions and 294 deletions.
  1. +18 −34 lib/linguist/language.rb
  2. +390 −258 lib/linguist/languages.yml
  3. +2 −2 test/test_pedantic.rb
View
52 lib/linguist/language.rb
@@ -24,7 +24,6 @@ class Language
@extension_index = Hash.new { |h,k| h[k] = [] }
@interpreter_index = Hash.new { |h,k| h[k] = [] }
@filename_index = Hash.new { |h,k| h[k] = [] }
- @primary_extension_index = {}
# Valid Languages types
TYPES = [:data, :markup, :programming, :prose]
@@ -80,12 +79,6 @@ def self.create(attributes = {})
@extension_index[extension] << language
end
- if @primary_extension_index.key?(language.primary_extension)
- raise ArgumentError, "Duplicate primary extension: #{language.primary_extension}"
- end
-
- @primary_extension_index[language.primary_extension] = language
-
language.interpreters.each do |interpreter|
@interpreter_index[interpreter] << language
end
@@ -191,8 +184,7 @@ def self.find_by_alias(name)
# Returns all matching Languages or [] if none were found.
def self.find_by_filename(filename)
basename, extname = File.basename(filename), File.extname(filename)
- langs = [@primary_extension_index[extname]] +
- @filename_index[basename] +
+ langs = @filename_index[basename] +
@extension_index[extname]
langs.compact.uniq
end
@@ -299,15 +291,6 @@ def initialize(attributes = {})
@interpreters = attributes[:interpreters] || []
@filenames = attributes[:filenames] || []
- unless @primary_extension = attributes[:primary_extension]
- raise ArgumentError, "#{@name} is missing primary extension"
- end
-
- # Prepend primary extension unless its already included
- if primary_extension && !extensions.include?(primary_extension)
- @extensions = [primary_extension] + extensions
- end
-
# Set popular, and searchable flags
@popular = attributes.key?(:popular) ? attributes[:popular] : false
@searchable = attributes.key?(:searchable) ? attributes[:searchable] : true
@@ -395,20 +378,6 @@ def initialize(attributes = {})
# Returns the extensions Array
attr_reader :extensions
- # Deprecated: Get primary extension
- #
- # Defaults to the first extension but can be overridden
- # in the languages.yml.
- #
- # The primary extension can not be nil. Tests should verify this.
- #
- # This attribute is only used by app/helpers/gists_helper.rb for
- # creating the language dropdown. It really should be using `name`
- # instead. Would like to drop primary extension.
- #
- # Returns the extension String.
- attr_reader :primary_extension
-
# Public: Get interpreters
#
# Examples
@@ -432,6 +401,22 @@ def all_extensions
(extensions + [primary_extension]).uniq
end
+ # Deprecated: Get primary extension
+ #
+ # Defaults to the first extension but can be overridden
+ # in the languages.yml.
+ #
+ # The primary extension can not be nil. Tests should verify this.
+ #
+ # This method is only used by app/helpers/gists_helper.rb for creating
+ # the language dropdown. It really should be using `name` instead.
+ # Would like to drop primary extension.
+ #
+ # Returns the extension String.
+ def primary_extension
+ extensions.first
+ end
+
# Public: Get URL escaped name.
#
# Examples
@@ -573,9 +558,8 @@ def inspect
:group_name => options['group'],
:searchable => options.key?('searchable') ? options['searchable'] : true,
:search_term => options['search_term'],
- :extensions => options['extensions'].sort,
+ :extensions => [options['extensions'].first] + options['extensions'][1..-1].sort,
:interpreters => options['interpreters'].sort,
- :primary_extension => options['primary_extension'],
:filenames => options['filenames'],
:popular => popular.include?(name)
)
View
648 lib/linguist/languages.yml
390 additions, 258 deletions not shown because the diff is too large. Please use a local Git client to view these changes.
View
4 test/test_pedantic.rb
@@ -22,10 +22,10 @@ def test_extensions_are_sorted
file("languages.yml").lines.each do |line|
if line =~ /^ extensions:$/
extensions = []
- elsif extensions && line =~ /^ - \.(\w+)$/
+ elsif extensions && line =~ /^ - \.([\w-]+)( *#.*)?$/
extensions << $1
else
- assert_sorted extensions if extensions
+ assert_sorted extensions[1..-1] if extensions
extensions = nil
end
end
Something went wrong with that request. Please try again.