Skip to content
This repository

Added detection for the M (aka MUMPS) programming language. #148

Closed
wants to merge 1 commit into from
Laurent Parenteau

This add detection for the M (aka MUMPS) programming language (see https://en.wikipedia.org/wiki/MUMPS).

I have successfully tested this using bundle exec rake test.

I have also called bundle exec linguist on the following projects, which I know have M files in them :

lparenteau/httpm
luisibanez/fis-gtm
luisibanez/VistA-FOIA
Luis Ibanez

+1

This is great,
thanks for preparing this patch.

Here are other projects in M as well:

https://github.com/OSEHR/M-Tools
https://github.com/OSEHR/CacheToGTM

and probably the most important is VistA (The EHR of the Department of Veterans Affairs):
https://github.com/OSEHRA/VistA-FOIA

VistA has about 40 forks now, and the number will increase soon.

K.S. Bhaskar

+1

A free / open source M/MUMPS implementation for Linux on x86 is GT.M (http://fis-gtm.com and http://sf/net/projects/fis-gtm )

Jean-Christophe Fillion-Robin
jcfr commented March 27, 2012

+1 Excellent. This is great news :) Thanks @lparenteau

dnrussell

+1 Sounds great!

gribnick

+1 Highly desirable..

LD Landis

+1 Highly useful addition!

Ben Pringle

+1 This would be great.

ozgunbas

+1
I want it!

Thomas Rozanski

+1, M will be increasingly popular as VistA rolls out

Joseph W. Dougherty

+1, will assist in development of VistA.

Michael Zacharias

+1

cool!

Steve

+1
Excellent!

Sean Woods

+1

Patrick Reynolds

+1

David Whitten

+1
M or MUMPS code is traditionally tagged with a .m extension if it is a single routine,
the .rsa extension signifies a Routine Save Archive
the .gsa extension signifies a Global Save Archive
the .zwr extension signifies a ZWRite global archive.

tuskentower

+1
GT.M also uses the extension .glo for a Global Extract

George Lilly

+1

mmendelson

+1
Very nice work and useful.

Lawrence Tarbox

+1

petercyli

+1
This is great for open source M

Ivan Sopin

+1

bulaza

+1
Would be very helpful for work with M on Github.

Joshua Peek josh commented on the diff March 27, 2012
lib/linguist/blob_helper.rb
@@ -471,6 +474,10 @@ def guess_m_language
471 474
       elsif lines.grep(/^%/).any?
472 475
         Language['Matlab']
473 476
 
  477
+      # M comment
  478
+      elsif lines.grep(/^[ \t]*;/).any?
  479
+        Language['M']
2
Joshua Peek Owner
josh added a note March 27, 2012

Only checking for comments is a rather crude method.

Sean Woods
seanwoods added a note March 27, 2012

Agreed, a better regex would be these two:

^[ \t%A-Z][A-Za-z0-9]+[ \t]+;*
^\d+[ \t]+;*

If all non-blank lines don't satisfy this two regexes, the program isn't valid MUMPS code.

Edit: I consulted the standard and had to revise.

Source: http://71.174.62.16/Demo/AnnoStd?Frame=Main&Page=a101004

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Joshua Peek josh commented on the diff March 27, 2012
lib/linguist/languages.yml
@@ -627,6 +627,14 @@ Lua:
627 627
   - .lua
628 628
   - .nse
629 629
 
  630
+M:
4
Joshua Peek Owner
josh added a note March 27, 2012

Seems like the primary name ought to be MUMPS rather than M.

LD Landis
ldlandis added a note March 27, 2012
Joshua Peek Owner
josh added a note March 27, 2012

Well, the first google research for "m language" actually leads you here

http://en.wikipedia.org/wiki/M_(programming_language)

Laurent Parenteau
lparenteau added a note March 28, 2012

M is only the codename for this new Microsoft programming language. It will probably change when / if this gets released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
80n
80n commented March 28, 2012

+1

K.S. Bhaskar

I don't have a strong preference between M and MUMPS, but for what it's worth, the official name is M. Ref: http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=29268

80n
80n commented March 28, 2012

If it wasn't clear, my +1 was for the original pull request by lparenteau, not a comment on the M vs MUMPS discussion.

fwiw:
M +1
MUMPS -1E6

fscwitte

+1 (for the original pull request by lparenteau)

Chris Harris
cjh1 commented March 28, 2012

+1

DotMish

+1

Katie

+1

Joshua Peek
Owner
josh commented March 28, 2012

Sorry, but with the name controversy, there being no lexer, and it clashing with another very popular extension (obj-c), this isn't going to work.

Thanks for that patch.

Joshua Peek josh closed this March 28, 2012
oldMster

The name controversy exists only in your mind.....

Luis Ibanez

Josh,

I'm wondering how github is dealing with MATLAB
code, that has also the .m extension.

It would seems that a file name extension clash
with Objective-C is not enough justification for not
classifying the language properly.

Also,
Could you please elaborate on the "lexer" and
how we could help to overcome that challenge ?

Thanks
LD Landis

I agree... that is a poor excuse (there are many conflicts with the .m
suffix alone).

Perhaps there is no lexer, but I would imagine that an regular expression
controls this (sort of saw folks suggesting that anyway). I believe we can
come up with a pattern for file(1) that would mostly accurately identify
(our) M code.

For example (not exactly this, but similar): ^[%A-Za-z][A-Za-z0-9]*[\t ]+;
where the first characters of the first line are [%A-Za-z] optionally followed
by [A-Za-z0-9] followed by a spaces/tabs, followed by a semi-colon. Most
M routines have this structure, and I would not complain much if this was
a required "stylization".

David Whitten

I agree with you, Larry,
although I think the pattern should allow for characters after the semicolon.

seanwoods earlier suggested:
^[ \t%A-Z][A-Za-z0-9]+[ \t]+;*
^\d+[ \t]+;*

I don't know what \d is supposed to signify,
The first line of MUMPS routine should match the first pattern
unless it has an argument list on it.
In that case, the tag should allow for a single "(" followed by local variable
names separated by commas and ending with a ")"
You are not allowed to subscript the variables in an formal list,
and the "." is used for actual arguments, not for formal arguments.

Technically, the first line could have MUMPS code on it, but it is such a rare occurrence,
that I've only seen it a few times, and even then in throw-away code.

By the way, some of the code of the patch appears to be at this URL.
https://github.com/lparenteau/linguist/blob/e0190a5a6e1ec52dbdb70ef9f62db6e6043bd03c/lib/linguist/blob_helper.rb

The relevant portion is:

# Internal: Guess language of .m files.
#
# Objective-C heuristics:
# * Keywords
#
# Matlab heuristics:
# * Leading function keyword
# * "%" comments
#
# M heuristics:
# * ";" comments
#
# Returns a Language.
def guess_m_language
  # Objective-C keywords
  if lines.grep(/^#import|@(interface|implementation|property|synthesize|end)/).any?
    Language['Objective-C']

  # File function
  elsif lines.first.to_s =~ /^function /
    Language['Matlab']

  # Matlab comment
  elsif lines.grep(/^%/).any?
    Language['Matlab']

  # M comment
  elsif lines.grep(/^[ \t]*;/).any?
    Language['M']

  # Fallback to Objective-C, don't want any Matlab false positives
  else
    Language['Objective-C']
  end
end
Laurent Parenteau

@luisibanez The lexer is only used to do syntax highlighting when viewing source file directily in GitHub. There are many other languages that don't define a lexer as well. This was something I wanted to look at later, but if you are interested, GitHub use Pygments (http://pygments.org/) for this, so we would need to add a lexer for M in to project, which GitHub will eventually inherit.

As shown by @whitten, a .m files is currently considered to be either an Objective-C file or a Matlab file. My patch add M to that list. The regex (or other method) used to detect M source code doesn't need to be exact.

@josh I have tested my patch on all the project found in the Objective-C main page (https://github.com/languages/Objective-C), and on the 2317 .m files present, only 1 was wrongly tagged as M. I have fixed the issue and I think I could add that commit to this pull request if you re-open it. Or should I start a new pull request?

As for the other regex suggested, I did try them on various M project and the results weren't as good as looking for M comments. But, if GitHub want do go this way instead, I'm sure we can come up with a better regex.

Sean Woods

@whitten According to the standard (linked in my comment to the patch above), a M tag can be an integer as well. I just tested in GT.M.

The heuristics expressed in this Ruby code aren't very rigorous. Just look at how it detects Matlab.

M is pretty picky about how code needs to be laid out, but it boils down to those regexes. You could also check for strings like $Length(, $Piece(, etc. Alternatively you could look for the very-specific-to-M function call syntax e.g. $$trim^%str (that is, two dollar signs, followed by a tag name including the caret).

M is a pretty simple language. It should be easy to find the elements of M that don't intersect with Objective-C or Matlab.

As for the name issue - if it's between "M" and "MUMPS," use "M." This is how the standard is written. If it's against the Microsoft language linked to by Josh, I'd suggest using the syntax to detect the proper format.

David Whitten

Sean, I agree that you are allowed to put a string of numeric digits as a tag.

I didn't suggest that it needed to be an integer, because a string of
numeric digits is not a canonical integer in the M Language

00050 is a valid tag in M, but the canonical integer is 50

%000 is a valid tag as well, by the way.

I assume \d means "decimal integer" ?

David
713-870-3834

David Whitten whitten commented on the diff March 28, 2012
test/fixtures/m_simple.m
... ...
@@ -0,0 +1,4 @@
  1
+fox
1
David Whitten
whitten added a note March 28, 2012

I can't tell if this line has a ls (label-separator) or not.

The M language requires such following a label.

The word "fox" is clearly the label, but it isn't clear whether a space or tab character is following it.

This is documented for the current standard at this URL:
http://71.174.62.16/Demo/AnnoStd?Frame=Main&Page=a106007&Edition=1995
i.e.:
6.2.4 Label separator ls

A label separator (ls) precedes the linebody of each line. A ls consists of one or more spaces. The flexible number of spaces allows programmers to enhance the readability of their programs.

ls  ::= SP  ...

this is referenced from the URL:
http://71.174.62.16/Demo/AnnoStd?Frame=Main&Edition=1995&Page=a106003#Def_0002

6.2 Routine body routinebody

The routinebody is a sequence of lines terminated by an eor. Each line starts with one ls which may be preceded by an optional label and formallist. The ls is followed by zero or more li (level-indicator) which are followed by zero or more commands and a terminating eol. If there is a comment it is separated from the last command of a line by one or more spaces.

routinebody ::= line    ... eor
line    ::= │ levelline       |    formalline │
eor ::= CR FF
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
LD Landis

IMO you do not get to be that specific in the classification filter (mapping suffix and file contect to language).

That is why my pattern stopped at the semi-colon. Sure stuff can follow, but it may not be useful in classifying language.

Laurent Parenteau

I have created a new pull request (#150) with an improved regex based on the comments, and fixed @whitten 's concern regarding the "fox" label.

7queue

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Showing 1 unique commit by 1 author.

Mar 27, 2012
Laurent Parenteau Added detection for the new M (aka MUMPS) language. e0190a5
This page is out of date. Refresh to see the latest.
7  lib/linguist/blob_helper.rb
@@ -457,6 +457,9 @@ def guess_h_language
457 457
     # * Leading function keyword
458 458
     # * "%" comments
459 459
     #
  460
+    # M heuristics:
  461
+    # * ";" comments
  462
+    #
460 463
     # Returns a Language.
461 464
     def guess_m_language
462 465
       # Objective-C keywords
@@ -471,6 +474,10 @@ def guess_m_language
471 474
       elsif lines.grep(/^%/).any?
472 475
         Language['Matlab']
473 476
 
  477
+      # M comment
  478
+      elsif lines.grep(/^[ \t]*;/).any?
  479
+        Language['M']
  480
+
474 481
       # Fallback to Objective-C, don't want any Matlab false positives
475 482
       else
476 483
         Language['Objective-C']
8  lib/linguist/languages.yml
@@ -627,6 +627,14 @@ Lua:
627 627
   - .lua
628 628
   - .nse
629 629
 
  630
+M:
  631
+  type: programming
  632
+  lexer: Text only
  633
+  aliases:
  634
+  - mumps
  635
+  extensions:
  636
+  - .m
  637
+
630 638
 Makefile:
631 639
   extensions:
632 640
   - .mak
4  test/fixtures/m_simple.m
... ...
@@ -0,0 +1,4 @@
  1
+fox
  2
+	; The quick brown fox jumps over the lazy dog
  3
+	write "The quick brown fox jumps over the lazy dog",!
  4
+	quit
1  test/test_blob.rb
@@ -307,6 +307,7 @@ def test_language
307 307
     assert_equal Language['Objective-C'], blob("hello.m").language
308 308
     assert_equal Language['Matlab'], blob("matlab_function.m").language
309 309
     assert_equal Language['Matlab'], blob("matlab_script.m").language
  310
+    assert_equal Language['M'], blob("m_simple.m").language
310 311
 
311 312
     # .r disambiguation
312 313
     assert_equal Language['R'],           blob("hello-r.R").language
Commit_comment_tip

Tip: You can add notes to lines in a file. Hover to the left of a line to make a note

Something went wrong with that request. Please try again.