Added detection for the M (aka MUMPS) programming language. #148

lparenteau · 2012-03-27T16:02:48Z

This add detection for the M (aka MUMPS) programming language (see https://en.wikipedia.org/wiki/MUMPS).

I have successfully tested this using bundle exec rake test.

I have also called bundle exec linguist on the following projects, which I know have M files in them :

lparenteau/httpm
luisibanez/fis-gtm
luisibanez/VistA-FOIA

luisibanez · 2012-03-27T22:12:19Z

+1

This is great,
thanks for preparing this patch.

Here are other projects in M as well:

https://github.com/OSEHR/M-Tools
https://github.com/OSEHR/CacheToGTM

and probably the most important is VistA (The EHR of the Department of Veterans Affairs):
https://github.com/OSEHRA/VistA-FOIA

VistA has about 40 forks now, and the number will increase soon.

ksbhaskar · 2012-03-27T22:42:20Z

+1

A free / open source M/MUMPS implementation for Linux on x86 is GT.M (http://fis-gtm.com and http://sf/net/projects/fis-gtm )

jcfr · 2012-03-27T22:44:34Z

+1 Excellent. This is great news :) Thanks @lparenteau

dnrussell · 2012-03-27T22:57:35Z

+1 Sounds great!

ghost · 2012-03-27T22:57:42Z

+1 Highly desirable..

ldlandis · 2012-03-27T23:07:08Z

+1 Highly useful addition!

Pringley · 2012-03-27T23:17:05Z

+1 This would be great.

ozgunbas · 2012-03-27T23:18:25Z

+1
I want it!

rozant · 2012-03-27T23:19:09Z

+1, M will be increasingly popular as VistA rolls out

JDougherty · 2012-03-27T23:21:39Z

+1, will assist in development of VistA.

igotmumps · 2012-03-27T23:23:20Z

+1

cool!

owensw · 2012-03-27T23:28:11Z

+1
Excellent!

seanwoods · 2012-03-27T23:29:23Z

+1

cpatrick · 2012-03-28T00:14:36Z

+1

whitten · 2012-03-28T01:09:42Z

+1
M or MUMPS code is traditionally tagged with a .m extension if it is a single routine,
the .rsa extension signifies a Routine Save Archive
the .gsa extension signifies a Global Save Archive
the .zwr extension signifies a ZWRite global archive.

thalesmello · 2012-03-28T01:45:36Z

+1

tuskentower · 2012-03-28T02:00:59Z

+1
GT.M also uses the extension .glo for a Global Extract

glilly · 2012-03-28T02:12:30Z

+1

mmendelson · 2012-03-28T02:14:38Z

+1
Very nice work and useful.

ltarbox · 2012-03-28T02:22:45Z

+1

petercyli · 2012-03-28T02:43:51Z

+1
This is great for open source M

ivansopin · 2012-03-28T03:12:19Z

+1

0xAlexei · 2012-03-28T03:13:14Z

+1
Would be very helpful for work with M on Github.

josh · 2012-03-28T03:23:22Z

lib/linguist/blob_helper.rb

@@ -471,6 +474,10 @@ def guess_m_language
      elsif lines.grep(/^%/).any?
        Language['Matlab']

+      # M comment
+      elsif lines.grep(/^[ \t]*;/).any?
+        Language['M']


Only checking for comments is a rather crude method.

Agreed, a better regex would be these two:

^[ \t%A-Z][A-Za-z0-9]+[ \t]+;* ^\d+[ \t]+;*

If all non-blank lines don't satisfy this two regexes, the program isn't valid MUMPS code.

Edit: I consulted the standard and had to revise.

Source: http://71.174.62.16/Demo/AnnoStd?Frame=Main&Page=a101004

80n · 2012-03-28T09:50:50Z

+1

ksbhaskar · 2012-03-28T11:20:54Z

I don't have a strong preference between M and MUMPS, but for what it's worth, the official name is M. Ref: http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=29268

80n · 2012-03-28T11:53:11Z

If it wasn't clear, my +1 was for the original pull request by lparenteau, not a comment on the M vs MUMPS discussion.

fwiw:
M +1
MUMPS -1E6

fscwitte · 2012-03-28T11:56:30Z

+1 (for the original pull request by lparenteau)

cjh1 · 2012-03-28T12:19:21Z

+1

DotMish · 2012-03-28T12:31:40Z

+1

jamestjoyce · 2012-03-28T12:35:13Z

+1

Sharkles · 2012-03-28T13:13:13Z

+1

josh · 2012-03-28T13:53:06Z

Sorry, but with the name controversy, there being no lexer, and it clashing with another very popular extension (obj-c), this isn't going to work.

Thanks for that patch.

msires · 2012-03-28T14:48:27Z

The name controversy exists only in your mind.....

luisibanez · 2012-03-28T16:25:58Z

Josh,

I'm wondering how github is dealing with MATLAB
code, that has also the .m extension.

It would seems that a file name extension clash
with Objective-C is not enough justification for not
classifying the language properly.

Also,
Could you please elaborate on the "lexer" and
how we could help to overcome that challenge ?

Thanks

ldlandis · 2012-03-28T16:50:28Z

I agree... that is a poor excuse (there are many conflicts with the .m
suffix alone).

Perhaps there is no lexer, but I would imagine that an regular expression
controls this (sort of saw folks suggesting that anyway). I believe we can
come up with a pattern for file(1) that would mostly accurately identify
(our) M code.

For example (not exactly this, but similar): ^[%A-Za-z][A-Za-z0-9]*[\t ]+;
where the first characters of the first line are [%A-Za-z] optionally followed
by [A-Za-z0-9] followed by a spaces/tabs, followed by a semi-colon. Most
M routines have this structure, and I would not complain much if this was
a required "stylization".

whitten · 2012-03-28T17:29:01Z

I agree with you, Larry,
although I think the pattern should allow for characters after the semicolon.

seanwoods earlier suggested:
^[ \t%A-Z][A-Za-z0-9]+[ \t]+;*
^\d+[ \t]+;*

I don't know what \d is supposed to signify,
The first line of MUMPS routine should match the first pattern
unless it has an argument list on it.
In that case, the tag should allow for a single "(" followed by local variable
names separated by commas and ending with a ")"
You are not allowed to subscript the variables in an formal list,
and the "." is used for actual arguments, not for formal arguments.

Technically, the first line could have MUMPS code on it, but it is such a rare occurrence,
that I've only seen it a few times, and even then in throw-away code.

By the way, some of the code of the patch appears to be at this URL.
https://github.com/lparenteau/linguist/blob/e0190a5a6e1ec52dbdb70ef9f62db6e6043bd03c/lib/linguist/blob_helper.rb

The relevant portion is:

# Internal: Guess language of .m files.
#
# Objective-C heuristics:
# * Keywords
#
# Matlab heuristics:
# * Leading function keyword
# * "%" comments
#
# M heuristics:
# * ";" comments
#
# Returns a Language.
def guess_m_language
  # Objective-C keywords
  if lines.grep(/^#import|@(interface|implementation|property|synthesize|end)/).any?
    Language['Objective-C']

  # File function
  elsif lines.first.to_s =~ /^function /
    Language['Matlab']

  # Matlab comment
  elsif lines.grep(/^%/).any?
    Language['Matlab']

  # M comment
  elsif lines.grep(/^[ \t]*;/).any?
    Language['M']

  # Fallback to Objective-C, don't want any Matlab false positives
  else
    Language['Objective-C']
  end
end

lparenteau · 2012-03-28T17:53:55Z

@luisibanez The lexer is only used to do syntax highlighting when viewing source file directily in GitHub. There are many other languages that don't define a lexer as well. This was something I wanted to look at later, but if you are interested, GitHub use Pygments (http://pygments.org/) for this, so we would need to add a lexer for M in to project, which GitHub will eventually inherit.

As shown by @whitten, a .m files is currently considered to be either an Objective-C file or a Matlab file. My patch add M to that list. The regex (or other method) used to detect M source code doesn't need to be exact.

@josh I have tested my patch on all the project found in the Objective-C main page (https://github.com/languages/Objective-C), and on the 2317 .m files present, only 1 was wrongly tagged as M. I have fixed the issue and I think I could add that commit to this pull request if you re-open it. Or should I start a new pull request?

As for the other regex suggested, I did try them on various M project and the results weren't as good as looking for M comments. But, if GitHub want do go this way instead, I'm sure we can come up with a better regex.

seanwoods · 2012-03-28T18:41:07Z

@whitten According to the standard (linked in my comment to the patch above), a M tag can be an integer as well. I just tested in GT.M.

The heuristics expressed in this Ruby code aren't very rigorous. Just look at how it detects Matlab.

M is pretty picky about how code needs to be laid out, but it boils down to those regexes. You could also check for strings like $Length(, $Piece(, etc. Alternatively you could look for the very-specific-to-M function call syntax e.g. $$trim^%str (that is, two dollar signs, followed by a tag name including the caret).

M is a pretty simple language. It should be easy to find the elements of M that don't intersect with Objective-C or Matlab.

As for the name issue - if it's between "M" and "MUMPS," use "M." This is how the standard is written. If it's against the Microsoft language linked to by Josh, I'd suggest using the syntax to detect the proper format.

whitten · 2012-03-28T19:44:53Z

Sean, I agree that you are allowed to put a string of numeric digits as a tag.

I didn't suggest that it needed to be an integer, because a string of
numeric digits is not a canonical integer in the M Language

00050 is a valid tag in M, but the canonical integer is 50

%000 is a valid tag as well, by the way.

I assume \d means "decimal integer" ?

David
713-870-3834

whitten · 2012-03-28T20:22:38Z

test/fixtures/m_simple.m

@@ -0,0 +1,4 @@
+fox


I can't tell if this line has a ls (label-separator) or not.
The M language requires such following a label.
The word "fox" is clearly the label, but it isn't clear whether a space or tab character is following it.

This is documented for the current standard at this URL:
http://71.174.62.16/Demo/AnnoStd?Frame=Main&Page=a106007&Edition=1995
i.e.:
6.2.4 Label separator ls

A label separator (ls) precedes the linebody of each line. A ls consists of one or more spaces. The flexible number of spaces allows programmers to enhance the readability of their programs.

ls ::= SP ...

this is referenced from the URL:
http://71.174.62.16/Demo/AnnoStd?Frame=Main&Edition=1995&Page=a106003#Def_0002

6.2 Routine body routinebody

The routinebody is a sequence of lines terminated by an eor. Each line starts with one ls which may be preceded by an optional label and formallist. The ls is followed by zero or more li (level-indicator) which are followed by zero or more commands and a terminating eol. If there is a comment it is separated from the last command of a line by one or more spaces.

routinebody ::= line ... eor line ::= │ levelline | formalline │ eor ::= CR FF

ldlandis · 2012-03-28T20:35:01Z

IMO you do not get to be that specific in the classification filter (mapping suffix and file contect to language).

That is why my pattern stopped at the semi-colon. Sure stuff can follow, but it may not be useful in classifying language.

lparenteau · 2012-03-29T01:05:06Z

I have created a new pull request (#150) with an improved regex based on the comments, and fixed @whitten 's concern regarding the "fox" label.

7queue · 2014-03-27T02:34:53Z

+1

shameer · 2014-04-25T19:09:57Z

+1

vietanhvu3001 · 2015-06-25T18:25:52Z

+1

xlijun · 2016-11-15T05:30:20Z

+1

James3678 · 2017-03-30T19:47:31Z

+1

pchaigno · 2017-03-31T06:32:45Z

This pull request was closed 5 years ago. Since then LInguist evolved a lot and it now has support for M. If this is not working for you, please open a new issue.

Added detection for the new M (aka MUMPS) language.

e0190a5

josh reviewed Mar 28, 2012
View reviewed changes

josh closed this Mar 28, 2012

whitten reviewed Mar 28, 2012
View reviewed changes

Added detection for the M (aka MUMPS) programming language. #148

Added detection for the M (aka MUMPS) programming language. #148

Conversation

lparenteau commented Mar 27, 2012

luisibanez commented Mar 27, 2012

ksbhaskar commented Mar 27, 2012

jcfr commented Mar 27, 2012

dnrussell commented Mar 27, 2012

ghost commented Mar 27, 2012

ldlandis commented Mar 27, 2012

Pringley commented Mar 27, 2012

ozgunbas commented Mar 27, 2012

rozant commented Mar 27, 2012

JDougherty commented Mar 27, 2012

igotmumps commented Mar 27, 2012

owensw commented Mar 27, 2012

seanwoods commented Mar 27, 2012

cpatrick commented Mar 28, 2012

whitten commented Mar 28, 2012

thalesmello commented Mar 28, 2012

tuskentower commented Mar 28, 2012

glilly commented Mar 28, 2012

mmendelson commented Mar 28, 2012

ltarbox commented Mar 28, 2012

petercyli commented Mar 28, 2012

ivansopin commented Mar 28, 2012

0xAlexei commented Mar 28, 2012

josh Mar 28, 2012

Choose a reason for hiding this comment

seanwoods Mar 28, 2012

Choose a reason for hiding this comment

80n commented Mar 28, 2012

ksbhaskar commented Mar 28, 2012

80n commented Mar 28, 2012

fscwitte commented Mar 28, 2012

cjh1 commented Mar 28, 2012

DotMish commented Mar 28, 2012

jamestjoyce commented Mar 28, 2012

Sharkles commented Mar 28, 2012

josh commented Mar 28, 2012

msires commented Mar 28, 2012

luisibanez commented Mar 28, 2012

ldlandis commented Mar 28, 2012

whitten commented Mar 28, 2012

lparenteau commented Mar 28, 2012

seanwoods commented Mar 28, 2012

whitten commented Mar 28, 2012

whitten Mar 28, 2012

Choose a reason for hiding this comment

ldlandis commented Mar 28, 2012

lparenteau commented Mar 29, 2012

7queue commented Mar 27, 2014

shameer commented Apr 25, 2014

vietanhvu3001 commented Jun 25, 2015

xlijun commented Nov 15, 2016

James3678 commented Mar 30, 2017

pchaigno commented Mar 31, 2017