This add detection for the M (aka MUMPS) programming language (see https://en.wikipedia.org/wiki/MUMPS).
I have successfully tested this using bundle exec rake test.
bundle exec rake test
I have also called bundle exec linguist on the following projects, which I know have M files in them :
bundle exec linguist
Added detection for the new M (aka MUMPS) language.
This is great,
thanks for preparing this patch.
Here are other projects in M as well:
and probably the most important is VistA (The EHR of the Department of Veterans Affairs):
VistA has about 40 forks now, and the number will increase soon.
A free / open source M/MUMPS implementation for Linux on x86 is GT.M (http://fis-gtm.com and http://sf/net/projects/fis-gtm )
+1 Excellent. This is great news :) Thanks @lparenteau
+1 Sounds great!
+1 Highly desirable..
+1 Highly useful addition!
+1 This would be great.
I want it!
+1, M will be increasingly popular as VistA rolls out
+1, will assist in development of VistA.
M or MUMPS code is traditionally tagged with a .m extension if it is a single routine,
the .rsa extension signifies a Routine Save Archive
the .gsa extension signifies a Global Save Archive
the .zwr extension signifies a ZWRite global archive.
GT.M also uses the extension .glo for a Global Extract
Very nice work and useful.
This is great for open source M
Would be very helpful for work with M on Github.
Only checking for comments is a rather crude method.
Agreed, a better regex would be these two:
^[ \t%A-Z][A-Za-z0-9]+[ \t]+;*
If all non-blank lines don't satisfy this two regexes, the program isn't valid MUMPS code.
Edit: I consulted the standard and had to revise.
Seems like the primary name ought to be MUMPS rather than M.
Well, the first google research for "m language" actually leads you here
M is only the codename for this new Microsoft programming language. It will probably change when / if this gets released.
I don't have a strong preference between M and MUMPS, but for what it's worth, the official name is M. Ref: http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=29268
If it wasn't clear, my +1 was for the original pull request by lparenteau, not a comment on the M vs MUMPS discussion.
+1 (for the original pull request by lparenteau)
Sorry, but with the name controversy, there being no lexer, and it clashing with another very popular extension (obj-c), this isn't going to work.
Thanks for that patch.
The name controversy exists only in your mind.....
I'm wondering how github is dealing with MATLAB
code, that has also the .m extension.
It would seems that a file name extension clash
with Objective-C is not enough justification for not
classifying the language properly.
Could you please elaborate on the "lexer" and
how we could help to overcome that challenge ?
I agree... that is a poor excuse (there are many conflicts with the .m
Perhaps there is no lexer, but I would imagine that an regular expression
controls this (sort of saw folks suggesting that anyway). I believe we can
come up with a pattern for file(1) that would mostly accurately identify
(our) M code.
For example (not exactly this, but similar): ^[%A-Za-z][A-Za-z0-9]*[\t ]+;
where the first characters of the first line are [%A-Za-z] optionally followed
by [A-Za-z0-9] followed by a spaces/tabs, followed by a semi-colon. Most
M routines have this structure, and I would not complain much if this was
a required "stylization".
I agree with you, Larry,
although I think the pattern should allow for characters after the semicolon.
seanwoods earlier suggested:
^[ \t%A-Z][A-Za-z0-9]+[ \t]+;*
I don't know what \d is supposed to signify,
The first line of MUMPS routine should match the first pattern
unless it has an argument list on it.
In that case, the tag should allow for a single "(" followed by local variable
names separated by commas and ending with a ")"
You are not allowed to subscript the variables in an formal list,
and the "." is used for actual arguments, not for formal arguments.
Technically, the first line could have MUMPS code on it, but it is such a rare occurrence,
that I've only seen it a few times, and even then in throw-away code.
By the way, some of the code of the patch appears to be at this URL.
The relevant portion is:
# Internal: Guess language of .m files.
# Objective-C heuristics:
# * Keywords
# Matlab heuristics:
# * Leading function keyword
# * "%" comments
# M heuristics:
# * ";" comments
# Returns a Language.
# Objective-C keywords
# File function
elsif lines.first.to_s =~ /^function /
# Matlab comment
# M comment
elsif lines.grep(/^[ \t]*;/).any?
# Fallback to Objective-C, don't want any Matlab false positives
@luisibanez The lexer is only used to do syntax highlighting when viewing source file directily in GitHub. There are many other languages that don't define a lexer as well. This was something I wanted to look at later, but if you are interested, GitHub use Pygments (http://pygments.org/) for this, so we would need to add a lexer for M in to project, which GitHub will eventually inherit.
As shown by @whitten, a .m files is currently considered to be either an Objective-C file or a Matlab file. My patch add M to that list. The regex (or other method) used to detect M source code doesn't need to be exact.
@josh I have tested my patch on all the project found in the Objective-C main page (https://github.com/languages/Objective-C), and on the 2317 .m files present, only 1 was wrongly tagged as M. I have fixed the issue and I think I could add that commit to this pull request if you re-open it. Or should I start a new pull request?
As for the other regex suggested, I did try them on various M project and the results weren't as good as looking for M comments. But, if GitHub want do go this way instead, I'm sure we can come up with a better regex.
@whitten According to the standard (linked in my comment to the patch above), a M tag can be an integer as well. I just tested in GT.M.
The heuristics expressed in this Ruby code aren't very rigorous. Just look at how it detects Matlab.
M is pretty picky about how code needs to be laid out, but it boils down to those regexes. You could also check for strings like $Length(, $Piece(, etc. Alternatively you could look for the very-specific-to-M function call syntax e.g. $$trim^%str (that is, two dollar signs, followed by a tag name including the caret).
M is a pretty simple language. It should be easy to find the elements of M that don't intersect with Objective-C or Matlab.
As for the name issue - if it's between "M" and "MUMPS," use "M." This is how the standard is written. If it's against the Microsoft language linked to by Josh, I'd suggest using the syntax to detect the proper format.
Sean, I agree that you are allowed to put a string of numeric digits as a tag.
I didn't suggest that it needed to be an integer, because a string of
numeric digits is not a canonical integer in the M Language
00050 is a valid tag in M, but the canonical integer is 50
%000 is a valid tag as well, by the way.
I assume \d means "decimal integer" ?
I can't tell if this line has a ls (label-separator) or not.
The M language requires such following a label.
The word "fox" is clearly the label, but it isn't clear whether a space or tab character is following it.
This is documented for the current standard at this URL:
6.2.4 Label separator ls
A label separator (ls) precedes the linebody of each line. A ls consists of one or more spaces. The flexible number of spaces allows programmers to enhance the readability of their programs.
ls ::= SP ...
this is referenced from the URL:
6.2 Routine body routinebody
The routinebody is a sequence of lines terminated by an eor. Each line starts with one ls which may be preceded by an optional label and formallist. The ls is followed by zero or more li (level-indicator) which are followed by zero or more commands and a terminating eol. If there is a comment it is separated from the last command of a line by one or more spaces.
routinebody ::= line ... eor
line ::= │ levelline | formalline │
eor ::= CR FF
IMO you do not get to be that specific in the classification filter (mapping suffix and file contect to language).
That is why my pattern stopped at the semi-colon. Sure stuff can follow, but it may not be useful in classifying language.
I have created a new pull request (#150) with an improved regex based on the comments, and fixed @whitten 's concern regarding the "fox" label.