Added detection for the M (aka MUMPS) programming language. #148

Closed
wants to merge 1 commit into
from

Projects

None yet
@lparenteau

This add detection for the M (aka MUMPS) programming language (see https://en.wikipedia.org/wiki/MUMPS).

I have successfully tested this using bundle exec rake test.

I have also called bundle exec linguist on the following projects, which I know have M files in them :

lparenteau/httpm
luisibanez/fis-gtm
luisibanez/VistA-FOIA
@luisibanez

+1

This is great,
thanks for preparing this patch.

Here are other projects in M as well:

https://github.com/OSEHR/M-Tools
https://github.com/OSEHR/CacheToGTM

and probably the most important is VistA (The EHR of the Department of Veterans Affairs):
https://github.com/OSEHRA/VistA-FOIA

VistA has about 40 forks now, and the number will increase soon.

@ksbhaskar

+1

A free / open source M/MUMPS implementation for Linux on x86 is GT.M (http://fis-gtm.com and http://sf/net/projects/fis-gtm )

@jcfr
jcfr commented Mar 27, 2012

+1 Excellent. This is great news :) Thanks @lparenteau

@dnrussell

+1 Sounds great!

@gribnick

+1 Highly desirable..

@ldlandis

+1 Highly useful addition!

@Pringley

+1 This would be great.

@ozgunbas

+1
I want it!

@rozant
rozant commented Mar 27, 2012

+1, M will be increasingly popular as VistA rolls out

@JDougherty

+1, will assist in development of VistA.

@igotmumps

+1

cool!

@owensw
owensw commented Mar 27, 2012

+1
Excellent!

@seanwoods

+1

@cpatrick

+1

@whitten
whitten commented Mar 28, 2012

+1
M or MUMPS code is traditionally tagged with a .m extension if it is a single routine,
the .rsa extension signifies a Routine Save Archive
the .gsa extension signifies a Global Save Archive
the .zwr extension signifies a ZWRite global archive.

@thalesmello

+1

@tuskentower

+1
GT.M also uses the extension .glo for a Global Extract

@glilly
glilly commented Mar 28, 2012

+1

@mmendelson

+1
Very nice work and useful.

@ltarbox
ltarbox commented Mar 28, 2012

+1

@petercyli

+1
This is great for open source M

@ivansopin

+1

@bulaza
bulaza commented Mar 28, 2012

+1
Would be very helpful for work with M on Github.

@josh josh commented on the diff Mar 28, 2012
lib/linguist/blob_helper.rb
@@ -471,6 +474,10 @@ def guess_m_language
elsif lines.grep(/^%/).any?
Language['Matlab']
+ # M comment
+ elsif lines.grep(/^[ \t]*;/).any?
+ Language['M']
@josh
josh Mar 28, 2012 GitHub member

Only checking for comments is a rather crude method.

@seanwoods
seanwoods Mar 28, 2012

Agreed, a better regex would be these two:

^[ \t%A-Z][A-Za-z0-9]+[ \t]+;*
^\d+[ \t]+;*

If all non-blank lines don't satisfy this two regexes, the program isn't valid MUMPS code.

Edit: I consulted the standard and had to revise.

Source: http://71.174.62.16/Demo/AnnoStd?Frame=Main&Page=a101004

@josh josh commented on the diff Mar 28, 2012
lib/linguist/languages.yml
@@ -627,6 +627,14 @@ Lua:
- .lua
- .nse
+M:
@josh
josh Mar 28, 2012 GitHub member

Seems like the primary name ought to be MUMPS rather than M.

@ldlandis
ldlandis Mar 28, 2012

On Tue, Mar 27, 2012 at 9:23 PM, Joshua Peek <
reply@reply.github.com

wrote:

@@ -627,6 +627,14 @@ Lua:

  • .lua
  • .nse

+M:

Seems like the primary name ought to be MUMPS rather than M.


Reply to this email directly or view it on GitHub:
https://github.com/github/linguist/pull/148/files#r613173

Dear Joshua,

Except that for various legal reasons, the name MUMPS was officially
changed to M.

Something about vendors not like the old name, AND Massachusetts General
Hospital
wanting to keep the ownership of the name MUMPS.

Cheers,
--ldl


NOTE: If it is important CALL ME - I may miss email,
which I do NOT normally check on weekends nor on

a regular basis during any other day.

LD Landis - N0YRQ - de la tierra del encanto
3960 Schooner Loop, Las Cruces, NM 88012
651-340-4007 N32 21'48.28" W106 46'5.80"

@josh
josh Mar 28, 2012 GitHub member

Well, the first google research for "m language" actually leads you here

http://en.wikipedia.org/wiki/M_(programming_language)

@lparenteau
lparenteau Mar 28, 2012

M is only the codename for this new Microsoft programming language. It will probably change when / if this gets released.

@80n
80n commented Mar 28, 2012

+1

@ksbhaskar

I don't have a strong preference between M and MUMPS, but for what it's worth, the official name is M. Ref: http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=29268

@80n
80n commented Mar 28, 2012

If it wasn't clear, my +1 was for the original pull request by lparenteau, not a comment on the M vs MUMPS discussion.

fwiw:
M +1
MUMPS -1E6

@fscwitte

+1 (for the original pull request by lparenteau)

@cjh1
cjh1 commented Mar 28, 2012

+1

@DotMish
DotMish commented Mar 28, 2012

+1

@jamestjoyce

+1

@Sharkles

+1

@josh
Member
josh commented Mar 28, 2012

Sorry, but with the name controversy, there being no lexer, and it clashing with another very popular extension (obj-c), this isn't going to work.

Thanks for that patch.

@josh josh closed this Mar 28, 2012
@msires
msires commented Mar 28, 2012

The name controversy exists only in your mind.....

@luisibanez

Josh,

I'm wondering how github is dealing with MATLAB
code, that has also the .m extension.

It would seems that a file name extension clash
with Objective-C is not enough justification for not
classifying the language properly.

Also,
Could you please elaborate on the "lexer" and
how we could help to overcome that challenge ?

Thanks
@ldlandis

I agree... that is a poor excuse (there are many conflicts with the .m
suffix alone).

Perhaps there is no lexer, but I would imagine that an regular expression
controls this (sort of saw folks suggesting that anyway). I believe we can
come up with a pattern for file(1) that would mostly accurately identify
(our) M code.

For example (not exactly this, but similar): ^[%A-Za-z][A-Za-z0-9]*[\t ]+;
where the first characters of the first line are [%A-Za-z] optionally followed
by [A-Za-z0-9] followed by a spaces/tabs, followed by a semi-colon. Most
M routines have this structure, and I would not complain much if this was
a required "stylization".

@whitten
whitten commented Mar 28, 2012

I agree with you, Larry,
although I think the pattern should allow for characters after the semicolon.

seanwoods earlier suggested:
^[ \t%A-Z][A-Za-z0-9]+[ \t]+;*
^\d+[ \t]+;*

I don't know what \d is supposed to signify,
The first line of MUMPS routine should match the first pattern
unless it has an argument list on it.
In that case, the tag should allow for a single "(" followed by local variable
names separated by commas and ending with a ")"
You are not allowed to subscript the variables in an formal list,
and the "." is used for actual arguments, not for formal arguments.

Technically, the first line could have MUMPS code on it, but it is such a rare occurrence,
that I've only seen it a few times, and even then in throw-away code.

By the way, some of the code of the patch appears to be at this URL.
https://github.com/lparenteau/linguist/blob/e0190a5a6e1ec52dbdb70ef9f62db6e6043bd03c/lib/linguist/blob_helper.rb

The relevant portion is:

# Internal: Guess language of .m files.
#
# Objective-C heuristics:
# * Keywords
#
# Matlab heuristics:
# * Leading function keyword
# * "%" comments
#
# M heuristics:
# * ";" comments
#
# Returns a Language.
def guess_m_language
  # Objective-C keywords
  if lines.grep(/^#import|@(interface|implementation|property|synthesize|end)/).any?
    Language['Objective-C']

  # File function
  elsif lines.first.to_s =~ /^function /
    Language['Matlab']

  # Matlab comment
  elsif lines.grep(/^%/).any?
    Language['Matlab']

  # M comment
  elsif lines.grep(/^[ \t]*;/).any?
    Language['M']

  # Fallback to Objective-C, don't want any Matlab false positives
  else
    Language['Objective-C']
  end
end
@lparenteau

@luisibanez The lexer is only used to do syntax highlighting when viewing source file directily in GitHub. There are many other languages that don't define a lexer as well. This was something I wanted to look at later, but if you are interested, GitHub use Pygments (http://pygments.org/) for this, so we would need to add a lexer for M in to project, which GitHub will eventually inherit.

As shown by @whitten, a .m files is currently considered to be either an Objective-C file or a Matlab file. My patch add M to that list. The regex (or other method) used to detect M source code doesn't need to be exact.

@josh I have tested my patch on all the project found in the Objective-C main page (https://github.com/languages/Objective-C), and on the 2317 .m files present, only 1 was wrongly tagged as M. I have fixed the issue and I think I could add that commit to this pull request if you re-open it. Or should I start a new pull request?

As for the other regex suggested, I did try them on various M project and the results weren't as good as looking for M comments. But, if GitHub want do go this way instead, I'm sure we can come up with a better regex.

@seanwoods

@whitten According to the standard (linked in my comment to the patch above), a M tag can be an integer as well. I just tested in GT.M.

The heuristics expressed in this Ruby code aren't very rigorous. Just look at how it detects Matlab.

M is pretty picky about how code needs to be laid out, but it boils down to those regexes. You could also check for strings like $Length(, $Piece(, etc. Alternatively you could look for the very-specific-to-M function call syntax e.g. $$trim^%str (that is, two dollar signs, followed by a tag name including the caret).

M is a pretty simple language. It should be easy to find the elements of M that don't intersect with Objective-C or Matlab.

As for the name issue - if it's between "M" and "MUMPS," use "M." This is how the standard is written. If it's against the Microsoft language linked to by Josh, I'd suggest using the syntax to detect the proper format.

@whitten
whitten commented Mar 28, 2012

Sean, I agree that you are allowed to put a string of numeric digits as a tag.

I didn't suggest that it needed to be an integer, because a string of
numeric digits is not a canonical integer in the M Language

00050 is a valid tag in M, but the canonical integer is 50

%000 is a valid tag as well, by the way.

I assume \d means "decimal integer" ?

David
713-870-3834

@whitten whitten commented on the diff Mar 28, 2012
test/fixtures/m_simple.m
@@ -0,0 +1,4 @@
+fox
@whitten
whitten Mar 28, 2012

I can't tell if this line has a ls (label-separator) or not.
The M language requires such following a label.
The word "fox" is clearly the label, but it isn't clear whether a space or tab character is following it.

This is documented for the current standard at this URL:
http://71.174.62.16/Demo/AnnoStd?Frame=Main&Page=a106007&Edition=1995
i.e.:
6.2.4 Label separator ls

A label separator (ls) precedes the linebody of each line. A ls consists of one or more spaces. The flexible number of spaces allows programmers to enhance the readability of their programs.

ls  ::= SP  ...

this is referenced from the URL:
http://71.174.62.16/Demo/AnnoStd?Frame=Main&Edition=1995&Page=a106003#Def_0002

6.2 Routine body routinebody

The routinebody is a sequence of lines terminated by an eor. Each line starts with one ls which may be preceded by an optional label and formallist. The ls is followed by zero or more li (level-indicator) which are followed by zero or more commands and a terminating eol. If there is a comment it is separated from the last command of a line by one or more spaces.

routinebody ::= line    ... eor
line    ::= │ levelline       |    formalline │
eor ::= CR FF
@ldlandis

IMO you do not get to be that specific in the classification filter (mapping suffix and file contect to language).

That is why my pattern stopped at the semi-colon. Sure stuff can follow, but it may not be useful in classifying language.

@lparenteau

I have created a new pull request (#150) with an improved regex based on the comments, and fixed @whitten 's concern regarding the "fox" label.

@7queue
7queue commented Mar 27, 2014

+1

@shameer
shameer commented Apr 25, 2014

+1

@sillyg00se

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment