Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added detection for the M (aka MUMPS) programming language. #148

Closed
wants to merge 1 commit into from

Conversation

lparenteau
Copy link
Contributor

This add detection for the M (aka MUMPS) programming language (see https://en.wikipedia.org/wiki/MUMPS).

I have successfully tested this using bundle exec rake test.

I have also called bundle exec linguist on the following projects, which I know have M files in them :

lparenteau/httpm
luisibanez/fis-gtm
luisibanez/VistA-FOIA

@luisibanez
Copy link

+1

This is great,
thanks for preparing this patch.

Here are other projects in M as well:

https://github.com/OSEHR/M-Tools
https://github.com/OSEHR/CacheToGTM

and probably the most important is VistA (The EHR of the Department of Veterans Affairs):
https://github.com/OSEHRA/VistA-FOIA

VistA has about 40 forks now, and the number will increase soon.

@ksbhaskar
Copy link

+1

A free / open source M/MUMPS implementation for Linux on x86 is GT.M (http://fis-gtm.com and http://sf/net/projects/fis-gtm )

@jcfr
Copy link

jcfr commented Mar 27, 2012

+1 Excellent. This is great news :) Thanks @lparenteau

@dnrussell
Copy link

+1 Sounds great!

@ghost
Copy link

ghost commented Mar 27, 2012

+1 Highly desirable..

@ldlandis
Copy link

+1 Highly useful addition!

@Pringley
Copy link

+1 This would be great.

@ozgunbas
Copy link

+1
I want it!

@rozant
Copy link

rozant commented Mar 27, 2012

+1, M will be increasingly popular as VistA rolls out

@JDougherty
Copy link

+1, will assist in development of VistA.

@igotmumps
Copy link

+1

cool!

@owensw
Copy link

owensw commented Mar 27, 2012

+1
Excellent!

@seanwoods
Copy link

+1

1 similar comment
@cpatrick
Copy link

+1

@whitten
Copy link
Contributor

whitten commented Mar 28, 2012

+1
M or MUMPS code is traditionally tagged with a .m extension if it is a single routine,
the .rsa extension signifies a Routine Save Archive
the .gsa extension signifies a Global Save Archive
the .zwr extension signifies a ZWRite global archive.

@thalesmello
Copy link

+1

@tuskentower
Copy link

+1
GT.M also uses the extension .glo for a Global Extract

@glilly
Copy link

glilly commented Mar 28, 2012

+1

@mmendelson
Copy link

+1
Very nice work and useful.

@ltarbox
Copy link

ltarbox commented Mar 28, 2012

+1

@petercyli
Copy link

+1
This is great for open source M

@ivansopin
Copy link

+1

@0xAlexei
Copy link

+1
Would be very helpful for work with M on Github.

@@ -471,6 +474,10 @@ def guess_m_language
elsif lines.grep(/^%/).any?
Language['Matlab']

# M comment
elsif lines.grep(/^[ \t]*;/).any?
Language['M']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only checking for comments is a rather crude method.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, a better regex would be these two:

^[ \t%A-Z][A-Za-z0-9]+[ \t]+;*
^\d+[ \t]+;*

If all non-blank lines don't satisfy this two regexes, the program isn't valid MUMPS code.

Edit: I consulted the standard and had to revise.

Source: http://71.174.62.16/Demo/AnnoStd?Frame=Main&Page=a101004

@80n
Copy link

80n commented Mar 28, 2012

+1

@ksbhaskar
Copy link

I don't have a strong preference between M and MUMPS, but for what it's worth, the official name is M. Ref: http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=29268

@80n
Copy link

80n commented Mar 28, 2012

If it wasn't clear, my +1 was for the original pull request by lparenteau, not a comment on the M vs MUMPS discussion.

fwiw:
M +1
MUMPS -1E6

@fscwitte
Copy link

+1 (for the original pull request by lparenteau)

@cjh1
Copy link

cjh1 commented Mar 28, 2012

+1

3 similar comments
@DotMish
Copy link

DotMish commented Mar 28, 2012

+1

@jamestjoyce
Copy link

+1

@Sharkles
Copy link

+1

@josh
Copy link
Contributor

josh commented Mar 28, 2012

Sorry, but with the name controversy, there being no lexer, and it clashing with another very popular extension (obj-c), this isn't going to work.

Thanks for that patch.

@josh josh closed this Mar 28, 2012
@msires
Copy link

msires commented Mar 28, 2012

The name controversy exists only in your mind.....

@luisibanez
Copy link

Josh,

I'm wondering how github is dealing with MATLAB
code, that has also the .m extension.

It would seems that a file name extension clash
with Objective-C is not enough justification for not
classifying the language properly.

Also,
Could you please elaborate on the "lexer" and
how we could help to overcome that challenge ?

Thanks

@ldlandis
Copy link

I agree... that is a poor excuse (there are many conflicts with the .m
suffix alone).

Perhaps there is no lexer, but I would imagine that an regular expression
controls this (sort of saw folks suggesting that anyway). I believe we can
come up with a pattern for file(1) that would mostly accurately identify
(our) M code.

For example (not exactly this, but similar): ^[%A-Za-z][A-Za-z0-9]*[\t ]+;
where the first characters of the first line are [%A-Za-z] optionally followed
by [A-Za-z0-9] followed by a spaces/tabs, followed by a semi-colon. Most
M routines have this structure, and I would not complain much if this was
a required "stylization".

@whitten
Copy link
Contributor

whitten commented Mar 28, 2012

I agree with you, Larry,
although I think the pattern should allow for characters after the semicolon.

seanwoods earlier suggested:
^[ \t%A-Z][A-Za-z0-9]+[ \t]+;*
^\d+[ \t]+;*

I don't know what \d is supposed to signify,
The first line of MUMPS routine should match the first pattern
unless it has an argument list on it.
In that case, the tag should allow for a single "(" followed by local variable
names separated by commas and ending with a ")"
You are not allowed to subscript the variables in an formal list,
and the "." is used for actual arguments, not for formal arguments.

Technically, the first line could have MUMPS code on it, but it is such a rare occurrence,
that I've only seen it a few times, and even then in throw-away code.

By the way, some of the code of the patch appears to be at this URL.
https://github.com/lparenteau/linguist/blob/e0190a5a6e1ec52dbdb70ef9f62db6e6043bd03c/lib/linguist/blob_helper.rb

The relevant portion is:

# Internal: Guess language of .m files.
#
# Objective-C heuristics:
# * Keywords
#
# Matlab heuristics:
# * Leading function keyword
# * "%" comments
#
# M heuristics:
# * ";" comments
#
# Returns a Language.
def guess_m_language
  # Objective-C keywords
  if lines.grep(/^#import|@(interface|implementation|property|synthesize|end)/).any?
    Language['Objective-C']

  # File function
  elsif lines.first.to_s =~ /^function /
    Language['Matlab']

  # Matlab comment
  elsif lines.grep(/^%/).any?
    Language['Matlab']

  # M comment
  elsif lines.grep(/^[ \t]*;/).any?
    Language['M']

  # Fallback to Objective-C, don't want any Matlab false positives
  else
    Language['Objective-C']
  end
end

@lparenteau
Copy link
Contributor Author

@luisibanez The lexer is only used to do syntax highlighting when viewing source file directily in GitHub. There are many other languages that don't define a lexer as well. This was something I wanted to look at later, but if you are interested, GitHub use Pygments (http://pygments.org/) for this, so we would need to add a lexer for M in to project, which GitHub will eventually inherit.

As shown by @whitten, a .m files is currently considered to be either an Objective-C file or a Matlab file. My patch add M to that list. The regex (or other method) used to detect M source code doesn't need to be exact.

@josh I have tested my patch on all the project found in the Objective-C main page (https://github.com/languages/Objective-C), and on the 2317 .m files present, only 1 was wrongly tagged as M. I have fixed the issue and I think I could add that commit to this pull request if you re-open it. Or should I start a new pull request?

As for the other regex suggested, I did try them on various M project and the results weren't as good as looking for M comments. But, if GitHub want do go this way instead, I'm sure we can come up with a better regex.

@seanwoods
Copy link

@whitten According to the standard (linked in my comment to the patch above), a M tag can be an integer as well. I just tested in GT.M.

The heuristics expressed in this Ruby code aren't very rigorous. Just look at how it detects Matlab.

M is pretty picky about how code needs to be laid out, but it boils down to those regexes. You could also check for strings like $Length(, $Piece(, etc. Alternatively you could look for the very-specific-to-M function call syntax e.g. $$trim^%str (that is, two dollar signs, followed by a tag name including the caret).

M is a pretty simple language. It should be easy to find the elements of M that don't intersect with Objective-C or Matlab.

As for the name issue - if it's between "M" and "MUMPS," use "M." This is how the standard is written. If it's against the Microsoft language linked to by Josh, I'd suggest using the syntax to detect the proper format.

@whitten
Copy link
Contributor

whitten commented Mar 28, 2012

Sean, I agree that you are allowed to put a string of numeric digits as a tag.

I didn't suggest that it needed to be an integer, because a string of
numeric digits is not a canonical integer in the M Language

00050 is a valid tag in M, but the canonical integer is 50

%000 is a valid tag as well, by the way.

I assume \d means "decimal integer" ?

David
713-870-3834

@@ -0,0 +1,4 @@
fox
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't tell if this line has a ls (label-separator) or not.
The M language requires such following a label.
The word "fox" is clearly the label, but it isn't clear whether a space or tab character is following it.

This is documented for the current standard at this URL:
http://71.174.62.16/Demo/AnnoStd?Frame=Main&Page=a106007&Edition=1995
i.e.:
6.2.4 Label separator ls

A label separator (ls) precedes the linebody of each line. A ls consists of one or more spaces. The flexible number of spaces allows programmers to enhance the readability of their programs.

ls  ::= SP  ...

this is referenced from the URL:
http://71.174.62.16/Demo/AnnoStd?Frame=Main&Edition=1995&Page=a106003#Def_0002

6.2 Routine body routinebody

The routinebody is a sequence of lines terminated by an eor. Each line starts with one ls which may be preceded by an optional label and formallist. The ls is followed by zero or more li (level-indicator) which are followed by zero or more commands and a terminating eol. If there is a comment it is separated from the last command of a line by one or more spaces.

routinebody ::= line    ... eor
line    ::= │ levelline       |    formalline │
eor ::= CR FF

@ldlandis
Copy link

IMO you do not get to be that specific in the classification filter (mapping suffix and file contect to language).

That is why my pattern stopped at the semi-colon. Sure stuff can follow, but it may not be useful in classifying language.

@lparenteau
Copy link
Contributor Author

I have created a new pull request (#150) with an improved regex based on the comments, and fixed @whitten 's concern regarding the "fox" label.

@7queue
Copy link

7queue commented Mar 27, 2014

+1

4 similar comments
@shameer
Copy link

shameer commented Apr 25, 2014

+1

@vietanhvu3001
Copy link

+1

@xlijun
Copy link

xlijun commented Nov 15, 2016

+1

@James3678
Copy link

+1

@pchaigno
Copy link
Contributor

This pull request was closed 5 years ago. Since then LInguist evolved a lot and it now has support for M. If this is not working for you, please open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet