-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an entry for Bluespec's other syntax #6476
Conversation
For the heuristics, you can add them here: linguist/lib/linguist/heuristics.yml Lines 123 to 132 in e855ef2
When there's more than 2 languages with the same extension, having a good heuristic in place isn't luxury, it's close to a necessity. Otherwise you end up depending on the bayesian classifier which isn't that reliable. |
lib/linguist/languages.yml
Outdated
ace_mode: haskell | ||
codemirror_mode: haskell | ||
codemirror_mime_type: text/x-haskell | ||
language_id: 626c75650 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you run the script to generate the language_id
? Unless the ids were in hexadecimal all along and I didn't know, there shouldn't be a "c" in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm returning to these changes after a long pause, so I don't recall. I'll run the script to generate a new ID and push another commit for it. Thanks!
Thank you. I have pushed two commits to address the two concerns that you raised: one to regenerate the ID and one to add Bluespec BH to the heuristics for I added a pattern for a construct that all BH files must have. I put it last in the heuristic: I don't think it should match files in the other languages, but I can't be 100% sure; whereas, the existing patterns for those languages are unlikely for BH files, so it's OK for BH's pattern to be last. I tested that the new heuristic is being used, by removing the third sample (which was added to help the bayesian classifier) and saw that Linguist still correctly identified the files in my repo (which it hadn't done with just two samples and no heuristic). |
Looks good to me. I'll let a maintainer/collaborator take it from here. |
I see that the "Classifier cross-validation" job failed with this new line:
If I understand correctly, this is testing whether each sample can be recognized by a classifier trained only on the other samples. I would guess that this example differs significantly from the other two BH examples. I am able to run the test locally, so I'll see if adding more samples helps. There are plenty more BH libraries in the It would help if I could run the test script on only the |
I have pushed an additional sample for FYI, I was able to add this at line 84 of the Ruby script, so that it would only test
With that, it ran quickly, and all of the
|
Yup.
You found what I'd have recommended 😁
Hopefully two should be enough. 🤞 |
The cross-validation is now passing, but now the "Run Tests" jobs are failing, because I don't see anything in the |
You didn't miss anything. The docs don't cover this scenario as it's quite a rare occurrence that is picked up by the tests. |
It's too late for this PR since the sample-related issue is resolved, but I had to do this recently as well when having to fix a classifier issue with namely The syntax to apply the extensions filter is to pass the
@lildude is that something that would be worth merging in |
@DecimalTurn whilst it's likely to be rarely used, I think it's a useful addition so feel free to submit a PR. |
Description
The
Bluespec
language is already part of Linguist. It was added in 2013 in PR #739 and then updated in 2015 in PR #1945 to use its own grammar (rather than the SystemVerilog grammar).However, Bluespec has two syntaxes: BSV (based on SystemVerilog) and BH (based on Haskell). These are just skins on essentially the same language. But they use different filename extensions (
.bsv
versus.bs
) and require different highlighting/detection.The Bluespec compiler was proprietary from 2000 to 2020 (when it was made open source, github repo here). The BH existed for all of time, but it was not advertised publicly. Only since 2017-ish and particularly since being open source, has BH been used by the wider public. So there is less
.bs
code on GitHub than.bsv
, but it is growing.Lately, I have seen people writing git attributes like the following:
because they want proper syntax highlighting of BH files on GitHub. I don't like lying to linguist/GitHub in this way, and I would like the files to be reported as Bluespec in repo statistics and searches.
I am also seeing that repos containing Bluespec BH are reported by GitHub as having a high percentage of
Bikeshed
files -- this was a language added last year and also uses the.bs
extension. For these reasons, I think it's important to include Bluespec BH along with Bluespec BSV in Linguist's known languages.This PR includes two commits, because the way that I wanted to write the entries at first did not work. So the second commit a compromise. What didn't work is that I tried to make entries for
Bluespec BSV
andBluespec BH
and declare each as part of the groupBluespec
. However, it seems like a group name has to be one of the languages. In this case, neither BSV nor BH is the primary language, so I don't want either one to be the group name. Ultimately, I made it work by keepingBluespec BSV
named justBluespec
and makingBluespec BH
a subordinate in theBluespec
group.That seems to work. However, the
github-linguist
command-line tool doesn't report a breakdown ofBluespec BH
separately. I don't know how to test that it properly detected the syntax highlighting for BH files; but since the extensions differ, I assume it got it right.For the code samples, I took the two existing
Bluespec BSV
samples files and I transliterated them to the BH syntax. However, with only these two samples, I found that Linguist was still identifying some BH files as Bikeshed. I added a third example, and now things seem to work. The third example is a real world library that came from the open-source Bluespec compiler repo, which has a liberal (MIT-like) license, as long as the copyright/license is at the top of the file, so you'll see that in a comment at the top of the file -- unless there's a better way to do that? Can the license be in a separate file in the same directory or something?There is not currently a dedicated grammar tool for BH, so I have indicated to use Haskell where needed. (Haskell highlighting is what people use when editing BH in emacs and vim, for example.)
Below, I have indicated a search for BH files that turns up 1000+. I can't see how many repositorites it's in, though, to know if it's in 100s. But I hope that I have justified the need above, and that it makes sense to include if BSV is already included.
The checklist mentions adding a heuristic for distinguishing from other languages with the same extension. I have not done that, because I don't know where that would be written. BH files always have a top-level
package <name> [([<exports>])] where
declaration, which distinguishes it from Bikeshed -- I could add that, if necessary? When I created a search link for BH files in github (in the checklist below), many Bikeshed files turned up containing the common BH keywords, so I instead used a regexp looking for thepackage
declaration, and that worked for filtering out the Bikeshed results. So, if necessary, a regexp like that could be used.Checklist:
.bs
extension (using regexp to find true BH files):.bsv
extension (and filtering for Bluespec language):CGetPut.bs
from the Bluespec Compiler's standard libraries