Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an entry for Bluespec's other syntax #6476

Merged
merged 6 commits into from
Sep 7, 2023

Conversation

quark17
Copy link
Contributor

@quark17 quark17 commented Jul 13, 2023

Description

The Bluespec language is already part of Linguist. It was added in 2013 in PR #739 and then updated in 2015 in PR #1945 to use its own grammar (rather than the SystemVerilog grammar).

However, Bluespec has two syntaxes: BSV (based on SystemVerilog) and BH (based on Haskell). These are just skins on essentially the same language. But they use different filename extensions (.bsv versus .bs) and require different highlighting/detection.

The Bluespec compiler was proprietary from 2000 to 2020 (when it was made open source, github repo here). The BH existed for all of time, but it was not advertised publicly. Only since 2017-ish and particularly since being open source, has BH been used by the wider public. So there is less .bs code on GitHub than .bsv, but it is growing.

Lately, I have seen people writing git attributes like the following:

*.bs linguist-language=haskell

because they want proper syntax highlighting of BH files on GitHub. I don't like lying to linguist/GitHub in this way, and I would like the files to be reported as Bluespec in repo statistics and searches.

I am also seeing that repos containing Bluespec BH are reported by GitHub as having a high percentage of Bikeshed files -- this was a language added last year and also uses the .bs extension. For these reasons, I think it's important to include Bluespec BH along with Bluespec BSV in Linguist's known languages.

This PR includes two commits, because the way that I wanted to write the entries at first did not work. So the second commit a compromise. What didn't work is that I tried to make entries for Bluespec BSV and Bluespec BH and declare each as part of the group Bluespec. However, it seems like a group name has to be one of the languages. In this case, neither BSV nor BH is the primary language, so I don't want either one to be the group name. Ultimately, I made it work by keeping Bluespec BSV named just Bluespec and making Bluespec BH a subordinate in the Bluespec group.

That seems to work. However, the github-linguist command-line tool doesn't report a breakdown of Bluespec BH separately. I don't know how to test that it properly detected the syntax highlighting for BH files; but since the extensions differ, I assume it got it right.

For the code samples, I took the two existing Bluespec BSV samples files and I transliterated them to the BH syntax. However, with only these two samples, I found that Linguist was still identifying some BH files as Bikeshed. I added a third example, and now things seem to work. The third example is a real world library that came from the open-source Bluespec compiler repo, which has a liberal (MIT-like) license, as long as the copyright/license is at the top of the file, so you'll see that in a comment at the top of the file -- unless there's a better way to do that? Can the license be in a separate file in the same directory or something?

There is not currently a dedicated grammar tool for BH, so I have indicated to use Haskell where needed. (Haskell highlighting is what people use when editing BH in emacs and vim, for example.)

Below, I have indicated a search for BH files that turns up 1000+. I can't see how many repositorites it's in, though, to know if it's in 100s. But I hope that I have justified the need above, and that it makes sense to include if BSV is already included.

The checklist mentions adding a heuristic for distinguishing from other languages with the same extension. I have not done that, because I don't know where that would be written. BH files always have a top-level package <name> [([<exports>])] where declaration, which distinguishes it from Bikeshed -- I could add that, if necessary? When I created a search link for BH files in github (in the checklist below), many Bikeshed files turned up containing the common BH keywords, so I instead used a regexp looking for the package declaration, and that worked for filtering out the Bikeshed results. So, if necessary, a regexp like that could be used.

Checklist:

@quark17 quark17 requested a review from a team as a code owner July 13, 2023 04:02
@DecimalTurn
Copy link
Contributor

The checklist mentions adding a heuristic for distinguishing from other languages with the same extension. I have not done that, because I don't know where that would be written.

For the heuristics, you can add them here:

- extensions: ['.bs']
rules:
- language: Bikeshed
pattern: '^(?i:<pre\s+class)\s*=\s*(''|\"|\b)metadata\b\1[^>\r\n]*>'
- language: BrighterScript
pattern:
- (?i:^\s*(?=^sub\s)(?:sub\s*\w+\(.*?\))|(?::\s*sub\(.*?\))$)
- (?i:^\s*(end\ssub)$)
- (?i:^\s*(?=^function\s)(?:function\s*\w+\(.*?\)\s*as\s*\w*)|(?::\s*function\(.*?\)\s*as\s*\w*)$)
- (?i:^\s*(end\sfunction)$)

I could add that, if necessary?

When there's more than 2 languages with the same extension, having a good heuristic in place isn't luxury, it's close to a necessity. Otherwise you end up depending on the bayesian classifier which isn't that reliable.

ace_mode: haskell
codemirror_mode: haskell
codemirror_mime_type: text/x-haskell
language_id: 626c75650
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you run the script to generate the language_id? Unless the ids were in hexadecimal all along and I didn't know, there shouldn't be a "c" in there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm returning to these changes after a long pause, so I don't recall. I'll run the script to generate a new ID and push another commit for it. Thanks!

@quark17
Copy link
Contributor Author

quark17 commented Jul 13, 2023

Thank you. I have pushed two commits to address the two concerns that you raised: one to regenerate the ID and one to add Bluespec BH to the heuristics for .bs files.

I added a pattern for a construct that all BH files must have. I put it last in the heuristic: I don't think it should match files in the other languages, but I can't be 100% sure; whereas, the existing patterns for those languages are unlikely for BH files, so it's OK for BH's pattern to be last. I tested that the new heuristic is being used, by removing the third sample (which was added to help the bayesian classifier) and saw that Linguist still correctly identified the files in my repo (which it hadn't done with just two samples and no heuristic).

@DecimalTurn
Copy link
Contributor

Looks good to me. I'll let a maintainer/collaborator take it from here.

@quark17
Copy link
Contributor Author

quark17 commented Jul 21, 2023

I see that the "Classifier cross-validation" job failed with this new line:

Bluespec BH/CGetPut.bs BAD (Bikeshed)

If I understand correctly, this is testing whether each sample can be recognized by a classifier trained only on the other samples. I would guess that this example differs significantly from the other two BH examples. I am able to run the test locally, so I'll see if adding more samples helps. There are plenty more BH libraries in the bsc repo and a couple BH examples in the bsc-contrib repo; so I can easily add more.

It would help if I could run the test script on only the .bs files (or only the Bluespec BH directory) -- any advice on how to do that? I don't know Ruby, but I see that there's a section of the eval function that returns nil for extensions that don't appear for multiple languages; I'll see if I can return nil if the sample[:extname] isn't equal to .bs or something like that? Alternatively, the eval function is being applied to all the elements of $samples, so is there an easy way to populate that with just one directory?

@quark17
Copy link
Contributor Author

quark17 commented Jul 21, 2023

I have pushed an additional sample for Bluespec BH. This file is COBS.bs from the bsc-contrib repo. The repo uses the BSD-3-Clause license as indicated at the end of the README, unless otherwise specified by the contributed files, and this file does not. I have added two comment lines at the top of the file, indicating the source repo and the short identifier for the license -- I didn't want to lead the classifier astray with large license text at the start of each file.

FYI, I was able to add this at line 84 of the Ruby script, so that it would only test .bs files:

return nil if sample[:extname] != ".bs"

With that, it ran quickly, and all of the .bs files reported GOOD. I'm happy to add more BH samples, though, if you think it's necessary. I see that BikeShed only has two and BrighterScript has three.

Bikeshed/example.bs GOOD
Bikeshed/example2.bs GOOD
Bluespec BH/CGetPut.bs GOOD
Bluespec BH/COBS.bs GOOD
Bluespec BH/TL.bs GOOD
Bluespec BH/TbTL.bs GOOD
BrighterScript/Main.bs GOOD
BrighterScript/RowListExample.bs GOOD
BrighterScript/SimpleGrid.bs GOOD

@lildude
Copy link
Member

lildude commented Jul 21, 2023

If I understand correctly, this is testing whether each sample can be recognized by a classifier trained only on the other samples.

Yup.

It would help if I could run the test script on only the .bs files (or only the Bluespec BH directory) -- any advice on how to do that?

You found what I'd have recommended 😁

I'm happy to add more BH samples, though, if you think it's necessary. I see that BikeShed only has two and BrighterScript has three.

Hopefully two should be enough. 🤞

@quark17
Copy link
Contributor Author

quark17 commented Jul 21, 2023

The cross-validation is now passing, but now the "Run Tests" jobs are failing, because vendor/README.md wasn't updated with an entry for Bluespec BH. The failure message says to run scripts/list-grammars, so I did that and I see that it added a new line to the README -- so I've committed and pushed that. Hopefully the tests will pass now.

I don't see anything in the CONTRIBUTING.md instructions about running that script. It mentions script/add-grammar, but I wasn't adding a new grammar, just reusing an existing grammar from another language. Did I miss something, or should the instructions be updated to say what to do in that case?

@lildude
Copy link
Member

lildude commented Jul 21, 2023

I don't see anything in the CONTRIBUTING.md instructions about running that script. It mentions script/add-grammar, but I wasn't adding a new grammar, just reusing an existing grammar from another language. Did I miss something, or should the instructions be updated to say what to do in that case?

You didn't miss anything. The docs don't cover this scenario as it's quite a rare occurrence that is picked up by the tests.

@DecimalTurn
Copy link
Contributor

It would help if I could run the test script on only the .bs files (or only the Bluespec BH directory) -- any advice on how to do that?

It's too late for this PR since the sample-related issue is resolved, but I had to do this recently as well when having to fix a classifier issue with namely .bas and .cls files. The branch with that feature is this here https://github.com/DecimalTurn/linguist/tree/cross-validation-filter

The syntax to apply the extensions filter is to pass the --extensions option with extensions (including the dot) seperated by a comma.

bundle exec script/cross-validation --extensions=.bas,.cls
or
bundle exec script/cross-validation --extensions=".bas, .cls"

@lildude is that something that would be worth merging in master?

@lildude
Copy link
Member

lildude commented Jul 22, 2023

@DecimalTurn whilst it's likely to be rarely used, I think it's a useful addition so feel free to submit a PR.

@lildude lildude added this pull request to the merge queue Sep 7, 2023
Merged via the queue into github-linguist:master with commit e302011 Sep 7, 2023
5 checks passed
@quark17 quark17 deleted the update-bluespec branch December 20, 2023 01:10
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants