Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added HAProxy configuration language #4303

Merged
merged 5 commits into from Nov 7, 2018

Conversation

NickMRamirez
Copy link
Contributor

@NickMRamirez NickMRamirez commented Oct 23, 2018

Adding a new language for HAProxy configuration files, e.g. haproxy.cfg. The ".cfg" file extension is already used by the INI language, but I have added samples to both sample folders to differentiate them.

Checklist:

Copy link
Contributor

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

I've left a few comments inline.

lib/linguist/languages.yml Show resolved Hide resolved
lib/linguist/languages.yml Outdated Show resolved Hide resolved
samples/HAProxy/haproxy.cfg Show resolved Hide resolved
@pchaigno
Copy link
Contributor

Here is an overview of how syntax highlighted files will appear on GitHub.com with the new grammar: Lightshow link.

@pchaigno
Copy link
Contributor

@NickMRamirez Could you add one or two other sample files? I'm still seeing a lot of misclassifications by the Bayesian classifier between HAProxy and INI files. I'm expecting that it should be able to do a better job with more training data since HAProxy files contain a set of easily identifiable keywords. Files of the same size as the one you created, or a bit larger should be fine; try to avoid huge files, as the Bayesian then keeps all tokens in memory.

@NickMRamirez
Copy link
Contributor Author

Added more sample files. Let me know if I should add more.

@pchaigno
Copy link
Contributor

pchaigno commented Nov 5, 2018

I've downloaded 2347 files from your search query and ran Linguist on it. As far as I can see, it detects all HAProxy files perfectly. I'd say we're good to go!

@NickMRamirez
Copy link
Contributor Author

Great to hear! Thanks for your work on this!

@pchaigno pchaigno requested a review from lildude November 6, 2018 14:16
@lildude
Copy link
Member

lildude commented Nov 7, 2018

@pchaigno regarding:

I'm still seeing a lot of misclassifications by the Bayesian classifier between HAProxy and INI files. I'm expecting that it should be able to do a better job with more training data since HAProxy files contain a set of easily identifiable keywords.

Shouldn't we be using a heuristic first? This PR doesn't have a heuristic but introduces a duplicate extension.

@pchaigno
Copy link
Contributor

pchaigno commented Nov 7, 2018

@lildude Adding more samples was enough to improve the Bayesian classifier's accuracy significantly. We went from 1 to 4 samples and now don't have miss-classifications anymore (in the 2347 files corpus I downloaded). In the past, we've kept heuristic rules for cases where the Bayesian classifier is unable to identify recurring keywords. Here, we have a set of recurring keywords (frontend, backend, server, listen, etc.) which the classifier can easily identify as long as we have enough samples.

@lildude
Copy link
Member

lildude commented Nov 7, 2018

@pchaigno 👍 Thanks for the explanation and validation work. 🙇

Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Welcome to Linguist and thanks for your contribution.

@lildude lildude merged commit 5371f7b into github-linguist:master Nov 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants