-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some .for files are text. #2123
Conversation
There isn't currently a heuristic rule to distinguish Text from FORTRAN and Forth. Because of #2042 if we add |
@pchaigno Sorry, I added Text to the heuristic rule for I'll update. Now, the heuristic rule will only return a result for fairly obvious matches of Forth or FORTRAN. For ambiguous cases, it will fall back to the bayesian classifier. And the Text sample will catch the "weekly" file. Do you still object? |
I'm not objecting, I just think we should be careful :-) |
I don't think that can happen (at this moment), because the heuristics never return Text. But you are right that we should be careful. Unless this PR is shot down, I'll request 1000 samples and check the results before and after. |
Oh, you're right; I misread you :/ |
I'm all for checking thouroughly! :-) Just awaiting the words of the elders. Also, these files seem to come in at a rate of a few a week, and they cause the entire repo to be misclassified as Forth. They are starting to overwhelm true Forth repositories. |
I'm curious what these |
Me too. I believe the original file came from here: http://www.cpc.ncep.noaa.gov/data/indices/ I emailed the contact at that page, but haven't got any reply yet. There are a few other |
Tested with 1000 Tested with 1000
I checked that those Forth files were indeed misidentified. |
Received word from the original author:
|
According to http://www.weather.gov/disclaimer, the sample file is in the public domain,
|
Another new repository with plenty of Before my proposed change, it's classified as 48% Forth and 33% Python. After, 62% Python and no Forth. |
The Travis CI build is erroring now, because of
I'm not sure there's anything I can do about that? |
@aroben - I had a look at fixing this earlier and didn't find an obvious solution. It looks like we need an Perhaps we should add the |
I can't wait until we come up with a way to return "unknown" for files that don't have a high score from the classifier. I don't love that we are classifying this as |
Me too! Oh oh, and also a way to search for files and repos without an assigned language.
Agreed. From looking at various But anyway, I was thinking maybe something like this:
|
This looks good to me. @bkeepers do you think we should add |
Looks good to me 👍
No, it doesn't seem like this should be text to me. It's data, right? |
Data encoded in plain text :-\ |
Seriously though, I don't care that much. I could be convinced either way. |
Sample file wksst8110.for is from the Climate Prediction Center at the National Weather Service of the USA, and is in the public domain.
I merged the two commits into one. Leaving out |
@larsbrinkhoff - probably won't cut a new gem until early next week. |
That's fine. Thanks! |
As of late, many
.for
files have been created which are neither FORTRAN nor Forth:https://github.com/search?q=extension%3Afor+weekly&type=Code
The contents of these files are identical. I added it as a
Text
sample.