Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some .for files are text. #2123

Merged
merged 1 commit into from
Mar 17, 2015
Merged

Some .for files are text. #2123

merged 1 commit into from
Mar 17, 2015

Conversation

larsbrinkhoff
Copy link
Contributor

As of late, many .for files have been created which are neither FORTRAN nor Forth:
https://github.com/search?q=extension%3Afor+weekly&type=Code

The contents of these files are identical. I added it as a Textsample.

@pchaigno
Copy link
Contributor

There isn't currently a heuristic rule to distinguish Text from FORTRAN and Forth. Because of #2042 if we add .for for Text the heuristic for FORTRAN and Forth won't be used anymore. It seems risky for only 107 files :/

@larsbrinkhoff
Copy link
Contributor Author

@pchaigno Sorry, I added Text to the heuristic rule for .for, but lost it before I made this pull request. So good catch.

I'll update. Now, the heuristic rule will only return a result for fairly obvious matches of Forth or FORTRAN. For ambiguous cases, it will fall back to the bayesian classifier. And the Text sample will catch the "weekly" file.

Do you still object?

@pchaigno
Copy link
Contributor

I'm not objecting, I just think we should be careful :-)
And we probably should fix (or at least clarify) the situation with #2042 before. Currently, if you add Text to the Forth and FORTRAN heuristic, .F files may be recognized as Text :/

@larsbrinkhoff
Copy link
Contributor Author

.F files may be recognized as Text :/

I don't think that can happen (at this moment), because the heuristics never return Text.

But you are right that we should be careful. Unless this PR is shot down, I'll request 1000 samples and check the results before and after.

@pchaigno
Copy link
Contributor

I don't think that can happen (at this moment), because the heuristics never return Text.

Oh, you're right; I misread you :/
If it works with the classifier, perfect! You'll have to test that the new Text sample doesn't break other Text detections though.

@larsbrinkhoff
Copy link
Contributor Author

I'm all for checking thouroughly! :-) Just awaiting the words of the elders.

Also, these files seem to come in at a rate of a few a week, and they cause the entire repo to be misclassified as Forth. They are starting to overwhelm true Forth repositories.

@bkeepers
Copy link
Contributor

I'm curious what these .for files actually are. It looks like some data format. I wonder if we should be adding another definition for it.

@larsbrinkhoff
Copy link
Contributor Author

Me too. I believe the original file came from here: http://www.cpc.ncep.noaa.gov/data/indices/

I emailed the contact at that page, but haven't got any reply yet.

There are a few other .for files there. As far as I can see, they are columnar data separated by whitespace, and possibly including title and column headers. My best guess is that it's "formatted data".

@larsbrinkhoff
Copy link
Contributor Author

Tested with 1000 Text samples: no change in classification.

Tested with 1000 .forsamples:

Before After
FORTRAN 577 577
Text 0 415
Forth 423 8

I checked that those Forth files were indeed misidentified.

@larsbrinkhoff
Copy link
Contributor Author

Received word from the original author:

".for" is short for "formatted".

@larsbrinkhoff
Copy link
Contributor Author

According to http://www.weather.gov/disclaimer, the sample file is in the public domain,

as long as you do not: 1) claim it is your own (e.g., by claiming copyright for NWS information -- see below), 2) use it in a manner that implies an endorsement or affiliation with NOAA/NWS, or 3) modify its content and then present it as official government material. You also cannot present information of your own in a way that makes it appear to be official government information.

@larsbrinkhoff
Copy link
Contributor Author

Another new repository with plenty of .for files:
https://github.com/OpenSceneryX/Library/search?q=extension%3Afor

Before my proposed change, it's classified as 48% Forth and 33% Python. After, 62% Python and no Forth.

@larsbrinkhoff
Copy link
Contributor Author

Ping @bkeepers @arfon. Any thoughts?

@arfon
Copy link
Contributor

arfon commented Mar 2, 2015

Ping @bkeepers @arfon. Any thoughts?

I think we should probably merge this as it's a growing issue (178 files now). I'd really like to get #2042 resolved soon too.

@larsbrinkhoff
Copy link
Contributor Author

The Travis CI build is erroring now, because of

apt-get --assume-yes download libicu48 libicu-dev
Err Downloading libicu48 4.8.1.1-3ubuntu0.1
404 Not Found [IP: 91.189.91.15 80]
Err Downloading libicu-dev 4.8.1.1-3ubuntu0.1
404 Not Found [IP: 91.189.91.15 80]
ar p vendor/apt/.deb data.tar.gz
tar -vzxC vendor/debs --strip-components=2
ar: vendor/apt/
.deb: No such file or directory

I'm not sure there's anything I can do about that?

@arfon
Copy link
Contributor

arfon commented Mar 7, 2015

The Travis CI build is erroring now, because of

@aroben - I had a look at fixing this earlier and didn't find an obvious solution. It looks like we need an apt-get update before the install but we can't do the recommended as we can't use sudo on container builds.

Perhaps we should add the libicu48 deb to a local directory and install from there?

@bkeepers
Copy link
Contributor

I can't wait until we come up with a way to return "unknown" for files that don't have a high score from the classifier.

I don't love that we are classifying this as Text. Should we add a Formatted entry? Is there a spec for the file format anywhere?

@larsbrinkhoff
Copy link
Contributor Author

I can't wait until we come up with a way to return "unknown"

Me too! Oh oh, and also a way to search for files and repos without an assigned language.

I don't love that we are classifying this as Text. Should we add a Formatted entry? Is there a spec for the file format anywhere?

Agreed. From looking at various .for files, I'm quite sure there is no spec. Or maybe the spec would be "numbers formatted into whatever format looks pretty, plus some text that looks nice".

But anyway, I was thinking maybe something like this:

Formatted Data:
  type: data
  group: Text # maybe?
  extensions:
  - .for
  # maybe .f too, sigh

@larsbrinkhoff
Copy link
Contributor Author

@arfon @bkeepers Added a commit to introduce a new Formatted language. Does this look ok?

@arfon
Copy link
Contributor

arfon commented Mar 17, 2015

This looks good to me. @bkeepers do you think we should add group: Text?

@bkeepers
Copy link
Contributor

Looks good to me 👍

do you think we should add group: Text?

No, it doesn't seem like this should be text to me. It's data, right?

@arfon
Copy link
Contributor

arfon commented Mar 17, 2015

It's data, right?

Data encoded in plain text :-\

@arfon
Copy link
Contributor

arfon commented Mar 17, 2015

Seriously though, I don't care that much. I could be convinced either way.

Sample file wksst8110.for is from the Climate Prediction Center at the
National Weather Service of the USA, and is in the public domain.
@larsbrinkhoff
Copy link
Contributor Author

I merged the two commits into one. Leaving out group: Text.

arfon added a commit that referenced this pull request Mar 17, 2015
@arfon arfon merged commit b5472ab into github-linguist:master Mar 17, 2015
@arfon
Copy link
Contributor

arfon commented Mar 17, 2015

@larsbrinkhoff - probably won't cut a new gem until early next week.

@larsbrinkhoff
Copy link
Contributor Author

That's fine. Thanks!

@arfon arfon mentioned this pull request Mar 18, 2015
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants