Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.inc file extension #2180

Merged
merged 11 commits into from Jul 8, 2015
Merged

.inc file extension #2180

merged 11 commits into from Jul 8, 2015

Conversation

pchaigno
Copy link
Contributor

@pchaigno pchaigno commented Jul 5, 2015

There is more than 10 millions .inc files on GitHub and it's used by many languages. Thus, to add that extension, we need to identify all languages using it (with at least hundreds of examples).

I will update this post with the list of languages as we go. I might eventually turn this issue into a pull request to add support of the extension.

Language Number of files
PHP 9,241,000
Pascal 206,000
Assembly at least 69,000
MySQL 49,000
C++ 8,000
HTML 5,000*
SourcePawn 2,300*
Clarion at least 165*

* I checked these by downloading 1,000 samples from the search results and using simple heuristics to triage them.

@pchaigno pchaigno changed the title inc file extension .inc file extension Feb 28, 2015
@xPaw
Copy link
Contributor

xPaw commented Mar 10, 2015

SourcePawn also uses .inc files:

They're most likely to be in include folder relative to .sp or .sma file.

@pchaigno
Copy link
Contributor Author

@xPaw How common would you say the native and forward keywords are in SourcePawn? (I'm trying to identify keywords I could use to find SourcePawn files.)

@xPaw
Copy link
Contributor

xPaw commented Mar 10, 2015

They're very common, but not guaranteed to be in every file. Maybe @dvander can say more.

@larsbrinkhoff
Copy link
Contributor

By the way, the number of .inc files has grown from 8M to 9,6M in 24 days.

@larsbrinkhoff
Copy link
Contributor

Let's not forget the samples in #1268

@dvander
Copy link

dvander commented Mar 25, 2015

One of "native", "forward", "stock", or "public" would be in almost any SourcePawn .inc file.

@larsbrinkhoff
Copy link
Contributor

This search will find more assembly files: extension:inc NOT php mov OR jmp OR lda OR globl.

@pchaigno
Copy link
Contributor Author

This search will find more assembly files: extension:inc NOT php mov OR jmp OR lda OR globl.

@larsbrinkhoff Thanks!

@pchaigno
Copy link
Contributor Author

pchaigno commented Jul 5, 2015

So, I finally turned this into a pull request and here are the results:

PHP SourcePawn Assembly C++ Pascal HTML SQL
862 PHP samples 98.79% 0.89% 0.32%
952 SourcePawn samples 0.05% 98.89% 1.06%
898 Assembly samples 0.29% 99.24% 0.27% 0.20%
787 C++ samples 0.26% 9.74%
808 Pascal samples 0.19% 0.16% 1.20% 98.46%
897 HTML samples 0.56% 1.30% 0.64% 1.81% 0.42% 95.27%
938 SQL samples 0.26% 99.74%

It looks to me like heuristic rules won't be needed for .inc :-)

@arfon This is ready for review ;)

@arfon
Copy link
Contributor

arfon commented Jul 7, 2015

Epic effort @pchaigno - this is awesome ⚡

Given the coverage here of languages (and the classifier accuracy) I think this should be good to go. Would love a 👍 from @bkeepers here too for completeness.

@bkeepers
Copy link
Contributor

bkeepers commented Jul 7, 2015

Wow, I'm impressed the results are that accurate. I'm 👍.

This actually got me thinking that it would be interesting to add an extended test suite that runs against these other corpuses and ensures they match a certain threshold.

@pchaigno
Copy link
Contributor Author

pchaigno commented Jul 7, 2015

Wow, I'm impressed the results are that accurate.

Yep, I am too. I added more samples than usually so that's one way to explain it but I'm afraid we could be overfitting. Although, only C++/SourcePawn/Pascal really needed more samples. C++ vs. SourcePawn needed more samples because only a handful of keywords are different. If there's overfitting somewhere I'd expect it to be for Pascal...

This actually got me thinking that it would be interesting to add an extended test suite that runs against these other corpuses and ensures they match a certain threshold.

Sounds like a good idea to me. That would be in a separate PR right? Should we open an issue to track progress on that idea?

@bkeepers
Copy link
Contributor

bkeepers commented Jul 7, 2015

Sounds like a good idea to me. That would be in a separate PR right? Should we open an issue to track progress on that idea?

👍

arfon added a commit that referenced this pull request Jul 8, 2015
@arfon arfon merged commit 79a428a into github-linguist:master Jul 8, 2015
@arfon
Copy link
Contributor

arfon commented Jul 8, 2015

👍 thanks @pchaigno.

@cperciva
Copy link

cperciva commented Aug 5, 2015

Please add Makefile to the set of languages which can have a .inc suffix. For that matter, a file named "Makefile.inc" should probably be automatically classified as "Makefile" without even considering other options.

Right now github is telling me that Tarsnap/spiped's Makefile.inc is a sourcepawn file! https://github.com/Tarsnap/spiped/search?l=sourcepawn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants