Detect language of file based on shebang #45

adhikasp · 2016-11-19T04:07:32Z

In some project (mainly that specifically target linux platform) a file without extension is commonly found. Mainly they use shebang (!#/bin/sh or similiar) to specify language/intrepeter used.

AFAIK, coala-quickstart just detect used language based on their extension, so they fail to detect extensionless file. Also, in .coafile the file path need to be explicitly stated, because (ex. files = **.sh, image/base/*).

Project example
https://github.com/discourse/discourse_docker, see this folder

The text was updated successfully, but these errors were encountered:

hemangsk · 2016-12-09T18:46:22Z

Hey! can I take this up?

adtac · 2016-12-10T06:16:14Z

@hemangsk sure, but please describe how you'd do this before you write the code in case any of us have suggestions/modifications - it's much easier for both sides! :)

Assigning you 👍

hemangsk · 2016-12-13T16:53:50Z

@adtac Thanks!
I figured this solution that in coala-quickstart > generation > Utilities.py > get_extension(), split_by_language(), These functions have a similar task to separate the given files based on language and extensions. So Inside the loop which iterates through the list of project_files, we can add to call to new utility functions get_language_from_hashbang() and get_extension_from_hashbang().
These will read the contents from first line of extension-less file and then parse the string in it to see if the string starts '!#', confirming its a hashbang, we can obtain the language that is being used in that file and hence the extension from the exts dictionary/ the pygments approach [https://github.com/coala/coala/pull/3162].

Like for string on first line be,

first_line = '!#bin/bash'
lang = first_line[5:]
ext = exts[lang]

will it be the right approach and can be worked upon?

jayvdb · 2016-12-13T17:26:14Z

Sounds good. get_language_from_hashbang will be the interesting/challenging part. Would be good if you can describe how you will do that.

adtac · 2016-12-20T06:07:13Z

One more thing to look into is #!/usr/bin/env python - that should have the same effect as a #!/usr/bin/python shebang ;)

hemangsk · 2017-01-04T17:55:04Z

sorry for the delay! Here's the approach I've come up for get_language_from_hashbang(),
In the coala-quickstart > generation > Utilities.py

def split_by_language(project_files):
    lang_files = defaultdict(lambda: set())
    for file in project_files:
        name, ext = os.path.splitext(file)
        if ext in exts:
            for lang in exts[ext]:
                lang_files[lang.lower()].add(file)
                lang_files["all"].add(file)

       # Check for hashbang

        elif name and not ext:
            with open(file, 'r') as data:
                hashbang = data.readline()
                if(re.match('/(^#![(a-z)|\/]*[ ][a-z]*)|(^#![(a-z)|\/]*)/', hashbang)):
                    language = get_language_from_hashbang(hashbang)
                try:
                    for ext in exts:
                          for lang in exts[ext]:
                                 if(language == lang):
                                       lang_files[lang.lower()].add(file)
                                       lang_files["all"].add(file)
                except KeyError:
                   # Handling error                       
           data.close()
    return lang_files

And get_language_from_hashbang(hashbang)

def get_language_from_hashbang(hashbang):
    if(re.match('(^#![(a-z)|\/]*[ ][a-z]*)', hashbang)):
        language = hashbang.split(' ')[1]
    elif(re.match('(^#![(a-z)|\/]*)', hashbang)):
       language = hashbang.split('/')[-1]
    return language

Shortcomings in this approach which I've figured out till now and I'm working on are,

Regex can be improved using (Backtracing?)
Nested for loop is used in try block and it is not time efficient

jayvdb · 2017-01-06T12:20:21Z

get_language_from_hashbang return value can be memorized.
But performance is not a consideration, as this is run once per project lifetime typically.

adtac · 2017-01-06T17:21:20Z

Looks neat 👍

And unless my eyes fail me, the data.close() is outside the with open(...) as data ;) I know, this is just a prototype. Just saying :P

hemangsk · 2017-01-07T20:11:22Z

Thanks for the feedback @jayvdb @adtac :) I'm on it

Add a get_language_from_hashbang function which checks whether hashbang exists in a file and returns the language used in that file. This get_language_from_hashbang() is used in split_by_language() and language_percentage() Fixes coala#45

sils · 2017-02-01T22:26:42Z

@hemangsk any news?

hemangsk · 2017-02-02T02:37:16Z

I'll do the second iteration today asap :)

…

On Feb 2, 2017 3:56 AM, "Lasse Schuirmann" ***@***.***> wrote: @hemangsk <https://github.com/hemangsk> any news? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#45 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMalytkn0bnisrjlys0q0hRhMb2wmmJRks5rYQaigaJpZM4K3J7b> .

Add a get_language_from_hashbang function which checks whether hashbang exists in a file and returns the language used in that file. This get_language_from_hashbang() is used in split_by_language() and language_percentage() Fixes coala#45

jayvdb added the difficulty/medium label Nov 21, 2016

adtac assigned hemangsk Dec 10, 2016

hemangsk mentioned this issue Jan 16, 2017

Utilites.py: Add language detection from hashbang #72

Merged

gitmate-bot added the status/STALE label Sep 1, 2017

gitmate-bot closed this as completed in #72 Jul 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect language of file based on shebang #45

Detect language of file based on shebang #45

adhikasp commented Nov 19, 2016 •

edited by jayvdb

Loading

hemangsk commented Dec 9, 2016

adtac commented Dec 10, 2016

hemangsk commented Dec 13, 2016

jayvdb commented Dec 13, 2016

adtac commented Dec 20, 2016

hemangsk commented Jan 4, 2017

jayvdb commented Jan 6, 2017

adtac commented Jan 6, 2017

hemangsk commented Jan 7, 2017 •

edited

Loading

sils commented Feb 1, 2017

hemangsk commented Feb 2, 2017 via email

Detect language of file based on shebang #45

Detect language of file based on shebang #45

Comments

adhikasp commented Nov 19, 2016 • edited by jayvdb Loading

hemangsk commented Dec 9, 2016

adtac commented Dec 10, 2016

hemangsk commented Dec 13, 2016

jayvdb commented Dec 13, 2016

adtac commented Dec 20, 2016

hemangsk commented Jan 4, 2017

jayvdb commented Jan 6, 2017

adtac commented Jan 6, 2017

hemangsk commented Jan 7, 2017 • edited Loading

sils commented Feb 1, 2017

hemangsk commented Feb 2, 2017 via email

adhikasp commented Nov 19, 2016 •

edited by jayvdb

Loading

hemangsk commented Jan 7, 2017 •

edited

Loading