Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect language of file based on shebang #45

Closed
adhikasp opened this issue Nov 19, 2016 · 11 comments
Closed

Detect language of file based on shebang #45

adhikasp opened this issue Nov 19, 2016 · 11 comments

Comments

@adhikasp
Copy link
Member

adhikasp commented Nov 19, 2016

In some project (mainly that specifically target linux platform) a file without extension is commonly found. Mainly they use shebang (!#/bin/sh or similiar) to specify language/intrepeter used.

AFAIK, coala-quickstart just detect used language based on their extension, so they fail to detect extensionless file. Also, in .coafile the file path need to be explicitly stated, because (ex. files = **.sh, image/base/*).

Project example
https://github.com/discourse/discourse_docker, see this folder

@hemangsk
Copy link
Member

hemangsk commented Dec 9, 2016

Hey! can I take this up?

@adtac
Copy link
Member

adtac commented Dec 10, 2016

@hemangsk sure, but please describe how you'd do this before you write the code in case any of us have suggestions/modifications - it's much easier for both sides! :)

Assigning you 👍

@hemangsk
Copy link
Member

@adtac Thanks!
I figured this solution that in coala-quickstart > generation > Utilities.py > get_extension(), split_by_language(), These functions have a similar task to separate the given files based on language and extensions. So Inside the loop which iterates through the list of project_files, we can add to call to new utility functions get_language_from_hashbang() and get_extension_from_hashbang().
These will read the contents from first line of extension-less file and then parse the string in it to see if the string starts '!#', confirming its a hashbang, we can obtain the language that is being used in that file and hence the extension from the exts dictionary/ the pygments approach [https://github.com/coala/coala/pull/3162].

Like for string on first line be,

first_line = '!#bin/bash'
lang = first_line[5:]
ext = exts[lang]

will it be the right approach and can be worked upon?

@jayvdb
Copy link
Member

jayvdb commented Dec 13, 2016

Sounds good. get_language_from_hashbang will be the interesting/challenging part. Would be good if you can describe how you will do that.

@adtac
Copy link
Member

adtac commented Dec 20, 2016

One more thing to look into is #!/usr/bin/env python - that should have the same effect as a #!/usr/bin/python shebang ;)

@hemangsk
Copy link
Member

hemangsk commented Jan 4, 2017

sorry for the delay! Here's the approach I've come up for get_language_from_hashbang(),
In the coala-quickstart > generation > Utilities.py

def split_by_language(project_files):
    lang_files = defaultdict(lambda: set())
    for file in project_files:
        name, ext = os.path.splitext(file)
        if ext in exts:
            for lang in exts[ext]:
                lang_files[lang.lower()].add(file)
                lang_files["all"].add(file)

       # Check for hashbang

        elif name and not ext:
            with open(file, 'r') as data:
                hashbang = data.readline()
                if(re.match('/(^#![(a-z)|\/]*[ ][a-z]*)|(^#![(a-z)|\/]*)/', hashbang)):
                    language = get_language_from_hashbang(hashbang)
                try:
                    for ext in exts:
                          for lang in exts[ext]:
                                 if(language == lang):
                                       lang_files[lang.lower()].add(file)
                                       lang_files["all"].add(file)
                except KeyError:
                   # Handling error                       
           data.close()
    return lang_files

And get_language_from_hashbang(hashbang)

def get_language_from_hashbang(hashbang):
    if(re.match('(^#![(a-z)|\/]*[ ][a-z]*)', hashbang)):
        language = hashbang.split(' ')[1]
    elif(re.match('(^#![(a-z)|\/]*)', hashbang)):
       language = hashbang.split('/')[-1]
    return language

Shortcomings in this approach which I've figured out till now and I'm working on are,

  • Regex can be improved using (Backtracing?)
  • Nested for loop is used in try block and it is not time efficient

@jayvdb
Copy link
Member

jayvdb commented Jan 6, 2017

get_language_from_hashbang return value can be memorized.
But performance is not a consideration, as this is run once per project lifetime typically.

@adtac
Copy link
Member

adtac commented Jan 6, 2017

Looks neat 👍

And unless my eyes fail me, the data.close() is outside the with open(...) as data ;) I know, this is just a prototype. Just saying :P

@hemangsk
Copy link
Member

hemangsk commented Jan 7, 2017

Thanks for the feedback @jayvdb @adtac :) I'm on it

hemangsk added a commit to hemangsk/coala-quickstart that referenced this issue Jan 16, 2017
Add a get_language_from_hashbang function
which checks whether hashbang exists in a
file and returns the language used in that
file.

This get_language_from_hashbang()
is used in split_by_language()
and language_percentage()

Fixes coala#45
@sils
Copy link
Member

sils commented Feb 1, 2017

@hemangsk any news?

@hemangsk
Copy link
Member

hemangsk commented Feb 2, 2017 via email

hemangsk added a commit to hemangsk/coala-quickstart that referenced this issue Feb 2, 2017
Add a get_language_from_hashbang function
which checks whether hashbang exists in a
file and returns the language used in that
file.

This get_language_from_hashbang()
is used in split_by_language()
and language_percentage()

Fixes coala#45
hemangsk added a commit to hemangsk/coala-quickstart that referenced this issue Jul 21, 2018
Add a get_language_from_hashbang function
which checks whether hashbang exists in a
file and returns the language used in that
file.

This get_language_from_hashbang()
is used in split_by_language()
and language_percentage()

Fixes coala#45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

6 participants