Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python docstrings are counted as code #62

Closed
olivren opened this issue Mar 4, 2019 · 7 comments
Closed

Python docstrings are counted as code #62

olivren opened this issue Mar 4, 2019 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@olivren
Copy link

olivren commented Mar 4, 2019

I tried scc 2.2.0 on Windows, on a python file, and it appears that Python docstrings are counted as code instead of comments.

'''This is a module docstring'''

class C:
  '''
  This is a class docstring
  '''
  
  def f():
    """This is a function docstring.
    simple quotes and double quotes are equivalent
    """
    pass

lines: 12
code: 10
comments: 0
blanks: 2

This should be:
lines: 12
code: 3
comments: 7
blanks:2

@boyter
Copy link
Owner

boyter commented Mar 4, 2019

So the reason scc and the other tools do this is due to Python using triple quotes as strings.

>>> '''this is string'''.split()
['this', 'is', 'string']

Knowing the difference between a docstring or a string in the file I don't think is possible without parsing the code using an AST which is going to be incredibly slow compared to how all of the tools currently work.

I am not sure how python treats them in the interpreter but I belive it actually processes them every run which would mean that according to Python they are lines of code as well.

I belive this was raised enough with tokei which produced the treat_doc_strings_as_comments = true setting in its toml config to stem the issue requests. Not sure if I agree with this approach myself.

If you can think of a way to reliably identify when it is a doc string vs an actual string, perhaps with some test cases I would be happy to implement this though. Some thoughts I have to help with this.

If the triple quote string starts following a newline with only white-space characters in front and ends followed by only a newline or white-space characters it is a comment

Not sure if that is exclusive enough to catch all cases though.

@olivren
Copy link
Author

olivren commented Mar 5, 2019

The method you describe is indeed the best heuristic I can think of. If you want to be be even more strict, you can also add a contition that the previous line does not end with \, which indicates a line continuation. That would prevent this snippet to be considered a docstring.

message = \
"""
hello
world
"""

Here is how I would rank the various strategies, from the worst to the best:

  1. Interpret all the triple quoted strings as code (current behavior)
  2. Interpret all the triple quoted strings as comments
  3. Your proposed heuristic
  4. Your proposed heuristic with the additional handling of line continuation

Of course it's possible to use a real parser, but it needs to work with various versions of Python, and be resilient to malformed code. It would probably need one of these heuristics as a fallback, anyway. And even then, I doubt a real parser would be a significant improvement over strategy number 4.

For my own usage (on my own Python code base), a tool that uses strategy 1 is perfectly useless, while strategy 2 is perfectly fine.

@boyter boyter added the enhancement New feature or request label Mar 5, 2019
@boyter boyter self-assigned this Mar 5, 2019
@boyter
Copy link
Owner

boyter commented Mar 5, 2019

I am not a huge fan of option 2 personally. I would rather implement this properly. I just need to figure out how to change the JSON language structure to support this and I will start looking at implementing.

@boyter
Copy link
Owner

boyter commented Mar 6, 2019

Attempting to implement on this branch https://github.com/boyter/scc/tree/issue62

@boyter
Copy link
Owner

boyter commented Apr 30, 2019

Since this is related to #71 I am implementing a more generic solution there and discarding the work on the branch. I will remove it eventually once I have things working.

@boyter
Copy link
Owner

boyter commented May 6, 2019

The JSON changes needed to support this are sitting here #76

Once all 3 pending PR's are merged in I will resume work on this issue.

lukas-brenning added a commit to lukas-brenning/scc that referenced this issue Aug 21, 2019
Add test for issue62 (boyter#62)
lukas-brenning added a commit to lukas-brenning/scc that referenced this issue Aug 21, 2019
Add test for issue62 (boyter#62)
lukas-brenning added a commit to lukas-brenning/scc that referenced this issue Aug 21, 2019
Add test for issue62 (boyter#62)
@boyter
Copy link
Owner

boyter commented Aug 23, 2019

Merged in, should be all good now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants