Python docstrings are counted as code #62

olivren · 2019-03-04T10:04:42Z

I tried scc 2.2.0 on Windows, on a python file, and it appears that Python docstrings are counted as code instead of comments.

'''This is a module docstring'''

class C:
  '''
  This is a class docstring
  '''
  
  def f():
    """This is a function docstring.
    simple quotes and double quotes are equivalent
    """
    pass

lines: 12
code: 10
comments: 0
blanks: 2

This should be:
lines: 12
code: 3
comments: 7
blanks:2

boyter · 2019-03-04T21:14:04Z

So the reason scc and the other tools do this is due to Python using triple quotes as strings.

>>> '''this is string'''.split()
['this', 'is', 'string']

Knowing the difference between a docstring or a string in the file I don't think is possible without parsing the code using an AST which is going to be incredibly slow compared to how all of the tools currently work.

I am not sure how python treats them in the interpreter but I belive it actually processes them every run which would mean that according to Python they are lines of code as well.

I belive this was raised enough with tokei which produced the treat_doc_strings_as_comments = true setting in its toml config to stem the issue requests. Not sure if I agree with this approach myself.

If you can think of a way to reliably identify when it is a doc string vs an actual string, perhaps with some test cases I would be happy to implement this though. Some thoughts I have to help with this.

If the triple quote string starts following a newline with only white-space characters in front and ends followed by only a newline or white-space characters it is a comment

Not sure if that is exclusive enough to catch all cases though.

olivren · 2019-03-05T13:41:54Z

The method you describe is indeed the best heuristic I can think of. If you want to be be even more strict, you can also add a contition that the previous line does not end with \, which indicates a line continuation. That would prevent this snippet to be considered a docstring.

message = \
"""
hello
world
"""

Here is how I would rank the various strategies, from the worst to the best:

Interpret all the triple quoted strings as code (current behavior)
Interpret all the triple quoted strings as comments
Your proposed heuristic
Your proposed heuristic with the additional handling of line continuation

Of course it's possible to use a real parser, but it needs to work with various versions of Python, and be resilient to malformed code. It would probably need one of these heuristics as a fallback, anyway. And even then, I doubt a real parser would be a significant improvement over strategy number 4.

For my own usage (on my own Python code base), a tool that uses strategy 1 is perfectly useless, while strategy 2 is perfectly fine.

boyter · 2019-03-05T21:05:19Z

I am not a huge fan of option 2 personally. I would rather implement this properly. I just need to figure out how to change the JSON language structure to support this and I will start looking at implementing.

boyter · 2019-03-06T21:05:07Z

Attempting to implement on this branch https://github.com/boyter/scc/tree/issue62

boyter · 2019-04-30T22:19:53Z

Since this is related to #71 I am implementing a more generic solution there and discarding the work on the branch. I will remove it eventually once I have things working.

boyter · 2019-05-06T08:11:30Z

The JSON changes needed to support this are sitting here #76

Once all 3 pending PR's are merged in I will resume work on this issue.

Add test for issue62 (boyter#62)

boyter · 2019-08-23T01:15:17Z

Merged in, should be all good now.

olivren mentioned this issue Mar 4, 2019

Count for python is wildly inaccurate cgag/loc#111

Open

boyter added the enhancement New feature or request label Mar 5, 2019

boyter self-assigned this Mar 5, 2019

boyter mentioned this issue Apr 30, 2019

C# line count is broken if verbatim string with backslash is present #71

Closed

boyter mentioned this issue May 6, 2019

Change JSON structure and code to support custom string types #76

Merged

lukas-brenning added a commit to lukas-brenning/scc that referenced this issue Aug 21, 2019

Add docString handling

989daf8

Add test for issue62 (boyter#62)

lukas-brenning added a commit to lukas-brenning/scc that referenced this issue Aug 21, 2019

Add docString handling

33f8895

Add test for issue62 (boyter#62)

lukas-brenning added a commit to lukas-brenning/scc that referenced this issue Aug 21, 2019

Add docString handling

eecb4b2

Add test for issue62 (boyter#62)

lukas-brenning mentioned this issue Aug 21, 2019

Feature/issue62 #102

Merged

boyter closed this as completed Aug 23, 2019

boyter mentioned this issue Feb 9, 2021

Python Code Counts Innacurate #226

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python docstrings are counted as code #62

Python docstrings are counted as code #62

olivren commented Mar 4, 2019 •

edited

Loading

boyter commented Mar 4, 2019

olivren commented Mar 5, 2019 •

edited

Loading

boyter commented Mar 5, 2019

boyter commented Mar 6, 2019

boyter commented Apr 30, 2019

boyter commented May 6, 2019

boyter commented Aug 23, 2019

Python docstrings are counted as code #62

Python docstrings are counted as code #62

Comments

olivren commented Mar 4, 2019 • edited Loading

boyter commented Mar 4, 2019

olivren commented Mar 5, 2019 • edited Loading

boyter commented Mar 5, 2019

boyter commented Mar 6, 2019

boyter commented Apr 30, 2019

boyter commented May 6, 2019

boyter commented Aug 23, 2019

olivren commented Mar 4, 2019 •

edited

Loading

olivren commented Mar 5, 2019 •

edited

Loading