Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LInguist is reporting my project as a Jupyter Notebook #3316

Closed
adam704a opened this issue Nov 3, 2016 · 18 comments

Comments

@adam704a
Copy link

commented Nov 3, 2016

As you can see, I have some notebooks, but mostly this is a python project.

https://github.com/ICTatRTI/researchnet

Did I do something wrong?

@TotalVerb

This comment has been minimized.

Copy link

commented Nov 4, 2016

Jupyter notebooks have an inflated number of lines of code, since they store a lot of metadata. So it doesn't take many notebooks to "take over" a project.

@Alhadis

This comment has been minimized.

Copy link
Collaborator

commented Nov 4, 2016

Does anybody actually write these files out by hand? Because it sounds like they're generated primarily from a webapp:

The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

And if that's the case, well, I'd say these generated files should be marked as exactly that: generated.

/cc @pchaigno /resident Python-guy

@TotalVerb

This comment has been minimized.

Copy link

commented Nov 4, 2016

Whatever action is taken, it would be best to maintain the searchability and identifiability of notebook-only repositories.

@TotalVerb

This comment has been minimized.

Copy link

commented Nov 4, 2016

Possibly the best course of action is to modify the lines of code reported into an "equivalent lines of code" measure which takes into account the unavoidable boilerplate. For instance, the source line consisting of the single character π may turn into this monstrosity in the .ipynb file:

  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "π = 3.1415926535897..."
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "π"
   ]
  },
@Alhadis

This comment has been minimized.

Copy link
Collaborator

commented Nov 4, 2016

That thing is about as long as the value's floating-point component itself.

All we'd need to mark these things as generated is to match against a common pattern that's consistently used in webapp-created notebooks. Usually it's something like Generated by AppName 1.1.1.1.1.1-betasemverasfuck0 or what-have-you.

@TotalVerb

This comment has been minimized.

Copy link

commented Nov 4, 2016

You could maybe match against

 "metadata": {
   // [ stuff in here varies ]
 },
 "nbformat": 4,
 "nbformat_minor": 1

but wouldn't this make notebook-only repositories classify incorrectly?

@Alhadis

This comment has been minimized.

Copy link
Collaborator

commented Nov 4, 2016

Marking them as generated simply omits them from the language-statistics bar. We already have a number of generated-file detection routines that filter files that would otherwise unfairly skew a repository's stats. Here's the logic for detecting generated PostScript, for example. You can imagine how many projects would be incorrectly classified as PostScript if we left every .eps file unchecked.

And while that snippet you've posted might work, it should ideally be 100% unambiguous. E.g., leave no room for misidentification. The existing rules which test against single-line patterns are all very specific:

Et cetera.

@TotalVerb

This comment has been minimized.

Copy link

commented Nov 4, 2016

The difference between postscript and Jupyter is that all Jupyter notebooks are "generated", though (either by the web app or by IPython's CLI). And unlike postscript, human effort generally needs to go into every cell of a Jupyter notebook; it's just that each cell ends up taking a lot of lines of code.

Here are some empty, newly-created notebooks with Julia and Python kernels.

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Julia 0.5.0",
   "language": "julia",
   "name": "julia-0.5"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "0.5.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}

and

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [conda root]",
   "language": "python",
   "name": "conda-root-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
@TotalVerb

This comment has been minimized.

Copy link

commented Nov 4, 2016

For what it's worth, I think personally that a good solution would be to estimate how many lines of a Jupyter notebook are "source" and how many are "generated". Source lines (which are all written by a human) generally look like this:

   "source": [
    "import Base: +\n",
    "\n",
    "+{T<:Number}(x::DualNumber{T}, y::DualNumber{T}) = DualNumber{T}(x.re + y.re, x.ep + y.ep)\n",
    "\n",
    "DualNumber(10.0, 17.0) + DualNumber(5.0, 9.0)"
   ]

Can linguist already handle partial file identifications?

@soniclavier

This comment has been minimized.

Copy link

commented Nov 29, 2016

I am having the same problem, I have about 3-4 .ipynb files out of 144 files(mainly java and scala) in my repo. If there is any option to make Linguist report based on count of files rather than size, it would be helpful.

For now, I added *.ipynb linguist-vendored to my .gitattributes file in my repository.

@caged

This comment has been minimized.

Copy link
Contributor

commented May 3, 2017

👋 It looks like the original repo is no longer classified as an ipython notebook and I don't see a .gitattributes file in the repo. Can someone clarify if this is still an issue?

@lildude

This comment has been minimized.

Copy link
Member

commented Oct 31, 2017

As @caged mentioned, things appear to be working now on the original repo. As there hasn't been an update since 3 May, I'm closing this on the basis this has been resolved.

@lildude lildude closed this Oct 31, 2017

@pierluigiferrari

This comment has been minimized.

Copy link

commented Nov 8, 2017

@lildude, @caged I can confirm that things are not working regarding Jupyter notebooks. It's still the same issue as before: A Jupyter notebook consists of Python code that the author wrote, and of generated code that makes it an interactive environment that can be displayed in a web browser. The generated code usually makes up a lot more lines than the Python code that the author wrote.

The first problem here is that for the purpose of what linguist is trying to achieve (i.e. a breakdown of the programming languages the author used in the repo) "Jupyter notebook" should not be considered a language at all. For all intents and purposes it's just a container that holds Python code.

The second problem is that simply ingoring Jupyter notebooks from the statistics also ignores all of the actually relevant Python code inside them.

@lildude

This comment has been minimized.

Copy link
Member

commented Nov 8, 2017

Thanks for confirming this and for the explanation @pierluigiferrari. Now I have a better understanding having looked into it, and given your two points, I don't think this is something that can easily, if ever, be addressed automatically.

The biggest limiting factor that I can see is the fact the Jupyter notebooks combine written and generated language within the same file. Linguist doesn't support partial file classification and isn't likely to ever do so as I'd imagine this would be incredibly resource intensive and probably highly unreliable when it comes to even attempting to differentiate between human and computer written code within the same file. Our current classifier is already hugely inefficient as it is.

The next limiting factor is preference. Some want the Jupyter note books recognised for what they are, others prefer them to be identified by the language they're written themselves and others still don't want the files counted at all.

I think our current implementation (implemented in #2746 via #2763) combined with manual overrides is probably the best compromise for all.

Jupyter note books are also far too prevalent on GitHub to change the default behaviour without major backlash.

@Alhadis

This comment has been minimized.

Copy link
Collaborator

commented Nov 8, 2017

Linguist doesn't support partial file classification and isn't likely to ever do so as I'd imagine this would be incredibly resource intensive and probably highly unreliable when it comes to even attempting to differentiate between human and computer written code within the same file.

... which is where an idea of mine may hold the answer. ;) I regurgitated sleep-deprived explanations which, through weighting averages assigned to specific scopes, could yield a more rational Python Notebook usage. E.g., the number of lines the programmer actually did pen of their own hand.

@pierluigiferrari

This comment has been minimized.

Copy link

commented Nov 8, 2017

@lildude I understand. As you said, it seems like the best solution for Jupyter notebook users is to use manual override. Thanks for clarifying why it is the way it is and why it will likely remain this way!

@Borda

This comment has been minimized.

Copy link

commented Mar 18, 2019

@lildude I understand. As you said, it seems like the best solution for Jupyter notebook users is to use manual override. Thanks for clarifying why it is the way it is and why it will likely remain this way!

What does it mean the manual override for language statistic on GitHub? Is it the .gitatributes file?
In my opinion, it would be fair counting if for the ipynb lines will be counted only the source lines, not all metadata as well as all generated outputs...

@pchaigno

This comment has been minimized.

Copy link
Collaborator

commented Mar 19, 2019

@Borda Please see the last paragraph of how Linguist works and Linguist overrides.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants
You can’t perform that action at this time.