Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codepod-exported ipynb not compliant to NBFormat 4.0 #411

Open
forrestbao opened this issue Aug 1, 2023 · 3 comments
Open

Codepod-exported ipynb not compliant to NBFormat 4.0 #411

forrestbao opened this issue Aug 1, 2023 · 3 comments

Comments

@forrestbao
Copy link
Collaborator

The attachment of this issue is an ipynb exported from this Codepod repo. It has some format issues incompatible with NBFormat 4.0:

  1. #/metadata has no property called name
  2. id is a mandatory field for each cell under #/cells But it's missing for all cells exported. id is not a field in metadata of a cell.
  3. All cells' output_type is display_data. This is not right. There are four types: execute_result, display_data, error (see below), and stream. display_data has no property called execute_count but execute_result has. In many cases in the attached example ipynb, the output should be steam? For example, this cell
        {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "id": "w1o7rfvk37nkk2p8105s",
        "geoScore": 0.0021
      },
      "source": [
        "'''\n",
        "CodePod Scope structure: \n",
        "'''\n",
        "1+1\n"
      ],
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "2\n"
            ]
          }
        }
      ]
    },
  4. Erroneous cells are handled completely wrong. One of the pods contains error and traceback info. But the corresponding json is incorrectly as
    {
          "cell_type": "code",
          "execution_count": 3,
          "metadata": {
            "id": "xs1x514pd2o089p0uzub",
            "geoScore": 0.0012
          },
          "source": [
            "'''\n",
            "CodePod Scope structure: \n",
            "'''\n",
            "foo(4)\n"
          ],
          "outputs": [
            {
              "output_type": "display_data",
              "data": {
                "text/plain": [
                  "error\n"
                ]
              }
            }
          ]
        },

where the outputs/output_type should have been error, and data is NOT an allowed field for an error. Required fields, expect output_type are all missing. See specs here.
5. What is the geoScore property for each cell?
6. Why is a scope and the first pod under it mixed together? Like this:
json { "cell_type": "code", "execution_count": 0, "metadata": { "id": "yla4ktyffvgxenaivved", "geoScore": 0.0011 }, "source": [ "'''\n", "CodePod Scope structure: Another test scope\n", "'''\n", "def foo(x):\n", " return x * x\n" ], "outputs": [] },
The scope should have become one separate cell of raw or even markdown type. In Markdown, the header level (#, ##, ###, ...) should be proportional to the scope level.
7. And, can we beautify the JSON output?

@senwang86
Copy link
Collaborator

senwang86 commented Aug 1, 2023

The attachment of this issue is an ipynb exported from this Codepod repo. It has some format issues incompatible with NBFormat 4.0:

  1. #/metadata has no property called name

Quoted from nbformat official doc,

Metadata is a place that you can put arbitrary JSONable information about your notebook, cell, or output. Because it is a shared namespace, any custom metadata should use a sufficiently unique namespace, such as metadata.kaylees_md.foo = “bar”.

I use the name field to store the reponame currently, it reminds me that I should also put "repo Id" in the top-level metadata.

  1. id is a mandatory field for each cell under #/cells But it's missing for all cells exported. id is not a field in metadata of a cell.

The "pod id" in metadata is exactly for Codepod use, Colab will create and refer cell.id for its own whatever purpose, and this separation will avoid any future id incompatibility caused by either Colab or Codepod. Additionally, the mappings between "pod Id" and cell.id could be helpful as well, e.g., if a cell doesn't have a "pod id" in metadata, we know the cell is created after importing a Codepod repo.

  1. All cells' output_type is display_data. This is not right. There are four types: execute_result, display_data, error (see below), and stream. display_data has no property called execute_count but execute_result has. In many cases in the attached example ipynb, the output should be steam? For example, this cell
        {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "id": "w1o7rfvk37nkk2p8105s",
        "geoScore": 0.0021
      },
      "source": [
        "'''\n",
        "CodePod Scope structure: \n",
        "'''\n",
        "1+1\n"
      ],
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "2\n"
            ]
          }
        }
      ]
    },

It handles both stream and display_data, and display_data will not overwrite stream, here's the code pointer. However, I do need to find a way to correctly map the 4 output types to our current DB schema, are display_data and execute_result both written into Pod.result, @lihebi ?

  1. Erroneous cells are handled completely wrong. One of the pods contains error and traceback info. But the corresponding json is incorrectly as
    {
          "cell_type": "code",
          "execution_count": 3,
          "metadata": {
            "id": "xs1x514pd2o089p0uzub",
            "geoScore": 0.0012
          },
          "source": [
            "'''\n",
            "CodePod Scope structure: \n",
            "'''\n",
            "foo(4)\n"
          ],
          "outputs": [
            {
              "output_type": "display_data",
              "data": {
                "text/plain": [
                  "error\n"
                ]
              }
            }
          ]
        },

where the outputs/output_type should have been error, and data is NOT an allowed field for an error. Required fields, expect output_type are all missing. See specs here.

Good catch, the error are not added into the result output yet. I will send out a patch later.

  1. What is the geoScore property for each cell?

It's for 2D->1D mapping, geoScore is used to decide the order of exported cells.

  1. Why is a scope and the first pod under it mixed together? Like this: json { "cell_type": "code", "execution_count": 0, "metadata": { "id": "yla4ktyffvgxenaivved", "geoScore": 0.0011 }, "source": [ "'''\n", "CodePod Scope structure: Another test scope\n", "'''\n", "def foo(x):\n", " return x * x\n" ], "outputs": [] }, The scope should have become one separate cell of raw or even markdown type. In Markdown, the header level (#, ##, ###, ...) should be proportional to the scope level.

Please check the discussion about the scope information embedding here.

  1. And, can we beautify the JSON output?

Sure, let me send a patch later.

@forrestbao
Copy link
Collaborator Author

forrestbao commented Aug 2, 2023

  1. #/metadata has no property called name

Quoted from nbformat official doc,

Metadata is a place that you can put arbitrary JSONable information about your notebook, cell, or output. Because it is a shared namespace, any custom metadata should use a sufficiently unique namespace, such as metadata.kaylees_md.foo = “bar”.

I use the name field to store the reponame currently, it reminds me that I should also put "repo Id" in the top-level metadata.

My bad. I didn't notice that #/metadata allows additional properties.

  1. id is a mandatory field for each cell under #/cells But it's missing for all cells exported. id is not a field in metadata of a cell.

The "pod id" in metadata is exactly for Codepod use, Colab will create and refer cell.id for its own whatever purpose, and this separation will avoid any future id incompatibility caused by either Colab or Codepod. Additionally, the mappings between "pod Id" and cell.id could be helpful as well, e.g., if a cell doesn't have a "pod id" in metadata, we know the cell is created after importing a Codepod repo.

But id is a mandatory field for each cell. See here Colab and Deepnote seem to ignore it and put the cell_id into a cell's metadata.
Again, I missed that #/cells/{item}/metadata can have additional fields. So having id there is fine.

  1. All cells' output_type is display_data. This is not right. There are four types: execute_result, display_data, error (see below), and stream. display_data has no property called execute_count but execute_result has. In many cases in the attached example ipynb, the output should be steam? For example, this cell
        {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "id": "w1o7rfvk37nkk2p8105s",
        "geoScore": 0.0021
      },
      "source": [
        "'''\n",
        "CodePod Scope structure: \n",
        "'''\n",
        "1+1\n"
      ],
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "2\n"
            ]
          }
        }
      ]
    },

It handles both stream and display_data, and display_data will not overwrite stream, here's the code pointer. However, I do need to find a way to correctly map the 4 output types to our current DB schema, are display_data and execute_result both written into Pod.result, @lihebi ?

I wonder whether the type info is returned by the Jupyter kernel. Deepnote and Colab both list the content to be displayed on stdout (especially those explicitly triggered via Python's print() function) as stream and those from evaluating an expression (on an interactive Python shell) as execute_output.

Below is an example of stream by Google Colab:

"outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "1/1 [==============================] - 0s 282ms/step\n",
            "(3, 2, 4)\n"
          ]
        }
      ]

This is an example error output (exported by Colab with even color code):

"outputs": [
        {
          "output_type": "error",
          "ename": "TypeError",
          "evalue": "ignored",
          "traceback": [
            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
            "\u001b[0;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
            "\u001b[0;32m<ipython-input-3-e9dd7567ea46>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mfoo\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
            "\u001b[0;31mTypeError\u001b[0m: foo() takes 1 positional argument but 2 were given"
          ]
        }
      ]
  1. Erroneous cells are handled completely wrong. One of the pods contains error and traceback info. But the corresponding json is incorrectly as
    {
          "cell_type": "code",
          "execution_count": 3,
          "metadata": {
            "id": "xs1x514pd2o089p0uzub",
            "geoScore": 0.0012
          },
          "source": [
            "'''\n",
            "CodePod Scope structure: \n",
            "'''\n",
            "foo(4)\n"
          ],
          "outputs": [
            {
              "output_type": "display_data",
              "data": {
                "text/plain": [
                  "error\n"
                ]
              }
            }
          ]
        },

where the outputs/output_type should have been error, and data is NOT an allowed field for an error. Required fields, expect output_type are all missing. See specs here.

Good catch, the error are not added into the result output yet. I will send out a patch later.

  1. What is the geoScore property for each cell?

It's for 2D->1D mapping, geoScore is used to decide the order of exported cells.

  1. Why is a scope and the first pod under it mixed together? Like this: json { "cell_type": "code", "execution_count": 0, "metadata": { "id": "yla4ktyffvgxenaivved", "geoScore": 0.0011 }, "source": [ "'''\n", "CodePod Scope structure: Another test scope\n", "'''\n", "def foo(x):\n", " return x * x\n" ], "outputs": [] }, The scope should have become one separate cell of raw or even markdown type. In Markdown, the header level (#, ##, ###, ...) should be proportional to the scope level.

Please check the discussion about the scope information embedding here.

I read the two approaches you proposed. Markdown #title ##Title (jupyter) is the way to go. It keeps the hierarchy of the code. The number of cells is not a concern here. Please patch.

@forrestbao
Copy link
Collaborator Author

forrestbao commented Aug 2, 2023

I wonder whether the type info is returned by the Jupyter kernel. Deepnote and Colab both list the content to be displayed on stdout (especially those explicitly triggered via Python's print() function) as stream and those from evaluating an expression (on an interactive Python shell) as execute_output.

I read the messaging protocols in Jupyter notebook
So indeed, the output/execution result type, e.g., display_data, execution_result, stream, or error, is returned by the Jupyter kernel in the msg_type field of the header of every message:

{
    'msg_id' : str, # typically UUID, must be unique per message
    'session' : str, # typically UUID, should be unique per session
    'username' : str,
    # ISO 8601 timestamp for when the message is created
    'date': str,
    # All recognized message type strings are listed below.
    'msg_type' : str,
    # the message protocol version
    'version' : '5.0',
}

Then, all messages for executing code are exchanged on the IOPub channel

The difference between display_data and execute_result is:

  1. Format-wise: "Results of an execution are published as an execute_result. These are identical to display_data messages, with the addition of an execution_count key."
  2. Purpose-wise: Per Jason Grout, a key developer of Jupyter, in this thread, "The execute_result vs display_data difference you see is from how you evaluated things. execute_result basically returns the result of displaying the last line of execution. display_data messages come from explicitly displaying something." Like suspected above, execute_result is from evaluating the last express of a cell/pod in an interactive python shell, e.g., x.shape.
  3. Since version 5.1, display_data can be updated. Maybe that's the reason that execute_count is not included in a display_data.

Additionally, I experimented with a Colab notebook. If you export it to ipynb, pay attention to the output types of the four cells. x.shape is execute_result, print(x.shape) is stream, and more interestingly, plt.plot([1,2,3], [4,5,6]) results in a double-element output array, the first of which being execute_result and the second, the plot, being display_data, and lastly plt.plot([1,2,3], [4,5,6]) results in one display_data mixing the text output <Figure size 640x480 with 1 Axes> and the plot.

@lihebi do we store the message type, returned from Jupyter kernel, in Codepod's database? That will save @senwang86 a lot of effort when exporting and importing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants