Skip to content

Add Multimodal Output Support for execute_xxx Series Tools #69

@ChengJiale150

Description

@ChengJiale150

Summary

Currently, the execution result tools in jupyter_mcp_server/server.py (e.g., append_execute_code_cell, execute_cell_with_progress, etc.) only return text output, displaying only a "[Image Output (PNG)]" placeholder for image outputs. It is proposed to add full multimodal output support to fully leverage the visual understanding capabilities of state-of-the-art multimodal large models.

Core Objectives

  1. Unleash the Potential of Multimodal Large Models: Support direct output and analysis of visual content such as images and charts.
  2. Enhance User Experience: Allow the Agent to directly "see" and understand data visualization results.
  3. Maintain Backward Compatibility: Ensure that LLMs that do not support multimodality can still work correctly through environment variable control.
  4. Unify Output Format: Provide consistent multimodal support for all execution result tools.

Analysis of Current Problems

Affected Functions (5)

Functions that currently return execution results but do not support multimodality:

  • append_execute_code_cell(cell_source: str) -> list[str]
  • insert_execute_code_cell(cell_index: int, cell_source: str) -> list[str]
  • execute_cell_with_progress(cell_index: int, timeout_seconds: int = 300) -> list[str]
  • execute_cell_simple_timeout(cell_index: int, timeout_seconds: int = 300) -> list[str]
  • execute_cell_streaming(cell_index: int, timeout_seconds: int = 300, progress_interval: int = 5) -> list[str]

Limitations of Current Output Processing

In jupyter_mcp_server/utils.py:

elif "image/png" in data:
    return "[Image Output (PNG)]"  # Only returns placeholder text

Advanced Implementation:

elif ("image/png" in output['data']) and ALLOW_IMG:
    raw_image_data = base64.b64decode(output['data']['image/png'])
    processed_image_data = self._preprocess_image(raw_image_data)
    return Image(data=processed_image_data, format="image/png")  # Returns the actual image object

Environment Variable Configuration

{
  "mcpServers": {
    "jupyter": {
      ...
      "env": {
        ...
        "ALLOW_IMG_OUTPUT": "true"
      }
    }
  }
}

Output Example

Current Output (Text Only)

# Execute code containing a matplotlib chart
result = await execute_cell_with_progress(2)
print(result)
# ['import matplotlib.pyplot as plt\nplt.plot([1,2,3,4])\nplt.show()', '[Image Output (PNG)]']

Improved Output (Multimodal)

# Enable image output
result = await execute_cell_with_progress(2)
print(result)
# [
#   'import matplotlib.pyplot as plt\nplt.plot([1,2,3,4])\nplt.show()', 
#   Image(data=b'...', format='image/png')  # Actual image object
# ]

# Disable image output (compatibility mode)
result = await execute_cell_with_progress(2) 
print(result)
# [
#   'import matplotlib.pyplot as plt\nplt.plot([1,2,3,4])\nplt.show()', 
#   '[Image Output (PNG) - Image display disabled]'
# ]

Typical Use Cases

Data Visualization Analysis

# The Agent can "see" and analyze charts
await append_execute_code_cell("""
import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv('sales_data.csv')
df.plot(kind='bar', x='month', y='revenue')
plt.title('Monthly Revenue')
plt.show()
""")
# Returns: ['DataFrame plotted successfully', Image(...)]
# The Agent can now understand the chart content and provide insights based on visuals

Machine Learning Model Visualization

# The Agent can analyze model training curves
await execute_cell_with_progress(5)  # Cell containing a loss function chart
# Returns: ['Training completed', Image(...)]
# The Agent can evaluate the training effect and suggest optimization strategies

Compatibility Considerations

Backward Compatibility Guarantee

  1. Enabled by Default: ALLOW_IMG_OUTPUT=true ensures that the new feature is available out of the box.
  2. Graceful Degradation: LLMs that do not support multimodality will still receive text descriptions.
  3. Error Handling: Automatically degrades to text output when image processing fails.
  4. Type Safety: Use Union[str, Image] to ensure type checking passes.

Additional Notes

This improvement proposal is inspired by Anthropic's article on writing effective tools for AI Agents. The article emphasizes:

Tools should return meaningful contextual information to the Agent

Tool implementations should prioritize contextual relevance over flexibility, avoiding the return of low-level technical identifiers.

Multimodal output support aligns with this principle:

  • Provides Rich Context: Images contain more information than text descriptions.
  • Supports Advanced Reasoning: Leverages the visual understanding capabilities of the latest multimodal models.
  • Enhances Interactive Experience: The Agent can perform in-depth analysis of visual content.
  • Maintains Flexibility: Supports LLMs with different capabilities through configuration.

With the popularization of multimodal large models such as Claude 4 and Gemini 2.5 Pro, adding visual output support to Agent tools has become a necessary condition for fully realizing the potential of AI. This improvement will significantly enhance the utility of the Jupyter MCP Server in visually intensive scenarios such as data science and machine learning.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions