-
-
Notifications
You must be signed in to change notification settings - Fork 117
Description
Summary
Currently, the execution result tools in jupyter_mcp_server/server.py
(e.g., append_execute_code_cell
, execute_cell_with_progress
, etc.) only return text output, displaying only a "[Image Output (PNG)]"
placeholder for image outputs. It is proposed to add full multimodal output support to fully leverage the visual understanding capabilities of state-of-the-art multimodal large models.
Core Objectives
- Unleash the Potential of Multimodal Large Models: Support direct output and analysis of visual content such as images and charts.
- Enhance User Experience: Allow the Agent to directly "see" and understand data visualization results.
- Maintain Backward Compatibility: Ensure that LLMs that do not support multimodality can still work correctly through environment variable control.
- Unify Output Format: Provide consistent multimodal support for all execution result tools.
Analysis of Current Problems
Affected Functions (5)
Functions that currently return execution results but do not support multimodality:
append_execute_code_cell(cell_source: str) -> list[str]
insert_execute_code_cell(cell_index: int, cell_source: str) -> list[str]
execute_cell_with_progress(cell_index: int, timeout_seconds: int = 300) -> list[str]
execute_cell_simple_timeout(cell_index: int, timeout_seconds: int = 300) -> list[str]
execute_cell_streaming(cell_index: int, timeout_seconds: int = 300, progress_interval: int = 5) -> list[str]
Limitations of Current Output Processing
In jupyter_mcp_server/utils.py
:
elif "image/png" in data:
return "[Image Output (PNG)]" # Only returns placeholder text
Advanced Implementation:
elif ("image/png" in output['data']) and ALLOW_IMG:
raw_image_data = base64.b64decode(output['data']['image/png'])
processed_image_data = self._preprocess_image(raw_image_data)
return Image(data=processed_image_data, format="image/png") # Returns the actual image object
Environment Variable Configuration
{
"mcpServers": {
"jupyter": {
...
"env": {
...
"ALLOW_IMG_OUTPUT": "true"
}
}
}
}
Output Example
Current Output (Text Only)
# Execute code containing a matplotlib chart
result = await execute_cell_with_progress(2)
print(result)
# ['import matplotlib.pyplot as plt\nplt.plot([1,2,3,4])\nplt.show()', '[Image Output (PNG)]']
Improved Output (Multimodal)
# Enable image output
result = await execute_cell_with_progress(2)
print(result)
# [
# 'import matplotlib.pyplot as plt\nplt.plot([1,2,3,4])\nplt.show()',
# Image(data=b'...', format='image/png') # Actual image object
# ]
# Disable image output (compatibility mode)
result = await execute_cell_with_progress(2)
print(result)
# [
# 'import matplotlib.pyplot as plt\nplt.plot([1,2,3,4])\nplt.show()',
# '[Image Output (PNG) - Image display disabled]'
# ]
Typical Use Cases
Data Visualization Analysis
# The Agent can "see" and analyze charts
await append_execute_code_cell("""
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('sales_data.csv')
df.plot(kind='bar', x='month', y='revenue')
plt.title('Monthly Revenue')
plt.show()
""")
# Returns: ['DataFrame plotted successfully', Image(...)]
# The Agent can now understand the chart content and provide insights based on visuals
Machine Learning Model Visualization
# The Agent can analyze model training curves
await execute_cell_with_progress(5) # Cell containing a loss function chart
# Returns: ['Training completed', Image(...)]
# The Agent can evaluate the training effect and suggest optimization strategies
Compatibility Considerations
Backward Compatibility Guarantee
- Enabled by Default:
ALLOW_IMG_OUTPUT=true
ensures that the new feature is available out of the box. - Graceful Degradation: LLMs that do not support multimodality will still receive text descriptions.
- Error Handling: Automatically degrades to text output when image processing fails.
- Type Safety: Use
Union[str, Image]
to ensure type checking passes.
Additional Notes
This improvement proposal is inspired by Anthropic's article on writing effective tools for AI Agents. The article emphasizes:
Tools should return meaningful contextual information to the Agent
Tool implementations should prioritize contextual relevance over flexibility, avoiding the return of low-level technical identifiers.
Multimodal output support aligns with this principle:
- ✅ Provides Rich Context: Images contain more information than text descriptions.
- ✅ Supports Advanced Reasoning: Leverages the visual understanding capabilities of the latest multimodal models.
- ✅ Enhances Interactive Experience: The Agent can perform in-depth analysis of visual content.
- ✅ Maintains Flexibility: Supports LLMs with different capabilities through configuration.
With the popularization of multimodal large models such as Claude 4 and Gemini 2.5 Pro, adding visual output support to Agent tools has become a necessary condition for fully realizing the potential of AI. This improvement will significantly enhance the utility of the Jupyter MCP Server in visually intensive scenarios such as data science and machine learning.