feat(memory2): Space raster backend + experimental memory2 agent#2188
feat(memory2): Space raster backend + experimental memory2 agent#2188Mgczacki wants to merge 8 commits into
Conversation
Adds a cv2-based raster renderer for `dimos.memory2.vis.space.Space` so maps can be sent as PNGs to vision LLMs, alongside the existing SVG + Rerun backends. New Space elements (Polygon, Wedge, RasterOverlay) and Point shape/halo variants cover the agent's overlay needs. `dimos.memory2.experimental.memory2_agent` is a LangChain agent that uses the new rendering surface to answer questions about a recorded memory2 SqliteStore (occupancy maps, FOV cones, room polygons, image recall). Tests are gated behind a new `experimental` pytest marker so they don't run by default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
❌ 1 Tests Failed:
View the top 1 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
…o memory2-vis-and-agent
…e test fixtures - Add `describe_room` skill: answers "what's in room X" by reading frames inside the room (composes room_extents) instead of using semantic search, avoiding question bias. - Add `unexplored_spaces` skill: surfaces exploration frontiers as the unpartitioned orange blobs flagged by verify_room_partition that aren't enclosed by walls. - Wire MEMORY2_AGENT_DB_HONGKONG fixture + (x, y) parser for content-grounded eval cases bound to the larger Hong Kong office recording. - Update README skill list (now 7 skills, including count_unique_things). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Greptile SummaryThis PR delivers two related features: a cv2 raster backend for the existing
Confidence Score: 4/5Safe to merge with the room-partition index bug acknowledged; the experimental module is opt-in and the failure is caught and returned as an error string rather than crashing the agent. The raster backend, element additions, and agent wiring are all solid. The one real defect is in
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
Q[User question] --> Agent
subgraph Agent["LangGraph Agent (agent.py)"]
direction TB
PS[Pre-seed list_skills] --> LLM[LLM loop]
LLM -->|tool call| Tools
Tools -->|ToolMessage + HumanMessage| LLM
LLM -->|final text| Answer[Final answer]
end
subgraph Tools["LangChain Tools (tools.py)"]
direction TB
T1[list_streams / summary / recent]
T2[search_semantic / near]
T3[show_image / recall_view]
T4[show_map]
T5[walkthrough / walkthrough_timestamps]
T6[frames_facing]
T7[verify_room_partition]
T8[calc / list_skills / load_skill]
end
subgraph Render["Space Render Pipeline"]
direction LR
Space --> |to_bgr/to_png| Raster[raster.py]
Space --> |to_svg| SVG[svg.py]
Space --> |to_rerun| Rerun[rerun.py]
Raster --> BGRImage[BGR ndarray]
BGRImage --> |base64 encode| MultiModal[LangChain multimodal msg]
end
subgraph Store["SqliteStore"]
direction TB
OdomStream[odom stream]
LidarStream[lidar stream]
ImageStream[color_image stream]
EmbStream[color_image_embedded stream]
end
T4 --> MapRenderer[MapRenderer]
T6 --> MapRenderer
T7 --> MapRenderer
MapRenderer --> |lidar fusion| Occupancy[OccupancyGrid]
Occupancy --> Space
MapRenderer --> Render
T3 --> ImageStream
T2 --> EmbStream
T1 --> Store
Reviews (2): Last reviewed commit: "[autofix.ci] apply automated fixes" | Re-trigger Greptile |
| proj = _project_world_xy_to_pixel(cam_pose=cam_pose, query_x=query_x, query_y=query_y) | ||
| if proj is None: | ||
| return bgr # query behind camera; return unaltered | ||
| px, py_floor, z_cam = proj | ||
| H, W = bgr.shape[:2] | ||
| if px < -W // 2 or px > W + W // 2: | ||
| return bgr # very far off image; not informative |
There was a problem hiding this comment.
_annotate_query_in_frame is documented as returning None when the query is behind the camera or far off-screen, but it always returns the (possibly unmodified) bgr array instead. The caller in tools.py guards on if annotated is None to fall back to the original image encoding path — that branch is unreachable, so every frame (including those where the red cross was never drawn) gets JPEG-encoded and sent to the model as if it were annotated. Changing the two early returns to return None restores the intended contract.
| proj = _project_world_xy_to_pixel(cam_pose=cam_pose, query_x=query_x, query_y=query_y) | |
| if proj is None: | |
| return bgr # query behind camera; return unaltered | |
| px, py_floor, z_cam = proj | |
| H, W = bgr.shape[:2] | |
| if px < -W // 2 or px > W + W // 2: | |
| return bgr # very far off image; not informative | |
| proj = _project_world_xy_to_pixel(cam_pose=cam_pose, query_x=query_x, query_y=query_y) | |
| if proj is None: | |
| return None # query behind camera; caller falls back to unannotated frame | |
| px, py_floor, z_cam = proj | |
| H, W = bgr.shape[:2] | |
| if px < -W // 2 or px > W + W // 2: | |
| return None # very far off image; not informative |
| img_obs = store.stream(stream).to_list() | ||
| if not img_obs: | ||
| return f"walkthrough: stream {stream!r} is empty" | ||
|
|
||
| resolved = _resolve_walkthrough_range( | ||
| "walkthrough", | ||
| store, | ||
| t_start, | ||
| t_end, | ||
| step_seconds, | ||
| WALKTHROUGH_FRAMES_MAX, | ||
| ) | ||
| if isinstance(resolved, str): | ||
| return resolved |
There was a problem hiding this comment.
walkthrough_frames calls store.stream(stream).to_list() (loading every image frame into memory) before the range-validation check. If the agent supplies an oversized range, the entire image stream is loaded and then discarded when _resolve_walkthrough_range returns an error string. For a 60 s recording at 10 fps with 1280×720 RGB frames each frame is ~2.8 MB uncompressed — 600 frames = ~1.7 GB materialized unnecessarily. Moving the range check before the stream fetch avoids this.
| img_obs = store.stream(stream).to_list() | |
| if not img_obs: | |
| return f"walkthrough: stream {stream!r} is empty" | |
| resolved = _resolve_walkthrough_range( | |
| "walkthrough", | |
| store, | |
| t_start, | |
| t_end, | |
| step_seconds, | |
| WALKTHROUGH_FRAMES_MAX, | |
| ) | |
| if isinstance(resolved, str): | |
| return resolved | |
| resolved = _resolve_walkthrough_range( | |
| "walkthrough", | |
| store, | |
| t_start, | |
| t_end, | |
| step_seconds, | |
| WALKTHROUGH_FRAMES_MAX, | |
| ) | |
| if isinstance(resolved, str): | |
| return resolved | |
| img_obs = store.stream(stream).to_list() | |
| if not img_obs: | |
| return f"walkthrough: stream {stream!r} is empty" |
| def _validate_stream(name: str) -> str | None: | ||
| """Return an error string if the stream name is invalid, else None.""" | ||
| if name not in _KNOWN_STREAMS: | ||
| return f"unknown stream {name!r}; available: {sorted(_KNOWN_STREAMS)}" | ||
| return None |
There was a problem hiding this comment.
Hardcoded stream names may mislead the agent
_KNOWN_STREAMS is a static set of four names. list_streams() queries the actual SQLite store and shows the agent whatever streams actually exist. If a recording uses any stream name not in this set, the agent will be informed (via list_streams) that it exists but will receive "unknown stream 'X'" from every other tool that calls _validate_stream — including summary, recent, search_semantic, near, and show_image. The disconnect makes the agent appear broken rather than incapable. Consider deriving the allowed set dynamically from store.list_streams() at build time.
| try: | ||
| all_obs = store.stream(stream).to_list() | ||
| if not all_obs: | ||
| return f"stream {stream!r} is empty" | ||
| obs = min(all_obs, key=lambda o: abs(o.ts - float(ts))) | ||
| except Exception as e: |
There was a problem hiding this comment.
Full image stream materialized on every
show_image call
store.stream(stream).to_list() deserializes every observation in the image stream into memory before the single nearest-timestamp entry is picked with min(...). At 10 fps over 60 s, this is ~600 full-resolution images (~2.8 MB each uncompressed) loaded unnecessarily for each tool call. The store already supports ordered queries (see recent which uses .order_by("ts", desc=True).limit(n)). A narrower query or at minimum deferring data decoding would avoid the O(N) memory spike. The same pattern appears in frames_that_could_see_point (loads all color_image frames before filtering by FOV).
- _annotate_query_in_frame: return None (not bgr) on behind-camera / off-screen so the caller's fallback branch is reachable, matching the docstring. - walkthrough_frames: validate range before materializing the image stream so invalid ranges don't trigger a full stream load. - build_tools: snapshot store.list_streams() into known_streams instead of a static set, so the agent sees consistent answers between list_streams() and stream-named tool calls. - show_image: replace full-stream materialization with three indexed pushdown queries (before/at-exact/after). Image streams join blobs eagerly in SqliteObservationStore, so the previous to_list() decoded every JPEG just to pick one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| continue | ||
|
|
||
| # Numeric index as the on-image label so the legend (text) and the | ||
| # marker (image) line up one-to-one. | ||
| space.add( | ||
| SpacePoint( | ||
| GeoPoint(wx, wy, 0.0), | ||
| color=_POINT_COLOR_TO_MEMORY2[color_name], | ||
| radius=_POINT_OVERLAY_RADIUS_M, | ||
| shape="dot", | ||
| halo=True, | ||
| label=str(idx), | ||
| ) | ||
| ) | ||
|
|
||
| line = f" {idx:>2} ({color_name:<7s}) at ({wx:+.2f}, {wy:+.2f})" | ||
| if label: | ||
| line += f" — {label}" | ||
| lines.append(line) | ||
|
|
||
| legend = f"Points ({len(pts)}):\n" + "\n".join(lines) | ||
| if dropped: | ||
| legend += f"\n [dropped {dropped} points beyond soft cap of {POINT_SOFT_CAP}]" | ||
| if invalid_colors: | ||
| legend += ( | ||
| "\n [invalid color(s) " | ||
| + ", ".join(sorted(set(invalid_colors))) | ||
| + f"; defaulted to '{POINT_DEFAULT_COLOR}'. Allowed: " | ||
| + ", ".join(POINT_COLORS_ALLOWED) | ||
| + "]" | ||
| ) | ||
| return legend | ||
|
|
||
|
|
||
| def encode_space_as_multimodal( | ||
| space: Space, caption: str, *, width_px: int | ||
| ) -> list[dict[str, Any]]: | ||
| """Render *space* to PNG and return LangChain multimodal content blocks.""" | ||
| png = space.to_png(width_px=width_px, padding_m=0.0) | ||
| b64 = base64.b64encode(png).decode("ascii") | ||
| return [ | ||
| {"type": "text", "text": caption}, |
There was a problem hiding this comment.
polys_world / rooms index mismatch in verify_room_partition
When any room dict is missing a valid "polygon" or "rect" key (e.g. a polygon with fewer than 3 vertices passes the len(r["polygon"]) >= 3 guard), that room is silently skipped and polys_world ends up shorter than rooms. All subsequent index-based accesses then use the wrong room entry or blow up entirely:
overlap_with_per_roomis created withrange(len(rooms))entries, but the inner looproom_masks_bool[i]is indexed up tolen(rooms)-1whileroom_masks_boolonly haslen(polys_world)items — anIndexErrorwheni >= len(polys_world).- The per-room
Space.add(SpacePolygon(..., label=f"#{ident} {desc}"))andstats.append(PartitionStats(id=room.get("id",…)))loops both doroom = rooms[i]whereiis thepolys_worldindex, so the label/metadata drifts for every room that was skipped.
The except Exception in the tool wrapper converts the crash to an error string so the agent won't hang, but the partition analysis is entirely unusable. Fix by collecting (room, poly) pairs together and iterating the shorter zipped list throughout.
An approach for #1913
What does it do?
Two things in one PR:
memory2 Space gets a cv2 raster backend.
Space.to_bgr()/Space.to_png()produce the same world view a vision LLM gets, alongside the existing SVG and Rerun renderers. Plus the elements you need to draw on top of an occupancy map:Polygon,Wedge,RasterOverlay, andPoint.shape/halovariants.An experimental LangChain agent at
dimos/memory2/experimental/memory2_agent/. Given a memory2 database, it introspects all streams to perform modality fusion (when the tools are compatible with the modality) in order to generate rich/useful temporal-spatial representations, domain-structure validation, and measurement capabilities on spatial data.How do we achieve this?
(rendering)
space.elements, accumulates a world-frameBounds, paints onto a BGR ndarray. TheBoundsclass is lifted intodimos/memory2/vis/space/bounds.pyso both backends share it.Polygon,Wedge,RasterOverlay— cover what the agent draws on top of occupancy maps: room boundaries, camera FOV cones, arbitrary world-frame masks (overclaim highlights, heatmaps, etc.).Pointgrewshape("dot"/"cross"/"x"/"square") andhalo(black underlay), so markers stay readable on busy maps.resolve_deferrednow walkscolor,fill, ANDstroke, so cmap-based colors work onPolygon's two-color split.PointCloud2only gets inflated once per render — the bounds pass and the draw pass share a per-call cache keyed byid(el).(agent)
room_extents), describing a specific room (describe_room), finding exploration frontiers (unexplored_spaces), counting unique instances of a kind of thing across many frames (count_unique_things), distances, object positions, and reasoning from another entity's viewpoint.Full tool/skill inventory in
dimos/memory2/experimental/memory2_agent/README.md(15 tools, 7 skills).Examples (what the agent sees)
The screenshots below are real tool returns from a memory2 agent run.
show_map— top-down lidar map with the robot's pose pinned. The agent uses this to orient itself in world coordinates.frames_facing— top-down view with viewing cones. Given a world (x, y), the tool overlays the cones of the camera frames whose field of view could contain that point. Used for finding which recorded frames "saw" a target location.verify_room_partition— map with the agent's room polygons + per-room areas. The agent submits candidate polygons; the tool overlays them and flags issues (overlap, unpartitioned floor blobs, odometry outside any room).walkthrough— annotated frame strip across a time range. Each tile is captioned witht, robot(x, y), andyaw. Used for summarising what was visible across a stretch of the walk in one call.frames_facing(per-frame) — recorded camera frame with the query point reprojected as a red X. Used to verify whether a candidate (x, y) actually lands on the target object: if the red cross sits on the body of the thing across multiple views, the position is right.How do we test the agent?
End-to-end tests live in
dimos/memory2/experimental/test_memory2_agent_ask.py. They run the real LangChain agent against a recorded SqliteStore + a live OpenAI model, so they're gated behind the newexperimentalmarker and excluded from the defaultpytestrun.Each prompt ends with an explicit format directive (
"Reply with only the number, nothing else") so the agent commits a clean, parseable final answer instead of dumping its full reasoning chain.Two recordings, two env vars. The repo doesn't ship the .db files.
Tool-coverage tests (
go2_short.db)These assert the agent picked the right kind of tool — they don't grade the answer's content.
test_lists_streams— "How many streams does this memory store have?" → expectslist_streamsto be called and4in the answer (build_memory.pywrites 4 streams).test_visual_question_uses_image_tool— "At t=22s show me what the robot saw directly forward and describe it in one sentence." → expects at least one of{show_image, recall_view, walkthrough, show_map, frames_facing}to be called and a non-empty answer afterwards. Confirms the langgraph Command path is end-to-end functional.Content-grounded QA on
go2_short.db—test_short_recording_qa(10 cases)go2_short.db— a short go2 walk through an office with two rooms, two white robots, and a long meeting table. Path supplied viaMEMORY2_AGENT_DB.rooms_count_22biggest_room_area_~80m2start_equals_end_roomclosest_to_meeting_table_2mwhite_robots_count_22white_robots_distance_apartman_in_black_moved_handhandorfingermulti_choice_letter_BBexploration_waypoint_roipassed_through_doorway_top_leftContent-grounded QA on
go2_hongkong_office.db—test_hongkong_recording_qa(3 cases, new)A longer recording of the Hong Kong office (elevator room, multiple rooms, richer layout). Path supplied via
MEMORY2_AGENT_DB_HONGKONG.white_robots_count_2_hk2(Need to verify)elevator_room_centertotal_floor_areaHow to run
(default suite —
experimentalexcluded; should stay green for everyone)```
pytest
```
(unit tests for the new memory2 plotting surface — no LLM, no recording needed)
```
pytest dimos/memory2/vis/space/test_space.py
```
(end-to-end agent tests — opt-in, needs OpenAI + the recording(s); skips cleanly if a recording env var is unset)
```
export OPENAI_API_KEY=...
export MEMORY2_AGENT_DB=/path/to/go2_short.db # required for test_short_recording_qa
export MEMORY2_AGENT_DB_HONGKONG=/path/to/go2_hongkong.db # required for test_hongkong_recording_qa
export MEMORY2_AGENT_MODEL=gpt-4.1-mini # optional, default gpt-5.5
pytest -m experimental dimos/memory2/experimental/ -v
```
(one-shot CLI for ad-hoc questions)
```
python -m dimos.memory2.experimental.memory2_agent.ask \
--db /path/to/recording.db \
--model gpt-4.1-mini \
"Where is the biggest room?"
```
(broader smoke run — 7 mixed questions, no assertions, just prints traces)
```
python -m dimos.memory2.experimental.memory2_agent.run_smoke --db /path/to/recording.db
```
Out of scope