Skip to content

Add capture event transport and server-side write classification#112

Open
jspahn80134 wants to merge 49 commits into
bjones1:mainfrom
jspahn80134:main
Open

Add capture event transport and server-side write classification#112
jspahn80134 wants to merge 49 commits into
bjones1:mainfrom
jspahn80134:main

Conversation

@jspahn80134
Copy link
Copy Markdown
Contributor

@jspahn80134 jspahn80134 commented Apr 22, 2026

Summary:

  • Added privacy-conscious capture configuration with ignored local config, a safe capture_config.example.json, redacted config summaries, PostgreSQL capture writes, and JSONL fallback capture when database capture is missing or unavailable.
  • Added the canonical capture schema/docs in server/scripts/capture_events_schema.sql and linked it from toc.md.
  • Defines typed capture columns for study analysis: event_id, sequence_number, schema_version, user_id, session_id, event_source, language_id, file_hash, event_type, timestamp, client_tz_offset_min, and event-specific data.
  • Reworked capture setup for students around consent plus a status-bar recording toggle, with an auto-generated pseudonymous participant UUID instead of manual course/group/assignment/task fields.
  • Made recording session/window-local so capture starts paused on VS Code startup and separate VS Code windows can enable their own capture sessions independently.
  • Sends VS Code capture events over the existing IDE/server message path; the extension sends the local path only to the local Rust server, which hashes it before DB/fallback storage so raw paths are not persisted.
  • Added capture status UI/output handling and coordinated status/session shutdown across toggle, stop, and deactivate paths.
  • Added Rust-generated TypeScript capture/status types shared by the client and extension, with enum-backed capture event types and capture states.
  • Captures session/settings changes, saves, compile/run start/end, study lifecycle/handoff events, reflection prompt insertion, activity switches, doc-session duration, and server-generated write classification events.
  • Added server-side CodeChat translation classification for documentation vs. code writes and Markdown source diffs; server-generated events now use event_source=server_translation and get stream-local sequence numbers.
  • Removed paste/external-insert heuristic capture events from the recorded data so those classifications can be tuned later in the analysis package from captured write diffs.
  • Added the reflection prompt command/UI support while keeping study automation commands out of normal Command Palette contributions.
  • Addressed VS Code extension review findings: serialized async activity capture sends, added closed_by to activity-ended doc sessions, avoided classification scans while capture is off, tightened Markdown/RST code-block classification, fixed RST reflection prompt output, normalized parsed capture-status counters, and made unsupported DOM cursor updates explicit.
  • Added and expanded docs/comments for capture globals, data structures, event data keys, schema design, path privacy, fallback capture, and capture classification behavior.
  • Removed the temporary dissertation metrics/export utility from this PR scope and kept analysis/export work separate from the server write contract.
  • Stabilized browser/WebDriver tests, including serialized browser tests, improved timeout diagnostics, optional-message helpers, and a query(...).wait(...) Mocha result wait.
  • Updated dependency/config details needed for audit and CI checks, including the rustls advisory update and logging/config cleanup.

Validation:

  • cargo test export_bindings
  • cargo clippy --manifest-path server/Cargo.toml --all-targets --all-features -- -Dwarnings
  • cargo test --manifest-path server/Cargo.toml --lib -- --test-threads=1
  • Focused capture tests for stable event serialization, server-generated sequence numbers, and JSONL fallback capture
  • Client: pnpm exec tsc -noEmit
  • VS Code extension: pnpm exec tsc -noEmit
  • VS Code extension: pnpm exec eslint src
  • Manual Extension Development Host smoke test confirmed capture events reach the DB/fallback path.

Update the ignored PostgreSQL integration test to assert the rich events schema columns and fix timestamp/JSONB parameter casts used by the capture insert path. Verified against the AWS PostgreSQL database with event_capture_inserts_rich_schema_event_into_db.
macOS CI occasionally delivered the loadfile acknowledgement just after the old two-second harness timeout. Increase the shared browser message wait to five seconds so test_client does not fail on that timing edge.
The first CI rerun passed the original test_client wait but exposed the same timing issue in test_client_updates while waiting for the autosave content update. Use the client response window as the shared browser test wait budget.
The overall browser tests share one WebDriver endpoint and were running concurrently inside the same test binary. This was causing test_client_updates to miss its autosave content update on CI, especially macOS/Safari. Guard the harness with a shared async mutex so each browser session runs in isolation.
Copy link
Copy Markdown
Owner

@bjones1 bjones1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's some initial comments on the PR, mainly questions -- I'd like to hear your thoughts. I'll continue to review.

Comment thread extensions/VSCode/src/extension.ts
Comment thread extensions/VSCode/src/extension.ts Outdated
Comment thread extensions/VSCode/src/extension.ts Outdated
Comment thread extensions/VSCode/src/extension.ts
Comment thread extensions/VSCode/src/extension.ts
Comment thread extensions/VSCode/src/extension.ts
Comment thread extensions/VSCode/src/extension.ts
Comment thread extensions/VSCode/src/extension.ts
Use generated Rust-backed capture wire/status types in the VS Code extension.

Restore the explanatory extension comments and the current-file update after LoadFile.

Keep study lifecycle commands available for automation while removing them from the Command Palette.
Resolve conflicts in the VS Code extension, translation capture path, and overall test harness.

Keep upstream CursorPosition/WebDriver updates while preserving capture instrumentation and serialized browser test timing.
Copy link
Copy Markdown
Owner

@bjones1 bjones1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More review.

Comment thread extensions/VSCode/src/extension.ts
Comment thread extensions/VSCode/src/extension.ts Outdated
Comment thread extensions/VSCode/src/extension.ts Outdated
Comment thread server/src/webserver.rs Outdated
Comment thread server/src/webserver.rs Outdated
Comment thread server/tests/overall_common/mod.rs
Comment thread server/tests/overall_common/mod.rs
Comment thread server/src/webserver.rs Outdated
Comment thread server/src/webserver.rs
Comment thread server/src/webserver.rs Outdated
Comment thread server/src/webserver.rs Outdated
Comment thread server/tests/overall_1.rs Outdated
Comment thread server/tests/overall_1.rs
Comment thread server/tests/overall_1.rs
Comment thread .gitignore
Comment thread .gitignore Outdated
Comment thread capture_config.example.json
Comment thread extensions/VSCode/package.json Outdated
Comment thread extensions/VSCode/package.json Outdated
Comment thread client/src/CodeMirror-integration.mts Outdated
Comment thread client/src/CodeMirror-integration.mts Outdated
Copy link
Copy Markdown
Owner

@bjones1 bjones1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress!

If there's some discussion/a question you answer, don't resolve it -- this helps me find an read your responses. When everything's already resolved, it's hard for me to find/think about discussions.

Comment thread server/scripts/capture_events_schema.sql
Comment thread server/src/webserver.rs Outdated
Comment thread server/src/capture.rs Outdated
Comment thread server/src/capture.rs Outdated
Comment thread server/src/capture.rs Outdated
Comment thread server/src/capture.rs Outdated
Comment thread server/src/capture.rs
Comment thread server/src/capture.rs Outdated
Comment thread server/src/capture.rs Outdated
Comment thread server/src/capture.rs
Add a code_external_insert_candidate capture event for code edits that look non-incremental but were not observed as paste operations.

The classifier records only coarse metadata: basis, confidence, size band, block kind, source, and classification basis. Paste markers continue to take precedence so a single edit is not double-counted as both paste and heuristic external insertion.

Include targeted code comments and unit coverage for multi-line, small single-line, and large-block classifier behavior.
Comment thread server/src/webserver.rs Outdated
Comment thread server/src/webserver.rs Outdated
Comment thread server/src/translation.rs Outdated
Comment thread server/src/capture.rs
Comment thread server/scripts/capture_events_schema.sql
Comment thread server/tests/overall_1.rs Outdated
Comment thread server/src/capture.rs
Comment thread server/src/translation.rs Outdated
Comment thread server/src/translation.rs
Comment thread server/src/translation.rs Outdated
Comment thread server/src/translation.rs Outdated
Comment thread server/src/translation.rs Outdated
Comment thread server/src/translation.rs
Comment thread server/src/translation.rs Outdated
Comment thread server/src/translation.rs Outdated
Comment thread server/src/capture.rs
Comment thread server/src/capture.rs Outdated
Comment thread server/src/capture.rs
Comment thread server/src/capture.rs
Comment thread extensions/VSCode/src/extension.ts
Comment thread server/src/capture.rs
@bjones1
Copy link
Copy Markdown
Owner

bjones1 commented Jun 4, 2026

I noticed that sequence_number was None in some cases -- is this OK? JSONL:

{
  "event": {
    "client_tz_offset_min": -300,
    "data": {
      "classification_basis": "codemirror_doc_blocks",
      "doc_block_count_after": 3,
      "doc_block_count_before": 3,
      "doc_block_diff": [
        {
          "Update": {
            "contents": [
              {
                "from": 0,
                "insert": "<p>Copyright (C) 2025 Bryan A. Jones.<p>This file is part of the CodeChat Editor.<p>The CodeChat Editor is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.<p>The CodeChat Editor is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.<p>You should have received a copy of the GNU General Public License along with the CodeChat Editor. If not, see <a href=http://www.gnu.org/licenses/>http://www.gnu.org/licenses/</a>.<h1><code>.gitignore</code> -- files for Git to ignore</h1><p>dist build output",
                "to": 835
              }
            ],
            "from": 0
          }
        }
      ],
      "mode": "python",
      "source": "ide"
    },
    "event_id": "server-15356-1780568509808280-2",
    "event_source": "vscode_extension",
    "event_type": "write_doc",
    "file_hash": "c22e65e3f32618653447a821e7e2c35ec4cea7a142f2edc1ec7915b9ca7b3821",
    "language_id": null,
    "schema_version": 2,
    "sequence_number": null,
    "session_id": "8eafb29c-9634-459b-9e4e-a6b55ef5808c",
    "timestamp": "2026-06-04T10:21:49.808236300+00:00",
    "user_id": "1234"
  },
  "fallback_timestamp": "2026-06-04T10:21:49.809424800+00:00"
}

@bjones1
Copy link
Copy Markdown
Owner

bjones1 commented Jun 4, 2026

I noticed that starting capture, quitting VSCode, then restarting it, leaves capture on. Turn capture off when the extension first starts up -- I thought I'd need to re-enable it on startup.

@bjones1
Copy link
Copy Markdown
Owner

bjones1 commented Jun 4, 2026

When I open a second VSCode window, capture doesn't seem to work.

@bjones1
Copy link
Copy Markdown
Owner

bjones1 commented Jun 4, 2026

Edits to raw files (ones the the CodeChat Editor doesn't support) think that a minor edit is a copy/paste.

@bjones1
Copy link
Copy Markdown
Owner

bjones1 commented Jun 4, 2026

Any edits to a Markdown file puts the entire contents in the log, not just a diff. Probably need to think about the diff format sent.

Remove paste/external-insert heuristic capture rows, make recording session-local, add JSONL fallback capture without DB config, assign server-generated capture sequence numbers, and diff Markdown source writes.
@jspahn80134
Copy link
Copy Markdown
Contributor Author

Addressed the June 4 discussion notes in 3fd74f2:

  • Server-generated capture rows now use event_source=server_translation and receive stream-local sequence numbers.
  • Recording is now session/window-local and starts paused on extension activation; consent and participant ID still persist.
  • Separate VS Code windows can enable their own capture sessions independently.
  • Removed paste/external-insert heuristic capture rows so raw edits log diffs for analysis-time classification.
  • Markdown write logging now diffs source Markdown rather than rendered HTML.

@bjones1
Copy link
Copy Markdown
Owner

bjones1 commented Jun 5, 2026

I asked Claude to review extension.ts. I haven't looked in detail, but a lot of these seem valid. Have you done any review of your files using your favorite LLM? I'd be happy to post more Claude reviews, but think you'd get more mileage out of running these yourself and responding to them, then pinging me for a
Here are the findings ranked most-severe first:

Code Review: extensions/VSCode/src/extension.ts

Branch: john-pr vs main


Finding 1 — noteActivity fires capture events without await (line 1086)

sendCaptureEvent is called four times inside the synchronous noteActivity with no await and no .catch(). Rejections are silently dropped, and the emitted events (doc_session, session_end, switch_pane) race to the server.

Failure scenario: A code→doc→code keystroke sequence fires three unawaited sendCaptureEvent calls; any rejection is silently swallowed (docSessionStart is cleared so the session start is internally recorded but no doc_session row reaches the DB), and concurrent sends race so session_end(doc) can arrive before doc_session, corrupting duration analysis.


Finding 2 — noteActivity's doc_session omits closed_by (line 1095)

closeDocSession (line 1062) always includes closed_by in the doc_session payload; noteActivity does not. Sessions interrupted by switching to code produce structurally different rows than sessions ended by shutdown.

Failure scenario: Analyst queries closed_by on doc_session rows: sessions ended by switching to code (via noteActivity) have NULL, sessions ended by extension shutdown have "extension_deactivate"; the two populations are silently unlike, skewing any analysis that joins or filters on closed_by.


Finding 3 — Silent no-op when DomLocation cursor arrives (line 1654)

The old assert("Line" in cursor_position) was replaced by a plain if-guard with no else, no log, and no comment. CursorPosition.ts defines a second variant DomLocation that the server legitimately sends; when it arrives, cursor tracking silently stops.

Failure scenario: Server sends cursor_position = { DomLocation: {...} } (a valid variant per CursorPosition.ts) for a rich-text cursor; the if-body is silently skipped, cursor stops tracking in the VS Code editor, and nothing distinguishes this from "no cursor position provided" in logs.


Finding 4 — isInRstCodeBlock never breaks on non-indented lines that close the block (line 152)

The backward scan finds the nearest .. code-block:: but never stops when it crosses a non-blank, non-indented line (which RST uses to close a block). Any indented line anywhere below an earlier directive is wrongly classified as code activity.

Failure scenario: RST document has a code block on line 10 followed by a closing non-indented paragraph; an indented list item on line 50 — well after the block ends — triggers isInRstCodeBlock returning true, logging the edit as code activity instead of doc activity.


Finding 5 — isInMarkdownCodeFence misclassifies the fence-delimiter line itself (line 138)

The loop bound i <= line means when the cursor is on an opening ``` line, inFence toggles to true and the function returns true, logging edits to fence delimiters as code activity.

Failure scenario: User types a language tag on an opening ``` line (e.g., changing ``` to ```python); the edit is logged as code activity even though the fence delimiter is markup, not code content.


Finding 6 — RST reflection prompt inserts a malformed directive, not a comment (line 939)

reflectionPromptText for RST produces \n.. ${prompt}\n. In RST, .. followed by text is parsed as a directive — the first word of the prompt becomes the directive name. A valid RST comment requires a different form.

Failure scenario: User inserts the default prompt "What changed in your understanding of this code?"; RST output becomes .. What changed…, which Sphinx/docutils emits an "Unknown directive type 'What'" error and drops the content from rendered output.


Finding 7 — Markdown fence detector miscounts fences inside blockquotes (line 130)

text.trim() strips the > prefix before checking for ```, so blockquoted code samples toggle the outer fence state and invert code/doc classification for all subsequent lines in the document.

Failure scenario: Document contains > ```python (blockquoted code sample); the trim makes it look like a bare fence opener; inFence toggles and all subsequent lines have their code/doc classification inverted for the rest of the file.


Finding 8 — CaptureStatus bigint fields are actually number at runtime (line 607)

JSON.parse cannot produce bigint; CaptureStatus.ts declares queued_events, persisted_events, fallback_events, and failed_events as bigint, but the as CaptureStatus cast hides the mismatch. Currently harmless (fields are only string-interpolated), but the type is a lie.

Failure scenario: Any future code that performs bigint-specific operations (arithmetic, typeof === "bigint" checks, BigInt() comparisons) on those counter fields will receive a number at runtime and either throw a TypeError or silently misbehave.


Finding 9 — O(N) Markdown fence scan runs on every keystroke even when capture is off (line 1229)

classifyAtPosition is called before any capture-enabled guard in onDidChangeTextDocument, adding a full O(N) line scan to every keystroke event regardless of recording state.

Failure scenario: User has capture off but has a large Markdown file open; every keystroke triggers a full O(N) scan via isInMarkdownCodeFence before reaching any early-return guard, adding measurable latency to typing in large documents.


Finding 10 — reportCaptureFailure duplicates the timestamp prefix already in captureLog (line 566)

new Date().toISOString() is inlined in both captureLog (line 431) and reportCaptureFailure (line 566); reportCaptureFailure should call captureLog(...) instead.

Failure scenario: A change to the timestamp format in captureLog is not reflected in reportCaptureFailure output, causing inconsistent log entries in the capture output channel.

Serialize capture activity events, add closed_by to activity-ended doc sessions, skip activity classification while capture is off, tighten Markdown/RST code classification, normalize parsed CaptureStatus counters, make DomLocation cursor handling explicit, fix RST reflection prompt output, and reuse captureLog for failure messages.
@jspahn80134
Copy link
Copy Markdown
Contributor Author

Addressed the extension.ts Claude review findings in 0a9198d:

  • Serialized async capture activity events so doc/session/switch rows stay ordered and failures are handled.
  • Added closed_by for doc sessions ended by activity changes.
  • Made DomLocation cursor updates explicit instead of silently ignoring them.
  • Tightened Markdown/RST code-block classification, including fence delimiter and blockquote cases.
  • Changed RST reflection prompt insertion to a valid visible RST section.
  • Normalized parsed CaptureStatus counters from JSON numbers to bigint values.
  • Skipped activity classification work while capture is not recording.
  • Reused captureLog for capture failure output formatting.

Verification: pnpm exec tsc -noEmit and pnpm exec eslint src both pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants