Skip to content

feat: extract quoted tweet content for quote-tweet bookmarks#30

Closed
mindswim wants to merge 1 commit into
afar1:mainfrom
mindswim:feat/enrich-quoted-tweets
Closed

feat: extract quoted tweet content for quote-tweet bookmarks#30
mindswim wants to merge 1 commit into
afar1:mainfrom
mindswim:feat/enrich-quoted-tweets

Conversation

@mindswim
Copy link
Copy Markdown

@mindswim mindswim commented Apr 6, 2026

Summary

Bookmarks that quote another tweet only store the quoted tweet's ID (quotedStatusId). The quoted tweet's text, author, and media are missing, making many bookmarks hard to understand in isolation.

The GraphQL bookmarks API already returns this data nested inside tweet.quoted_status_result.result -- it just wasn't being extracted. This PR reads it during sync and adds a backfill command for existing bookmarks.

Closes #15

What changed

  • src/types.ts -- QuotedTweetSnapshot interface, quotedTweet field on BookmarkRecord
  • src/graphql-bookmarks.ts -- extract quoted_status_result.result in convertTweetToRecord(). Future syncs include quoted tweets automatically.
  • src/bookmarks-db.ts -- schema v4: quoted_tweet_json column, migration, updateQuotedTweets() for the DB layer
  • src/bookmark-enrich.ts -- new ft enrich command that backfills existing bookmarks via X's syndication API. Retry with exponential backoff (matching fetchPageWithRetry pattern), rate limiting, idempotent.
  • src/cli.ts -- register ft enrich command
  • tests/graphql-bookmarks.test.ts -- 2 tests: extraction with quoted tweet present, graceful handling when absent

How it works

New syncs: convertTweetToRecord() now reads the nested quoted tweet from the GraphQL response. No extra API calls, no config. It just works.

Existing bookmarks: ft enrich fetches missing quoted tweets via cdn.syndication.twimg.com (no auth required). Deduplicates by quoted tweet ID, retries on 429/5xx, skips deleted/private tweets. Run once to backfill, then never needed again.

ft enrich              # backfill missing quoted tweets
ft enrich --delay-ms 500  # slower rate limit

Verification

npm run build          # clean compile
npm test               # 2 new tests pass (5 pre-existing db test failures on main unchanged)
ft enrich              # tested against 7,641 bookmarks -- 1,164/1,205 quoted tweets fetched

Note

Medium Risk
Adds new data extraction and persistence paths (GraphQL parsing + SQLite schema migration) and a networked backfill command, which could affect sync/index correctness and rate-limit behavior but is contained to optional enrichment and a new column.

Overview
Adds first-class support for storing quoted-tweet context on quote-tweet bookmarks.

New GraphQL sync behavior extracts a QuotedTweetSnapshot from quoted_status_result and stores it on BookmarkRecord.quotedTweet, with tests covering presence/absence. The SQLite index schema is bumped to v4 with a new quoted_tweet_json column, insert/migration updates, and a new updateQuotedTweets() helper.

Introduces ft enrich, which backfills missing quotedTweet snapshots for existing bookmarks by fetching via X’s public syndication endpoint with retry/backoff, deduping quoted IDs, updating both the JSONL cache and the SQLite index.

Reviewed by Cursor Bugbot for commit 5f2d022. Bugbot is set up for automated code reviews on this repo. Configure here.

…ommand

The GraphQL bookmarks API already returns full quoted tweet data nested
inside tweet.quoted_status_result.result, but it was not being extracted.
This adds extraction in convertTweetToRecord() so future syncs
automatically include quoted tweet text, author, media, and URL.

For existing bookmarks synced without this data, adds `ft enrich` which
fetches missing quoted tweets via X's syndication API with retry and
rate limiting. Idempotent -- safe to run multiple times.

Schema bumped to v4 with a quoted_tweet_json column. DB update logic
lives in bookmarks-db.ts following the existing layer separation.

Closes afar1#15
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 5f2d022. Configure here.

Comment thread src/bookmarks-db.ts
githubUrls.length ? JSON.stringify(githubUrls) : null,
null, // domains — populated by classify-domains pass
null, // primary_domain
r.quotedTweet ? JSON.stringify(r.quotedTweet) : null,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migration skipped due to premature schema version bump

High Severity

In buildIndex, initSchema(db) is called before ensureMigrations(db). initSchema unconditionally sets schema_version to SCHEMA_VERSION (4) in the meta table, even when CREATE TABLE IF NOT EXISTS is a no-op for an existing table. When ensureMigrations runs next, it reads the version as 4, so the version < 4 check is false and the ALTER TABLE ADD COLUMN quoted_tweet_json migration never executes. For any existing database at schema v3, the column is missing, causing SQL errors when insertRecord passes 31 values to a 30-column table or when updateQuotedTweets references the non-existent column.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5f2d022. Configure here.

@BenevolentFutures
Copy link
Copy Markdown

First-Pass Review

Clean, well-structured PR. The GraphQL quoted tweet extraction is correct, the syndication API fallback for backfill is a good design choice, and the schema migration follows existing patterns. One real issue to address.

Must Fix

mediaObjects field mismatch with BookmarkMediaObject type

QuotedTweetSnapshot.mediaObjects is typed as BookmarkMediaObject[], but the objects use url instead of mediaUrl (the field defined on the interface), and expandedUrl which doesn't exist on the interface at all:

// graphql-bookmarks.ts (quoted tweet extraction)
mediaObjects: qtMediaEntities.map((m: any) => ({
  url: m.media_url_https ?? m.media_url,       // should be mediaUrl per interface
  expandedUrl: m.expanded_url,                  // not on BookmarkMediaObject
})),

Same pattern in bookmark-enrich.ts. Note: This is a pre-existing inconsistency — the main tweet's mediaObjects extraction has the same mismatch. The (m: any) cast hides the type error.

Pragmatic fix: update BookmarkMediaObject to use url instead of mediaUrl (matching what's actually serialized in the JSONL). Lower risk than changing all creation sites.

Should Fix

1. Schema version conflict with folder support

This PR uses SCHEMA_VERSION = 4 for quoted_tweet_json. PR #34 (folder support, now ready for review) also uses v4 for folder_ids/folder_names. Whichever merges second will need a rebase to v5, with both migration steps. Not a blocker today but coordinate merge order.

2. EnrichResult double-counts unavailable tweets

When a tweet is unavailable (404/403), fetchTweetWithRetry returns null, which increments failed in the fetch loop. Later, when applying snapshots, the same null resolution increments skipped. So unavailable tweets are counted in both failed AND skipped. Suggestion: reserve failed for actual exceptions and skipped for intentionally unavailable tweets.

3. Consider a --dry-run option

The enrich command mutates both JSONL and SQLite. A dry-run flag that reports how many bookmarks need enrichment without fetching would be useful for scripting and safety.

Nitpick

  • spinnerIdx referenced in the enrich command handler — compiles fine since it's module-scoped, but a local index would be slightly cleaner.
  • The token=x parameter in the syndication URL deserves a comment explaining it's a required-but-any-value parameter, not a real auth token.

Verified Correct

  • GraphQL extraction correctly navigates quoted_status_result.result, handles the tweet wrapper, extracts user from core.user_results.result with proper fallbacks
  • Null/missing quoted tweet handling is solid — quotedTweet is undefined, quotedStatusId still preserved
  • Schema migration follows the exact ALTER TABLE + try/catch pattern
  • Deduplication: quoted tweet IDs are deduped before fetching, then applied to all matching bookmarks
  • Retry logic mirrors existing fetchPageWithRetry pattern
  • JSONL + SQLite dual write maintains consistency
  • Test coverage: two tests covering successful extraction and graceful null handling
  • CLI registration follows existing patterns

Verdict: Solid work. Fix the mediaObjects type mismatch, then it's ready. Coordinate schema version with #34.

@afar1
Copy link
Copy Markdown
Owner

afar1 commented Apr 7, 2026

Closing in favor of #35 which landed this. @mindswim — your implementation directly shaped how we built this. The syndication API backfill approach, the snapshot type, the test cases — all solid. We ended up folding it into ft sync --gaps instead of a separate command but the core was yours. Thanks for the great work.

@afar1 afar1 closed this Apr 7, 2026
@mindswim
Copy link
Copy Markdown
Author

mindswim commented Apr 7, 2026 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enrich extracted objects with referenced tweet context (replies, quotes)

3 participants