Skip to content

feat: new app database shape#2445

Merged
mogery merged 26 commits intomainfrom
mog/new-app-db-shape
Dec 3, 2025
Merged

feat: new app database shape#2445
mogery merged 26 commits intomainfrom
mog/new-app-db-shape

Conversation

@mogery
Copy link
Copy Markdown
Member

@mogery mogery commented Nov 26, 2025

  • removes posthog for good
  • removes bigquery data sending in exchange for using the data pipeline
  • uses new database shape of requests -> sub-units

@mogery mogery changed the title feat: rework job logging to new schema (draft impl) feat: new app database shape Nov 26, 2025
@mogery mogery force-pushed the mog/new-app-db-shape branch 2 times, most recently from 7cdaa79 to f745f14 Compare November 27, 2025 08:35
@mogery mogery force-pushed the mog/new-app-db-shape branch 2 times, most recently from 8d4811b to 0a249a3 Compare November 27, 2025 09:29
@mogery
Copy link
Copy Markdown
Member Author

mogery commented Nov 27, 2025

@cubic-dev-ai review pls

@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented Nov 27, 2025

@cubic-dev-ai review pls

@mogery I've started the AI code review. It'll take a few minutes to complete.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 issues found across 51 files

Prompt for AI agents (all 6 issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="apps/api/src/lib/gcs-jobs.ts">

<violation number="1" location="apps/api/src/lib/gcs-jobs.ts:215">
Map results upload retries swallow every error, allowing all attempts to fail without logging or throwing, so the data is never written but the caller still sees success.</violation>
</file>

<file name="apps/api/src/controllers/v2/map.ts">

<violation number="1" location="apps/api/src/controllers/v2/map.ts:58">
logRequest no longer stores `req.body.integration`, so integration-originated map requests cannot be attributed in the new requests table.</violation>
</file>

<file name="apps/api/src/controllers/v1/map.ts">

<violation number="1" location="apps/api/src/controllers/v1/map.ts:465">
logMap returns a promise and is never awaited here, so any failure inside logMap (e.g., saveMapToGCS rejecting) becomes an unhandled rejection and the map log may be lost. Await or explicitly handle the promise.</violation>
</file>

<file name="apps/api/src/services/worker/crawl-logic.ts">

<violation number="1" location="apps/api/src/services/worker/crawl-logic.ts:157">
Batch scrapes now record `credits_cost = 0` whenever `credits_billed` is still null, so missing billing data is indistinguishable from a free job. Preserve the null (or defer logging) instead of forcing zero to avoid corrupting billing records.</violation>
</file>

<file name="apps/api/src/lib/extract/extraction-service.ts">

<violation number="1" location="apps/api/src/lib/extract/extraction-service.ts:682">
Persist the actual exception message when logging the multi-entity transform failure so the stored extract record contains actionable diagnostics instead of a generic string.</violation>

<violation number="2" location="apps/api/src/lib/extract/extraction-service.ts:792">
Log the actual scrape exception message instead of a generic string so extract failures remain diagnosable.</violation>
</file>

Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR

Comment thread apps/api/src/lib/gcs-jobs.ts Outdated
Comment thread apps/api/src/controllers/v2/map.ts
Comment thread apps/api/src/controllers/v1/map.ts
Comment thread apps/api/src/services/worker/crawl-logic.ts
Comment thread apps/api/src/lib/extract/extraction-service.ts
Comment thread apps/api/src/lib/extract/extraction-service.ts
@mogery mogery marked this pull request as ready for review November 28, 2025 12:39
@mogery mogery requested a review from nickscamara as a code owner November 28, 2025 12:39
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 68 files

Prompt for AI agents (all 4 issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="apps/api/src/controllers/v0/admin/crawl-check.ts">

<violation number="1" location="apps/api/src/controllers/v0/admin/crawl-check.ts:173">
`status` is now always `finished` because the code treats the mere presence of a `requests` row as completion, but those rows are created when the crawl is enqueued, so active crawls no longer report their real status.</violation>
</file>

<file name="apps/api/src/controllers/v1/crawl-status.ts">

<violation number="1" location="apps/api/src/controllers/v1/crawl-status.ts:71">
`getJobs` no longer provides any persisted document payload when the queue job or GCS blob is missing, so websocket consumers stop receiving completed crawl documents even though the scrapes still exist in Supabase.</violation>
</file>

<file name="apps/api/src/controllers/v0/search.ts">

<violation number="1" location="apps/api/src/controllers/v0/search.ts:245">
`logSearch` is async and can throw, but the controller fire-and-forgets it, so any logging/GCS failure produces an unhandled promise rejection. Await the call (or at least add a `.catch`) so failures are captured instead of crashing the process.</violation>
</file>

<file name="apps/api/src/controllers/v0/admin/zdrcleaner.ts">

<violation number="1" location="apps/api/src/controllers/v0/admin/zdrcleaner.ts:141">
Requests are removed from the cleanup schedule after the first successful blob deletion, so blobs that failed to delete are never retried. Defer clearing `dr_clean_by` until all blobs for a request delete successfully.</violation>
</file>

Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR

Comment thread apps/api/src/controllers/v0/admin/crawl-check.ts Outdated
Comment thread apps/api/src/controllers/v1/crawl-status.ts
Comment thread apps/api/src/controllers/v0/search.ts Outdated
batch.map(async (row: { request_id: string; blob_id: string }) => {
try {
await cleanUpJob(row.blob_id);
cleanedUpRequestIds.add(row.request_id);
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requests are removed from the cleanup schedule after the first successful blob deletion, so blobs that failed to delete are never retried. Defer clearing dr_clean_by until all blobs for a request delete successfully.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/api/src/controllers/v0/admin/zdrcleaner.ts, line 141:

<comment>Requests are removed from the cleanup schedule after the first successful blob deletion, so blobs that failed to delete are never retried. Defer clearing `dr_clean_by` until all blobs for a request delete successfully.</comment>

<file context>
@@ -97,13 +100,102 @@ async function cleanUp(specificTeamId: string | null, _logger: Logger) {
+          batch.map(async (row: { request_id: string; blob_id: string }) =&gt; {
+            try {
+              await cleanUpJob(row.blob_id);
+              cleanedUpRequestIds.add(row.request_id);
+            } catch (error) {
+              logger.error(`Error cleaning up blob`, {
</file context>

✅ Addressed in 6af8ba8

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (reviewed changes from recent commits).

Prompt for AI agents (all 1 issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="apps/api/src/controllers/v0/search.ts">

<violation number="1" location="apps/api/src/controllers/v0/search.ts:245">
P2: Awaiting `logSearch` makes successful searches fail whenever the analytics insert/upload throws, so a logging hiccup now returns HTTP 500 to the client. Keep the logging best-effort instead of blocking the response.</violation>
</file>

Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR

Comment thread apps/api/src/controllers/v0/search.ts
@mogery mogery force-pushed the mog/new-app-db-shape branch 2 times, most recently from 794fbdc to 3209e88 Compare December 2, 2025 14:26
@mogery mogery force-pushed the mog/new-app-db-shape branch from 0be3b16 to b82bee6 Compare December 3, 2025 09:41
@mogery
Copy link
Copy Markdown
Member Author

mogery commented Dec 3, 2025

@cubic-dev-ai please re-review

@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented Dec 3, 2025

@cubic-dev-ai please re-review

@mogery I've started the AI code review. It'll take a few minutes to complete.

@mogery mogery merged commit 3086a27 into main Dec 3, 2025
5 of 8 checks passed
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 72 files

Prompt for AI agents (all 4 issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="apps/api/src/controllers/v0/admin/crawl-migration.ts">

<violation number="1" location="apps/api/src/controllers/v0/admin/crawl-migration.ts:36">
P1: Using .single() here throws for the normal “not found” case, so every new crawl fails to migrate. Use maybeSingle() (or handle the error) so missing rows don’t abort the migration.</violation>
</file>

<file name="apps/api/src/controllers/v1/map.ts">

<violation number="1" location="apps/api/src/controllers/v1/map.ts:466">
P2: Handle the `logMap` promise so failures writing map logs don’t become unhandled rejections.</violation>
</file>

<file name="apps/api/src/controllers/v0/admin/zdrcleaner.ts">

<violation number="1" location="apps/api/src/controllers/v0/admin/zdrcleaner.ts:141">
P1: Requests are marked cleaned even when some of their blobs fail to delete, so `dr_clean_by` is cleared and the failed blobs are never retried.</violation>
</file>

<file name="apps/api/src/__tests__/snips/zdr-helpers.ts">

<violation number="1" location="apps/api/src/__tests__/snips/zdr-helpers.ts:117">
P2: `batch_scrapes` rows never contain a url column, so this expectation always throws and the helper can’t verify cleanup.</violation>
</file>

Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR

.from("requests")
.select("id")
.eq("id", crawlId)
.single();
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Using .single() here throws for the normal “not found” case, so every new crawl fails to migrate. Use maybeSingle() (or handle the error) so missing rows don’t abort the migration.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/api/src/controllers/v0/admin/crawl-migration.ts, line 36:

<comment>Using .single() here throws for the normal “not found” case, so every new crawl fails to migrate. Use maybeSingle() (or handle the error) so missing rows don’t abort the migration.</comment>

<file context>
@@ -0,0 +1,118 @@
+        .from(&quot;requests&quot;)
+        .select(&quot;id&quot;)
+        .eq(&quot;id&quot;, crawlId)
+        .single();
+
+      if (existingRequest) {
</file context>
Suggested change
.single();
.maybeSingle();
Fix with Cubic

Comment on lines +466 to 484
logMap({
id: result.job_id,
request_id: result.job_id,
url: req.body.url,
crawlerOptions: {},
scrapeOptions: {},
origin: req.body.origin ?? "api",
integration: req.body.integration,
num_tokens: 0,
credits_billed: 1,
team_id: req.auth.team_id,
options: {
search: req.body.search,
limit: req.body.limit,
ignoreSitemap: req.body.ignoreSitemap,
includeSubdomains: req.body.includeSubdomains,
filterByPath: req.body.filterByPath !== false,
useIndex: req.body.useIndex,
timeout: req.body.timeout,
location: req.body.location,
},
results: result.links,
credits_cost: 1,
zeroDataRetention: false, // not supported
});
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Handle the logMap promise so failures writing map logs don’t become unhandled rejections.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/api/src/controllers/v1/map.ts, line 466:

<comment>Handle the `logMap` promise so failures writing map logs don’t become unhandled rejections.</comment>

<file context>
@@ -446,22 +463,23 @@ export async function mapController(
-    time_taken: result.time_taken,
-    team_id: req.auth.team_id,
-    mode: &quot;map&quot;,
+  logMap({
+    id: result.job_id,
+    request_id: result.job_id,
</file context>
Suggested change
logMap({
id: result.job_id,
request_id: result.job_id,
url: req.body.url,
crawlerOptions: {},
scrapeOptions: {},
origin: req.body.origin ?? "api",
integration: req.body.integration,
num_tokens: 0,
credits_billed: 1,
team_id: req.auth.team_id,
options: {
search: req.body.search,
limit: req.body.limit,
ignoreSitemap: req.body.ignoreSitemap,
includeSubdomains: req.body.includeSubdomains,
filterByPath: req.body.filterByPath !== false,
useIndex: req.body.useIndex,
timeout: req.body.timeout,
location: req.body.location,
},
results: result.links,
credits_cost: 1,
zeroDataRetention: false, // not supported
});
logMap({
id: result.job_id,
request_id: result.job_id,
url: req.body.url,
team_id: req.auth.team_id,
options: {
search: req.body.search,
limit: req.body.limit,
ignoreSitemap: req.body.ignoreSitemap,
includeSubdomains: req.body.includeSubdomains,
filterByPath: req.body.filterByPath !== false,
useIndex: req.body.useIndex,
timeout: req.body.timeout,
location: req.body.location,
},
results: result.links,
credits_cost: 1,
zeroDataRetention: false, // not supported
}).catch(error => {
logger.error(`Failed to log job for team ${req.auth.team_id}: ${error}`);
});
Fix with Cubic

batch.map(async (row: { request_id: string; blob_id: string }) => {
try {
await cleanUpJob(row.blob_id);
cleanedUpRequestIds.add(row.request_id);
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Requests are marked cleaned even when some of their blobs fail to delete, so dr_clean_by is cleared and the failed blobs are never retried.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/api/src/controllers/v0/admin/zdrcleaner.ts, line 141:

<comment>Requests are marked cleaned even when some of their blobs fail to delete, so `dr_clean_by` is cleared and the failed blobs are never retried.</comment>

<file context>
@@ -97,13 +100,102 @@ async function cleanUp(specificTeamId: string | null, _logger: Logger) {
+          batch.map(async (row: { request_id: string; blob_id: string }) =&gt; {
+            try {
+              await cleanUpJob(row.blob_id);
+              cleanedUpRequestIds.add(row.request_id);
+            } catch (error) {
+              logger.error(`Error cleaning up blob`, {
</file context>
Fix with Cubic


if (requestData && requestData.length === 1) {
const record = requestData[0];
expect(record.url).not.toContain("://"); // no url stored
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: batch_scrapes rows never contain a url column, so this expectation always throws and the helper can’t verify cleanup.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/api/src/__tests__/snips/zdr-helpers.ts, line 117:

<comment>`batch_scrapes` rows never contain a url column, so this expectation always throws and the helper can’t verify cleanup.</comment>

<file context>
@@ -76,70 +85,130 @@ export async function getLogs() {
+  
+  if (requestData &amp;&amp; requestData.length === 1) {
+    const record = requestData[0];
+    expect(record.url).not.toContain(&quot;://&quot;); // no url stored
+    expect(record.kind).toBe(&quot;crawl&quot;);
+    expect(record.dr_clean_by).not.toBeNull();
</file context>
Fix with Cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant