Skip to content

feat: enrich custom companies#3311

Merged
AmarTrebinjac merged 12 commits intomainfrom
ENG-46-enrichment
Dec 2, 2025
Merged

feat: enrich custom companies#3311
AmarTrebinjac merged 12 commits intomainfrom
ENG-46-enrichment

Conversation

@AmarTrebinjac
Copy link
Copy Markdown
Contributor

@AmarTrebinjac AmarTrebinjac commented Nov 28, 2025

Enrich companies/schools using Claude to see if it has details on it.
Added "altName" to company table so that we can preserve both English and Native name. Can use it when searching companies (linkedIn does this too), and potentially other stuff as well.

When enriching, we will replace the user's typed in name with the "English" name. Example below I type in the Norwegian name for my school "Handelshøyskolen BI", but after enrichment, it uses the English name.

Screen.Recording.2025-11-28.at.14.57.08.mov

@pulumi
Copy link
Copy Markdown

pulumi Bot commented Nov 28, 2025

🍹 The Update (preview) for dailydotdev/api/prod (at d6636c0) was successful.

Resource Changes

    Name                                                   Type                           Operation
~   vpc-native-personalized-digest-deployment              kubernetes:apps/v1:Deployment  update
~   vpc-native-daily-digest-cron                           kubernetes:batch/v1:CronJob    update
~   vpc-native-calculate-top-readers-cron                  kubernetes:batch/v1:CronJob    update
~   vpc-native-generate-search-invites-cron                kubernetes:batch/v1:CronJob    update
~   vpc-native-clean-zombie-images-cron                    kubernetes:batch/v1:CronJob    update
~   vpc-native-clean-zombie-users-cron                     kubernetes:batch/v1:CronJob    update
-   vpc-native-api-db-migration-ba1273a5                   kubernetes:batch/v1:Job        delete
+   vpc-native-api-clickhouse-migration-86905be0           kubernetes:batch/v1:Job        create
+   vpc-native-api-db-migration-86905be0                   kubernetes:batch/v1:Job        create
~   vpc-native-update-tags-str-cron                        kubernetes:batch/v1:CronJob    update
~   vpc-native-generic-referral-reminder-cron              kubernetes:batch/v1:CronJob    update
~   vpc-native-temporal-deployment                         kubernetes:apps/v1:Deployment  update
~   vpc-native-clean-gifted-plus-cron                      kubernetes:batch/v1:CronJob    update
~   vpc-native-sync-subscription-with-cio-cron             kubernetes:batch/v1:CronJob    update
~   vpc-native-update-current-streak-cron                  kubernetes:batch/v1:CronJob    update
~   vpc-native-post-analytics-history-day-clickhouse-cron  kubernetes:batch/v1:CronJob    update
~   vpc-native-update-tag-recommendations-cron             kubernetes:batch/v1:CronJob    update
-   vpc-native-api-clickhouse-migration-ba1273a5           kubernetes:batch/v1:Job        delete
~   vpc-native-bg-deployment                               kubernetes:apps/v1:Deployment  update
~   vpc-native-update-source-tag-view-cron                 kubernetes:batch/v1:CronJob    update
~   vpc-native-clean-zombie-user-companies-cron            kubernetes:batch/v1:CronJob    update
~   vpc-native-update-highlighted-views-cron               kubernetes:batch/v1:CronJob    update
~   vpc-native-private-deployment                          kubernetes:apps/v1:Deployment  update
~   vpc-native-clean-stale-user-transactions-cron          kubernetes:batch/v1:CronJob    update
~   vpc-native-hourly-notification-cron                    kubernetes:batch/v1:CronJob    update
~   vpc-native-update-source-public-threshold-cron         kubernetes:batch/v1:CronJob    update
~   vpc-native-personalized-digest-cron                    kubernetes:batch/v1:CronJob    update
~   vpc-native-deployment                                  kubernetes:apps/v1:Deployment  update
~   vpc-native-post-analytics-clickhouse-cron              kubernetes:batch/v1:CronJob    update
~   vpc-native-update-trending-cron                        kubernetes:batch/v1:CronJob    update
~   vpc-native-ws-deployment                               kubernetes:apps/v1:Deployment  update
~   vpc-native-validate-active-users-cron                  kubernetes:batch/v1:CronJob    update
~   vpc-native-clean-zombie-opportunities-cron             kubernetes:batch/v1:CronJob    update
~   vpc-native-update-views-cron                           kubernetes:batch/v1:CronJob    update
~   vpc-native-check-analytics-report-cron                 kubernetes:batch/v1:CronJob    update

Comment thread __tests__/workers/companyEnrichment.ts Fixed
Comment thread src/entity/Company.ts
@Index('IDX_company_name_slugify', { synchronize: false })
@Index('IDX_company_altName_trgm', { synchronize: false })
@Index('IDX_company_altName_lower', { synchronize: false })
@Index('IDX_company_altName_slugify', { synchronize: false })
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the same indexes as we have on name

return new Promise((resolve) => setTimeout(resolve, ms));
}

async function tryFetchDomain(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is the best way to do this, open to suggestions.

Basically some domains don't resolve without www, and some don't resolve with. Had some trouble with the Universities i tested, so the idea is to test both paths.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

funny that those domains do not support automatic www redirect and so on ... so 1999

@AmarTrebinjac AmarTrebinjac marked this pull request as ready for review November 29, 2025 03:29
@AmarTrebinjac AmarTrebinjac requested review from a team and capJavert as code owners November 29, 2025 03:29
Copy link
Copy Markdown
Contributor

@rebelchris rebelchris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe run the actual anthropic stuff via Bragi? it handles things internally better maybe.

Comment on lines +107 to +109
function getGoogleFaviconUrl(domain: string): string {
return `${GOOGLE_FAVICON_URL}?domain=${encodeURIComponent(domain)}&sz=${FAVICON_SIZE}`;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually have this internally in API already a wrapper incase we ever wanna switch to another system.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have /icon=url= endpoint on API that proxies to google favicons

@AmarTrebinjac
Copy link
Copy Markdown
Contributor Author

Should we maybe run the actual anthropic stuff via Bragi? it handles things internally better maybe.

@rebelchris I mean we could, but I've never used Python or worked in Bragi before, so it's going to take me a minute to redo :lolsob:

Copy link
Copy Markdown
Contributor

@capJavert capJavert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing really blocking aside from using the new topic overhead, but overall some general comments on implementation to make it more reliable.

Comment thread src/common/companyEnrichment.ts Outdated
domain: string;
};

logger.info(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and other loggerInfo will spam a lot, let's move them to debug, also see if you really need warn, its ok if you need it for a first deployment, but after that most likely they will just spam without anyone looking at them

Comment on lines +97 to +99
if (attempt < DOMAIN_CHECK_RETRIES) {
await sleep(RETRY_DELAY_MS * attempt);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Garmr has support for retries, I would not bother in detecting what the error is, if we want to check 3 times we just set maxAttempts on garmr and do simple fetch

const existingCompany = await con
.getRepository(Company)
.createQueryBuilder('company')
.where(':domain = ANY(company.domains)', { domain: validatedDomain })
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be read replica I think?

Comment thread src/integrations/anthropic/client.ts Outdated
Comment thread src/integrations/anthropic/client.ts Outdated
},
});

export const anthropicClient = process.env.ANTHROPIC_API_KEY
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does client break on initialization if API key is null, seems error prone that this variable would be null or object depending on env, can it just always be:

new AnthropicClient(process.env.ANTHROPIC_API_KEY, {
      garmr: garmrAnthropicService,
    })

Comment thread src/workers/cdc/primary.ts Outdated
experience.customCompanyName &&
!experience.companyId
) {
await triggerTypedEvent(logger, 'api.v1.company-enrichment', {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we avoid adding new topic all together and just running enrichCompanyForExperience logic here?

Comment on lines +274 to +282
await con.getRepository(Company).save(company);
logger.info(
{ companyId, name: englishName, type: companyType },
'Company created',
);

await con
.getRepository(UserExperience)
.update({ id: experienceId }, { companyId });
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all writes in transaction so we don't get partials on errors

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand. Why would we get partials on a singular update? And its not a case where we want to cancel all if one fails imo

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you want to stop if any of them fail in a batch?

Copy link
Copy Markdown
Contributor Author

@AmarTrebinjac AmarTrebinjac Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to necessarily since if one company/experience doesn't get verified/related, it doesn't really matter for the rest of them. If we stop and retry we'll have to redo API calls to the LLM for all of them again. If we get fails, wouldn't it just be better to do a "full run" at the end that picks up all the stragglers?

Copy link
Copy Markdown
Contributor

@capJavert capJavert Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we understand each other I am talking about writing everything company related in transaction, be that a batch of multiple companies (from script) or this one, to avoid writing parts of data or having incomplete data written when error happens.

You can then control how you want to handle fails, terminate transaction, or continue by catching it.

Either way up to you, try and you will see if its slow or you have issues with enrichment.

Comment thread .infra/common.ts Outdated
subscription: 'api.parse-cv-profile',
},
{
topic: 'api.v1.company-enrichment',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you end up using this you need to add it to streams repo

Comment on lines +42 to +57
const limitArgument = process.argv[2];
const offsetArgument = process.argv[3];

if (!limitArgument || !offsetArgument) {
throw new Error('limit and offset arguments are required');
}

const limit = +limitArgument;
if (Number.isNaN(limit)) {
throw new Error('limit argument is invalid, it should be a number');
}

const offset = +offsetArgument;
if (Number.isNaN(offset)) {
throw new Error('offset argument is invalid, it should be a number');
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not blocking for this PR, but there is really cool parseArgs and zod pattern so you can easily validate, something for your future bin script, check importProfileFromJSON script

Comment on lines +94 to +96
await con
.getRepository(UserExperience)
.update({ id: experience.id }, { companyId: exactMatch.id });
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also see to do a batch in a transaction, also updating user experience can have impact on CDC so start slow and monitor

@AmarTrebinjac AmarTrebinjac merged commit 414d914 into main Dec 2, 2025
9 checks passed
@AmarTrebinjac AmarTrebinjac deleted the ENG-46-enrichment branch December 2, 2025 11:20
@AmarTrebinjac AmarTrebinjac restored the ENG-46-enrichment branch December 2, 2025 11:37
AmarTrebinjac added a commit that referenced this pull request Dec 2, 2025
@AmarTrebinjac AmarTrebinjac deleted the ENG-46-enrichment branch December 2, 2025 11:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants