feat: enrich custom companies#3311
Conversation
|
🍹 The Update (preview) for dailydotdev/api/prod (at d6636c0) was successful. Resource Changes Name Type Operation
~ vpc-native-personalized-digest-deployment kubernetes:apps/v1:Deployment update
~ vpc-native-daily-digest-cron kubernetes:batch/v1:CronJob update
~ vpc-native-calculate-top-readers-cron kubernetes:batch/v1:CronJob update
~ vpc-native-generate-search-invites-cron kubernetes:batch/v1:CronJob update
~ vpc-native-clean-zombie-images-cron kubernetes:batch/v1:CronJob update
~ vpc-native-clean-zombie-users-cron kubernetes:batch/v1:CronJob update
- vpc-native-api-db-migration-ba1273a5 kubernetes:batch/v1:Job delete
+ vpc-native-api-clickhouse-migration-86905be0 kubernetes:batch/v1:Job create
+ vpc-native-api-db-migration-86905be0 kubernetes:batch/v1:Job create
~ vpc-native-update-tags-str-cron kubernetes:batch/v1:CronJob update
~ vpc-native-generic-referral-reminder-cron kubernetes:batch/v1:CronJob update
~ vpc-native-temporal-deployment kubernetes:apps/v1:Deployment update
~ vpc-native-clean-gifted-plus-cron kubernetes:batch/v1:CronJob update
~ vpc-native-sync-subscription-with-cio-cron kubernetes:batch/v1:CronJob update
~ vpc-native-update-current-streak-cron kubernetes:batch/v1:CronJob update
~ vpc-native-post-analytics-history-day-clickhouse-cron kubernetes:batch/v1:CronJob update
~ vpc-native-update-tag-recommendations-cron kubernetes:batch/v1:CronJob update
- vpc-native-api-clickhouse-migration-ba1273a5 kubernetes:batch/v1:Job delete
~ vpc-native-bg-deployment kubernetes:apps/v1:Deployment update
~ vpc-native-update-source-tag-view-cron kubernetes:batch/v1:CronJob update
~ vpc-native-clean-zombie-user-companies-cron kubernetes:batch/v1:CronJob update
~ vpc-native-update-highlighted-views-cron kubernetes:batch/v1:CronJob update
~ vpc-native-private-deployment kubernetes:apps/v1:Deployment update
~ vpc-native-clean-stale-user-transactions-cron kubernetes:batch/v1:CronJob update
~ vpc-native-hourly-notification-cron kubernetes:batch/v1:CronJob update
~ vpc-native-update-source-public-threshold-cron kubernetes:batch/v1:CronJob update
~ vpc-native-personalized-digest-cron kubernetes:batch/v1:CronJob update
~ vpc-native-deployment kubernetes:apps/v1:Deployment update
~ vpc-native-post-analytics-clickhouse-cron kubernetes:batch/v1:CronJob update
~ vpc-native-update-trending-cron kubernetes:batch/v1:CronJob update
~ vpc-native-ws-deployment kubernetes:apps/v1:Deployment update
~ vpc-native-validate-active-users-cron kubernetes:batch/v1:CronJob update
~ vpc-native-clean-zombie-opportunities-cron kubernetes:batch/v1:CronJob update
~ vpc-native-update-views-cron kubernetes:batch/v1:CronJob update
~ vpc-native-check-analytics-report-cron kubernetes:batch/v1:CronJob update
|
| @Index('IDX_company_name_slugify', { synchronize: false }) | ||
| @Index('IDX_company_altName_trgm', { synchronize: false }) | ||
| @Index('IDX_company_altName_lower', { synchronize: false }) | ||
| @Index('IDX_company_altName_slugify', { synchronize: false }) |
There was a problem hiding this comment.
added the same indexes as we have on name
| return new Promise((resolve) => setTimeout(resolve, ms)); | ||
| } | ||
|
|
||
| async function tryFetchDomain( |
There was a problem hiding this comment.
Not sure if this is the best way to do this, open to suggestions.
Basically some domains don't resolve without www, and some don't resolve with. Had some trouble with the Universities i tested, so the idea is to test both paths.
There was a problem hiding this comment.
funny that those domains do not support automatic www redirect and so on ... so 1999
rebelchris
left a comment
There was a problem hiding this comment.
Should we maybe run the actual anthropic stuff via Bragi? it handles things internally better maybe.
| function getGoogleFaviconUrl(domain: string): string { | ||
| return `${GOOGLE_FAVICON_URL}?domain=${encodeURIComponent(domain)}&sz=${FAVICON_SIZE}`; | ||
| } |
There was a problem hiding this comment.
We actually have this internally in API already a wrapper incase we ever wanna switch to another system.
There was a problem hiding this comment.
We have /icon=url= endpoint on API that proxies to google favicons
@rebelchris I mean we could, but I've never used Python or worked in Bragi before, so it's going to take me a minute to redo :lolsob: |
capJavert
left a comment
There was a problem hiding this comment.
Nothing really blocking aside from using the new topic overhead, but overall some general comments on implementation to make it more reliable.
| domain: string; | ||
| }; | ||
|
|
||
| logger.info( |
There was a problem hiding this comment.
this and other loggerInfo will spam a lot, let's move them to debug, also see if you really need warn, its ok if you need it for a first deployment, but after that most likely they will just spam without anyone looking at them
| if (attempt < DOMAIN_CHECK_RETRIES) { | ||
| await sleep(RETRY_DELAY_MS * attempt); | ||
| } |
There was a problem hiding this comment.
Garmr has support for retries, I would not bother in detecting what the error is, if we want to check 3 times we just set maxAttempts on garmr and do simple fetch
| const existingCompany = await con | ||
| .getRepository(Company) | ||
| .createQueryBuilder('company') | ||
| .where(':domain = ANY(company.domains)', { domain: validatedDomain }) |
There was a problem hiding this comment.
this can be read replica I think?
| }, | ||
| }); | ||
|
|
||
| export const anthropicClient = process.env.ANTHROPIC_API_KEY |
There was a problem hiding this comment.
Does client break on initialization if API key is null, seems error prone that this variable would be null or object depending on env, can it just always be:
new AnthropicClient(process.env.ANTHROPIC_API_KEY, {
garmr: garmrAnthropicService,
})
| experience.customCompanyName && | ||
| !experience.companyId | ||
| ) { | ||
| await triggerTypedEvent(logger, 'api.v1.company-enrichment', { |
There was a problem hiding this comment.
could we avoid adding new topic all together and just running enrichCompanyForExperience logic here?
| await con.getRepository(Company).save(company); | ||
| logger.info( | ||
| { companyId, name: englishName, type: companyType }, | ||
| 'Company created', | ||
| ); | ||
|
|
||
| await con | ||
| .getRepository(UserExperience) | ||
| .update({ id: experienceId }, { companyId }); |
There was a problem hiding this comment.
Do all writes in transaction so we don't get partials on errors
There was a problem hiding this comment.
I'm not sure I understand. Why would we get partials on a singular update? And its not a case where we want to cancel all if one fails imo
There was a problem hiding this comment.
you want to stop if any of them fail in a batch?
There was a problem hiding this comment.
I don't think we need to necessarily since if one company/experience doesn't get verified/related, it doesn't really matter for the rest of them. If we stop and retry we'll have to redo API calls to the LLM for all of them again. If we get fails, wouldn't it just be better to do a "full run" at the end that picks up all the stragglers?
There was a problem hiding this comment.
I am not sure we understand each other I am talking about writing everything company related in transaction, be that a batch of multiple companies (from script) or this one, to avoid writing parts of data or having incomplete data written when error happens.
You can then control how you want to handle fails, terminate transaction, or continue by catching it.
Either way up to you, try and you will see if its slow or you have issues with enrichment.
| subscription: 'api.parse-cv-profile', | ||
| }, | ||
| { | ||
| topic: 'api.v1.company-enrichment', |
There was a problem hiding this comment.
if you end up using this you need to add it to streams repo
| const limitArgument = process.argv[2]; | ||
| const offsetArgument = process.argv[3]; | ||
|
|
||
| if (!limitArgument || !offsetArgument) { | ||
| throw new Error('limit and offset arguments are required'); | ||
| } | ||
|
|
||
| const limit = +limitArgument; | ||
| if (Number.isNaN(limit)) { | ||
| throw new Error('limit argument is invalid, it should be a number'); | ||
| } | ||
|
|
||
| const offset = +offsetArgument; | ||
| if (Number.isNaN(offset)) { | ||
| throw new Error('offset argument is invalid, it should be a number'); | ||
| } |
There was a problem hiding this comment.
not blocking for this PR, but there is really cool parseArgs and zod pattern so you can easily validate, something for your future bin script, check importProfileFromJSON script
| await con | ||
| .getRepository(UserExperience) | ||
| .update({ id: experience.id }, { companyId: exactMatch.id }); |
There was a problem hiding this comment.
Also see to do a batch in a transaction, also updating user experience can have impact on CDC so start slow and monitor
This reverts commit 414d914.
Enrich companies/schools using Claude to see if it has details on it.
Added "altName" to company table so that we can preserve both English and Native name. Can use it when searching companies (linkedIn does this too), and potentially other stuff as well.
When enriching, we will replace the user's typed in name with the "English" name. Example below I type in the Norwegian name for my school "Handelshøyskolen BI", but after enrichment, it uses the English name.
Screen.Recording.2025-11-28.at.14.57.08.mov