Problem
When fetching job postings from apply.careers.microsoft.com, the fetcher returns incomplete/truncated content because the site uses React with client-side rendering.
What Happened
| Source |
Result |
https://apply.careers.microsoft.com/careers/job/1970393556658235 |
Only got responsibilities text, missing: location, salary, job type, company name |
| LinkedIn redirect to same job |
Only got preview snippet (12 lines) |
The parser then extracted incorrect data:
company: "1970393556658235" (job ID instead of "Microsoft")
position: "" (empty)
- No salary, location, or remote status
Root Cause
internal/fetch/fetcher.go has no Microsoft-specific extractor. It falls back to extractGeneric() which:
- Tries
og:description meta tag (truncated)
- Looks for common containers (
<div class="job-description">, etc.)
- Microsoft's React app doesn't have these - job data is in
<script id="__NEXT_DATA__"> JSON
Proposed Solution
1. Add Microsoft extractor
// In ExtractJobPosting switch statement:
case strings.Contains(host, "microsoft.com") || strings.Contains(host, "careers.microsoft.com"):
return f.extractMicrosoft(html)
func (f *Fetcher) extractMicrosoft(html string) (content, company, position string) {
// Microsoft Careers embeds job data as JSON in __NEXT_DATA__ script tag
// Extract and parse: position, location, salary range, requirements
// Look for: <script id="__NEXT_DATA__" type="application/json">
jsonData := extractBetween(html, `<script id="__NEXT_DATA__" type="application/json">`, `</script>`)
// Parse JSON to extract:
// - job.title
// - job.location
// - job.salary (if present)
// - job.description
company = "Microsoft" // Always Microsoft for this domain
// ... parse rest from JSON
}
2. Add content validation
In parser agent or fetcher, reject extractions where:
- Company name is numeric (like job ID)
- Position is empty
- Content length < 500 chars
3. Test URLs
Acceptance Criteria
Related
- Existing extractors: Lever, Greenhouse, Workday, LinkedIn, Ashby
- File:
internal/fetch/fetcher.go:164-181
Problem
When fetching job postings from
apply.careers.microsoft.com, the fetcher returns incomplete/truncated content because the site uses React with client-side rendering.What Happened
https://apply.careers.microsoft.com/careers/job/1970393556658235The parser then extracted incorrect data:
company: "1970393556658235"(job ID instead of "Microsoft")position: ""(empty)Root Cause
internal/fetch/fetcher.gohas no Microsoft-specific extractor. It falls back toextractGeneric()which:og:descriptionmeta tag (truncated)<div class="job-description">, etc.)<script id="__NEXT_DATA__">JSONProposed Solution
1. Add Microsoft extractor
2. Add content validation
In parser agent or fetcher, reject extractions where:
3. Test URLs
Acceptance Criteria
ghosted fetch <microsoft-careers-url>extracts: company, position, location, salary, remote status, full descriptioninternal/fetch/fetcher_test.goRelated
internal/fetch/fetcher.go:164-181