Skip to content

Add Microsoft Careers site fetcher #18

@celloopa

Description

@celloopa

Problem

When fetching job postings from apply.careers.microsoft.com, the fetcher returns incomplete/truncated content because the site uses React with client-side rendering.

What Happened

Source Result
https://apply.careers.microsoft.com/careers/job/1970393556658235 Only got responsibilities text, missing: location, salary, job type, company name
LinkedIn redirect to same job Only got preview snippet (12 lines)

The parser then extracted incorrect data:

  • company: "1970393556658235" (job ID instead of "Microsoft")
  • position: "" (empty)
  • No salary, location, or remote status

Root Cause

internal/fetch/fetcher.go has no Microsoft-specific extractor. It falls back to extractGeneric() which:

  1. Tries og:description meta tag (truncated)
  2. Looks for common containers (<div class="job-description">, etc.)
  3. Microsoft's React app doesn't have these - job data is in <script id="__NEXT_DATA__"> JSON

Proposed Solution

1. Add Microsoft extractor

// In ExtractJobPosting switch statement:
case strings.Contains(host, "microsoft.com") || strings.Contains(host, "careers.microsoft.com"):
    return f.extractMicrosoft(html)
func (f *Fetcher) extractMicrosoft(html string) (content, company, position string) {
    // Microsoft Careers embeds job data as JSON in __NEXT_DATA__ script tag
    // Extract and parse: position, location, salary range, requirements
    
    // Look for: <script id="__NEXT_DATA__" type="application/json">
    jsonData := extractBetween(html, `<script id="__NEXT_DATA__" type="application/json">`, `</script>`)
    
    // Parse JSON to extract:
    // - job.title
    // - job.location  
    // - job.salary (if present)
    // - job.description
    
    company = "Microsoft" // Always Microsoft for this domain
    // ... parse rest from JSON
}

2. Add content validation

In parser agent or fetcher, reject extractions where:

  • Company name is numeric (like job ID)
  • Position is empty
  • Content length < 500 chars

3. Test URLs

Acceptance Criteria

  • ghosted fetch <microsoft-careers-url> extracts: company, position, location, salary, remote status, full description
  • Parser agent correctly identifies Microsoft as company
  • Generated posting file has complete metadata in frontmatter
  • Add test case in internal/fetch/fetcher_test.go

Related

  • Existing extractors: Lever, Greenhouse, Workday, LinkedIn, Ashby
  • File: internal/fetch/fetcher.go:164-181

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions