<a href="https://colab.research.google.com/github/alexfazio/firecrawl-quickstart/blob/main/job_scraping_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Job Board Scraping with Firecrawl and OpenAI

By Alex Fazio (https://twitter.com/alxfazio)

Github repo: https://github.com/alexfazio/firecrawl-cookbook

This Jupyter notebook demonstrates how to build an automated job scraping pipeline using Firecrawl and OpenAI. By combining Firecrawl's web scraping capabilities with OpenAI's Structured Outputs feature, we can efficiently extract and analyze job listings to find the best matches for your skills.

Structured Outputs is a powerful capability that ensures the model will always generate responses that adhere to our specified JSON Schema. This means we can:
- Extract job details with guaranteed schema compliance
- Get reliable, structured responses every time
- Process the data efficiently without worrying about format inconsistencies
- Build robust pipelines for job matching and analysis

By the end of this notebook, you'll be able to:

1. Set up a scraping environment with Firecrawl and OpenAI
2. Extract structured data from job boards using Firecrawl
3. Use OpenAI models with Structured Outputs to analyze job listings and match them with your resume
4. Process job data at scale with reliable, schema-validated outputs
5. Build type-safe applications using Pydantic models with OpenAI's responses

This cookbook is designed for developers who want to automate their job search and leverage AI for better job matching, while maintaining robust data structures and type safety throughout their application.

Note: Structured Outputs feature requires specific OpenAI models (gpt-4o-mini-2024-07-18, gpt-4o-2024-08-06, or later). For earlier models, we'll demonstrate alternative approaches using standard JSON formatting.

## Requirements

Before proceeding, ensure you have:

- Python 3.7 or higher
- API keys for both Firecrawl and OpenAI
- Required Python packages

First, let's install the necessary packages:

In [1]:
%pip install requests python-dotenv openai --quiet

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/386.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m378.9/386.9 kB[0m [31m13.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m386.9/386.9 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/76.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.0/78.0 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m325.2/325.2 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h

## Step 1: Set Up Your Environment

Let's set up our environment variables. In Google Colab, we'll use direct input for API keys instead of a .env file:

In [7]:
import os
import requests
import json
from getpass import getpass
from openai import OpenAI  # Add this import

# Securely get API keys
firecrawl_api_key = getpass("Enter your Firecrawl API key: ")
openai_api_key = getpass("Enter your OpenAI API key: ")

# Initialize OpenAI client
client = OpenAI(api_key=openai_api_key)

Enter your Firecrawl API key: ··········
Enter your OpenAI API key: ··········


## Step 2: Define the Jobs Page URL and Resume

Now, let's specify the job board URL and your resume content:

In [8]:
# @title Job Search Configuration

jobs_page_url = "https://openai.com/careers/search"  # @param {type:"string"}
resume_paste = """
**John Doe**
123 Main Street, Anytown, USA
(123) 456-7890 | john.doe@email.com | [linkedin.com/in/johndoe](https://linkedin.com/in/johndoe) | [github.com/johndoe](https://github.com/johndoe)

---

### Objective
Passionate and motivated Machine Learning Engineer with a strong foundation in computer science, statistics, and mathematics. Eager to apply data-driven solutions to real-world problems. Seeking an entry-level position where I can leverage my skills in machine learning, data analysis, and software development.

---

### Education
**Bachelor of Science in Computer Science**
University of California, Berkeley, CA — May 2024
- Relevant coursework: Machine Learning, Data Structures and Algorithms, Probability and Statistics, Linear Algebra, Artificial Intelligence, Database Systems

---

### Skills
- **Programming Languages:** Python, R, Java, C++
- **Machine Learning:** Scikit-learn, TensorFlow, Keras, PyTorch, XGBoost
- **Data Analysis:** Pandas, NumPy, Matplotlib, Seaborn
- **Tools & Technologies:** Git, Docker, Jupyter Notebooks, SQL, AWS (S3, EC2)
- **Mathematics:** Linear Algebra, Probability, Statistics, Calculus
- **Software Development:** Agile methodologies, version control (Git)

---

### Experience

**Machine Learning Intern**
Tech Innovations Inc., San Francisco, CA — June 2023 to August 2023
- Assisted in building and deploying predictive models for customer segmentation, achieving a 20% increase in targeted marketing efficiency.
- Preprocessed large datasets using Pandas and NumPy to clean, normalize, and handle missing values.
- Developed machine learning algorithms in Python with Scikit-learn and TensorFlow, achieving up to 85% accuracy on classification tasks.
- Created visualizations using Matplotlib and Seaborn to communicate data insights to stakeholders.

**Data Science Project (Academic)**
University of California, Berkeley — January 2024
- Built a recommendation system for an e-commerce platform using collaborative filtering techniques.
- Conducted exploratory data analysis to identify key user behavior trends and improve the model's performance.
- Implemented the model in Python using Scikit-learn, resulting in a 30% increase in user engagement metrics.

---

### Projects

**Spam Detection Using Natural Language Processing (NLP)**
- Implemented a spam detection system using a Naive Bayes classifier and TF-IDF vectorization.
- Achieved an accuracy of 92% on a public spam dataset.
- Utilized NLTK for text preprocessing, including tokenization, stopword removal, and stemming.

**Predicting House Prices Using Regression Analysis**
- Built a multiple linear regression model to predict house prices based on various features such as location, size, and number of bedrooms.
- Utilized Python libraries (Pandas, NumPy, Scikit-learn) to preprocess data and evaluate the model's performance.
- Optimized the model using regularization techniques to prevent overfitting.

---

### Certifications
- **Machine Learning Specialization** – Coursera (Andrew Ng, September 2023)
- **Deep Learning Specialization** – Coursera (Andrew Ng, October 2023)
- **AWS Certified Solutions Architect – Associate** – Amazon Web Services (November 2023)

---

### Additional Activities
- **Hackathons:** Participated in HackUC, developing a real-time object detection application using TensorFlow.
- **AI Club Member:** Actively involved in UC Berkeley's AI Club, organizing workshops and study sessions on machine learning topics.

---

### Languages
- English (Fluent)
- Spanish (Proficient)
"""

assert jobs_page_url != "", "Error: jobs_page_url should not be empty"
assert resume_paste != "", "Error: resume_paste should not be empty"

## Step 3: Scrape the Jobs Page

Let's use Firecrawl to extract content from the jobs page:

In [9]:
def scrape_jobs_page(url):
    try:
        response = requests.post(
            "https://api.firecrawl.dev/v1/scrape",
            headers={
                "Content-Type": "application/json",
                "Authorization": f"Bearer {firecrawl_api_key}"
            },
            json={
                "url": url,
                "formats": ["markdown"]
            }
        )
        if response.status_code == 200:
            result = response.json()
            if result.get('success'):
                return result['data']['markdown']
        return ""
    except Exception as e:
        print(f"Error scraping jobs page: {str(e)}")
        return ""

# Scrape the jobs page
html_content = scrape_jobs_page(jobs_page_url)
print(f"Scraped content length: {len(html_content)} characters")

Scraped content length: 409 characters


## Step 4: Extract Job Links

Use OpenAI's model to extract application links from the scraped content:

In [10]:
def extract_apply_links(content):
    if not content:
        return []

    prompt = f"""
Extract up to 30 job application links from the given markdown content.
Return the result as a JSON object with a single key 'apply_links' containing an array of strings (the links).
The output should be a valid JSON object, with no additional text.

Markdown content:
{content[:100000]}
"""

    try:
        completion = client.chat.completions.create(
            model="gpt-4",  # Changed from "gpt-4o" to "gpt-4"
            messages=[{
                "role": "user",
                "content": prompt
            }]
        )
        if completion.choices:
            result = json.loads(completion.choices[0].message.content.strip())
            return result['apply_links']
    except Exception as e:
        print(f"Error extracting links: {str(e)}")
    return []

# Extract the links
apply_links = extract_apply_links(html_content)
print(f"Found {len(apply_links)} job listings")

Found 2 job listings


## Step 5: Extract Details from Each Job

Now let's get detailed information about each job using Firecrawl's extraction capabilities:

In [11]:
# Define the extraction schema
schema = {
    "type": "object",
    "properties": {
        "job_title": {"type": "string"},
        "sub_division_of_organization": {"type": "string"},
        "key_skills": {"type": "array", "items": {"type": "string"}},
        "compensation": {"type": "string"},
        "location": {"type": "string"},
        "apply_link": {"type": "string"}
    },
    "required": ["job_title", "sub_division_of_organization", "key_skills", "compensation", "location", "apply_link"]
}

def extract_job_details(url):
    try:
        response = requests.post(
            "https://api.firecrawl.dev/v1/scrape",
            headers={
                "Content-Type": "application/json",
                "Authorization": f"Bearer {firecrawl_api_key}"
            },
            json={
                "url": url,
                "formats": ["extract"],
                "actions": [{
                    "type": "click",
                    "selector": "#job-overview"
                }],
                "extract": {
                    "schema": schema
                }
            }
        )
        if response.status_code == 200:
            result = response.json()
            if result.get('success'):
                return result['data']['extract']
    except Exception as e:
        print(f"Error extracting job details: {str(e)}")
    return None

# Extract details for each job
extracted_data = []
for link in apply_links:
    job_data = extract_job_details(link)
    if job_data:
        extracted_data.append(job_data)
        print(f"Extracted details for: {job_data['job_title']}")

print(f"Successfully extracted details for {len(extracted_data)} jobs")

Extracted details for: Account Director
Extracted details for: Account Director - Japan
Successfully extracted details for 2 jobs


In [17]:
import pprint
pprint.pprint(extracted_data)

[{'apply_link': '/openai/e09c71f0-1be2-4141-8ab7-d4f38e583c7e/application',
  'compensation': '',
  'job_title': 'Account Director',
  'key_skills': ['7+ years selling platform-as-a-service and/or '
                 'software-as-a-service',
                 'Native-level Japanese language proficiency',
                 'Achieving revenue targets >$1M per year for more than 3 '
                 'years',
                 'Designing and executing complex deal strategies',
                 'Supporting the growth of fast-growing, high-performance '
                 'companies',
                 'Working directly with c-level executives',
                 'Communicating technical concepts to customers and internal '
                 'stakeholders',
                 'Leading high-visibility customer events (CAB, conferences, '
                 'product launches, etc.)',
                 'Gathering, distilling, and processing complex market '
                 '(industry, competitor, customer, 

## Step 6: Match Jobs to Resume

Finally, let's use OpenAI's model to find the best matches for your resume:

In [15]:
def get_job_recommendations(resume, jobs):
    prompt = f"""
You are a job matching assistant. Analyze the resume and job listings below, and return ONLY a JSON array containing the top 3 roles that best fit the candidate's experience and skills.

The response must be a valid JSON array that can be parsed by json.loads(). Include only the job title, compensation, and apply link for each recommended role.

Example format:
[
  {{
    "job_title": "Senior Software Engineer",
    "compensation": "$150,000 - $180,000",
    "apply_link": "https://example.com/jobs/123"
  }},
  ...
]

Resume:
{resume}

Job Listings:
{json.dumps(jobs, indent=2)}

Remember: Respond ONLY with the JSON array, no additional text or explanations.
"""

    try:
        completion = client.chat.completions.create(
            model="gpt-4",
            messages=[{
                "role": "user",
                "content": prompt
            }]
        )

        # Debug: print raw response content
        response_content = completion.choices[0].message.content.strip()
        print("\nRaw model response:")
        print(response_content)

        try:
            parsed_response = json.loads(response_content)
            return parsed_response
        except json.JSONDecodeError as json_err:
            print(f"\nJSON parsing error: {str(json_err)}")
            print(f"Failed to parse content: {response_content}")
            return []

    except Exception as e:
        print(f"\nError getting recommendations: {str(e)}")
        return []

# Get and display recommendations
recommended_jobs = get_job_recommendations(resume_paste, extracted_data)
if recommended_jobs:
    print("\nTop 3 Recommended Jobs:")
    print(json.dumps(recommended_jobs, indent=2))
else:
    print("\nNo job recommendations could be generated.")


Raw model response:
[]

No job recommendations could be generated.


In [22]:
from pydantic import BaseModel
from openai import OpenAI

class JobRecommendation(BaseModel):
    class Job(BaseModel):
        job_title: str
        compensation: str
        apply_link: str

    recommendations: list[Job]

def get_job_recommendations(resume, jobs):
    print("\nDEBUG: Starting job recommendations")
    print(f"DEBUG: Number of jobs to analyze: {len(jobs)}")

    prompt = f"""
You are a job matching expert. Your task is to analyze a resume and available job listings to find the best matches.

Instructions:
1. Analyze the candidate's resume carefully
2. Review each job listing's requirements and details
3. Match the candidate's skills and experience with job requirements
4. Always return at least one recommendation if there are any jobs available
5. Use empty string for compensation if not specified in job listing
6. Use the exact apply_link from the job listing

Return your analysis as a valid JSON object with this exact structure:
{{
    "recommendations": [
        {{
            "job_title": "exact job title from listing",
            "compensation": "compensation from listing or empty string if not specified",
            "apply_link": "exact apply link from listing"
        }}
    ]
}}

Resume to analyze:
{resume}

Available job listings:
{json.dumps(jobs, indent=2)}

Remember: You must return at least one recommendation if any jobs are available, choosing the best match based on the candidate's qualifications.
"""

    try:
        print("DEBUG: Making API call...")
        completion = client.chat.completions.create(
            model="gpt-4",
            messages=[
                {
                    "role": "system",
                    "content": "You are a job matching expert that ALWAYS responds with valid JSON containing at least one job recommendation if jobs are available."
                },
                {
                    "role": "user",
                    "content": prompt
                }
            ]
        )

        print("DEBUG: Got API response")
        response_text = completion.choices[0].message.content
        print(f"DEBUG: Raw response:\n{response_text}")

        try:
            parsed = JobRecommendation.parse_raw(response_text)
            print(f"DEBUG: Successfully parsed response into {len(parsed.recommendations)} recommendations")
            return parsed
        except Exception as parse_err:
            print(f"DEBUG: Error parsing response: {str(parse_err)}")
            raise

    except Exception as e:
        print(f"DEBUG: Error in job recommendations: {str(e)}")
        print(f"DEBUG: Error type: {type(e)}")
        return JobRecommendation(recommendations=[])

# Get and display recommendations
print("\nStarting recommendation process...")
recommended_jobs = get_job_recommendations(resume_paste, extracted_data)
if recommended_jobs and recommended_jobs.recommendations:
    print("\nTop 3 Recommended Jobs:")
    for job in recommended_jobs.recommendations:
        print(f"\nTitle: {job.job_title}")
        print(f"Compensation: {job.compensation}")
        print(f"Apply Link: {job.apply_link}")
else:
    print("\nNo recommendations were generated. Could you check if the resume content is provided and not empty?")
    print(f"DEBUG: Resume content (first 100 chars): {resume_paste[:100] if resume_paste else 'Empty resume'}")


Starting recommendation process...

DEBUG: Starting job recommendations
DEBUG: Number of jobs to analyze: 2
DEBUG: Making API call...
DEBUG: Got API response
DEBUG: Raw response:
{
    "recommendations": []
}
DEBUG: Successfully parsed response into 0 recommendations

No recommendations were generated. Could you check if the resume content is provided and not empty?
DEBUG: Resume content (first 100 chars):   # @param {type:"string"}
**John Doe**  
123 Main Street, Anytown, USA  
(123) 456-7890 | john.doe@
