# Data Analyst Skills Analysis: Job Data Collection (JSearch API)

## 0.1 Project goals and objectives

**Goal:**  
Collect job posting data for *data analyst* and *junior data analyst* roles across major English-speaking markets using the **JSearch API** (via RapidAPI).

**Key objectives:**
- Define a reusable function to search for jobs with pagination;
- Configure target countries, locations, and roles;
- Fetch and convert API responses into DataFrames;
- Combine and save the dataset to CSV;
- Run sanity checks on the collected data.

## 0.2 Data description

**Source:** Job postings from the **JSearch API** ([RapidAPI](https://rapidapi.com/letscrape-6bRBa3QguO5/api/jsearch)), which aggregates listings from Google for Jobs, pulling data from LinkedIn, Indeed, Glassdoor, ZipRecruiter, and other job boards.

**Markets:** USA, United Kingdom, Canada. These countries were chosen because:
- All postings are in English, enabling consistent skill extraction via keyword matching;
- They represent the largest English-speaking job markets for data analysts;
- JSearch provides good coverage for these regions.

**Roles:** *data analyst*, *junior data analyst*.

**Time window:** Only postings from the last month are collected (18 Jan – 17 Feb 2026). Since the goal is to identify *currently* in-demand skills rather than track trends over time, a one-month snapshot provides a representative and up-to-date picture of the market.

**Location granularity:** Queries are split by major cities and states to maximise coverage, since Google for Jobs returns ~50–100 results per unique query.

**Output:** 
- `jsearch_all_countries_roles.csv` — combined dataset with all jobs;

## 0.3 Project structure

- [1. Defining the job search function](#section-1)
- [2. Defining target locations, roles and collection settings](#section-2)
- [3. Converting API results into a DataFrame](#section-3)
- [4. Fetching data for all location-role combinations and saving to CSV](#section-4)
- [5. Quick sanity checks on the combined dataset](#section-5)
- [6. Summary](#section-6)

In [None]:
import os
import time
import requests
import pandas as pd
from dotenv import load_dotenv

load_dotenv()

RAPIDAPI_KEY = os.getenv("RAPIDAPI_KEY")

if not RAPIDAPI_KEY:
    raise ValueError(
        "Fill in RAPIDAPI_KEY in .env first.\n")

JSEARCH_HOST = "jsearch.p.rapidapi.com"
JSEARCH_URL = "https://jsearch.p.rapidapi.com/search"

HEADERS = {
    "X-RapidAPI-Key": RAPIDAPI_KEY,
    "X-RapidAPI-Host": JSEARCH_HOST,
}

print("API key loaded successfully.")

API key loaded successfully.


<a id="section-1"></a>

## 1. Defining the job search function

We define a reusable function `fetch_jsearch_jobs` that sends GET requests to the JSearch `/search` endpoint.

**How it works:**
1. Takes a `query` string (e.g. `"data analyst in New York"`) and a `country` ISO code;
2. Paginates through results — each page returns up to 10 jobs;
3. Uses `num_pages` to fetch multiple pages per call (each page = 1 quota credit);
4. Logs remaining quota after each request;
5. Stops immediately on HTTP 429 (rate limit);
6. Returns a flat list of all job dictionaries collected.

In [2]:
def fetch_jsearch_jobs(
    query: str,
    num_pages: int = 3,
    date_posted: str = "month",
    country: str = None,
    pause_seconds: float = 2.0,
) -> list:
    """
    Fetch job postings from JSearch API.

    Args:
        query: Full search query, e.g. 'data analyst in New York'.
        num_pages: Pages per API call (10 results/page, each page = 1 credit).
        date_posted: Time filter — 'all', 'today', '3days', 'week', 'month'.
        country: Optional ISO country code for the 'country' API param.
        pause_seconds: Pause between pagination calls.

    Returns:
        List of job posting dictionaries.
    """
    all_jobs = []
    page = 1

    while True:
        params = {
            "query": query,
            "page": str(page),
            "num_pages": str(num_pages),
            "date_posted": date_posted,
        }
        if country:
            params["country"] = country

        response = requests.get(JSEARCH_URL, headers=HEADERS, params=params)

        # Log rate-limit headers from RapidAPI
        remaining = response.headers.get("x-ratelimit-requests-remaining", "?")
        limit = response.headers.get("x-ratelimit-requests-limit", "?")
        print(f"  [Quota] {remaining}/{limit} requests remaining")

        # On 429: stop immediately to conserve quota
        if response.status_code == 429:
            print("  [Rate limit 429] Stopping to save quota.")
            break

        if response.status_code != 200:
            print(f"  [HTTP {response.status_code}] {response.text[:200]}")
            break

        data = response.json()

        if data.get("status") != "OK":
            error_msg = data.get("error", {}).get("message", "Unknown error")
            print(f"  [API Error] {error_msg}")
            break

        jobs = data.get("data", [])

        if not jobs:
            print(f"  No more results at page {page}, stopping.")
            break

        all_jobs.extend(jobs)
        print(
            f"  Page {page} (num_pages={num_pages}): "
            f"received {len(jobs)} jobs (total: {len(all_jobs)})"
        )

        # If fewer results than expected, we've reached the end
        if len(jobs) < num_pages * 10:
            break

        # Move to the next batch of pages
        page += num_pages
        time.sleep(pause_seconds)

    return all_jobs

<a id="section-2"></a>

## 2. Defining target locations, roles and collection settings

To maximise the number of collected postings, we query by **individual cities and states** instead of whole countries. Google for Jobs returns ~50–100 results per unique query, so splitting by location increases the total yield. Duplicates across overlapping queries are removed by `job_id` at the end.

**Countries:** USA (10 locations), United Kingdom (5), Canada (3).


In [3]:
# Each tuple: (country_code, country_name, query_location, api_country)
#   query_location: city/state appended to role → "data analyst in {location}"
#   api_country:    ISO code passed as the 'country' API parameter

LOCATIONS = [
    # USA — 10 major states / metro areas
    ("us", "USA", "New York", "us"),
    ("us", "USA", "California", "us"),
    ("us", "USA", "Texas", "us"),
    ("us", "USA", "Illinois", "us"),
    ("us", "USA", "Florida", "us"),
    ("us", "USA", "Virginia", "us"),
    ("us", "USA", "Massachusetts", "us"),
    ("us", "USA", "Georgia", "us"),
    ("us", "USA", "Colorado", "us"),
    ("us", "USA", "remote USA", "us"),

    # United Kingdom — 5 key locations
    ("gb", "United Kingdom", "London", "gb"),
    ("gb", "United Kingdom", "Manchester", "gb"),
    ("gb", "United Kingdom", "Birmingham UK", "gb"),
    ("gb", "United Kingdom", "Edinburgh", "gb"),
    ("gb", "United Kingdom", "remote UK", "gb"),

    # Canada — 3 key locations
    ("ca", "Canada", "Toronto", "ca"),
    ("ca", "Canada", "Vancouver", "ca"),
    ("ca", "Canada", "remote Canada", "ca"),
]

ROLES = ["data analyst", "junior data analyst"]

NUM_PAGES = 3        # pages per API call (10 results/page, 1 credit/page)
DATE_POSTED = "month"  # only jobs posted within the last month

# --- Budget summary ---
total_combos = len(LOCATIONS) * len(ROLES)

print(f"Locations:            {len(LOCATIONS)}")
print(f"Roles:                {len(ROLES)}")
print(f"Total combinations:   {total_combos}")
print(f"Max possible results: {total_combos * NUM_PAGES * 10}")

Locations:            18
Roles:                2
Total combinations:   36
Max possible results: 1080


<a id="section-3"></a>

## 3. Converting API results into a DataFrame

We define a `jsearch_to_dataframe` function that converts the raw list of job dictionaries into a pandas DataFrame. It keeps only the fields relevant to our analysis and adds helper columns (`country_code`, `country_name`, `search_role`, `data_source`).

In [4]:
KEEP_FIELDS = [
    "job_id",
    "job_title",
    "job_description",
    "employer_name",
    "employer_website",
    "employer_company_type",
    "job_publisher",
    "job_employment_type",
    "job_is_remote",
    "job_apply_link",
    "job_city",
    "job_state",
    "job_country",
    "job_latitude",
    "job_longitude",
    "job_posted_at_datetime_utc",
    "job_min_salary",
    "job_max_salary",
    "job_salary_currency",
    "job_salary_period",
    "job_required_experience",
    "job_required_skills",
    "job_required_education",
    "job_highlights",
    "job_posting_language",
]


def jsearch_to_dataframe(
    jobs: list,
    country_code: str,
    country_name: str,
    role_query: str,
) -> pd.DataFrame:
    """
    Convert a list of JSearch job results into a pandas DataFrame
    and add helper columns for country and role.
    """
    if not jobs:
        return pd.DataFrame()

    df = pd.DataFrame(jobs)

    # Keep only relevant fields (skip any that are missing)
    available = [col for col in KEEP_FIELDS if col in df.columns]
    df = df[available].copy()

    # Add helper columns
    df["country_code"] = country_code
    df["country_name"] = country_name
    df["search_role"] = role_query
    df["data_source"] = "jsearch"

    return df

<a id="section-4"></a>

## 4. Fetching data for all location-role combinations and saving to CSV

We loop over all location-role combinations, call `fetch_jsearch_jobs` for each, and combine the results. Since different city queries within the same country may return overlapping postings, we deduplicate by `job_id` at the end. The combined dataset is saved as a single CSV file.

A 10-second pause between combos helps avoid hitting the hourly rate limit.

In [5]:
RAW_DATA_DIR = "../data/raw"
os.makedirs(RAW_DATA_DIR, exist_ok=True)

SAVE_PATH = os.path.join(RAW_DATA_DIR, "jsearch_all_countries_roles.csv")
PAUSE_BETWEEN_COMBOS = 10

all_dfs = []
total_combos = len(LOCATIONS) * len(ROLES)
combo_num = 0

for country_code, country_name, location, api_country in LOCATIONS:
    for role in ROLES:
        combo_num += 1
        query = f"{role} in {location}"

        print(f"\n{'='*60}")
        print(f"[{combo_num}/{total_combos}] '{role}' in {location} ({country_code})")
        print(f"{'='*60}")

        jobs = fetch_jsearch_jobs(
            query=query,
            num_pages=NUM_PAGES,
            date_posted=DATE_POSTED,
            country=api_country,
        )

        df = jsearch_to_dataframe(
            jobs=jobs,
            country_code=country_code,
            country_name=country_name,
            role_query=role,
        )

        print(f"  Rows received: {len(df)}")
        if not df.empty:
            all_dfs.append(df)

        print(f"  Pausing {PAUSE_BETWEEN_COMBOS}s...")
        time.sleep(PAUSE_BETWEEN_COMBOS)

print(f"\n{'='*60}")
print(f"Collection complete. Combinations processed: {combo_num}")
print(f"{'='*60}")

# Combine and deduplicate
if all_dfs:
    full_df = pd.concat(all_dfs, ignore_index=True)
    before_dedup = len(full_df)
    full_df = full_df.drop_duplicates(subset=["job_id"], keep="first")
    after_dedup = len(full_df)

    print(f"\nTotal rows before dedup: {before_dedup}")
    print(f"Duplicates removed:     {before_dedup - after_dedup}")
    print(f"Unique jobs kept:       {after_dedup}")

    full_df.to_csv(SAVE_PATH, index=False)
    print(f"\nSaved to: {SAVE_PATH}")
else:
    print("\nNo data collected.")


[1/36] 'data analyst' in New York (us)
  [Quota] 197/200 requests remaining
  Page 1 (num_pages=3): received 29 jobs (total: 29)
  Rows received: 29
  Pausing 10s...

[2/36] 'junior data analyst' in New York (us)
  [Quota] 194/200 requests remaining
  Page 1 (num_pages=3): received 29 jobs (total: 29)
  Rows received: 29
  Pausing 10s...

[3/36] 'data analyst' in California (us)
  [Quota] 191/200 requests remaining
  Page 1 (num_pages=3): received 29 jobs (total: 29)
  Rows received: 29
  Pausing 10s...

[4/36] 'junior data analyst' in California (us)
  [Quota] 188/200 requests remaining
  Page 1 (num_pages=3): received 29 jobs (total: 29)
  Rows received: 29
  Pausing 10s...

[5/36] 'data analyst' in Texas (us)
  [Quota] 185/200 requests remaining
  Page 1 (num_pages=3): received 30 jobs (total: 30)
  [Quota] 179/200 requests remaining
  Page 4 (num_pages=3): received 26 jobs (total: 56)
  Rows received: 56
  Pausing 10s...

[6/36] 'junior data analyst' in Texas (us)
  [Quota] 176/20

<a id="section-5"></a>

## 5. Quick sanity checks on the combined dataset

We load the saved CSV and inspect:
- Shape (rows and columns);
- First few rows;
- Job counts per country and role;
- Date range of collected postings.

In [17]:
combined_path = os.path.join(RAW_DATA_DIR, "jsearch_all_countries_roles.csv")
combined_df = pd.read_csv(combined_path)

print(f"Combined dataset shape: {combined_df.shape}")
print(f"\nColumns: {list(combined_df.columns)}")

Combined dataset shape: (1228, 23)

Columns: ['job_id', 'job_title', 'job_description', 'employer_name', 'employer_website', 'job_publisher', 'job_employment_type', 'job_is_remote', 'job_apply_link', 'job_city', 'job_state', 'job_country', 'job_latitude', 'job_longitude', 'job_posted_at_datetime_utc', 'job_min_salary', 'job_max_salary', 'job_salary_period', 'job_highlights', 'country_code', 'country_name', 'search_role', 'data_source']


In [18]:
combined_df.head(3)

Unnamed: 0,job_id,job_title,job_description,employer_name,employer_website,job_publisher,job_employment_type,job_is_remote,job_apply_link,job_city,...,job_longitude,job_posted_at_datetime_utc,job_min_salary,job_max_salary,job_salary_period,job_highlights,country_code,country_name,search_role,data_source
0,hDcRNAz4ev12li_ZAAAAAA==,Data Analyst,Join a newly created team dedicated to the Dis...,Disney Direct to Consumer,https://disney.fandom.com,Disney Careers,Full-time,False,https://www.disneycareers.com/en/job/new-york/...,New York,...,-74.005973,2026-02-12T00:00:00.000Z,,,,{'Qualifications': ['3+ years of relevant expe...,us,USA,data analyst,jsearch
1,YBGg4U6cLzfYcc2VAAAAAA==,Entry Level Human Resources & Data Analyst,"Top-Tier Bank in Midtown, Manhattan is seeking...",Social Capital Resources,,LinkedIn,Contractor,False,https://www.linkedin.com/jobs/view/entry-level...,New York,...,-74.005973,2026-02-17T17:00:00.000Z,25.0,32.0,HOUR,{'Qualifications': ['0-1 years previous experi...,us,USA,data analyst,jsearch
2,wyxH_fGPJReWty01AAAAAA==,"Associate Data Analyst, DX Research",Overview\n\nAbout DX\n\nDX is one of the faste...,Atlassian,https://www.atlassian.com,LinkedIn,Full-time,True,https://www.linkedin.com/jobs/view/associate-d...,New York,...,-74.005973,2026-02-14T00:00:00.000Z,,,,{'Qualifications': ['4+ years of experience in...,us,USA,data analyst,jsearch


In [19]:
# Jobs count per country and role
print("Jobs per country and role:")
print(
    combined_df
    .groupby(["country_code", "search_role"])
    .size()
    .reset_index(name="n_jobs")
    .to_string(index=False)
)

Jobs per country and role:
country_code         search_role  n_jobs
          ca        data analyst     131
          ca junior data analyst      80
          gb        data analyst     138
          gb junior data analyst     139
          us        data analyst     471
          us junior data analyst     269


In [20]:
# Date range of collected postings
combined_df["job_posted_at_datetime_utc"] = pd.to_datetime(
    combined_df["job_posted_at_datetime_utc"]
)
print(f"Earliest posting: {combined_df['job_posted_at_datetime_utc'].min()}")
print(f"Latest posting:   {combined_df['job_posted_at_datetime_utc'].max()}")

Earliest posting: 2026-01-18 00:00:00+00:00
Latest posting:   2026-02-17 18:00:00+00:00


<a id="section-6"></a>

## 6. Summary

During the data collection stage, the following steps were performed:

- **Fetch function:** A `fetch_jsearch_jobs` function was defined to request job postings from the JSearch API (via RapidAPI), with pagination, real-time quota logging, and automatic stop on HTTP 429 rate limits.
- **Collection settings:** Target countries (USA, UK, Canada) and roles (*data analyst*, *junior data analyst*) were defined. Queries were split by major cities and states within each country to maximise coverage, since Google for Jobs returns ~50–100 results per unique query.
- **Result conversion:** A `jsearch_to_dataframe` function was defined to convert the raw API response list into a pandas DataFrame and to add columns `country_code`, `country_name`, `search_role`, and `data_source` for later filtering and grouping.
- **Data collection and saving:** For each location-role combination, job postings were fetched, converted to a DataFrame, and combined. Duplicates from overlapping city queries were removed by `job_id`. A single combined file, **jsearch_all_countries_roles.csv**, was saved for use in the next notebook.
- **Sanity checks:** The combined dataset was loaded and briefly inspected (shape, head, and counts by country and role) to confirm that collection completed as expected.

The raw data is stored in `data/raw/` and is ready for cleaning and skill extraction in the next notebook.