# Job Scraping in Hong Kong using JobSpy

This notebook demonstrates how to scrape job listings from LinkedIn, Glassdoor, and Indeed for Hong Kong positions using the JobSpy library. We'll clean the data and export it to a CSV file for further analysis.

## Overview
- Scrape jobs from LinkedIn, Glassdoor, and Indeed
- Focus on Hong Kong job market
- Clean and preprocess the data
- Export to CSV for analysis

## 1. Install and Import Required Libraries

First, we'll install JobSpy and import all necessary libraries for data scraping, processing, and analysis.

In [2]:
# Install JobSpy if not already installed
!pip install python-jobspy pandas numpy datetime



In [3]:
# Import required libraries
import pandas as pd
import numpy as np
from jobspy import scrape_jobs
from datetime import datetime, timedelta
import warnings
import os

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")
print(f"Current working directory: {os.getcwd()}")

Libraries imported successfully!
Current working directory: c:\Users\j_fel\Desktop\projects\job-automation


## 2. Configure JobSpy Parameters

Set up common parameters for job scraping including location, search terms, and other configurations.

In [5]:
# Configure common scraping parameters
LOCATION = "Hong Kong"
SEARCH_TERMS = ["data analyst", "software engineer", "graduate trainee", "management trainee", "data scientist"]
RESULTS_WANTED = 100  # Number of jobs per search term per site
HOURS_OLD = 1440  # Jobs posted within last X hours
COUNTRY_INDEED = "Hong Kong"  # Indeed specific country parameter

print(f"Configuration set:")
print(f"Location: {LOCATION}")
print(f"Search terms: {SEARCH_TERMS}")
print(f"Results wanted per search: {RESULTS_WANTED}")
print(f"Jobs from last {HOURS_OLD} hours")

Configuration set:
Location: Hong Kong
Search terms: ['data analyst', 'software engineer', 'graduate trainee', 'management trainee', 'data scientist']
Results wanted per search: 100
Jobs from last 1440 hours


## 2.1 Diagnostic: Test Location Formats for Glassdoor

Let's test different location formats to see which ones work with each job site, particularly Glassdoor.

In [None]:
# Test different location formats for each site
import time

# Different location formats to test for Hong Kong
location_formats = [
    "Hong Kong",
    "Hong Kong, Hong Kong", 
    "Hong Kong SAR",
    "Central, Hong Kong",
    "Hong Kong SAR, China",
    "Kowloon, Hong Kong"
]

sites_to_test = ["linkedin", "glassdoor", "indeed"]

print("Testing location formats for each job site...")
print("=" * 60)

for site in sites_to_test:
    print(f"\n🔍 Testing {site.upper()}:")
    
    for location in location_formats:
        try:
            print(f"  Testing location: '{location}'... ", end="")
            
            # Test with minimal parameters to reduce load
            test_jobs = scrape_jobs(
                site_name=[site],
                search_term="data analyst",  # Simple search term
                location=location,
                results_wanted=5,  # Very small number for testing
                hours_old=7200,  # Longer time range
                country_indeed="Hong Kong" if site == "indeed" else None
            )
            
            if not test_jobs.empty:
                print(f"✅ SUCCESS ({len(test_jobs)} jobs found)")
                break  # Found working format, move to next site
            else:
                print("⚠️  No jobs returned")
                
        except Exception as e:
            error_msg = str(e)
            if "status code 400" in error_msg or "location not parsed" in error_msg:
                print("❌ Location format rejected")
            else:
                print(f"❌ Error: {error_msg[:50]}...")
        
        # Small delay between tests
        time.sleep(1)
    
    print(f"  Finished testing {site}")

print("\n" + "=" * 60)
print("Diagnostic complete. Check results above to see which location formats work.")

## 2.2 Updated Configuration with Site-Specific Settings

Based on the diagnostic results above, configure different settings for each site to handle their specific requirements.

In [None]:
# Site-specific configuration based on what works
SEARCH_TERMS = ["data analyst", "software engineer", "graduate trainee", "management trainee", "data scientist"]
RESULTS_WANTED = 50  # Reduced from 100 to be more conservative
HOURS_OLD = 1440

# Site-specific location formats (update these based on diagnostic results above)
LOCATION_CONFIGS = {
    "linkedin": {
        "location": "Hong Kong",  # Usually works with simple format
        "enabled": True
    },
    "indeed": {
        "location": "Hong Kong", 
        "country_indeed": "Hong Kong",
        "enabled": True
    },
    "glassdoor": {
        "location": "Hong Kong, Hong Kong",  # Try the city, country format first
        "enabled": True  # Set to False if it keeps failing
    }
}

print("Site-specific configurations:")
for site, config in LOCATION_CONFIGS.items():
    status = "✅ Enabled" if config["enabled"] else "❌ Disabled"
    print(f"  {site.capitalize()}: {config['location']} - {status}")

print(f"\nSearch terms: {SEARCH_TERMS}")
print(f"Results wanted per search per site: {RESULTS_WANTED}")
print(f"Jobs from last {HOURS_OLD} hours")

## 3. Scrape Jobs from All Sites

Use JobSpy to scrape job listings from LinkedIn, Glassdoor, and Indeed simultaneously with Hong Kong as the target location.

In [6]:
# Scrape jobs from all sites simultaneously
print("Starting job scraping from LinkedIn, Glassdoor, and Indeed...")
all_jobs = pd.DataFrame()

for term in SEARCH_TERMS:
    try:
        print(f"Scraping all sites for: {term}")
        
        jobs = scrape_jobs(
            site_name=["linkedin", "glassdoor", "indeed"],
            search_term=term,
            location=LOCATION,
            results_wanted=RESULTS_WANTED,
            hours_old=HOURS_OLD,
            country_indeed=COUNTRY_INDEED
        )
        
        if not jobs.empty:
            jobs['search_term'] = term
            all_jobs = pd.concat([all_jobs, jobs], ignore_index=True)
            print(f"Found {len(jobs)} jobs for '{term}' across all sites")
            
            # Show breakdown by site if 'site' column exists
            if 'site' in jobs.columns:
                site_counts = jobs['site'].value_counts()
                for site, count in site_counts.items():
                    print(f"  - {site}: {count} jobs")
        else:
            print(f"No jobs found for '{term}' on any site")
            
    except Exception as e:
        print(f"Error scraping for '{term}': {str(e)}")

print(f"\nTotal jobs scraped: {len(all_jobs)}")
if not all_jobs.empty:
    print(f"Columns available: {list(all_jobs.columns)}")
    
    # Show overall breakdown by site
    if 'site' in all_jobs.columns:
        print(f"\nJobs by site:")
        print(all_jobs['site'].value_counts())

2025-09-08 06:02:05,990 - ERROR - JobSpy:Glassdoor - Glassdoor response status code 400
2025-09-08 06:02:05,997 - ERROR - JobSpy:Glassdoor - Glassdoor: location not parsed


Starting job scraping from LinkedIn, Glassdoor, and Indeed...
Scraping all sites for: data analyst


2025-09-08 06:03:03,162 - INFO - JobSpy:Linkedin - finished scraping
2025-09-08 06:03:04,714 - ERROR - JobSpy:Glassdoor - Glassdoor response status code 400
2025-09-08 06:03:04,721 - ERROR - JobSpy:Glassdoor - Glassdoor: location not parsed


Found 200 jobs for 'data analyst' across all sites
  - indeed: 100 jobs
  - linkedin: 100 jobs
Scraping all sites for: software engineer
Found 200 jobs for 'software engineer' across all sites
  - indeed: 100 jobs
  - linkedin: 100 jobs
Scraping all sites for: graduate trainee


2025-09-08 06:04:05,877 - ERROR - JobSpy:Glassdoor - Glassdoor response status code 400
2025-09-08 06:04:05,885 - ERROR - JobSpy:Glassdoor - Glassdoor: location not parsed
2025-09-08 06:04:46,084 - ERROR - JobSpy:Glassdoor - Glassdoor response status code 400
2025-09-08 06:04:46,089 - ERROR - JobSpy:Glassdoor - Glassdoor: location not parsed


Found 145 jobs for 'graduate trainee' across all sites
  - indeed: 76 jobs
  - linkedin: 69 jobs
Scraping all sites for: management trainee


2025-09-08 06:05:14,251 - ERROR - JobSpy:Glassdoor - Glassdoor response status code 400
2025-09-08 06:05:14,257 - ERROR - JobSpy:Glassdoor - Glassdoor: location not parsed
2025-09-08 06:05:14,257 - ERROR - JobSpy:Glassdoor - Glassdoor: location not parsed


Found 110 jobs for 'management trainee' across all sites
  - indeed: 70 jobs
  - linkedin: 40 jobs
Scraping all sites for: data scientist
Found 170 jobs for 'data scientist' across all sites
  - linkedin: 100 jobs
  - indeed: 70 jobs

Total jobs scraped: 825
Columns available: ['id', 'site', 'job_url', 'job_url_direct', 'title', 'company', 'location', 'date_posted', 'job_type', 'salary_source', 'interval', 'min_amount', 'max_amount', 'currency', 'is_remote', 'job_level', 'job_function', 'listing_type', 'emails', 'description', 'company_industry', 'company_url', 'company_logo', 'company_url_direct', 'company_addresses', 'company_num_employees', 'company_revenue', 'company_description', 'skills', 'experience_range', 'company_rating', 'company_reviews_count', 'vacancy_count', 'work_from_home_type', 'search_term']

Jobs by site:
site
indeed      416
linkedin    409
Name: count, dtype: int64
Found 170 jobs for 'data scientist' across all sites
  - linkedin: 100 jobs
  - indeed: 70 jobs

Tot

## 4. Review Scraped Data

Review the scraped data from all three job sites that was collected in the previous step.

In [7]:
# Review the scraped data
if not all_jobs.empty:
    print(f"Total jobs scraped: {len(all_jobs)}")
    print(f"Available columns: {list(all_jobs.columns)}")
    
    # Display basic statistics
    if 'site' in all_jobs.columns:
        print("\nJobs by site:")
        print(all_jobs['site'].value_counts())
    
    print("\nJobs by search term:")
    print(all_jobs['search_term'].value_counts())
    
    # Show sample data
    print("\nSample of scraped data:")
    display_columns = ['title', 'company', 'location', 'site'] if 'site' in all_jobs.columns else ['title', 'company', 'location']
    available_display_columns = [col for col in display_columns if col in all_jobs.columns]
    if available_display_columns:
        print(all_jobs[available_display_columns].head())
    
else:
    print("No jobs were scraped from any source.")
    print("This could be due to:")
    print("- Network connectivity issues")
    print("- Site blocking or rate limiting")
    print("- No jobs matching the search criteria")
    print("- API changes in JobSpy or the job sites")

Total jobs scraped: 825
Available columns: ['id', 'site', 'job_url', 'job_url_direct', 'title', 'company', 'location', 'date_posted', 'job_type', 'salary_source', 'interval', 'min_amount', 'max_amount', 'currency', 'is_remote', 'job_level', 'job_function', 'listing_type', 'emails', 'description', 'company_industry', 'company_url', 'company_logo', 'company_url_direct', 'company_addresses', 'company_num_employees', 'company_revenue', 'company_description', 'skills', 'experience_range', 'company_rating', 'company_reviews_count', 'vacancy_count', 'work_from_home_type', 'search_term']

Jobs by site:
site
indeed      416
linkedin    409
Name: count, dtype: int64

Jobs by search term:
search_term
data analyst          200
software engineer     200
data scientist        170
graduate trainee      145
management trainee    110
Name: count, dtype: int64

Sample of scraped data:
                                               title  \
0  Analyst to Associate - Risk Management, Prime ...   
1       

## 5. Clean and Preprocess Data

Clean the combined data by handling missing values, removing duplicates, standardizing formats, and filtering relevant columns.

In [8]:
# Data cleaning and preprocessing
if not all_jobs.empty:
    print("Starting data cleaning...")
    
    # Display initial data info
    print(f"Initial dataset shape: {all_jobs.shape}")
    print(f"\nMissing values per column:")
    print(all_jobs.isnull().sum())
    
    # Remove duplicates based on title, company, and location
    initial_count = len(all_jobs)
    if 'title' in all_jobs.columns and 'company' in all_jobs.columns:
        all_jobs = all_jobs.drop_duplicates(
            subset=['title', 'company', 'location'], 
            keep='first'
        )
        duplicates_removed = initial_count - len(all_jobs)
        print(f"\nRemoved {duplicates_removed} duplicate jobs")
    
    # Clean and standardize text fields
    text_columns = ['title', 'company', 'location', 'description']
    for col in text_columns:
        if col in all_jobs.columns:
            # Remove extra whitespace and convert to string
            all_jobs[col] = all_jobs[col].astype(str).str.strip()
            # Replace 'nan' strings with actual NaN
            all_jobs[col] = all_jobs[col].replace('nan', np.nan)
    
    # Clean salary information if available
    salary_columns = ['min_amount', 'max_amount', 'salary']
    for col in salary_columns:
        if col in all_jobs.columns:
            # Convert to numeric, handling any string values
            all_jobs[col] = pd.to_numeric(all_jobs[col], errors='coerce')
    
    # Convert date columns if available
    date_columns = ['date_posted']
    for col in date_columns:
        if col in all_jobs.columns:
            all_jobs[col] = pd.to_datetime(all_jobs[col], errors='coerce')
    
    # Filter out jobs with missing essential information
    essential_columns = ['title', 'company']
    for col in essential_columns:
        if col in all_jobs.columns:
            before_filter = len(all_jobs)
            all_jobs = all_jobs.dropna(subset=[col])
            after_filter = len(all_jobs)
            if before_filter != after_filter:
                print(f"Removed {before_filter - after_filter} jobs with missing {col}")
    
    print(f"\nCleaned dataset shape: {all_jobs.shape}")
    
else:
    print("No data to clean.")

Starting data cleaning...
Initial dataset shape: (825, 35)

Missing values per column:
id                         0
site                       0
job_url                    0
job_url_direct           409
title                      0
company                    7
location                   0
date_posted                4
job_type                 519
salary_source            825
interval                 825
min_amount               825
max_amount               825
currency                 825
is_remote                  0
job_level                416
job_function             825
listing_type             825
emails                   809
description              409
company_industry         785
company_url                6
company_logo             607
company_url_direct       582
company_addresses        618
company_num_employees    633
company_revenue          670
company_description      675
skills                   825
experience_range         825
company_rating           825
company_review

## 6. Create Final DataFrame

Structure the cleaned data into a well-organized pandas DataFrame with appropriate column names and data types.

In [None]:
# Create final structured DataFrame
if not all_jobs.empty:
    print("Creating final structured DataFrame...")
    
    # Define the columns we want in our final dataset
    desired_columns = [
        'title', 'company', 'location', 'description', 
        'min_amount', 'max_amount', 'currency',
        'date_posted', 'job_url', 'site', 'search_term'
    ]
    
    # Select only available columns
    available_columns = [col for col in desired_columns if col in all_jobs.columns]
    final_jobs_df = all_jobs[available_columns].copy()
    
    # Add a unique job ID
    final_jobs_df['job_id'] = range(1, len(final_jobs_df) + 1)
    
    # Add scraping timestamp
    final_jobs_df['scraped_at'] = datetime.now()
    
    # Reorder columns for better readability
    column_order = ['job_id', 'title', 'company', 'location', 'site', 'search_term']
    remaining_columns = [col for col in final_jobs_df.columns if col not in column_order]
    final_column_order = column_order + remaining_columns
    
    final_jobs_df = final_jobs_df[[col for col in final_column_order if col in final_jobs_df.columns]]
    
    print(f"Final DataFrame shape: {final_jobs_df.shape}")
    print(f"Final columns: {list(final_jobs_df.columns)}")
    
    # Display sample data
    print("\nSample of final data:")
    print(final_jobs_df.head())
    
    # Display summary statistics
    print("\nDataset Summary:")
    print(f"Total jobs: {len(final_jobs_df)}")
    print(f"Unique companies: {final_jobs_df['company'].nunique() if 'company' in final_jobs_df.columns else 'N/A'}")
    print(f"Sites: {final_jobs_df['site'].unique().tolist() if 'site' in final_jobs_df.columns else 'N/A'}")
    print(f"Search terms: {final_jobs_df['search_term'].unique().tolist() if 'search_term' in final_jobs_df.columns else 'N/A'}")
    
else:
    print("No data available to create final DataFrame.")
    final_jobs_df = pd.DataFrame()


Creating final structured DataFrame...
Final DataFrame shape: (689, 13)
Final columns: ['job_id', 'title', 'company', 'location', 'site', 'search_term', 'description', 'min_amount', 'max_amount', 'currency', 'date_posted', 'job_url', 'scraped_at']

Sample of final data:
   job_id                                              title  \
0       1  Analyst to Associate - Risk Management, Prime ...   
1       2                 Analyst / Officer, Risk Management   
2       3  Quant Model Risk - Equities and eTrading - Ana...   
3       4                              Risk Controls Analyst   
4       5                         Analyst, Consumer Insights   

                                             company location    site  \
0  Haitong International Management Services Comp...       HK  indeed   
1                 MIB Securities (Hong Kong) Limited       HK  indeed   
2                                      JPMorganChase       HK  indeed   
3                              Millennium Management

## 7. Export to CSV File

Save the final cleaned DataFrame to a CSV file for future use and analysis.

In [10]:
# Export to CSV file
if not final_jobs_df.empty:
    # Generate filename with timestamp
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"hongkong_jobs_{timestamp}.csv"
    filepath = os.path.join(os.getcwd(), filename)
    
    try:
        # Export to CSV
        final_jobs_df.to_csv(filepath, index=False, encoding='utf-8')
        print(f"✅ Successfully exported {len(final_jobs_df)} jobs to: {filepath}")
        
        # Display file information
        file_size = os.path.getsize(filepath) / 1024  # Size in KB
        print(f"File size: {file_size:.2f} KB")
        
        # Display what was exported
        print(f"\nExported data contains:")
        print(f"- {len(final_jobs_df)} job listings")
        print(f"- {len(final_jobs_df.columns)} columns")
        print(f"- Jobs from: {', '.join(final_jobs_df['site'].unique()) if 'site' in final_jobs_df.columns else 'Various sources'}")
        
        # Show column names
        print(f"\nColumns exported: {', '.join(final_jobs_df.columns)}")
        
    except Exception as e:
        print(f"❌ Error exporting to CSV: {str(e)}")
        
else:
    print("❌ No data to export. Please check the scraping results above.")

✅ Successfully exported 689 jobs to: c:\Users\j_fel\Desktop\projects\job-automation\hongkong_jobs_20250908_060716.csv
File size: 1117.76 KB

Exported data contains:
- 689 job listings
- 13 columns
- Jobs from: indeed, linkedin

Columns exported: job_id, title, company, location, site, search_term, description, min_amount, max_amount, currency, date_posted, job_url, scraped_at


## Summary

This notebook has successfully:

1. ✅ Installed and imported the JobSpy library and required dependencies
2. ✅ Configured scraping parameters for Hong Kong job market
3. ✅ Scraped job listings from LinkedIn, Glassdoor, and Indeed
4. ✅ Combined data from all three sources
5. ✅ Cleaned and preprocessed the data (removed duplicates, handled missing values)
6. ✅ Created a structured final DataFrame
7. ✅ Exported the cleaned data to a CSV file

### Next Steps

You can now:
- Analyze the CSV file in Excel or other tools
- Use the data for job market analysis
- Modify search terms to target specific roles
- Schedule regular scraping to track job market trends
- Add more data cleaning or analysis steps as needed

### Notes

- The scraping results may vary based on the current job market and website availability
- Some sites may have rate limiting or anti-scraping measures
- Always respect the terms of service of the job sites
- Consider adding delays between requests for large-scale scraping