Skip to content

coderJT/Trendwork

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Trendwork: Job Market Intelligence Platform

Trendwork is a data-driven platform designed to aggregate, process, and analyze job listings from multiple job boards. It provides a centralized view of market trends, skill demands, and salary insights by leveraging modern data engineering practices and Artificial Intelligence.

Core Objective

The primary goal of Trendwork is to automate the extraction of job market data and transform it into actionable intelligence. By using advanced NLP and geocoding, it identifies emerging skills and geographic hotspots for various roles.

Architecture

The platform follows a Medallion Architecture (Bronze, Silver, Gold) implemented on Google Cloud Platform to ensure data quality and lineage.

Bronze Layer (Raw Data)

  • Scrapers running on Cloud Run extract raw JSON data from job boards.
  • Data is stored as immutable objects in Google Cloud Storage.
  • Cloud Scheduler triggers automated scraping cycles.

Silver Layer (Cleaned Data)

  • A Cloud Function (Processor) is triggered by new files in Cloud Storage.
  • Performs data cleaning, normalization, and deduplication.
  • Loads structured data into BigQuery silver tables.
  • Implements fail-safe mechanisms for missing fields like keywords from file metadata.

Gold Layer (Enriched insights)

  • An Enrichment service utilizes Vertex AI (Gemini 2.0 Flash) to extract specific information:
    • Skill requirements.
    • Salary ranges.
    • Role summaries.
  • Geocoding services convert location strings into latitude and longitude.
  • Final enriched data is stored in BigQuery gold tables for high-performance querying.

Technical Stack

  • Infrastructure: Terraform for Reproducible Infrastructure as Code.
  • Compute: Cloud Run and Cloud Run Functions (Python 3.11).
  • Storage: Google Cloud Storage and BigQuery.
  • AI/ML: Vertex AI (Gemini) for text extraction and analysis.
  • Visualization: Streamlit for interactive dashboards.
  • Observability: OpenTelemetry and Google Cloud Trace for distributed tracing.

Key Features

  • Automated Scraping: Bypasses modern bot detection mechanisms.
  • AI-Powered Extraction: Transforms unstructured job descriptions into structured skill lists and summaries.
  • Geographic Mapping: Visualizes job density across regions using interactive 3D maps.
  • Real-time Analysis: Dynamic filtering by job keywords and automatic metric calculations.

Business Questions Answered

  • Which job titles are experiencing the highest growth in demand?
  • What are the top technical skills required for specific roles in current market?
  • How do salary ranges vary across different geographic locations?
  • Are there emerging roles or skills appearing suddenly in recent postings?

Deployment Sequence

To successfully deploy the platform using Terraform, the following order must be observed:

  1. Initial Infrastructure: Configure terraform.tfvars with your project_id and bucket_name.
  2. Container Image: The dashboard service depends on a pre-existing container image. Before applying the dashboard infrastructure, build and push the image to Google Container Registry:
    cd dashboard
    docker build --platform linux/amd64 -t gcr.io/[PROJECT_ID]/jobstreet-dashboard:latest .
    docker push gcr.io/[PROJECT_ID]/jobstreet-dashboard:latest
  3. Terraform Apply: Run terraform apply to provision the Cloud Run services, BigQuery tables, and necessary IAM permissions.

Local Development

  • The dashboard can be run locally using streamlit run dashboard/app.py.
  • Ensure Google Cloud credentials are configured via gcloud auth login.
  • Secrets are managed through Streamlit secrets or environment variables.

About

Scrapes Job Platforms (Jobstreet for now) to obtain job insights.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published