# MyCareersFuture Job Scraper

## Configure your search


- Generating this is easy.
- Go to [mycareersfuture.gov.sg](mycareersfuture.gov.sg) and apply a search using the specs you want. (e.g. min salary, full time/contract/etc...)
- Open web developer tools and go to Network. Refresh the page. (These instructions for Firefox, but Chrome should be similar)
- Find the row item that says `GET`, `api.mycareersfuture.gov.sg` `search?search=data&salary=...` (this being whatever you specced)
- Right click, Copy Value, Copy URL parameters. Below was my example.
- You could also copy as curl command, send to chatGPT and ask it to convert it for you for as a Python request.

```
      search=data
      salary=6000
      positionLevel=Executive
      positionLevel=Junior%20Executive
      positionLevel=Fresh%2Fentry%20level
      sortBy=relevancy
      page=0
```


In [1]:
from dotenv import load_dotenv
import os

# Load environment variables from .env
load_dotenv()

HF_TOKEN = os.getenv("HF_TOKEN")


# change these
data = {
    "sessionId": "",
    "search": "data",
    "salary": 6000,
    "positionLevels": ["Executive", "Junior Executive", "Fresh/entry level"],
    "postingCompany": []
}

start_url = "https://api.mycareersfuture.gov.sg/v2/search?limit=20&page=0"

json_save_file = "./jobslist.json"

SLEEP_DELAY = 0.5 # secs

## Run the scrape and save to file

In [2]:
import logging
import requests

logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)

from src.mycareersfuture import MyCareersFutureListings


lst = MyCareersFutureListings(sleep_delay=SLEEP_DELAY)
listings = lst.scrape_listings(data=data, start_url=start_url)
listings = lst.expand_listings()
lst.save_json(json_save_file=json_save_file)
# listings = lst.load_json(json_load_file=json_save_file)

print(listings[0])

{'metadata': {'jobPostId': 'MCF-2023-0769409', 'updatedAt': '2023-10-09T08:29:36', 'newPostingDate': '2023-10-09', 'totalNumberJobApplication': 0, 'isPostedOnBehalf': False, 'isHideSalary': False, 'isHideHiringEmployerName': False, 'jobDetailsUrl': 'https://www.mycareersfuture.gov.sg/job/information-technology/data-engineer-nsearch-global-f9a24b54305ae4c1bb7bcf9974f4e44b'}, 'hiringCompany': None, 'address': {'overseasCountry': None, 'foreignAddress1': None, 'foreignAddress2': None, 'block': None, 'street': None, 'floor': None, 'unit': None, 'building': None, 'postalCode': None, 'isOverseas': False, 'districts': [{'id': 998, 'location': 'Islandwide', 'region': 'Islandwide', 'sectors': [], 'regionId': 'Islandwide'}]}, 'positionLevels': [{'id': 9, 'position': 'Executive'}], 'schemes': [], 'postedCompany': {'uen': '200805822M', 'name': 'NSEARCH GLOBAL PTE. LTD.', 'logoFileName': 'ef210e1b98ea89c1ccc396f7f36a493b/NSEARCH GLOBAL PTE. LTD..jpg', 'logoUploadPath': 'https://static.mycareersfutu

`listings` is still a lot of metadata, still deciding what fields relevant to reduce it:

In [3]:
reduced = []
for listing in listings:
    reduced.append({
        'url' : listing['metadata']['jobDetailsUrl'],
        'job_title' : listing['title'],
        'job_desc' : listing['job_desc'],
        'company' : listing['postedCompany']['name'],
        'salary_min' : listing['salary']['minimum'],
        'salary_max' : listing['salary']['maximum'],
        'skills' : ', '.join([skill['skill'] for skill in listing['skills']]),
    })

reduced[:2]


[{'url': 'https://www.mycareersfuture.gov.sg/job/information-technology/data-engineer-nsearch-global-f9a24b54305ae4c1bb7bcf9974f4e44b',
  'job_title': 'Data Engineer',
  'job_desc': 'Our client, one of the leading organisations in Asia-Pacific is looking for:\nData Engineer\nResponsibilities:\n\n  Design, Architect, Deploy, and maintain solutions on AWS and Databricks to provide secure and governed access to data for data scientist, data analysts and business users.\n  Manage the full life-cycle of a data lakehouse project from requirement gathering to data modelling, design of the data architecture and deployment.\n  Collaborate with data stewards, data analysts and data scientists to build data pipelines to ingest data from enterprise systems for both batch and real-time streaming data.\n  Establish and manage the complete machine learning lifecycle using MLFlow.\n\nRequirements:\n\n  Minimum 2 to 3 years of relevant work experience.\n  Degree in Computer Science or Information Techn

In [4]:
for listing in reduced:
    print(listing['job_desc'])
    print("\n\n\n")

Our client, one of the leading organisations in Asia-Pacific is looking for:
Data Engineer
Responsibilities:

  Design, Architect, Deploy, and maintain solutions on AWS and Databricks to provide secure and governed access to data for data scientist, data analysts and business users.
  Manage the full life-cycle of a data lakehouse project from requirement gathering to data modelling, design of the data architecture and deployment.
  Collaborate with data stewards, data analysts and data scientists to build data pipelines to ingest data from enterprise systems for both batch and real-time streaming data.
  Establish and manage the complete machine learning lifecycle using MLFlow.

Requirements:

  Minimum 2 to 3 years of relevant work experience.
  Degree in Computer Science or Information Technology or related disciplines
  Hands-on experience in implementing Data Lake/Data Warehouse with technologies like – Databricks, Azure Synapse Analytics, SQL Database, AWS Lake formation.
  Under

# Employing LLMs for Semantic Similarity/RAG/Summarization

- need some experimentation to find best path. based on the following constraints:
- each JD is probably 3-4 paragraphs of text
- the user's resume they may wish to put in is also at least a 1 pager of text
- do we try summarizing each of those first to attempt semantic similarity of the embeddings?
  - but summarization quality also varies, some models i've tested, asking it to summarize only the skills required, just returned 'data engineer'
- do we try to split every sentence, make a list of embeddings, and try to score every resume sentence to every JD sentence, and somehow only save maximum scores/similarities (this sounds complicated)


### Flan T5 XXL : "Extract the skills required for the below job description"

In [5]:
# API_URL = "https://api-inference.huggingface.co/models/togethercomputer/RedPajama-INCITE-Chat-3B-v1"
# headers = {"Authorization": f"Bearer {HF_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/google/flan-t5-xxl"
headers = {"Authorization": f"Bearer {HF_TOKEN}"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()

for listing in reduced[:4]:
    output = query({
        "inputs": f"Extract the skills required for the below job description: \n{listing['job_desc']}",
    })

    print(output[0]['generated_text'])


Data Architecture, Docker, S3, Git, PySpark, Kubernetes
Data Engineer
Data Analysis, Catalogs, Data Management, Data Quality, SQL, SAP, Data Migration, Attention
Data Engineer


### Flan T5 XXL : "Summarize the job skills requirements in 200 words"

In [6]:
for listing in reduced[:4]:
    output = query({
        "inputs": f"Summarize the job skills requirements in 200 words: \n{listing['job_desc']}",
    })

    print(output[0]['generated_text'])

Data Engineer - Design, Architect, Deploy, and maintain solutions on AWS and
Data Engineer - Microsoft Azure - Town Area - MNC, good corporate culture and 5-
Data Analyst for a global mining company. In the capacity of a Data Analyst, your primary
Data Engineer - Fintech - New York, NY - 5+ years of experience in
