# Collect Job Data with Generative AI

This notebook demonstrates how to collect job posts from [USAJOBS](https://developer.usajobs.gov/). 

Please note:
- If you find a data source that provides direct data download, downloading data is the easiest way.
- Otherwise, APIs can be used with the assistance of AI to collect data.
- Please avoid web crawling with AI, and always check the [robots.txt](https://developers.google.com/search/docs/crawling-indexing/robots/intro) file before crawling a website.

## Set up a Database and Request API Keys

Create a [MongoDB](www.mongodb.com) cluster and store the connection string in a safe place, such as AWS Secrets Manager. 
- key name: `api_key`
- key value: <`the connection string`>, you need to type the password
- secret name: `mongodb`

Request a [USAJOBS API key](https://developer.usajobs.gov/apirequest/) and store the key in a safe place, such as AWS Secrets Manager. 
- key name: `api_key`
- key value: <`the API key you received in email`>
- secret name: `usajobs`

You also need to store your email in AWS Secrets Manager:
- key name: `address`
- key value: <`the email you used in applying the API key`>
- secret name: `email`

## Install Python Packages

- jupyter-ai: the JupyterLab extension to call Generative AI models
- langchain-openai: the LangChain package to interact with OpenAI
- pymongo: manage the MongoDB database

In [21]:
pip install jupyter-ai~=1.0 # Because I am using JupyterLab V3, I need to use Jupyter-ai V1.0

Note: you may need to restart the kernel to use updated packages.


In [22]:
pip install jupyter-ai[all] # execute this cell if the AI model not in the ai list

Note: you may need to restart the kernel to use updated packages.


In [23]:
pip install langchain-openai # skip this if you pip install jupyter-ai[all]

Note: you may need to restart the kernel to use updated packages.


In [24]:
pip install pymongo

Note: you may need to restart the kernel to use updated packages.


## Secrets Manager Function

In [8]:
import boto3
from botocore.exceptions import ClientError
import json

def get_secret(secret_name):
    region_name = "us-east-1"

    # Create a Secrets Manager client
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )

    try:
        get_secret_value_response = client.get_secret_value(
            SecretId=secret_name
        )
    except ClientError as e:
        raise e

    secret = get_secret_value_response['SecretString']
    
    return json.loads(secret)

## Import Python Libraries and Credentials

In [12]:
import pymongo
from pymongo import MongoClient
import json
import re
import os

os.environ["OPENAI_API_KEY"] = get_secret('openai')['api_key']
email = get_secret('email')['address']
mongodb_connect = get_secret('mongodb')['connection_string']
usa_jobs_key = get_secret('usajobs')['api_key']

## Connect to the MongoDB cluster

In [13]:
mongo_client = MongoClient(mongodb_connect)
db = mongo_client.demo # use or create a database named demo
job_collection = db.job_collection #use or create a collection named job_collection


InvalidURI: Invalid URI scheme: URI must begin with 'mongodb://' or 'mongodb+srv://'

## Load the Jupyter AI Magic Commands

In [11]:
%load_ext jupyter_ai_magics

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


Check the available AI models, this is optional. 

In [None]:
%ai list

## Example Prompt
Bellow is a prompt that may create a workable code

In [14]:
import requests

def search_usajobs(agent, auth_key, location, keywords, collection):
    base_url = "https://data.usajobs.gov/api/search"
    headers = {
        "User-Agent": agent, 
        "Authorization-Key": auth_key,
        "Host": "data.usajobs.gov"
    }
    params = {
        "LocationName": location,
        "Keyword": keywords,
        "ResultsPerPage": 500
    }

    response = requests.get(base_url, headers=headers, params=params)
    data = response.json()
    num_pages = int(data['SearchResult']['UserArea']['NumberOfPages'])

    for page in range(1, num_pages + 1):
        params["Page"] = page
        response = requests.get(base_url, headers=headers, params=params)
        data = response.json()
        jobs = data['SearchResult']['SearchResultItems']

        for job in jobs:
            collection.insert_one(job)

## Example Code
Bellow is a code generated by AI that works

In [18]:
import requests

def search_jobs(agent, auth_key, job_location, job_keywords, collection):
    base_url = 'https://data.usajobs.gov/api/search'
    headers = {'User-Agent': agent, 'Authorization-Key': auth_key}
    params = {'LocationName': job_location, 'Keyword': job_keywords, 'ResultsPerPage': 500}

    page = 1
    while page <= 10:
        params['Page'] = page
        response = requests.get(base_url, headers=headers, params=params)
        if response.status_code != 200:
            break

        job_data = response.json()
        for job in job_data['SearchResult']['SearchResultItems']:
            job_info = job['MatchedObjectDescriptor']
            collection.insert_one(job_info)

        page += 1

Use the AI-generated code to collect `AI-related` jobs in `Fairfax, VA`. We also pass the `job_collection`, `api_key`, and `email` to the function.

In [19]:
search_jobs(collection= job_collection,
            auth_key=usa_jobs_key, 
            agent= email, 
            job_keywords= 'geospatial',
            job_location= 'fairfax, va')

NameError: name 'job_collection' is not defined

Display the number of collected jobs:

In [20]:
job_collection.estimated_document_count()

NameError: name 'job_collection' is not defined