---
format:
    html:
        embed-resources: true
---

# Crawling 

## Overview 

In this portion of the homework, we will be crawling google jobs to collect various job-descriptions for later processing. 

We will be using the `serpapi` API to crawl google jobs. `serpapi` is a paid API, but they have a free tier which should be more than enough for this homework. The API allows you to search google programatically, which has a wealth of practical applications.

We will need and API key from SerpApi, it is free to sign up, you shouldn't enter any payment information.

https://serpapi.com/manage-api-key

We will use the following python wrapper for the API, it can be installed with

`pip install google-search-results`

The following are additional useful reference resources 

For instructions on the API see the following

* [https://serpapi.com/google-jobs-api](https://serpapi.com/google-jobs-api)
* [https://serpapi.com/blog/scrape-google-jobs-organic-results-with-python/](https://serpapi.com/blog/scrape-google-jobs-organic-results-with-python/)
* [https://serpapi.com/integrations/python](https://serpapi.com/integrations/python)

## Starter code 

Here is some starter code:

`Note: uule parameter`

The uule parameter is an encoded location parameter used in Google search queries. It stands for "Unique User Location Encoding" and is used to specify the geographic location from which the search is being conducted. This can influence the search results to be more relevant to the specified location.

This can be set to `'w+CAIQICINVW5pdGVkIFN0YXRlcw'`, which is an encoded string representing a specific location, i.e. the United States. This encoding helps simulate searches as if they are being conducted from that location, which can be useful for testing or gathering location-specific data.

In [99]:
from serpapi import GoogleSearch
import json

Save your API key in a centralized location, e.g. `~/.api-keys.json`

Read it in with `import json` 

In [100]:
with open('../.api-keys.json') as f:
    keys = json.load(f)
API_KEY = keys['serpapi']

Be careful, don't run this too many times for debugging and prototyping, or you will use up all your free searches. 

In [101]:
search_query = 'data science'
params = {
	'api_key':API_KEY,                          # https://serpapi.com/manage-api-key
	'uule': 'w+CAIQICINVW5pdGVkIFN0YXRlcw',		# encoded location (USA)
	'q': search_query,              			# search query
    'hl': 'en',                         		# language of the search
    'gl': 'us',                         		# country of the search
	'engine': 'google_jobs',					# SerpApi search engine
}

Lets do one search and explore the output 

In [102]:
search = GoogleSearch(params)   			# where data extraction happens on the SerpApi backend
result_dict = search.get_dict() 			# JSON -> Python dict

if 'error' in result_dict:
    print("ERROR FOUND IN SEARCH")

In [None]:
for result in result_dict['jobs_results']:
    print(result)
    # google_jobs_results.append(result)

In web crawling, pagination involves retrieving data across multiple pages of a website. It helps manage large datasets by fetching a limited number of results per request, enabling efficient data extraction without overwhelming the server or exceeding resource limits.

You could use pagination to get more results for a given search, however you need to start where the last search left off.

You can do this by adding the `next_page_token` to the `params` dictionary.

You get the last `next_page_token` from the `result_dict["serpapi_pagination"]`

If you don't do pagination, and search 10 times, you will just get the first 10 results over and over again.

In [104]:
print(result_dict["serpapi_pagination"],"\n")
print(result_dict["serpapi_pagination"]["next_page_token"])

{'next_page_token': 'eyJmYyI6IkVvd0ZDc3dFUVVwSE9VcHJUekJxV0MxQ1FrbHViVXB3UkcxSVlYUmFaR2RUWkhSU1ZFRm5ja1l5TlhGVmRsOTVhVTlGZEROS09GRTRZbW90VkVSWWNHaFVhRlJQZVRsaVdteFNVMnR5WlVaclJuWjNTbk13YWxneVRVNUlSeko2WXkxQ2IzZEhTVVZCVkV4MWNFOW1OR0ozWTE5dlpUTkRlV0pCZFdOME5sWlJaamgzV25OZlkxVTNkVmhoUlRsVE5sRnBZakk1YTNCTk1WSlVla3g0VGpRd1RqWTVXWGxNU2t0SE5uSnJkRVJJVEUxTVNIQnJXWFExYld3MGRWaFljVFpqTFRFNWJXeHlaRGR6UlRSUmIzaHZUalptVTNaZlpVbFphVVY0UlhWdVNrSmpjVlp5UmpJM1YyMXhaVTFYTjFvM2VGZHVZa2hTVmxrM05uUXhjbTVOTlZKNllYbHRiVlJDY1ROTmJUSnllazF5WVRKbFoybzNhR05uYlUxT1RVcFZNR05RVG5ReFR6QjBiRUpFTVdwS2VUaGxiVTlRV1hOWVVVc3dObTQyZDBJek9WWXlOMEpmTWxsb01EUmxaRmRQZEVKbU5tVTBZbmxsV2xaclRXZElRV2x1VDNkelVYSnhOelZ3Y0ZkVFZGVnZNMjFMUmkxS1VuVmhRVXBDWTNreGFsTjRZM2hmYzJsUWVFTkdaemw0U2pkVk0yTmlVMmRIU1VJNE1HRm9Wa2M0Tm14S1ZrZFZUV3R4Y3pkaFFVTk1NazVyYkRGNFIwOW5VVEpuTmtkNVkxTk5OV2MyU0hab2RDMWthMmgyVDFWeWNtRnVRMU5RYW05V0xYaFdTMlphY2s5YWJIWjZhV2hJVTJSSVR6WkxjM0ZFZVhJemJGWkpVVEpuUldGMlVuVlpjVTkzTVdOMU9ISlJaa001VjFremVpMUxkMFpGTnpWeFVVbzRZMGh

Lets do one more search, but this time with pagination, starting where the last search left off.

In [105]:
search_query = 'data science'
params = {
	'api_key':API_KEY,                          # https://serpapi.com/manage-api-key
	'uule': 'w+CAIQICINVW5pdGVkIFN0YXRlcw',		# encoded location (USA)
	'q': search_query,              			# search query
    'hl': 'en',                         		# language of the search
    'gl': 'us',                         		# country of the search
    "num": 10,									# number of results per page
	'engine': 'google_jobs',					# SerpApi search engine
    'next_page_token': result_dict["serpapi_pagination"]["next_page_token"]
}

In [106]:
search = GoogleSearch(params)   			# where data extraction happens on the SerpApi backend
result_dict = search.get_dict() 			# JSON -> Python dict

if 'error' in result_dict:
    print("ERROR FOUND IN SEARCH")

In [None]:
for result in result_dict['jobs_results']:
    print(result)
    # google_jobs_results.append(result)

# Utility function

Create utility function to search google jobs, and save the results to a file.

Here is one sketch of what the function might look like:

- Imports the current date and time using `datetime`.
- Defines `search_google_jobs` to perform a Google Jobs search with a default or custom query.
- Accepts parameters for the search query, pagination token, and verbosity.
- Sets search parameters like API key, location, language, and search engine.
- Appends the pagination token if provided.
- Creates a timestamped output filename based on the query and time.
- Does a search and data extraction.
- Optionally prints the data if `verbose` is `True` and saves results to a JSON file.
- Returns the `next_page_token` for pagination or handles errors.

In [108]:
# INSERT CODE HERE
import datetime

def search_google_jobs(search_query='data science', pagination_token=None, verbose=False):
    current_time = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

    params = {
        'api_key':API_KEY,                          
        'uule': 'w+CAIQICINVW5pdGVkIFN0YXRlcw',		
        'q': search_query,              			
        'hl': 'en',                         		
        'gl': 'us',                         		
        "num": 10,								
        'engine': 'google_jobs',				
    }
    
    if pagination_token:
        params['next_page_token'] = pagination_token

    search = GoogleSearch(params)
    result_dict = search.get_dict()
    
    if 'error' in result_dict:
        print("ERROR FOUND IN SEARCH:", result_dict['error'])
        return None, None
    
    if verbose:
        print(json.dumps(result_dict, indent=2))  
    
    next_page_token = None
    if 'serpapi_pagination' in result_dict and 'next_page_token' in result_dict['serpapi_pagination']:
        next_page_token = result_dict['serpapi_pagination']['next_page_token']
        return result_dict, next_page_token

    output_filename = f"googlejobs_{search_query.replace(' ', '_')}_{current_time}.json"


In [None]:
_, next_page_token = search_google_jobs(search_query="machine learning engineer", verbose=True)

In [110]:
next_page_token

'eyJmYyI6IkVxSUZDdUlFUVVwSE9VcHJUVVZHZW1OTldqQmZVVTR4YUc5RE9VaFNhVVZxT0hsaVltZFdUMHhVT1VSMFRtSm1RbUZYUVdaclJ5MUlZbll4U1ZFMVdFWm1SVWxRUkUweVNHNU1jMmhQYVRaVFVFMWhhRWRHUnpaNGJWWktTbGRKZVdWVGQzcHlVak15VEhscmNtaG5ZM2hSYzNsbVdEZEZkV2g1WkhZMmVERkljMjEyZFVneGRVWkhZVlF4U0dKNVVIaFdiV1ZJTjJsTFJreG1OMUl0ZFcwNFQwaE9MWGxMV201UlpXZ3ROV1ZmZGtwYWJTMVlNVUpIY2poeVpXOWFZMjR6UkZKRk1FZEhWelJWTlY5RVZHdEtXSEJJZUMwNVZXOUNRVm80TjFKcVgyVjBlR2xQTkd0UVJHVkdiek4wTTI5S05FMXlZV1J1WDJOQlJIRXdhV3BDV1RaV1MybG1USGxUWW1jMk0yaGhUVWxsTm14eFdWVkNUbWs0TTNoa1IwRTBlRzVvZDFSa2FsWkNWRTFGWlZsa1VHYzBXWE5GVTI4MGExZFNUVmQyYVhKd05FSkdPRXBDTm1SNWNqWTBjWEZHZEdsNlNtMWxhSGcxWlZrMWN6bDVla3RJY2toaGVHaFJja3RPV1ZvMVgzWXpPRkUzYVhkWlpucFliR3czTms5NU5tMVZXbTVJYTNOMFdVbG5VVXhqYVV3MmRVeHNVM2MwZVZJd09IZG9OR0phVFZkbGFtUlBjV0pVWkZZMldESjRka1J4ZUhsdmVWOUZNME5LWDNFdFp6aHNaMHB4VEhCbVNGZEpkMWxFYkV0U01qbE9iWHBHWlhKb1ZsUm9iMTlwWkd4SlpuaGxaMFUyWW1aRFRGSTFhbWd5YmpsT2NtdE5iMnRyUVY5cVkybG9iVmRUYmxaNmJFMTFNa05ZY21wU1lrdGxTV04yVmtWMFVXbElWbDkxU0RoQlRGQlJNelZMV1d

In [None]:
search_google_jobs(search_query="machine learning engineer", pagination_token=next_page_token, verbose=True)

In [112]:
for result in result_dict['jobs_results']:
    print(result["title"], result["company_name"])

Manager Data Scientist Capital One
Senior Data Scientist, ASE iCloud Data Organization [Executive Communications] Apple
Data Scientist Harris-Stowe State University
Senior Computational Biology - Real World Data Science Tempus
Lead Data Scientist Cox Communications
Senior Data Scientist / Machine Learning Engineer - GenAI & LLM Databricks
Sr. data science (Artificial Intelligence) Pyromis
Data Scientist, Paramount Advertising Paramount
Senior Scientist, Machine Learning 50056740 - Senior Scientist
BP Data Science Analyst - NY, NJ, Or PA Visions Federal Credit Union


## Iterate over job titles

These titles reflect a wide range of roles that leverage data science and machine learning skills in various industries and specialties.

- Data Scientist
- Machine Learning Engineer
- Artificial Intelligence Specialist
- Data Analyst
- Business Intelligence Analyst
- Research Scientist (AI/ML)
- Deep Learning Engineer
- NLP Engineer (Natural Language Processing)
- Computer Vision Engineer
- Data Engineer
- Applied Scientist
- Quantitative Analyst (Quant)
- Predictive Modeler
- AI Solutions Architect
- Statistician
- Big Data Engineer
- Data Science Consultant
- Automation Engineer
- Analytics Manager
- Decision Scientist
- Operations Research Analyst
- Robotics Engineer
- Bioinformatics Data Scientist
- Healthcare Data Analyst
- Financial Data Scientist
- Customer Insights Analyst
- Marketing Data Analyst
- Data Strategy Manager
- Cloud AI Engineer
- Computational Scientist
- Fraud Detection Specialist
- Risk Analyst
- Data Architect
- Algorithm Engineer

For each keyword, do three searches, using pagination, this will result in around 30 jobs per keyword (assuming there are at least 30 jobs for the particular keyword), save each search results to a file. 

Note, just to be safe, wait a one second between each request e.g. using `time.sleep(1)`

In [113]:
job_titles = [
    "Data Scientist",
    "Machine Learning Engineer",
    "Artificial Intelligence Specialist",
    "Data Analyst",
    "Business Intelligence Analyst",
    "Research Scientist (AI-ML)",
    "Deep Learning Engineer",
    "NLP Engineer (Natural Language Processing)",
    "Computer Vision Engineer",
    "Data Engineer",
    "Applied Scientist",
    "Quantitative Analyst (Quant)",
    "AI Solutions Architect",
    "Statistician",
    "Big Data Engineer",
    "Data Science Consultant",
    "Automation Engineer",
    "Analytics Manager",
    "Operations Research Analyst",
    "Robotics Engineer",
    "Bioinformatics Data Scientist",
    "Financial Data Scientist",
    "Customer Insights Analyst",
    "Marketing Data Analyst",
    "Data Strategy Manager",
    "Cloud AI Engineer",
    "Computational Scientist",
    "Fraud Detection Specialist",
    "Risk Analyst",
    "Data Architect"
]

print(len(job_titles)*3)

90


Now insert code to iterate over the job titles, and perform the searches.

Be very careful, this needs to be 100% correct before running it, otherwise you will burn through your free searches.

I would recommend doing just one iteration of the loop as a trial run, if that looks good, then do do the next iteration and carefully check the results, if everything looks good then do remaining 28 iterations.

Note: sometimes the Pagination will return less than 10 results, so you may end up with slightly less than 30 results per keyword, e.g. 25 to 30

Remember to clean the job tiles to remove any characters like spaces, `/` or `()`

In [None]:
# INSERT CODE HERE
all_results = []

for title in job_titles: 
    page_count = 0
    pagination_token = None

    while page_count < 3:
        result_dict, next_token = search_google_jobs(search_query=title, pagination_token=pagination_token, verbose=True)
        
        if result_dict is None:
            print(f"Search failed for {title}")
            break

        page_count += 1

        all_results.append({
        "query": title,              
        "page": page_count + 1,      
        "results": result_dict       
        })

        if not next_token:
            print(f"No more pages for {title}")
            break
        
        pagination_token = next_token
    
        

current_time = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
output_filename = f"googlejobs_alltitles_{current_time}.json"
    
with open(output_filename, 'w') as f:
    json.dump(all_results, f, indent=2)
